Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
Question about swap volume, swap volume's implementation is very similar with live snapshot. Both implemented by blockRebase. But swap volume didn't check any libvirt and qemu version. Should we add version check for swap_volume now? That means swap_volume will be disable also. On 2014?06?26? 19:00, Sean Dague wrote: While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). There might be other options as well, ideas welcomed. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. -Sean ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Wed, Jul 09, 2014 at 06:23:27PM -0400, Sean Dague wrote: The libvirt logs needed are huge, so we can't run them all the time. And realistically, I don't think they provided us the info we needed. There has been at least one fail on Dan's log hack patch for this scenario today, so maybe it will be in there. I did finally get lucky and hit the failure, and the libvirtd.log has provided the info to narrow down the problem in QEMU I believe. I'm going to be talking with QEMU developers about it based on this info now. FYI, the logs are approximately 3 MB compressed for a full tempest run. If turned on this would be either the 3rd or 4th largest log file we'd be collecting, adding 8-10% to the total size of all. Currently I had to do a crude hack to enable it diff --git a/etc/nova/rootwrap.d/compute.filters b/etc/nova/rootwrap.d/compute.filters index b79851b..7e4469a 100644 --- a/etc/nova/rootwrap.d/compute.filters +++ b/etc/nova/rootwrap.d/compute.filters @@ -226,3 +226,6 @@ cp: CommandFilter, cp, root # nova/virt/xenapi/vm_utils.py: sync: CommandFilter, sync, root +apt-get: CommandFilter, apt-get, root +service: CommandFilter, service, root +augtool: CommandFilter, augtool, root diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py index 99edf12..93e60af 100644 --- a/nova/virt/libvirt/driver.py +++ b/nova/virt/libvirt/driver.py @@ -28,6 +28,7 @@ Supports KVM, LXC, QEMU, UML, and XEN. @@ -611,6 +619,16 @@ class LibvirtDriver(driver.ComputeDriver): {'type': CONF.libvirt.virt_type, 'arch': arch}) def init_host(self, host): +utils.execute(apt-get, -y, install, augeas-tools, run_as_root=True) +utils.execute(augtool, + process_input=set /files/etc/libvirt/libvirtd.conf/log_filters 1:libvirt.c 1:qemu 1:conf 1:security 3:object 3:event 3:json 3:file 1:util +set /files/etc/libvirt/libvirtd.conf/log_outputs 1:file:/var/log/libvirt/libvirtd.log +save +, run_as_root=True) +utils.execute(service, libvirt-bin, restart, + run_as_root=True) +time.sleep(10) + If we genuinely can't enable it all the time, then I think we really need to figure out a way to let us turn it on selectively per review, in a bit of an easier manner. devstack lets you set DEBUG_LIBVIRT environment variable to turn this on, but there's no way for people to get that env var set in the gate runs - AFAICT infra team would have to toggle that globally each time it was needed which isn't really practical. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On 07/10/2014 05:03 AM, Daniel P. Berrange wrote: On Wed, Jul 09, 2014 at 06:23:27PM -0400, Sean Dague wrote: The libvirt logs needed are huge, so we can't run them all the time. And realistically, I don't think they provided us the info we needed. There has been at least one fail on Dan's log hack patch for this scenario today, so maybe it will be in there. I did finally get lucky and hit the failure, and the libvirtd.log has provided the info to narrow down the problem in QEMU I believe. I'm going to be talking with QEMU developers about it based on this info now. FYI, the logs are approximately 3 MB compressed for a full tempest run. If turned on this would be either the 3rd or 4th largest log file we'd be collecting, adding 8-10% to the total size of all. It's larger than anything other that the ceilometer logs, which are their own issue. Remember that we are doing 20 - 30k runs a week. So 3MB + 20k = 60 GB / week. We're currently trying to keep 6 months of logs. So * 26 = 1.5 TB of libvirt logs. We're currently limited by having a max of 14 x 1 TB volumes on our log server in Rax. We're hoping to fix that with using swift for log storage, if we get that in place, we could probably do that on every run. Is it possible to make libvirt log to 2 log files? One that is the normal light load, and an enhanced error log? Then we could maybe make a decision on cleanup time about if we need the error log saved or not. Like if things failed we'd keep it. This all starts to get more complicated, but might be worth exploring. Currently I had to do a crude hack to enable it diff --git a/etc/nova/rootwrap.d/compute.filters b/etc/nova/rootwrap.d/compute.filters index b79851b..7e4469a 100644 --- a/etc/nova/rootwrap.d/compute.filters +++ b/etc/nova/rootwrap.d/compute.filters @@ -226,3 +226,6 @@ cp: CommandFilter, cp, root # nova/virt/xenapi/vm_utils.py: sync: CommandFilter, sync, root +apt-get: CommandFilter, apt-get, root +service: CommandFilter, service, root +augtool: CommandFilter, augtool, root diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py index 99edf12..93e60af 100644 --- a/nova/virt/libvirt/driver.py +++ b/nova/virt/libvirt/driver.py @@ -28,6 +28,7 @@ Supports KVM, LXC, QEMU, UML, and XEN. @@ -611,6 +619,16 @@ class LibvirtDriver(driver.ComputeDriver): {'type': CONF.libvirt.virt_type, 'arch': arch}) def init_host(self, host): +utils.execute(apt-get, -y, install, augeas-tools, run_as_root=True) +utils.execute(augtool, + process_input=set /files/etc/libvirt/libvirtd.conf/log_filters 1:libvirt.c 1:qemu 1:conf 1:security 3:object 3:event 3:json 3:file 1:util +set /files/etc/libvirt/libvirtd.conf/log_outputs 1:file:/var/log/libvirt/libvirtd.log +save +, run_as_root=True) +utils.execute(service, libvirt-bin, restart, + run_as_root=True) +time.sleep(10) + If we genuinely can't enable it all the time, then I think we really need to figure out a way to let us turn it on selectively per review, in a bit of an easier manner. devstack lets you set DEBUG_LIBVIRT environment variable to turn this on, but there's no way for people to get that env var set in the gate runs - AFAICT infra team would have to toggle that globally each time it was needed which isn't really practical. Regards, Daniel -- Sean Dague http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote: But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Realistically I don't think there was much/any chance of avoiding this problem. Despite many days of work trying to reproduce it by multiple people, no one has managed even 1 single failure outside of the gate. Even inside the gate it is hard to reproduce. I still have absolutely no clue what is failing after days of investigation debugging with all the tricks I can think of, because as I say, it works perfectly every time I try it, except in the gate where it is impossible to debug it. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Wed, Jul 09, 2014 at 08:58:02AM +1000, Michael Still wrote: On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague s...@dague.net wrote: This is also why I find it unlikely to be a qemu bug, because that's not shared state between guests. If qemu just randomly wedges itself, that would be detectable much easier outside of the gate. And there have been attempts by danpb to sniff that out, and they haven't worked. Do you think it would help if we added logging of what eventlet threads are running at the time of a failure like this? I can see that it might be a bit noisey, but it might also help nail down what this is an interaction between. I don't think so. What I really need is more verbose libvirtd daemon logs when a time when it fails. I've done a gross hack with a review I have posted [1] which munges rootwrap to allow me to reconfigure libvirtd and capture logs. Unfortunately I've been unable to get it to fail on the snapshot bug since then - it is always hitting other bugs so far :-( Regards, Daniel [1] https://review.openstack.org/#/c/103066/ -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Tue, Jul 08, 2014 at 06:21:31PM -0400, Sean Dague wrote: On 07/08/2014 06:12 PM, Joe Gordon wrote: On Tue, Jul 8, 2014 at 2:56 PM, Michael Still mi...@stillhq.com mailto:mi...@stillhq.com wrote: The associated bug says this is probably a qemu bug, so I think we should rephrase that to we need to start thinking about how to make sure upstream changes don't break nova. Good point. Would running devstack-tempest on the latest upstream release of ? help. Not as a voting job but as a periodic (third party?) job, that we can hopefully identify these issues early on. I think the big question here is who would volunteer to help run a job like this. Although, I'm familiar with Gate and infra in depth, I can help volunteer debug such issues (as I try to test libvirt/QEMU upstreams and from git quite frequently). The running of the job really isn't the issue. It's the debugging of the jobs when the go wrong. Creating a new test job and getting it lit is really 10% of the work, sifting through the fails and getting to the bottom of things is the hard and time consuming part. Very true. For instance -- the live snapshot issue[1], I wish we could get to the logical end of it (without letting it languish) and enable it back in Nova soon. But, as of now, we're not able to pin point the root cause and it's not reproducible any more from Dan Berrange's detailed analysis after a week of tests outside the Gate or tests with some debugging enabled[2] when there's a light load on the Gate -- both cases, he didn't hit the issue after multiple test runs. Dan raised on #openstack-nova if there might be some weird I/O issue in HP cloud that's leading to these timeouts, but Sean said timeout would be an issue only if this (the test in question) take 2 minutes some times and succeed. FWIW, from my local tests of exact Nova invocation of libvirt blockRebase API to do parallel blockcopy operations followed by an explicit abort (to gracefully end the block operation), I couldn't reproduce it on multiple runs either. [1] https://bugs.launchpad.net/nova/+bug/1334398 -- libvirt live_snapshot periodically explodes on libvirt 1.2.2 in the gate [2] https://review.openstack.org/#/c/103066/ The other option is to remove more concurrency from nova-compute. It's pretty clear that this problem only seems to happen when the snapshotting is going on at the same time guests are being created or destroyed (possibly also a second snapshot going on). This is also why I find it unlikely to be a qemu bug, because that's not shared state between guests. If qemu just randomly wedges itself, that would be detectable much easier outside of the gate. And there have been attempts by danpb to sniff that out, and they haven't worked. -Sean -- /kashyap ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On 07/09/2014 03:58 AM, Daniel P. Berrange wrote: On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote: But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Right, this is where math is against us. If a race shows up 1% of the time, you need 66 runs to have a 50% of seeing it. I still haven't calibrated the bugs to an absolute scale, but I think based on what I remember this livesnapshot bug was probably a 3-4% bug (per Tempest run). So you'd need 50 Tempest runs to have an 80% to see it show up again. (Absolute calibration of the bugs is on my todo list for Elastic Recheck, maybe it's time to put that in front of fixing the bugs) Realistically I don't think there was much/any chance of avoiding this problem. Despite many days of work trying to reproduce it by multiple people, no one has managed even 1 single failure outside of the gate. Even inside the gate it is hard to reproduce. I still have absolutely no clue what is failing after days of investigation debugging with all the tricks I can think of, because as I say, it works perfectly every time I try it, except in the gate where it is impossible to debug it. Out of curiosity, is your reproduce using eventlet? My expectation is that eventlet's concurency actually exacerbates this because when the snapshot starts we're now doing IO, and that means it's exactly the time that other compute work will be triggered. -Sean -- Sean Dague http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Wed, Jul 09, 2014 at 08:34:06AM -0400, Sean Dague wrote: On 07/09/2014 03:58 AM, Daniel P. Berrange wrote: On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote: But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Right, this is where math is against us. If a race shows up 1% of the time, you need 66 runs to have a 50% of seeing it. I still haven't calibrated the bugs to an absolute scale, but I think based on what I remember this livesnapshot bug was probably a 3-4% bug (per Tempest run). So you'd need 50 Tempest runs to have an 80% to see it show up again. (Absolute calibration of the bugs is on my todo list for Elastic Recheck, maybe it's time to put that in front of fixing the bugs) Realistically I don't think there was much/any chance of avoiding this problem. Despite many days of work trying to reproduce it by multiple people, no one has managed even 1 single failure outside of the gate. Even inside the gate it is hard to reproduce. I still have absolutely no clue what is failing after days of investigation debugging with all the tricks I can think of, because as I say, it works perfectly every time I try it, except in the gate where it is impossible to debug it. Out of curiosity, is your reproduce using eventlet? My expectation is that eventlet's concurency actually exacerbates this because when the snapshot starts we're now doing IO, and that means it's exactly the time that other compute work will be triggered. I've tried both running the tempest suite itself, and also running a dedicated stress test written against libvirt snapshot APIs in C. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Wed, Jul 09, 2014 at 05:47:47PM +0530, Kashyap Chamarthy wrote: On Tue, Jul 08, 2014 at 06:21:31PM -0400, Sean Dague wrote: On 07/08/2014 06:12 PM, Joe Gordon wrote: On Tue, Jul 8, 2014 at 2:56 PM, Michael Still mi...@stillhq.com mailto:mi...@stillhq.com wrote: The associated bug says this is probably a qemu bug, so I think we should rephrase that to we need to start thinking about how to make sure upstream changes don't break nova. Good point. Would running devstack-tempest on the latest upstream release of ? help. Not as a voting job but as a periodic (third party?) job, that we can hopefully identify these issues early on. I think the big question here is who would volunteer to help run a job like this. Although, I'm familiar Oops, typo: *Not familiar :-) with Gate and infra in depth, I can help volunteer debug such issues (as I try to test libvirt/QEMU upstreams and from git quite frequently). -- /kashyap ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
Can we get our gate images tweaked to have more verbose libvirt logging on in general? There's been a few times in the last year or so when we've really needed it. Michael On Wed, Jul 9, 2014 at 6:01 PM, Daniel P. Berrange berra...@redhat.com wrote: On Wed, Jul 09, 2014 at 08:58:02AM +1000, Michael Still wrote: On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague s...@dague.net wrote: This is also why I find it unlikely to be a qemu bug, because that's not shared state between guests. If qemu just randomly wedges itself, that would be detectable much easier outside of the gate. And there have been attempts by danpb to sniff that out, and they haven't worked. Do you think it would help if we added logging of what eventlet threads are running at the time of a failure like this? I can see that it might be a bit noisey, but it might also help nail down what this is an interaction between. I don't think so. What I really need is more verbose libvirtd daemon logs when a time when it fails. I've done a gross hack with a review I have posted [1] which munges rootwrap to allow me to reconfigure libvirtd and capture logs. Unfortunately I've been unable to get it to fail on the snapshot bug since then - it is always hitting other bugs so far :-( Regards, Daniel [1] https://review.openstack.org/#/c/103066/ -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
The libvirt logs needed are huge, so we can't run them all the time. And realistically, I don't think they provided us the info we needed. There has been at least one fail on Dan's log hack patch for this scenario today, so maybe it will be in there. On 07/09/2014 05:44 PM, Michael Still wrote: Can we get our gate images tweaked to have more verbose libvirt logging on in general? There's been a few times in the last year or so when we've really needed it. Michael On Wed, Jul 9, 2014 at 6:01 PM, Daniel P. Berrange berra...@redhat.com wrote: On Wed, Jul 09, 2014 at 08:58:02AM +1000, Michael Still wrote: On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague s...@dague.net wrote: This is also why I find it unlikely to be a qemu bug, because that's not shared state between guests. If qemu just randomly wedges itself, that would be detectable much easier outside of the gate. And there have been attempts by danpb to sniff that out, and they haven't worked. Do you think it would help if we added logging of what eventlet threads are running at the time of a failure like this? I can see that it might be a bit noisey, but it might also help nail down what this is an interaction between. I don't think so. What I really need is more verbose libvirtd daemon logs when a time when it fails. I've done a gross hack with a review I have posted [1] which munges rootwrap to allow me to reconfigure libvirtd and capture logs. Unfortunately I've been unable to get it to fail on the snapshot bug since then - it is always hitting other bugs so far :-( Regards, Daniel [1] https://review.openstack.org/#/c/103066/ -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Sean Dague http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange berra...@redhat.com wrote: On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote: While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. Yes, this is a sensible pragmatic workaround for the short term until we diagnose the root cause fix it. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). Yep, since we *never* tested this code path in the gate before, rolling back to precise would not even really be a fix for the problem. It would merely mean we're not testing the code path again, which is really akin to sticking our head in the sand. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
The associated bug says this is probably a qemu bug, so I think we should rephrase that to we need to start thinking about how to make sure upstream changes don't break nova. Michael On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon joe.gord...@gmail.com wrote: On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange berra...@redhat.com wrote: On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote: While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. Yes, this is a sensible pragmatic workaround for the short term until we diagnose the root cause fix it. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). Yep, since we *never* tested this code path in the gate before, rolling back to precise would not even really be a fix for the problem. It would merely mean we're not testing the code path again, which is really akin to sticking our head in the sand. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
Joe, What about running benchmarks (with small load), for all major functions (like snaphshoting, booting/deleting, ..) on every patch in nova. It can catch a lot of related stuff. Best regards, Boris Pavlovic On Wed, Jul 9, 2014 at 1:56 AM, Michael Still mi...@stillhq.com wrote: The associated bug says this is probably a qemu bug, so I think we should rephrase that to we need to start thinking about how to make sure upstream changes don't break nova. Michael On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon joe.gord...@gmail.com wrote: On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange berra...@redhat.com wrote: On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote: While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. Yes, this is a sensible pragmatic workaround for the short term until we diagnose the root cause fix it. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). Yep, since we *never* tested this code path in the gate before, rolling back to precise would not even really be a fix for the problem. It would merely mean we're not testing the code path again, which is really akin to sticking our head in the sand. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Tue, Jul 8, 2014 at 2:56 PM, Michael Still mi...@stillhq.com wrote: The associated bug says this is probably a qemu bug, so I think we should rephrase that to we need to start thinking about how to make sure upstream changes don't break nova. Good point. Would running devstack-tempest on the latest upstream release of ? help. Not as a voting job but as a periodic (third party?) job, that we can hopefully identify these issues early on. I think the big question here is who would volunteer to help run a job like this. Michael On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon joe.gord...@gmail.com wrote: On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange berra...@redhat.com wrote: On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote: While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. Yes, this is a sensible pragmatic workaround for the short term until we diagnose the root cause fix it. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). Yep, since we *never* tested this code path in the gate before, rolling back to precise would not even really be a fix for the problem. It would merely mean we're not testing the code path again, which is really akin to sticking our head in the sand. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. How can we prevent libvirt issues like this from landing in trunk in the first place? If we don't figure out a way to prevent this from landing the first place I fear we will keep repeating this same pattern of failure. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague s...@dague.net wrote: This is also why I find it unlikely to be a qemu bug, because that's not shared state between guests. If qemu just randomly wedges itself, that would be detectable much easier outside of the gate. And there have been attempts by danpb to sniff that out, and they haven't worked. Do you think it would help if we added logging of what eventlet threads are running at the time of a failure like this? I can see that it might be a bit noisey, but it might also help nail down what this is an interaction between. Michael -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] top gate bug is libvirt snapshot
While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). There might be other options as well, ideas welcomed. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. -Sean -- Sean Dague http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] top gate bug is libvirt snapshot
On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote: While the Trusty transition was mostly uneventful, it has exposed a particular issue in libvirt, which is generating ~ 25% failure rate now on most tempest jobs. As can be seen here - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 ... the libvirt live_snapshot code is something that our test pipeline has never tested before, because it wasn't a new enough libvirt for us to take that path. Right now it's exploding, a lot - https://bugs.launchpad.net/nova/+bug/1334398 Snapshotting gets used in Tempest to create images for testing, so image setup tests are doing a decent number of snapshots. If I had to take a completely *wild guess*, it's that libvirt can't do 2 live_snapshots at the same time. It's probably something that most people haven't hit. The wild guess is based on other libvirt issues we've hit that other people haven't, and they are basically always a parallel ops triggered problem. My 'stop the bleeding' suggested fix is this - https://review.openstack.org/#/c/102643/ which just effectively disables this code path for now. Then we can get some libvirt experts engaged to help figure out the right long term fix. Yes, this is a sensible pragmatic workaround for the short term until we diagnose the root cause fix it. I think there are a couple: 1) see if newer libvirt fixes this (1.2.5 just came out), and if so mandate at some known working version. This would actually take a bunch of work to be able to test a non packaged libvirt in our pipeline. We'd need volunteers for that. 2) lock snapshot operations in nova-compute, so that we can only do 1 at a time. Hopefully it's just 2 snapshot operations that is the issue, not any other libvirt op during a snapshot, so serializing snapshot ops in n-compute could put the kid gloves on libvirt and make it not break here. This also needs some volunteers as we're going to be playing a game of progressive serialization until we get to a point where it looks like the failures go away. 3) Roll back to precise. I put this idea here for completeness, but I think it's a terrible choice. This is one isolated, previously untested (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually need to fix this for real (be it in nova's use of libvirt, or libvirt itself). Yep, since we *never* tested this code path in the gate before, rolling back to precise would not even really be a fix for the problem. It would merely mean we're not testing the code path again, which is really akin to sticking our head in the sand. But for right now, we should stop the bleeding, so that nova/libvirt isn't blocking everyone else from merging code. Agreed, we should merge the hack and treat the bug as release blocker to be resolve prior to Juno GA. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev