Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-15 Thread Alex Xu
Question about swap volume, swap volume's implementation is very similar 
with live snapshot.
Both implemented by blockRebase. But swap volume didn't check any 
libvirt and qemu version.
Should we add version check for swap_volume now? That means swap_volume 
will be disable also.


On 2014?06?26? 19:00, Sean Dague wrote:

While the Trusty transition was mostly uneventful, it has exposed a
particular issue in libvirt, which is generating ~ 25% failure rate now
on most tempest jobs.

As can be seen here -
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297


... the libvirt live_snapshot code is something that our test pipeline
has never tested before, because it wasn't a new enough libvirt for us
to take that path.

Right now it's exploding, a lot -
https://bugs.launchpad.net/nova/+bug/1334398

Snapshotting gets used in Tempest to create images for testing, so image
setup tests are doing a decent number of snapshots. If I had to take a
completely *wild guess*, it's that libvirt can't do 2 live_snapshots at
the same time. It's probably something that most people haven't hit. The
wild guess is based on other libvirt issues we've hit that other people
haven't, and they are basically always a parallel ops triggered problem.

My 'stop the bleeding' suggested fix is this -
https://review.openstack.org/#/c/102643/ which just effectively disables
this code path for now. Then we can get some libvirt experts engaged to
help figure out the right long term fix.

I think there are a couple:

1) see if newer libvirt fixes this (1.2.5 just came out), and if so
mandate at some known working version. This would actually take a bunch
of work to be able to test a non packaged libvirt in our pipeline. We'd
need volunteers for that.

2) lock snapshot operations in nova-compute, so that we can only do 1 at
a time. Hopefully it's just 2 snapshot operations that is the issue, not
any other libvirt op during a snapshot, so serializing snapshot ops in
n-compute could put the kid gloves on libvirt and make it not break
here. This also needs some volunteers as we're going to be playing a
game of progressive serialization until we get to a point where it looks
like the failures go away.

3) Roll back to precise. I put this idea here for completeness, but I
think it's a terrible choice. This is one isolated, previously untested
(by us), code path. We can't stay on libvirt 0.9.6 forever, so actually
need to fix this for real (be it in nova's use of libvirt, or libvirt
itself).

There might be other options as well, ideas welcomed.

But for right now, we should stop the bleeding, so that nova/libvirt
isn't blocking everyone else from merging code.

-Sean



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-10 Thread Daniel P. Berrange
On Thu, Jul 10, 2014 at 08:32:37AM -0400, Sean Dague wrote:
> On 07/10/2014 05:03 AM, Daniel P. Berrange wrote:
> > On Wed, Jul 09, 2014 at 06:23:27PM -0400, Sean Dague wrote:
> >> The libvirt logs needed are huge, so we can't run them all the time. And
> >> realistically, I don't think they provided us the info we needed. There
> >> has been at least one fail on Dan's log hack patch for this scenario
> >> today, so maybe it will be in there.
> > 
> > I did finally get lucky and hit the failure, and the libvirtd.log has
> > provided the info to narrow down the problem in QEMU I believe. I'm
> > going to be talking with QEMU developers about it based on this info
> > now.
> > 
> > FYI, the logs are approximately 3 MB compressed for a full tempest
> > run. If turned on this would be either the 3rd or 4th largest log
> > file we'd be collecting, adding 8-10% to the total size of all.
> 
> It's larger than anything other that the ceilometer logs, which are
> their own issue. Remember that we are doing 20 - 30k runs a week. So 3MB
> + 20k = 60 GB / week. We're currently trying to keep 6 months of logs.
> So * 26 = 1.5 TB of libvirt logs. We're currently limited by having a
> max of 14 x 1 TB volumes on our log server in Rax.

Could we simply expire the libvirtd.log file after 2 weeks while
leaving everything else at 6 months ?

> We're hoping to fix that with using swift for log storage, if we get
> that in place, we could probably do that on every run.
> 
> Is it possible to make libvirt log to 2 log files? One that is the
> normal light load, and an enhanced error log? Then we could maybe make a
> decision on cleanup time about if we need the error log saved or not.
> Like if things failed we'd keep it. This all starts to get more
> complicated, but might be worth exploring.

Yes, you can easily do multiple log files at different levels

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-10 Thread Sean Dague
On 07/10/2014 05:03 AM, Daniel P. Berrange wrote:
> On Wed, Jul 09, 2014 at 06:23:27PM -0400, Sean Dague wrote:
>> The libvirt logs needed are huge, so we can't run them all the time. And
>> realistically, I don't think they provided us the info we needed. There
>> has been at least one fail on Dan's log hack patch for this scenario
>> today, so maybe it will be in there.
> 
> I did finally get lucky and hit the failure, and the libvirtd.log has
> provided the info to narrow down the problem in QEMU I believe. I'm
> going to be talking with QEMU developers about it based on this info
> now.
> 
> FYI, the logs are approximately 3 MB compressed for a full tempest
> run. If turned on this would be either the 3rd or 4th largest log
> file we'd be collecting, adding 8-10% to the total size of all.

It's larger than anything other that the ceilometer logs, which are
their own issue. Remember that we are doing 20 - 30k runs a week. So 3MB
+ 20k = 60 GB / week. We're currently trying to keep 6 months of logs.
So * 26 = 1.5 TB of libvirt logs. We're currently limited by having a
max of 14 x 1 TB volumes on our log server in Rax.

We're hoping to fix that with using swift for log storage, if we get
that in place, we could probably do that on every run.

Is it possible to make libvirt log to 2 log files? One that is the
normal light load, and an enhanced error log? Then we could maybe make a
decision on cleanup time about if we need the error log saved or not.
Like if things failed we'd keep it. This all starts to get more
complicated, but might be worth exploring.

> 
> Currently I had to do a crude hack to enable it
> 
> diff --git a/etc/nova/rootwrap.d/compute.filters 
> b/etc/nova/rootwrap.d/compute.filters
> index b79851b..7e4469a 100644
> --- a/etc/nova/rootwrap.d/compute.filters
> +++ b/etc/nova/rootwrap.d/compute.filters
> @@ -226,3 +226,6 @@ cp: CommandFilter, cp, root
>  # nova/virt/xenapi/vm_utils.py:
>  sync: CommandFilter, sync, root
>  
> +apt-get: CommandFilter, apt-get, root
> +service: CommandFilter, service, root
> +augtool: CommandFilter, augtool, root
> diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
> index 99edf12..93e60af 100644
> --- a/nova/virt/libvirt/driver.py
> +++ b/nova/virt/libvirt/driver.py
> @@ -28,6 +28,7 @@ Supports KVM, LXC, QEMU, UML, and XEN.
> @@ -611,6 +619,16 @@ class LibvirtDriver(driver.ComputeDriver):
>  {'type': CONF.libvirt.virt_type, 'arch': arch})
>  
>  def init_host(self, host):
> +utils.execute("apt-get", "-y", "install", "augeas-tools", 
> run_as_root=True)
> +utils.execute("augtool",
> +  process_input="""set 
> /files/etc/libvirt/libvirtd.conf/log_filters "1:libvirt.c 1:qemu 1:conf 
> 1:security 3:object 3:event 3:json 3:file 1:util"
> +set /files/etc/libvirt/libvirtd.conf/log_outputs 
> "1:file:/var/log/libvirt/libvirtd.log"
> +save
> +""", run_as_root=True)
> +utils.execute("service", "libvirt-bin", "restart",
> +  run_as_root=True)
> +time.sleep(10)
> +
> 
> 
> 
> If we genuinely can't enable it all the time, then I think we really need
> to figure out a way to let us turn it on selectively per review, in a bit
> of an easier manner. devstack lets you set DEBUG_LIBVIRT environment
> variable to turn this on, but there's no way for people to get that env
> var set in the gate runs - AFAICT infra team would have to toggle that
> globally each time it was needed which isn't really practical.
> 
> Regards,
> Daniel
> 


-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-10 Thread Daniel P. Berrange
On Wed, Jul 09, 2014 at 06:23:27PM -0400, Sean Dague wrote:
> The libvirt logs needed are huge, so we can't run them all the time. And
> realistically, I don't think they provided us the info we needed. There
> has been at least one fail on Dan's log hack patch for this scenario
> today, so maybe it will be in there.

I did finally get lucky and hit the failure, and the libvirtd.log has
provided the info to narrow down the problem in QEMU I believe. I'm
going to be talking with QEMU developers about it based on this info
now.

FYI, the logs are approximately 3 MB compressed for a full tempest
run. If turned on this would be either the 3rd or 4th largest log
file we'd be collecting, adding 8-10% to the total size of all.

Currently I had to do a crude hack to enable it

diff --git a/etc/nova/rootwrap.d/compute.filters 
b/etc/nova/rootwrap.d/compute.filters
index b79851b..7e4469a 100644
--- a/etc/nova/rootwrap.d/compute.filters
+++ b/etc/nova/rootwrap.d/compute.filters
@@ -226,3 +226,6 @@ cp: CommandFilter, cp, root
 # nova/virt/xenapi/vm_utils.py:
 sync: CommandFilter, sync, root
 
+apt-get: CommandFilter, apt-get, root
+service: CommandFilter, service, root
+augtool: CommandFilter, augtool, root
diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 99edf12..93e60af 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -28,6 +28,7 @@ Supports KVM, LXC, QEMU, UML, and XEN.
@@ -611,6 +619,16 @@ class LibvirtDriver(driver.ComputeDriver):
 {'type': CONF.libvirt.virt_type, 'arch': arch})
 
 def init_host(self, host):
+utils.execute("apt-get", "-y", "install", "augeas-tools", 
run_as_root=True)
+utils.execute("augtool",
+  process_input="""set 
/files/etc/libvirt/libvirtd.conf/log_filters "1:libvirt.c 1:qemu 1:conf 
1:security 3:object 3:event 3:json 3:file 1:util"
+set /files/etc/libvirt/libvirtd.conf/log_outputs 
"1:file:/var/log/libvirt/libvirtd.log"
+save
+""", run_as_root=True)
+utils.execute("service", "libvirt-bin", "restart",
+  run_as_root=True)
+time.sleep(10)
+



If we genuinely can't enable it all the time, then I think we really need
to figure out a way to let us turn it on selectively per review, in a bit
of an easier manner. devstack lets you set DEBUG_LIBVIRT environment
variable to turn this on, but there's no way for people to get that env
var set in the gate runs - AFAICT infra team would have to toggle that
globally each time it was needed which isn't really practical.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Sean Dague
The libvirt logs needed are huge, so we can't run them all the time. And
realistically, I don't think they provided us the info we needed. There
has been at least one fail on Dan's log hack patch for this scenario
today, so maybe it will be in there.

On 07/09/2014 05:44 PM, Michael Still wrote:
> Can we get our gate images tweaked to have more verbose libvirt
> logging on in general? There's been a few times in the last year or so
> when we've really needed it.
> 
> Michael
> 
> On Wed, Jul 9, 2014 at 6:01 PM, Daniel P. Berrange  
> wrote:
>> On Wed, Jul 09, 2014 at 08:58:02AM +1000, Michael Still wrote:
>>> On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague  wrote:
>>>
 This is also why I find it unlikely to be a qemu bug, because that's not
 shared state between guests. If qemu just randomly wedges itself, that
 would be detectable much easier outside of the gate. And there have been
 attempts by danpb to sniff that out, and they haven't worked.
>>>
>>> Do you think it would help if we added logging of what eventlet
>>> threads are running at the time of a failure like this? I can see that
>>> it might be a bit noisey, but it might also help nail down what this
>>> is an interaction between.
>>
>> I don't think so. What I really need is more verbose libvirtd daemon
>> logs when a time when it fails. I've done a gross hack with a review
>> I have posted [1] which munges rootwrap to allow me to reconfigure
>> libvirtd and capture logs. Unfortunately I've been unable to get it
>> to fail on the snapshot bug since then - it is always hitting other
>> bugs so far :-(
>>
>> Regards,
>> Daniel
>>
>> [1] https://review.openstack.org/#/c/103066/
>> --
>> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
>> |: http://libvirt.org  -o- http://virt-manager.org :|
>> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
>> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 


-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Michael Still
Can we get our gate images tweaked to have more verbose libvirt
logging on in general? There's been a few times in the last year or so
when we've really needed it.

Michael

On Wed, Jul 9, 2014 at 6:01 PM, Daniel P. Berrange  wrote:
> On Wed, Jul 09, 2014 at 08:58:02AM +1000, Michael Still wrote:
>> On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague  wrote:
>>
>> > This is also why I find it unlikely to be a qemu bug, because that's not
>> > shared state between guests. If qemu just randomly wedges itself, that
>> > would be detectable much easier outside of the gate. And there have been
>> > attempts by danpb to sniff that out, and they haven't worked.
>>
>> Do you think it would help if we added logging of what eventlet
>> threads are running at the time of a failure like this? I can see that
>> it might be a bit noisey, but it might also help nail down what this
>> is an interaction between.
>
> I don't think so. What I really need is more verbose libvirtd daemon
> logs when a time when it fails. I've done a gross hack with a review
> I have posted [1] which munges rootwrap to allow me to reconfigure
> libvirtd and capture logs. Unfortunately I've been unable to get it
> to fail on the snapshot bug since then - it is always hitting other
> bugs so far :-(
>
> Regards,
> Daniel
>
> [1] https://review.openstack.org/#/c/103066/
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Rackspace Australia

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Kashyap Chamarthy
On Wed, Jul 09, 2014 at 05:47:47PM +0530, Kashyap Chamarthy wrote:
> On Tue, Jul 08, 2014 at 06:21:31PM -0400, Sean Dague wrote:
> > On 07/08/2014 06:12 PM, Joe Gordon wrote:
> > > 
> > > 
> > > 
> > > On Tue, Jul 8, 2014 at 2:56 PM, Michael Still  > > > wrote:
> > > 
> > > The associated bug says this is probably a qemu bug, so I think we
> > > should rephrase that to "we need to start thinking about how to make
> > > sure upstream changes don't break nova".
> > > 
> > > 
> > > Good point.
> > >  
> > > 
> > > Would running devstack-tempest on the latest upstream release of ? help.
> > > Not as a voting job but as a periodic (third party?) job, that we can
> > > hopefully identify these issues early on. I think the big question here
> > > is who would volunteer to help run a job like this.
> 
> Although, I'm familiar 

Oops, typo: *Not familiar :-)

> with Gate and infra in depth, I can help
> volunteer debug such issues (as I try to test libvirt/QEMU upstreams and
> from git quite frequently).
> 

-- 
/kashyap

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Daniel P. Berrange
On Wed, Jul 09, 2014 at 08:34:06AM -0400, Sean Dague wrote:
> On 07/09/2014 03:58 AM, Daniel P. Berrange wrote:
> > On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote:
>  But for right now, we should stop the bleeding, so that nova/libvirt
>  isn't blocking everyone else from merging code.
> >>>
> >>> Agreed, we should merge the hack and treat the bug as release blocker
> >>> to be resolve prior to Juno GA.
> >>>
> >>
> >>
> >> How can we prevent libvirt issues like this from landing in trunk in the
> >> first place? If we don't figure out a way to prevent this from landing the
> >> first place I fear we will keep repeating this same pattern of failure.
> 
> Right, this is where math is against us. If a race shows up 1% of the
> time, you need 66 runs to have a 50% of seeing it. I still haven't
> calibrated the bugs to an absolute scale, but I think based on what I
> remember this livesnapshot bug was probably a 3-4% bug (per Tempest
> run). So you'd need 50 Tempest runs to have an 80% to see it show up again.
> 
> (Absolute calibration of the bugs is on my todo list for Elastic
> Recheck, maybe it's time to put that in front of fixing the bugs)
> 
> > Realistically I don't think there was much/any chance of avoiding this
> > problem. Despite many days of work trying to reproduce it by multiple
> > people, no one has managed even 1 single failure outside of the gate.
> > Even inside the gate it is hard to reproduce. I still have absolutely
> > no clue what is failing after days of investigation & debugging with
> > all the tricks I can think of, because as I say, it works perfectly
> > every time I try it, except in the gate where it is impossible to
> > debug it.
> 
> Out of curiosity, is your reproduce using eventlet? My expectation is
> that eventlet's concurency actually exacerbates this because when the
> snapshot starts we're now doing IO, and that means it's exactly the time
> that other compute work will be triggered.

I've tried both running the tempest suite itself, and also running
a dedicated stress test written against libvirt snapshot APIs in C.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Sean Dague
On 07/09/2014 03:58 AM, Daniel P. Berrange wrote:
> On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote:
 But for right now, we should stop the bleeding, so that nova/libvirt
 isn't blocking everyone else from merging code.
>>>
>>> Agreed, we should merge the hack and treat the bug as release blocker
>>> to be resolve prior to Juno GA.
>>>
>>
>>
>> How can we prevent libvirt issues like this from landing in trunk in the
>> first place? If we don't figure out a way to prevent this from landing the
>> first place I fear we will keep repeating this same pattern of failure.

Right, this is where math is against us. If a race shows up 1% of the
time, you need 66 runs to have a 50% of seeing it. I still haven't
calibrated the bugs to an absolute scale, but I think based on what I
remember this livesnapshot bug was probably a 3-4% bug (per Tempest
run). So you'd need 50 Tempest runs to have an 80% to see it show up again.

(Absolute calibration of the bugs is on my todo list for Elastic
Recheck, maybe it's time to put that in front of fixing the bugs)

> Realistically I don't think there was much/any chance of avoiding this
> problem. Despite many days of work trying to reproduce it by multiple
> people, no one has managed even 1 single failure outside of the gate.
> Even inside the gate it is hard to reproduce. I still have absolutely
> no clue what is failing after days of investigation & debugging with
> all the tricks I can think of, because as I say, it works perfectly
> every time I try it, except in the gate where it is impossible to
> debug it.

Out of curiosity, is your reproduce using eventlet? My expectation is
that eventlet's concurency actually exacerbates this because when the
snapshot starts we're now doing IO, and that means it's exactly the time
that other compute work will be triggered.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Kashyap Chamarthy
On Tue, Jul 08, 2014 at 06:21:31PM -0400, Sean Dague wrote:
> On 07/08/2014 06:12 PM, Joe Gordon wrote:
> > 
> > 
> > 
> > On Tue, Jul 8, 2014 at 2:56 PM, Michael Still  > > wrote:
> > 
> > The associated bug says this is probably a qemu bug, so I think we
> > should rephrase that to "we need to start thinking about how to make
> > sure upstream changes don't break nova".
> > 
> > 
> > Good point.
> >  
> > 
> > Would running devstack-tempest on the latest upstream release of ? help.
> > Not as a voting job but as a periodic (third party?) job, that we can
> > hopefully identify these issues early on. I think the big question here
> > is who would volunteer to help run a job like this.

Although, I'm familiar with Gate and infra in depth, I can help
volunteer debug such issues (as I try to test libvirt/QEMU upstreams and
from git quite frequently).

> The running of the job really isn't the issue.
> 
> It's the debugging of the jobs when the go wrong. Creating a new test
> job and getting it lit is really < 10% of the work, sifting through the
> fails and getting to the bottom of things is the hard and time consuming
> part.

Very true. For instance -- the live snapshot issue[1], I wish we could
get to the logical end of it (without letting it languish) and enable it
back in Nova soon. But, as of now, we're not able to pin point the
root cause and it's not reproducible any more from Dan Berrange's
detailed analysis after a week of tests outside the Gate or tests 
with some debugging enabled[2] when there's a light load on the Gate --
both cases, he didn't hit the issue after multiple test runs.

Dan raised on #openstack-nova if there might be some  weird I/O issue in
HP cloud that's leading to these timeouts, but Sean said  timeout would
be an issue only if this (the test in question) take 2 minutes some
times and succeed.

FWIW, from my local tests of exact Nova invocation of libvirt
blockRebase API to do parallel blockcopy operations followed by an
explicit abort (to gracefully end the block operation), I couldn't
reproduce it on multiple runs either.

 
  [1] https://bugs.launchpad.net/nova/+bug/1334398 -- libvirt
  live_snapshot periodically explodes on libvirt 1.2.2 in the gate
  [2] https://review.openstack.org/#/c/103066/
 
> 
> The other option is to remove more concurrency from nova-compute. It's
> pretty clear that this problem only seems to happen when the
> snapshotting is going on at the same time guests are being created or
> destroyed (possibly also a second snapshot going on).
> 
> This is also why I find it unlikely to be a qemu bug, because that's not
> shared state between guests. If qemu just randomly wedges itself, that
> would be detectable much easier outside of the gate. And there have been
> attempts by danpb to sniff that out, and they haven't worked.
> 
>   -Sean
> 


-- 
/kashyap

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Daniel P. Berrange
On Wed, Jul 09, 2014 at 08:58:02AM +1000, Michael Still wrote:
> On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague  wrote:
> 
> > This is also why I find it unlikely to be a qemu bug, because that's not
> > shared state between guests. If qemu just randomly wedges itself, that
> > would be detectable much easier outside of the gate. And there have been
> > attempts by danpb to sniff that out, and they haven't worked.
> 
> Do you think it would help if we added logging of what eventlet
> threads are running at the time of a failure like this? I can see that
> it might be a bit noisey, but it might also help nail down what this
> is an interaction between.

I don't think so. What I really need is more verbose libvirtd daemon
logs when a time when it fails. I've done a gross hack with a review
I have posted [1] which munges rootwrap to allow me to reconfigure
libvirtd and capture logs. Unfortunately I've been unable to get it
to fail on the snapshot bug since then - it is always hitting other
bugs so far :-(

Regards,
Daniel

[1] https://review.openstack.org/#/c/103066/
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-09 Thread Daniel P. Berrange
On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote:
> > > But for right now, we should stop the bleeding, so that nova/libvirt
> > > isn't blocking everyone else from merging code.
> >
> > Agreed, we should merge the hack and treat the bug as release blocker
> > to be resolve prior to Juno GA.
> >
> 
> 
> How can we prevent libvirt issues like this from landing in trunk in the
> first place? If we don't figure out a way to prevent this from landing the
> first place I fear we will keep repeating this same pattern of failure.

Realistically I don't think there was much/any chance of avoiding this
problem. Despite many days of work trying to reproduce it by multiple
people, no one has managed even 1 single failure outside of the gate.
Even inside the gate it is hard to reproduce. I still have absolutely
no clue what is failing after days of investigation & debugging with
all the tricks I can think of, because as I say, it works perfectly
every time I try it, except in the gate where it is impossible to
debug it.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-08 Thread Michael Still
On Wed, Jul 9, 2014 at 8:21 AM, Sean Dague  wrote:

> This is also why I find it unlikely to be a qemu bug, because that's not
> shared state between guests. If qemu just randomly wedges itself, that
> would be detectable much easier outside of the gate. And there have been
> attempts by danpb to sniff that out, and they haven't worked.

Do you think it would help if we added logging of what eventlet
threads are running at the time of a failure like this? I can see that
it might be a bit noisey, but it might also help nail down what this
is an interaction between.

Michael

-- 
Rackspace Australia

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-08 Thread Sean Dague
On 07/08/2014 06:12 PM, Joe Gordon wrote:
> 
> 
> 
> On Tue, Jul 8, 2014 at 2:56 PM, Michael Still  > wrote:
> 
> The associated bug says this is probably a qemu bug, so I think we
> should rephrase that to "we need to start thinking about how to make
> sure upstream changes don't break nova".
> 
> 
> Good point.
>  
> 
> Would running devstack-tempest on the latest upstream release of ? help.
> Not as a voting job but as a periodic (third party?) job, that we can
> hopefully identify these issues early on. I think the big question here
> is who would volunteer to help run a job like this.

The running of the job really isn't the issue.

It's the debugging of the jobs when the go wrong. Creating a new test
job and getting it lit is really < 10% of the work, sifting through the
fails and getting to the bottom of things is the hard and time consuming
part.

The other option is to remove more concurrency from nova-compute. It's
pretty clear that this problem only seems to happen when the
snapshotting is going on at the same time guests are being created or
destroyed (possibly also a second snapshot going on).

This is also why I find it unlikely to be a qemu bug, because that's not
shared state between guests. If qemu just randomly wedges itself, that
would be detectable much easier outside of the gate. And there have been
attempts by danpb to sniff that out, and they haven't worked.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-08 Thread Joe Gordon
On Tue, Jul 8, 2014 at 2:56 PM, Michael Still  wrote:

> The associated bug says this is probably a qemu bug, so I think we
> should rephrase that to "we need to start thinking about how to make
> sure upstream changes don't break nova".
>

Good point.


Would running devstack-tempest on the latest upstream release of ? help.
Not as a voting job but as a periodic (third party?) job, that we can
hopefully identify these issues early on. I think the big question here is
who would volunteer to help run a job like this.


>
> Michael
>
> On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon  wrote:
> >
> > On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange  >
> > wrote:
> >>
> >> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
> >> > While the Trusty transition was mostly uneventful, it has exposed a
> >> > particular issue in libvirt, which is generating ~ 25% failure rate
> now
> >> > on most tempest jobs.
> >> >
> >> > As can be seen here -
> >> >
> >> >
> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
> >> >
> >> >
> >> > ... the libvirt live_snapshot code is something that our test pipeline
> >> > has never tested before, because it wasn't a new enough libvirt for us
> >> > to take that path.
> >> >
> >> > Right now it's exploding, a lot -
> >> > https://bugs.launchpad.net/nova/+bug/1334398
> >> >
> >> > Snapshotting gets used in Tempest to create images for testing, so
> image
> >> > setup tests are doing a decent number of snapshots. If I had to take a
> >> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots
> at
> >> > the same time. It's probably something that most people haven't hit.
> The
> >> > wild guess is based on other libvirt issues we've hit that other
> people
> >> > haven't, and they are basically always a parallel ops triggered
> problem.
> >> >
> >> > My 'stop the bleeding' suggested fix is this -
> >> > https://review.openstack.org/#/c/102643/ which just effectively
> disables
> >> > this code path for now. Then we can get some libvirt experts engaged
> to
> >> > help figure out the right long term fix.
> >>
> >> Yes, this is a sensible pragmatic workaround for the short term until
> >> we diagnose the root cause & fix it.
> >>
> >> > I think there are a couple:
> >> >
> >> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
> >> > mandate at some known working version. This would actually take a
> bunch
> >> > of work to be able to test a non packaged libvirt in our pipeline.
> We'd
> >> > need volunteers for that.
> >> >
> >> > 2) lock snapshot operations in nova-compute, so that we can only do 1
> at
> >> > a time. Hopefully it's just 2 snapshot operations that is the issue,
> not
> >> > any other libvirt op during a snapshot, so serializing snapshot ops in
> >> > n-compute could put the kid gloves on libvirt and make it not break
> >> > here. This also needs some volunteers as we're going to be playing a
> >> > game of progressive serialization until we get to a point where it
> looks
> >> > like the failures go away.
> >> >
> >> > 3) Roll back to precise. I put this idea here for completeness, but I
> >> > think it's a terrible choice. This is one isolated, previously
> untested
> >> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so
> actually
> >> > need to fix this for real (be it in nova's use of libvirt, or libvirt
> >> > itself).
> >>
> >> Yep, since we *never* tested this code path in the gate before, rolling
> >> back to precise would not even really be a fix for the problem. It would
> >> merely mean we're not testing the code path again, which is really akin
> >> to sticking our head in the sand.
> >>
> >> > But for right now, we should stop the bleeding, so that nova/libvirt
> >> > isn't blocking everyone else from merging code.
> >>
> >> Agreed, we should merge the hack and treat the bug as release blocker
> >> to be resolve prior to Juno GA.
> >
> >
> >
> > How can we prevent libvirt issues like this from landing in trunk in the
> > first place? If we don't figure out a way to prevent this from landing
> the
> > first place I fear we will keep repeating this same pattern of failure.
> >
> >>
> >>
> >> Regards,
> >> Daniel
> >> --
> >> |: http://berrange.com  -o-
> http://www.flickr.com/photos/dberrange/
> >> :|
> >> |: http://libvirt.org  -o-
> http://virt-manager.org
> >> :|
> >> |: http://autobuild.org   -o-
> http://search.cpan.org/~danberr/
> >> :|
> >> |: http://entangle-photo.org   -o-
> http://live.gnome.org/gtk-vnc
> >> :|
> >>
> >> ___
> >> OpenStack-dev mailing list
> >> OpenStack-dev@lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> Rackspace Aust

Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-08 Thread Boris Pavlovic
Joe,

What about running benchmarks (with small load), for all major functions
(like snaphshoting, booting/deleting, ..) on every patch in nova. It can
catch a lot of related stuff.


Best regards,
Boris Pavlovic


On Wed, Jul 9, 2014 at 1:56 AM, Michael Still  wrote:

> The associated bug says this is probably a qemu bug, so I think we
> should rephrase that to "we need to start thinking about how to make
> sure upstream changes don't break nova".
>
> Michael
>
> On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon  wrote:
> >
> > On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange  >
> > wrote:
> >>
> >> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
> >> > While the Trusty transition was mostly uneventful, it has exposed a
> >> > particular issue in libvirt, which is generating ~ 25% failure rate
> now
> >> > on most tempest jobs.
> >> >
> >> > As can be seen here -
> >> >
> >> >
> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
> >> >
> >> >
> >> > ... the libvirt live_snapshot code is something that our test pipeline
> >> > has never tested before, because it wasn't a new enough libvirt for us
> >> > to take that path.
> >> >
> >> > Right now it's exploding, a lot -
> >> > https://bugs.launchpad.net/nova/+bug/1334398
> >> >
> >> > Snapshotting gets used in Tempest to create images for testing, so
> image
> >> > setup tests are doing a decent number of snapshots. If I had to take a
> >> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots
> at
> >> > the same time. It's probably something that most people haven't hit.
> The
> >> > wild guess is based on other libvirt issues we've hit that other
> people
> >> > haven't, and they are basically always a parallel ops triggered
> problem.
> >> >
> >> > My 'stop the bleeding' suggested fix is this -
> >> > https://review.openstack.org/#/c/102643/ which just effectively
> disables
> >> > this code path for now. Then we can get some libvirt experts engaged
> to
> >> > help figure out the right long term fix.
> >>
> >> Yes, this is a sensible pragmatic workaround for the short term until
> >> we diagnose the root cause & fix it.
> >>
> >> > I think there are a couple:
> >> >
> >> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
> >> > mandate at some known working version. This would actually take a
> bunch
> >> > of work to be able to test a non packaged libvirt in our pipeline.
> We'd
> >> > need volunteers for that.
> >> >
> >> > 2) lock snapshot operations in nova-compute, so that we can only do 1
> at
> >> > a time. Hopefully it's just 2 snapshot operations that is the issue,
> not
> >> > any other libvirt op during a snapshot, so serializing snapshot ops in
> >> > n-compute could put the kid gloves on libvirt and make it not break
> >> > here. This also needs some volunteers as we're going to be playing a
> >> > game of progressive serialization until we get to a point where it
> looks
> >> > like the failures go away.
> >> >
> >> > 3) Roll back to precise. I put this idea here for completeness, but I
> >> > think it's a terrible choice. This is one isolated, previously
> untested
> >> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so
> actually
> >> > need to fix this for real (be it in nova's use of libvirt, or libvirt
> >> > itself).
> >>
> >> Yep, since we *never* tested this code path in the gate before, rolling
> >> back to precise would not even really be a fix for the problem. It would
> >> merely mean we're not testing the code path again, which is really akin
> >> to sticking our head in the sand.
> >>
> >> > But for right now, we should stop the bleeding, so that nova/libvirt
> >> > isn't blocking everyone else from merging code.
> >>
> >> Agreed, we should merge the hack and treat the bug as release blocker
> >> to be resolve prior to Juno GA.
> >
> >
> >
> > How can we prevent libvirt issues like this from landing in trunk in the
> > first place? If we don't figure out a way to prevent this from landing
> the
> > first place I fear we will keep repeating this same pattern of failure.
> >
> >>
> >>
> >> Regards,
> >> Daniel
> >> --
> >> |: http://berrange.com  -o-
> http://www.flickr.com/photos/dberrange/
> >> :|
> >> |: http://libvirt.org  -o-
> http://virt-manager.org
> >> :|
> >> |: http://autobuild.org   -o-
> http://search.cpan.org/~danberr/
> >> :|
> >> |: http://entangle-photo.org   -o-
> http://live.gnome.org/gtk-vnc
> >> :|
> >>
> >> ___
> >> OpenStack-dev mailing list
> >> OpenStack-dev@lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> Rackspace Australia
>
> ___
> OpenStack-d

Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-08 Thread Michael Still
The associated bug says this is probably a qemu bug, so I think we
should rephrase that to "we need to start thinking about how to make
sure upstream changes don't break nova".

Michael

On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon  wrote:
>
> On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange 
> wrote:
>>
>> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
>> > While the Trusty transition was mostly uneventful, it has exposed a
>> > particular issue in libvirt, which is generating ~ 25% failure rate now
>> > on most tempest jobs.
>> >
>> > As can be seen here -
>> >
>> > https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
>> >
>> >
>> > ... the libvirt live_snapshot code is something that our test pipeline
>> > has never tested before, because it wasn't a new enough libvirt for us
>> > to take that path.
>> >
>> > Right now it's exploding, a lot -
>> > https://bugs.launchpad.net/nova/+bug/1334398
>> >
>> > Snapshotting gets used in Tempest to create images for testing, so image
>> > setup tests are doing a decent number of snapshots. If I had to take a
>> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots at
>> > the same time. It's probably something that most people haven't hit. The
>> > wild guess is based on other libvirt issues we've hit that other people
>> > haven't, and they are basically always a parallel ops triggered problem.
>> >
>> > My 'stop the bleeding' suggested fix is this -
>> > https://review.openstack.org/#/c/102643/ which just effectively disables
>> > this code path for now. Then we can get some libvirt experts engaged to
>> > help figure out the right long term fix.
>>
>> Yes, this is a sensible pragmatic workaround for the short term until
>> we diagnose the root cause & fix it.
>>
>> > I think there are a couple:
>> >
>> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
>> > mandate at some known working version. This would actually take a bunch
>> > of work to be able to test a non packaged libvirt in our pipeline. We'd
>> > need volunteers for that.
>> >
>> > 2) lock snapshot operations in nova-compute, so that we can only do 1 at
>> > a time. Hopefully it's just 2 snapshot operations that is the issue, not
>> > any other libvirt op during a snapshot, so serializing snapshot ops in
>> > n-compute could put the kid gloves on libvirt and make it not break
>> > here. This also needs some volunteers as we're going to be playing a
>> > game of progressive serialization until we get to a point where it looks
>> > like the failures go away.
>> >
>> > 3) Roll back to precise. I put this idea here for completeness, but I
>> > think it's a terrible choice. This is one isolated, previously untested
>> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually
>> > need to fix this for real (be it in nova's use of libvirt, or libvirt
>> > itself).
>>
>> Yep, since we *never* tested this code path in the gate before, rolling
>> back to precise would not even really be a fix for the problem. It would
>> merely mean we're not testing the code path again, which is really akin
>> to sticking our head in the sand.
>>
>> > But for right now, we should stop the bleeding, so that nova/libvirt
>> > isn't blocking everyone else from merging code.
>>
>> Agreed, we should merge the hack and treat the bug as release blocker
>> to be resolve prior to Juno GA.
>
>
>
> How can we prevent libvirt issues like this from landing in trunk in the
> first place? If we don't figure out a way to prevent this from landing the
> first place I fear we will keep repeating this same pattern of failure.
>
>>
>>
>> Regards,
>> Daniel
>> --
>> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/
>> :|
>> |: http://libvirt.org  -o- http://virt-manager.org
>> :|
>> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/
>> :|
>> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc
>> :|
>>
>> ___
>> OpenStack-dev mailing list
>> OpenStack-dev@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Rackspace Australia

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-07-08 Thread Joe Gordon
On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange 
wrote:

> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
> > While the Trusty transition was mostly uneventful, it has exposed a
> > particular issue in libvirt, which is generating ~ 25% failure rate now
> > on most tempest jobs.
> >
> > As can be seen here -
> >
> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
> >
> >
> > ... the libvirt live_snapshot code is something that our test pipeline
> > has never tested before, because it wasn't a new enough libvirt for us
> > to take that path.
> >
> > Right now it's exploding, a lot -
> > https://bugs.launchpad.net/nova/+bug/1334398
> >
> > Snapshotting gets used in Tempest to create images for testing, so image
> > setup tests are doing a decent number of snapshots. If I had to take a
> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots at
> > the same time. It's probably something that most people haven't hit. The
> > wild guess is based on other libvirt issues we've hit that other people
> > haven't, and they are basically always a parallel ops triggered problem.
> >
> > My 'stop the bleeding' suggested fix is this -
> > https://review.openstack.org/#/c/102643/ which just effectively disables
> > this code path for now. Then we can get some libvirt experts engaged to
> > help figure out the right long term fix.
>
> Yes, this is a sensible pragmatic workaround for the short term until
> we diagnose the root cause & fix it.
>
> > I think there are a couple:
> >
> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
> > mandate at some known working version. This would actually take a bunch
> > of work to be able to test a non packaged libvirt in our pipeline. We'd
> > need volunteers for that.
> >
> > 2) lock snapshot operations in nova-compute, so that we can only do 1 at
> > a time. Hopefully it's just 2 snapshot operations that is the issue, not
> > any other libvirt op during a snapshot, so serializing snapshot ops in
> > n-compute could put the kid gloves on libvirt and make it not break
> > here. This also needs some volunteers as we're going to be playing a
> > game of progressive serialization until we get to a point where it looks
> > like the failures go away.
> >
> > 3) Roll back to precise. I put this idea here for completeness, but I
> > think it's a terrible choice. This is one isolated, previously untested
> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually
> > need to fix this for real (be it in nova's use of libvirt, or libvirt
> > itself).
>
> Yep, since we *never* tested this code path in the gate before, rolling
> back to precise would not even really be a fix for the problem. It would
> merely mean we're not testing the code path again, which is really akin
> to sticking our head in the sand.
>
> > But for right now, we should stop the bleeding, so that nova/libvirt
> > isn't blocking everyone else from merging code.
>
> Agreed, we should merge the hack and treat the bug as release blocker
> to be resolve prior to Juno GA.
>


How can we prevent libvirt issues like this from landing in trunk in the
first place? If we don't figure out a way to prevent this from landing the
first place I fear we will keep repeating this same pattern of failure.


>
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org  -o- http://virt-manager.org
> :|
> |: http://autobuild.org   -o- http://search.cpan.org/~danberr/
> :|
> |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc
> :|
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] top gate bug is libvirt snapshot

2014-06-26 Thread Daniel P. Berrange
On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
> While the Trusty transition was mostly uneventful, it has exposed a
> particular issue in libvirt, which is generating ~ 25% failure rate now
> on most tempest jobs.
> 
> As can be seen here -
> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
> 
> 
> ... the libvirt live_snapshot code is something that our test pipeline
> has never tested before, because it wasn't a new enough libvirt for us
> to take that path.
> 
> Right now it's exploding, a lot -
> https://bugs.launchpad.net/nova/+bug/1334398
> 
> Snapshotting gets used in Tempest to create images for testing, so image
> setup tests are doing a decent number of snapshots. If I had to take a
> completely *wild guess*, it's that libvirt can't do 2 live_snapshots at
> the same time. It's probably something that most people haven't hit. The
> wild guess is based on other libvirt issues we've hit that other people
> haven't, and they are basically always a parallel ops triggered problem.
> 
> My 'stop the bleeding' suggested fix is this -
> https://review.openstack.org/#/c/102643/ which just effectively disables
> this code path for now. Then we can get some libvirt experts engaged to
> help figure out the right long term fix.

Yes, this is a sensible pragmatic workaround for the short term until
we diagnose the root cause & fix it.

> I think there are a couple:
> 
> 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
> mandate at some known working version. This would actually take a bunch
> of work to be able to test a non packaged libvirt in our pipeline. We'd
> need volunteers for that.
> 
> 2) lock snapshot operations in nova-compute, so that we can only do 1 at
> a time. Hopefully it's just 2 snapshot operations that is the issue, not
> any other libvirt op during a snapshot, so serializing snapshot ops in
> n-compute could put the kid gloves on libvirt and make it not break
> here. This also needs some volunteers as we're going to be playing a
> game of progressive serialization until we get to a point where it looks
> like the failures go away.
> 
> 3) Roll back to precise. I put this idea here for completeness, but I
> think it's a terrible choice. This is one isolated, previously untested
> (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually
> need to fix this for real (be it in nova's use of libvirt, or libvirt
> itself).

Yep, since we *never* tested this code path in the gate before, rolling
back to precise would not even really be a fix for the problem. It would
merely mean we're not testing the code path again, which is really akin
to sticking our head in the sand.

> But for right now, we should stop the bleeding, so that nova/libvirt
> isn't blocking everyone else from merging code.

Agreed, we should merge the hack and treat the bug as release blocker
to be resolve prior to Juno GA.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev