Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate

2017-01-18 Thread Amrith Kumar
Jay,

 

This is the Trove commit …  
<https://review.openstack.org/#q,I85364c6530058e964a8eba7fb515d7deadfd5d72,n,z> 
I85364c6530058e964a8eba7fb515d7deadfd5d72

 

-amrith

 

From: Jim Rollenhagen [mailto:j...@jimrollenhagen.com] 
Sent: Wednesday, January 18, 2017 7:57 AM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate

 

On Tue, Jan 17, 2017 at 6:41 PM, Jay Faulkner <j...@jvf.cc <mailto:j...@jvf.cc> 
> wrote:

Hi all,

Back in late October, Vasyl wrote support for devstack to auto detect, and when 
possible, use kvm to power Ironic gate jobs 
(0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job time when 
it works, but has caused failures — how many? It’s hard to quantify as the log 
messages that show the error don’t appear to be indexed by elastic search. It’s 
something seen often enough that the issue has become a permanent staple on our 
gate whiteboard, and doesn’t appear to be decreasing in quantity.

I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps the 
auto detection behavior, but defaults devstack to use qemu emulation instead of 
kvm.

I have two questions:
1) Is there any way I’m not aware of we can quantify the number of failures 
this is causing? The key log message, "KVM: entry failed, hardware error 0x0”, 
shows up in logs/libvirt/qemu/node-*.txt.gz.
2) Are these failures avoidable or visible in any way?

IMO, if we can’t fix these failures, in my opinion, we have to do a change to 
avoid using nested KVM altogether. Lower reliability for our jobs is not worth 
a small decrease in job run time.

 

+2, especially this late in the cycle, we need our CI to be rock solid.

// jim

 

 



smime.p7s
Description: S/MIME cryptographic signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate

2017-01-18 Thread Jim Rollenhagen
On Tue, Jan 17, 2017 at 6:41 PM, Jay Faulkner  wrote:

> Hi all,
>
> Back in late October, Vasyl wrote support for devstack to auto detect, and
> when possible, use kvm to power Ironic gate jobs (
> 0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job time
> when it works, but has caused failures — how many? It’s hard to quantify as
> the log messages that show the error don’t appear to be indexed by elastic
> search. It’s something seen often enough that the issue has become a
> permanent staple on our gate whiteboard, and doesn’t appear to be
> decreasing in quantity.
>
> I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps
> the auto detection behavior, but defaults devstack to use qemu emulation
> instead of kvm.
>
> I have two questions:
> 1) Is there any way I’m not aware of we can quantify the number of
> failures this is causing? The key log message, "KVM: entry failed, hardware
> error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz.
> 2) Are these failures avoidable or visible in any way?
>
> IMO, if we can’t fix these failures, in my opinion, we have to do a change
> to avoid using nested KVM altogether. Lower reliability for our jobs is not
> worth a small decrease in job run time.
>

+2, especially this late in the cycle, we need our CI to be rock solid.

// jim
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate

2017-01-17 Thread Amrith Kumar
Clark is right, trove does detect and try to use kvm where possible. The
performance has been well worth the change (IMHO).

-amrith

On Jan 17, 2017 6:53 PM, "Clark Boylan"  wrote:

> On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote:
> > Hi all,
> >
> > Back in late October, Vasyl wrote support for devstack to auto detect,
> > and when possible, use kvm to power Ironic gate jobs
> > (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job
> > time when it works, but has caused failures — how many? It’s hard to
> > quantify as the log messages that show the error don’t appear to be
> > indexed by elastic search. It’s something seen often enough that the
> > issue has become a permanent staple on our gate whiteboard, and doesn’t
> > appear to be decreasing in quantity.
> >
> > I pushed up a patch, https://review.openstack.org/#/c/421581, which
> keeps
> > the auto detection behavior, but defaults devstack to use qemu emulation
> > instead of kvm.
> >
> > I have two questions:
> > 1) Is there any way I’m not aware of we can quantify the number of
> > failures this is causing? The key log message, "KVM: entry failed,
> > hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz.
> > 2) Are these failures avoidable or visible in any way?
> >
> > IMO, if we can’t fix these failures, in my opinion, we have to do a
> > change to avoid using nested KVM altogether. Lower reliability for our
> > jobs is not worth a small decrease in job run time.
>
> Part of the problem with nested KVM failures is that in many cases they
> destroy the test nodes in unrecoverable ways. In which case you don't
> get any logs, and zuul will restart the job for you. I think that
> graphite will capture this as a job that resulted in a Null/None status
> though (rather than SUCCESS/FAILURE).
>
> As for collecting info when you do get logs, we don't index the libvirt
> instance logs currently and I am not sure we want to. We already
> struggle to keep up with the existing set of logs when we are busy.
> Instead we might have job cleanup do a quick grep for known nested virt
> problem indicators and then log that to the console log which will be
> indexed.
>
> I think trove has also seen kernel panic type errors in syslog that we
> hypothesized were a result of using nested virt.
>
> The infra team explicitly attempts to force qemu instead of kvm on jobs
> using devstack-gate for these reasons. We know it doesn't work reliably
> and not all clouds support it. Unfortunately my understanding of the
> situation is that base hypervisor cpu and kernel, second level
> hypervisor kernel, and nested guest kernel all come into play here. And
> there can be nasty interactions between them causing a variety of
> problems.
>
> Put another way:
>
> 2017-01-14T00:42:00   if we're talking nested kvm
> 2017-01-14T00:42:04   it's kindof a nightmare
> from
> http://eavesdrop.openstack.org/irclogs/%23openstack-
> infra/%23openstack-infra.2017-01-14.log
>
> Clark
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate

2017-01-17 Thread Clark Boylan
On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote:
> Hi all,
> 
> Back in late October, Vasyl wrote support for devstack to auto detect,
> and when possible, use kvm to power Ironic gate jobs
> (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job
> time when it works, but has caused failures — how many? It’s hard to
> quantify as the log messages that show the error don’t appear to be
> indexed by elastic search. It’s something seen often enough that the
> issue has become a permanent staple on our gate whiteboard, and doesn’t
> appear to be decreasing in quantity.
> 
> I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps
> the auto detection behavior, but defaults devstack to use qemu emulation
> instead of kvm.
> 
> I have two questions:
> 1) Is there any way I’m not aware of we can quantify the number of
> failures this is causing? The key log message, "KVM: entry failed,
> hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz.
> 2) Are these failures avoidable or visible in any way?
> 
> IMO, if we can’t fix these failures, in my opinion, we have to do a
> change to avoid using nested KVM altogether. Lower reliability for our
> jobs is not worth a small decrease in job run time.

Part of the problem with nested KVM failures is that in many cases they
destroy the test nodes in unrecoverable ways. In which case you don't
get any logs, and zuul will restart the job for you. I think that
graphite will capture this as a job that resulted in a Null/None status
though (rather than SUCCESS/FAILURE).

As for collecting info when you do get logs, we don't index the libvirt
instance logs currently and I am not sure we want to. We already
struggle to keep up with the existing set of logs when we are busy.
Instead we might have job cleanup do a quick grep for known nested virt
problem indicators and then log that to the console log which will be
indexed.

I think trove has also seen kernel panic type errors in syslog that we
hypothesized were a result of using nested virt.

The infra team explicitly attempts to force qemu instead of kvm on jobs
using devstack-gate for these reasons. We know it doesn't work reliably
and not all clouds support it. Unfortunately my understanding of the
situation is that base hypervisor cpu and kernel, second level
hypervisor kernel, and nested guest kernel all come into play here. And
there can be nasty interactions between them causing a variety of
problems.

Put another way:

2017-01-14T00:42:00   if we're talking nested kvm
2017-01-14T00:42:04   it's kindof a nightmare
from
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-01-14.log

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ironic] [infra] Nested KVM + the gate

2017-01-17 Thread Jay Faulkner
Hi all,

Back in late October, Vasyl wrote support for devstack to auto detect, and when 
possible, use kvm to power Ironic gate jobs 
(0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job time when 
it works, but has caused failures — how many? It’s hard to quantify as the log 
messages that show the error don’t appear to be indexed by elastic search. It’s 
something seen often enough that the issue has become a permanent staple on our 
gate whiteboard, and doesn’t appear to be decreasing in quantity.

I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps the 
auto detection behavior, but defaults devstack to use qemu emulation instead of 
kvm.

I have two questions:
1) Is there any way I’m not aware of we can quantify the number of failures 
this is causing? The key log message, "KVM: entry failed, hardware error 0x0”, 
shows up in logs/libvirt/qemu/node-*.txt.gz.
2) Are these failures avoidable or visible in any way?

IMO, if we can’t fix these failures, in my opinion, we have to do a change to 
avoid using nested KVM altogether. Lower reliability for our jobs is not worth 
a small decrease in job run time.

Thanks,
Jay Faulkner
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev