Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate
Jay, This is the Trove commit … <https://review.openstack.org/#q,I85364c6530058e964a8eba7fb515d7deadfd5d72,n,z> I85364c6530058e964a8eba7fb515d7deadfd5d72 -amrith From: Jim Rollenhagen [mailto:j...@jimrollenhagen.com] Sent: Wednesday, January 18, 2017 7:57 AM To: OpenStack Development Mailing List (not for usage questions) <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate On Tue, Jan 17, 2017 at 6:41 PM, Jay Faulkner <j...@jvf.cc <mailto:j...@jvf.cc> > wrote: Hi all, Back in late October, Vasyl wrote support for devstack to auto detect, and when possible, use kvm to power Ironic gate jobs (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job time when it works, but has caused failures — how many? It’s hard to quantify as the log messages that show the error don’t appear to be indexed by elastic search. It’s something seen often enough that the issue has become a permanent staple on our gate whiteboard, and doesn’t appear to be decreasing in quantity. I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps the auto detection behavior, but defaults devstack to use qemu emulation instead of kvm. I have two questions: 1) Is there any way I’m not aware of we can quantify the number of failures this is causing? The key log message, "KVM: entry failed, hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. 2) Are these failures avoidable or visible in any way? IMO, if we can’t fix these failures, in my opinion, we have to do a change to avoid using nested KVM altogether. Lower reliability for our jobs is not worth a small decrease in job run time. +2, especially this late in the cycle, we need our CI to be rock solid. // jim smime.p7s Description: S/MIME cryptographic signature __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate
On Tue, Jan 17, 2017 at 6:41 PM, Jay Faulknerwrote: > Hi all, > > Back in late October, Vasyl wrote support for devstack to auto detect, and > when possible, use kvm to power Ironic gate jobs ( > 0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job time > when it works, but has caused failures — how many? It’s hard to quantify as > the log messages that show the error don’t appear to be indexed by elastic > search. It’s something seen often enough that the issue has become a > permanent staple on our gate whiteboard, and doesn’t appear to be > decreasing in quantity. > > I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps > the auto detection behavior, but defaults devstack to use qemu emulation > instead of kvm. > > I have two questions: > 1) Is there any way I’m not aware of we can quantify the number of > failures this is causing? The key log message, "KVM: entry failed, hardware > error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. > 2) Are these failures avoidable or visible in any way? > > IMO, if we can’t fix these failures, in my opinion, we have to do a change > to avoid using nested KVM altogether. Lower reliability for our jobs is not > worth a small decrease in job run time. > +2, especially this late in the cycle, we need our CI to be rock solid. // jim __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate
Clark is right, trove does detect and try to use kvm where possible. The performance has been well worth the change (IMHO). -amrith On Jan 17, 2017 6:53 PM, "Clark Boylan"wrote: > On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote: > > Hi all, > > > > Back in late October, Vasyl wrote support for devstack to auto detect, > > and when possible, use kvm to power Ironic gate jobs > > (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job > > time when it works, but has caused failures — how many? It’s hard to > > quantify as the log messages that show the error don’t appear to be > > indexed by elastic search. It’s something seen often enough that the > > issue has become a permanent staple on our gate whiteboard, and doesn’t > > appear to be decreasing in quantity. > > > > I pushed up a patch, https://review.openstack.org/#/c/421581, which > keeps > > the auto detection behavior, but defaults devstack to use qemu emulation > > instead of kvm. > > > > I have two questions: > > 1) Is there any way I’m not aware of we can quantify the number of > > failures this is causing? The key log message, "KVM: entry failed, > > hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. > > 2) Are these failures avoidable or visible in any way? > > > > IMO, if we can’t fix these failures, in my opinion, we have to do a > > change to avoid using nested KVM altogether. Lower reliability for our > > jobs is not worth a small decrease in job run time. > > Part of the problem with nested KVM failures is that in many cases they > destroy the test nodes in unrecoverable ways. In which case you don't > get any logs, and zuul will restart the job for you. I think that > graphite will capture this as a job that resulted in a Null/None status > though (rather than SUCCESS/FAILURE). > > As for collecting info when you do get logs, we don't index the libvirt > instance logs currently and I am not sure we want to. We already > struggle to keep up with the existing set of logs when we are busy. > Instead we might have job cleanup do a quick grep for known nested virt > problem indicators and then log that to the console log which will be > indexed. > > I think trove has also seen kernel panic type errors in syslog that we > hypothesized were a result of using nested virt. > > The infra team explicitly attempts to force qemu instead of kvm on jobs > using devstack-gate for these reasons. We know it doesn't work reliably > and not all clouds support it. Unfortunately my understanding of the > situation is that base hypervisor cpu and kernel, second level > hypervisor kernel, and nested guest kernel all come into play here. And > there can be nasty interactions between them causing a variety of > problems. > > Put another way: > > 2017-01-14T00:42:00 if we're talking nested kvm > 2017-01-14T00:42:04 it's kindof a nightmare > from > http://eavesdrop.openstack.org/irclogs/%23openstack- > infra/%23openstack-infra.2017-01-14.log > > Clark > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ironic] [infra] Nested KVM + the gate
On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote: > Hi all, > > Back in late October, Vasyl wrote support for devstack to auto detect, > and when possible, use kvm to power Ironic gate jobs > (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job > time when it works, but has caused failures — how many? It’s hard to > quantify as the log messages that show the error don’t appear to be > indexed by elastic search. It’s something seen often enough that the > issue has become a permanent staple on our gate whiteboard, and doesn’t > appear to be decreasing in quantity. > > I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps > the auto detection behavior, but defaults devstack to use qemu emulation > instead of kvm. > > I have two questions: > 1) Is there any way I’m not aware of we can quantify the number of > failures this is causing? The key log message, "KVM: entry failed, > hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. > 2) Are these failures avoidable or visible in any way? > > IMO, if we can’t fix these failures, in my opinion, we have to do a > change to avoid using nested KVM altogether. Lower reliability for our > jobs is not worth a small decrease in job run time. Part of the problem with nested KVM failures is that in many cases they destroy the test nodes in unrecoverable ways. In which case you don't get any logs, and zuul will restart the job for you. I think that graphite will capture this as a job that resulted in a Null/None status though (rather than SUCCESS/FAILURE). As for collecting info when you do get logs, we don't index the libvirt instance logs currently and I am not sure we want to. We already struggle to keep up with the existing set of logs when we are busy. Instead we might have job cleanup do a quick grep for known nested virt problem indicators and then log that to the console log which will be indexed. I think trove has also seen kernel panic type errors in syslog that we hypothesized were a result of using nested virt. The infra team explicitly attempts to force qemu instead of kvm on jobs using devstack-gate for these reasons. We know it doesn't work reliably and not all clouds support it. Unfortunately my understanding of the situation is that base hypervisor cpu and kernel, second level hypervisor kernel, and nested guest kernel all come into play here. And there can be nasty interactions between them causing a variety of problems. Put another way: 2017-01-14T00:42:00 if we're talking nested kvm 2017-01-14T00:42:04 it's kindof a nightmare from http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-01-14.log Clark __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [ironic] [infra] Nested KVM + the gate
Hi all, Back in late October, Vasyl wrote support for devstack to auto detect, and when possible, use kvm to power Ironic gate jobs (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job time when it works, but has caused failures — how many? It’s hard to quantify as the log messages that show the error don’t appear to be indexed by elastic search. It’s something seen often enough that the issue has become a permanent staple on our gate whiteboard, and doesn’t appear to be decreasing in quantity. I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps the auto detection behavior, but defaults devstack to use qemu emulation instead of kvm. I have two questions: 1) Is there any way I’m not aware of we can quantify the number of failures this is causing? The key log message, "KVM: entry failed, hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. 2) Are these failures avoidable or visible in any way? IMO, if we can’t fix these failures, in my opinion, we have to do a change to avoid using nested KVM altogether. Lower reliability for our jobs is not worth a small decrease in job run time. Thanks, Jay Faulkner __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev