[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
hello paelzer thanks a lot PPA link is https://launchpad.net/~xtrusia/+archive/ubuntu/sf161119 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1746630 Title: virsh api is stuck when vm is down with NFS borken To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1746630/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
Hi Seyeong, thanks for the analysis and backport. The patches look good to me - two structural changes with no net-effect (to let the backport apply) - there could have been a backport without those, but I agree that this looks clearer - the actual fix seems ok, skipping if no data avail sounds right - some cleanups in the patch headers is required, but I can do that on upload for you. - these changes are upstream a long time and are still the way implemented by this change in 4.0 Thanks a lot, two things: 1. do you have a ppa with that already that I should run checks against (otherwise I'll open one up when really prepping the SRU)? 2. there is a security update in flight we have to wait for - I'm postponing this fix until that is complete. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1746630 Title: virsh api is stuck when vm is down with NFS borken To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1746630/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
** Also affects: libvirt (Ubuntu Xenial) Importance: Undecided Status: New ** Changed in: libvirt (Ubuntu) Status: New => Fix Released ** Changed in: libvirt (Ubuntu Xenial) Status: New => Triaged -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1746630 Title: virsh api is stuck when vm is down with NFS borken To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1746630/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
The attachment "lp1746630_xenial.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team. [This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.] ** Tags added: patch -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1746630 Title: virsh api is stuck when vm is down with NFS borken To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1746630/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
Launchpad has imported 15 comments from the remote bug at https://bugzilla.redhat.com/show_bug.cgi?id=1337073. If you reply to an imported comment from within Launchpad, your comment will be sent to the remote bug automatically. Read more about Launchpad's inter-bugtracker facilities at https://help.launchpad.net/InterBugTracking. On 2016-05-18T08:46:03+00:00 Francesco wrote: Description of problem: Short summary: if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetControlInfo() stops to respond. Packages: qemu-kvm-tools-rhev-2.3.0-31.el7_2.14.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64 libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64 qemu-img-rhev-2.3.0-31.el7_2.14.x86_64 qemu-kvm-common-rhev-2.3.0-31.el7_2.14.x86_64 libvirt-daemon-driver-storage-1.3.4-1.el7.x86_64 libvirt-daemon-driver-interface-1.3.4-1.el7.x86_64 libvirt-debuginfo-1.3.4-1.el7.x86_64 libvirt-daemon-kvm-1.3.4-1.el7.x86_64 libvirt-daemon-config-nwfilter-1.3.4-1.el7.x86_64 libvirt-daemon-config-network-1.3.4-1.el7.x86_64 libvirt-client-1.3.4-1.el7.x86_64 libvirt-daemon-driver-lxc-1.3.4-1.el7.x86_64 libvirt-lock-sanlock-1.3.4-1.el7.x86_64 libvirt-daemon-1.3.4-1.el7.x86_64 libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64 libvirt-devel-1.3.4-1.el7.x86_64 libvirt-daemon-driver-secret-1.3.4-1.el7.x86_64 libvirt-daemon-lxc-1.3.4-1.el7.x86_64 libvirt-nss-1.3.4-1.el7.x86_64 libvirt-1.3.4-1.el7.x86_64 libvirt-daemon-driver-nodedev-1.3.4-1.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-daemon-driver-network-1.3.4-1.el7.x86_64 libvirt-login-shell-1.3.4-1.el7.x86_64 libvirt-daemon-driver-nwfilter-1.3.4-1.el7.x86_64 libvirt-docs-1.3.4-1.el7.x86_64 libvirt recompiled from git, qemu from RHEL Context: Vdsm is the node management system of oVirt (http://www.ovirt.org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage. Among the calls Vdsm periodically run to monitor the VM state: virConnectGetAllDomainStats virDomainListGetStats virDomainGetBlockIoTune virDomainBlockJobInfo virDomainGetBlockInfo virDomainGetVcpus We know from experience storage may get unresponsive/unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang. Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like def isDomainReadyForCommands(self): try: state, details, stateTime = self._dom.controlInfo() except virdomain.NotConnectedError: # this method may be called asynchronously by periodic # operations. Thus, we must use a try/except block # to avoid racy checks. return False except libvirt.libvirtError as e: if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN: return False else: raise else: return state == libvirt.VIR_DOMAIN_CONTROL_OK Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainControlInfo() state is VIR_DOMAIN_CONTROL_OK) When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetControlInfo() calls not responding anymore, like (full log attached) 2016-05-18 06:01:45.920+: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHandleJob) is now running job remoteDispatchDomainGetVcpus 2016-05-18 06:01:45.920+: 3069: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a70004070 classname=virDomain 2016-05-18 06:01:45.920+: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0 2016-05-18 06:01:45.920+: 3069: debug : virDomainGetVcpus:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1 2016-05-18 06:01:45.920+: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0 2016-05-18 06:01:45.920+: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930f6f00 2016-05-18 06:01:45.920+: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930f6f00(name=stack) driver=QEMU domain=0x7f5a64012c40 perm=1 2016-05-18 06:01:45.920+: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930ebdf0(name=none) driver=QEMU domain=0x7f5a64012c40 perm=1 2016-05-18 06:01:45.920+: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930f6f00 2016-05-18 06:01:45.920+: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128 2016-05-18 06:01:45.920+: 3069: debug : qem
[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
** Bug watch added: Red Hat Bugzilla #1337073 https://bugzilla.redhat.com/show_bug.cgi?id=1337073 ** Also affects: libvirt via https://bugzilla.redhat.com/show_bug.cgi?id=1337073 Importance: Unknown Status: Unknown ** Also affects: cloud-archive Importance: Undecided Status: New ** Description changed: [Impact] virsh command is hang if there is broken VM on broken NFS - This is affected to Xenial + This is affected to Xenial, UCA-Mitaka [Test Case] 1. deploy VM with NFS storage ( running ) 2. block NFS via iptables - iptables -A OUTPUT -d NFS_SERVER_IP -p tcp --dport 2049 -j DROP ( on host machine ) 3. virsh blkdeviotune generic hda => hang 4. virsh domstats => hang 5. virsh list => lang [Regression] After patch, we can command domstats and list with short timeout. and libvirt-bin needs to be restarted. so if there are many VMs it will be affected short time while it is restarting. [Others] This bug is fixed in redhat bug report[1] and mailing list[2] and git commit[3][4][5] and it is merged 1.3.5 upstream https://libvirt.org/git/?p=libvirt.git;a=blobdiff;f=docs/news.html.in;h=1ad8337f5f8443b5ac76450dc3370f95c51503fd;hp=d035f6833fb5eaaced8f5a7010872f3e61b6955b;hb=732bc70dcc3e2d1fe0baa640712efb99e273;hpb=d57e73d06fe5901ac4ab9c025b3531251292b509 - [1] https://bugzilla.redhat.com/show_bug.cgi?id=1337073 [2] https://www.redhat.com/archives/libvir-list/2016-May/msg01353.html [3] https://libvirt.org/git/?p=libvirt.git;a=commit;h=5d2b0e6f12b4e57d75ed1047ab1c36443b7a54b3 [4] https://libvirt.org/git/?p=libvirt.git;a=commit;h=3aa5d51a9530a8737ca584b393c29297dd9bbc37 [5] https://libvirt.org/git/?p=libvirt.git;a=commit;h=71d2c172edb997bae1e883b2e1bafa97d9f953a1 ** Patch added: "lp1746630_mitaka.debdiff" https://bugs.launchpad.net/cloud-archive/+bug/1746630/+attachment/5046659/+files/lp1746630_mitaka.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1746630 Title: virsh api is stuck when vm is down with NFS borken To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1746630/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1746630] Re: virsh api is stuck when vm is down with NFS borken
** Description changed: [Impact] virsh command is hang if there is broken VM on broken NFS This is affected to Xenial [Test Case] 1. deploy VM with NFS storage ( running ) 2. block NFS via iptables - iptables -A OUTPUT -d NFS_SERVER_IP -p tcp --dport 2049 -j DROP ( on host machine ) 3. virsh blkdeviotune generic hda => hang 4. virsh domstats => hang 5. virsh list => lang [Regression] After patch, we can command domstats and list with short timeout. and libvirt-bin needs to be restarted. so if there are many VMs it will be affected short time while it is restarting. [Others] This bug is fixed in redhat bug report[1] and mailing list[2] and git commit[3][4][5] + and it is merged 1.3.5 upstream + + https://libvirt.org/git/?p=libvirt.git;a=blobdiff;f=docs/news.html.in;h=1ad8337f5f8443b5ac76450dc3370f95c51503fd;hp=d035f6833fb5eaaced8f5a7010872f3e61b6955b;hb=732bc70dcc3e2d1fe0baa640712efb99e273;hpb=d57e73d06fe5901ac4ab9c025b3531251292b509 + + [1] https://bugzilla.redhat.com/show_bug.cgi?id=1337073 [2] https://www.redhat.com/archives/libvir-list/2016-May/msg01353.html [3] https://libvirt.org/git/?p=libvirt.git;a=commit;h=5d2b0e6f12b4e57d75ed1047ab1c36443b7a54b3 [4] https://libvirt.org/git/?p=libvirt.git;a=commit;h=3aa5d51a9530a8737ca584b393c29297dd9bbc37 [5] https://libvirt.org/git/?p=libvirt.git;a=commit;h=71d2c172edb997bae1e883b2e1bafa97d9f953a1 ** Patch added: "lp1746630_xenial.debdiff" https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1746630/+attachment/5046651/+files/lp1746630_xenial.debdiff ** Tags added: sts-sru-needed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1746630 Title: virsh api is stuck when vm is down with NFS borken To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1746630/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs