Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown
On Mon, Aug 6, 2018 at 10:47 AM Daniel P. Berrangé wrote: > On Mon, Aug 06, 2018 at 07:20:10AM +0200, Christian Ehrhardt wrote: > > In that case I wonder what the libvirt community thinks of the proposed > > general "Pid is gone means we can assume it is dead" approach? > > The key thing with the shutdown process is that we use the dissapperance of > the PID as the flag to indicate that it is safe to release any resources > that > the PID was using. eg the hostdevs are now available for another guest to > use. > > I'd be concerned that if we looking /proc/$PID going away as the flag, then > we would be releasing the hostdevs for reuse, before the kernel has cleaned > them up. In the best case this would result in a 2nd guest failing to start > because the device was still in the case, in the worst case we could crash > the entire host (though I'd be hopeful vfio prevents that). > Yeah I agree that ressources being in use could lead to bad and rather hard to debug problems. > An alternative would be to understand on the Kernel side why the PID is > > gone "too early" and fix that so it stays until fully cleaned up. > > But even then on the Libvirt side we would need the extended timeout > values. > > Yeah, looks like extended timeouts are unavoidable. The only real > optimization > would be to pass an explicit timeout to the kill method, increasing it by 2 > seconds for each hostdev that is assigned. That way we'll scale the timeout > up as we need, so don't have to predict the worst case number of assigned > devices. > I'd do both: - extending the KILL path (if force is set) timeout in general to give bad systems a chance - extend the maximum by 2s per hostdev I'll submit that in a few minutes as a reply. > Regards, > Daniel > -- > |: https://berrange.com -o- > https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- > https://fstop138.berrange.com :| > |: https://entangle-photo.org-o- > https://www.instagram.com/dberrange :| > -- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown
On Mon, Aug 06, 2018 at 07:20:10AM +0200, Christian Ehrhardt wrote: > In that case I wonder what the libvirt community thinks of the proposed > general "Pid is gone means we can assume it is dead" approach? The key thing with the shutdown process is that we use the dissapperance of the PID as the flag to indicate that it is safe to release any resources that the PID was using. eg the hostdevs are now available for another guest to use. I'd be concerned that if we looking /proc/$PID going away as the flag, then we would be releasing the hostdevs for reuse, before the kernel has cleaned them up. In the best case this would result in a 2nd guest failing to start because the device was still in the case, in the worst case we could crash the entire host (though I'd be hopeful vfio prevents that). > An alternative would be to understand on the Kernel side why the PID is > gone "too early" and fix that so it stays until fully cleaned up. > But even then on the Libvirt side we would need the extended timeout values. Yeah, looks like extended timeouts are unavoidable. The only real optimization would be to pass an explicit timeout to the kill method, increasing it by 2 seconds for each hostdev that is assigned. That way we'll scale the timeout up as we need, so don't have to predict the worst case number of assigned devices. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown
On Fri, Aug 3, 2018 at 6:39 PM Alex Williamson wrote: > On Fri, 3 Aug 2018 08:29:39 +0200 > Christian Ehrhardt wrote: > > > Hi, > > I was recently looking into a case which essentially looked like this: > > 1. virsh shutdown guest > > 2. after <1 second the qemu process was gone from /proc/ > > 3. but libvirt spun in virProcessKillPainfully because the process > > was still reachable via signals > > 4. virProcessKillPainfully eventually fails after 15 seconds and the > > guest stays in "in shutdown" state forever > > > > This is not one of the common cases I've found for > > virProcessKillPainfully to break: > > - bad I/O e.g. NFS gets qemu stuck > > - CPU overload stalls things to death > > - qemu not being reaped (by init) > > All of the above would have the process still available in /proc/ > > as Zombie or in uninterruptible sleep, but that is not true in my case. > > > > It turned out that the case was dependent on the amount of hostdev > resources > > passed to the guest. Debugging showed that with 8 and more likely 16 GPUs > > passed it took ~18 seconds from SIGTERM to "no more be reachable with > signal 0". > > I haven't conducted much more tests but stayed on the 16 GPU case, but > > I'm rather sure more devices might make it take even longer. > > If it's dependent on device assignment, then it's probably either > related to unmapping DMA or resetting devices. The former should scale > with the size of the VM, not the number of devices attached. The > latter could increase with each device. Typically with physical GPUs > we don't have a function level reset mechanism so we need to do a > secondary bus reset on the upstream bridge to reset the device, this > requires a 1s delay to let the bus settle after reset. So if we're > gated by these sorts of resets, your scaling doesn't sound > unreasonable, So the scaling makes sense with ~16*1s plus a tiny bit of default time to clean up matching the ~18 seconds I see. Thanks for that explanation! > though I'm not sure how these factor into the process > state you're seeing. Yeah I'd have thought to still see it in any form like a Zombie or such. But it really is gone. > I'd also be surprised if you have a system that > can host 16 physical GPUs, so maybe this is a vGPU example? 16*physical GPU it is :-) See https://www.nvidia.com/en-us/data-center/dgx-2/ > Any mdev > device should provide a reset callback for roughly the equivalent of a > function level reset. Implementation of such a reset would be vendor > specific. Since it is no classic mdev [1][2], but just 16*physical GPUs the callback suggestion would not make sens right? In that case I wonder what the libvirt community thinks of the proposed general "Pid is gone means we can assume it is dead" approach? An alternative would be to understand on the Kernel side why the PID is gone "too early" and fix that so it stays until fully cleaned up. But even then on the Libvirt side we would need the extended timeout values. [1]: https://libvirt.org/drvnodedev.html#MDEV [2]: https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt > Thanks, Alex > -- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown
Hi, This series was run against 'syntax-check' test by patchew.org, which failed, please find the details below: Type: series Message-id: 20180803062941.31641-1-christian.ehrha...@canonical.com Subject: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown === TEST SCRIPT BEGIN === #!/bin/bash # Testing script will be invoked under the git checkout with # HEAD pointing to a commit that has the patches applied on top of "base" # branch time bash -c './autogen.sh && make syntax-check' === TEST SCRIPT END === Updating bcb55ab053bc79561b55d0394490f4b64e0f2d01 Switched to a new branch 'test' c874968466 process: accept the lack of /proc/ as valid process removal 4c5a51d196 process: wait longer 5->30s on hard shutdown === OUTPUT BEGIN === Updating submodules... Submodule 'gnulib' (https://git.savannah.gnu.org/git/gnulib.git/) registered for path '.gnulib' Submodule 'keycodemapdb' (https://gitlab.com/keycodemap/keycodemapdb.git) registered for path 'src/keycodemapdb' Cloning into '/var/tmp/patchew-tester-tmp-tugv8oz4/src/.gnulib'... Cloning into '/var/tmp/patchew-tester-tmp-tugv8oz4/src/src/keycodemapdb'... Submodule path '.gnulib': checked out '68df637b5f1b5c10370f6981d2a43a5cf74368df' Submodule path 'src/keycodemapdb': checked out '16e5b0787687d8904dad2c026107409eb9bfcb95' Running bootstrap... ./bootstrap: Bootstrapping from checked-out libvirt sources... ./bootstrap: consider installing git-merge-changelog from gnulib ./bootstrap: getting gnulib files... running: libtoolize --install --copy libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'build-aux'. libtoolize: copying file 'build-aux/config.guess' libtoolize: copying file 'build-aux/config.sub' libtoolize: copying file 'build-aux/install-sh' libtoolize: copying file 'build-aux/ltmain.sh' libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'. libtoolize: copying file 'm4/libtool.m4' libtoolize: copying file 'm4/ltoptions.m4' libtoolize: copying file 'm4/ltsugar.m4' libtoolize: copying file 'm4/ltversion.m4' libtoolize: copying file 'm4/lt~obsolete.m4' ./bootstrap: .gnulib/gnulib-tool--no-changelog --aux-dir=build-aux --doc-base=doc --lib=libgnu --m4-base=m4/ --source-base=gnulib/lib/ --tests-base=gnulib/tests --local-dir=gnulib/local--lgpl=2 --with-tests --makefile-name=gnulib.mk --avoid=pt_chown --avoid=lock-tests --libtool --import ... Module list with included dependencies (indented): absolute-header accept accept-tests alloca alloca-opt alloca-opt-tests allocator areadlink areadlink-tests arpa_inet arpa_inet-tests assure autobuild base64 base64-tests binary-io binary-io-tests bind bind-tests bitrotate bitrotate-tests btowc btowc-tests builtin-expect byteswap byteswap-tests c-ctype c-ctype-tests c-strcase c-strcase-tests c-strcasestr c-strcasestr-tests calloc-posix canonicalize-lgpl canonicalize-lgpl-tests careadlinkat chown chown-tests clock-time cloexec cloexec-tests close close-tests configmake connect connect-tests count-leading-zeros count-leading-zeros-tests count-one-bits count-one-bits-tests ctype ctype-tests dirname-lgpl dosname double-slash-root dup dup-tests dup2 dup2-tests environ environ-tests errno errno-tests error execinfo exitfail extensions extern-inline fatal-signal fclose fclose-tests fcntl fcntl-h fcntl-h-tests fcntl-tests fd-hook fdatasync fdatasync-tests fdopen fdopen-tests fflush fflush-tests ffs ffs-tests ffsl ffsl-tests fgetc-tests filename flexmember float float-tests fnmatch fnmatch-tests fpieee fpucw fpurge fpurge-tests fputc-tests fread-tests freading freading-tests fseek fseek-tests fseeko fseeko-tests fstat fstat-tests fsync fsync-tests ftell ftell-tests ftello ftello-tests ftruncate ftruncate-tests func func-tests fwrite-tests getaddrinfo getaddrinfo-tests getcwd-lgpl getcwd-lgpl-tests getdelim getdelim-tests getdtablesize getdtablesize-tests getgroups getgroups-tests gethostname gethostname-tests getline getline-tests getopt-posix getopt-posix-tests getpagesize getpass getpeername getpeername-tests getprogname getprogname-tests getsockname getsockname-tests getsockopt getsockopt-tests gettext-h gettimeofday gettimeofday-tests getugroups gitlog-to-changelog gnumakefile grantpt grantpt-tests hard-locale havelib host-cpu-c-abi hostent ignore-value ignore-value-tests include_next inet_ntop inet_ntop-tests inet_pton inet_pton-tests intprops intprops-tests inttypes inttypes-incomplete inttypes-tests
Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown
On Fri, 3 Aug 2018 08:29:39 +0200 Christian Ehrhardt wrote: > Hi, > I was recently looking into a case which essentially looked like this: > 1. virsh shutdown guest > 2. after <1 second the qemu process was gone from /proc/ > 3. but libvirt spun in virProcessKillPainfully because the process > was still reachable via signals > 4. virProcessKillPainfully eventually fails after 15 seconds and the > guest stays in "in shutdown" state forever > > This is not one of the common cases I've found for > virProcessKillPainfully to break: > - bad I/O e.g. NFS gets qemu stuck > - CPU overload stalls things to death > - qemu not being reaped (by init) > All of the above would have the process still available in /proc/ > as Zombie or in uninterruptible sleep, but that is not true in my case. > > It turned out that the case was dependent on the amount of hostdev resources > passed to the guest. Debugging showed that with 8 and more likely 16 GPUs > passed it took ~18 seconds from SIGTERM to "no more be reachable with signal > 0". > I haven't conducted much more tests but stayed on the 16 GPU case, but > I'm rather sure more devices might make it take even longer. If it's dependent on device assignment, then it's probably either related to unmapping DMA or resetting devices. The former should scale with the size of the VM, not the number of devices attached. The latter could increase with each device. Typically with physical GPUs we don't have a function level reset mechanism so we need to do a secondary bus reset on the upstream bridge to reset the device, this requires a 1s delay to let the bus settle after reset. So if we're gated by these sorts of resets, your scaling doesn't sound unreasonable, though I'm not sure how these factor into the process state you're seeing. I'd also be surprised if you have a system that can host 16 physical GPUs, so maybe this is a vGPU example? Any mdev device should provide a reset callback for roughly the equivalent of a function level reset. Implementation of such a reset would be vendor specific. Thanks, Alex -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list