Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown

2018-08-06 Thread Christian Ehrhardt
On Mon, Aug 6, 2018 at 10:47 AM Daniel P. Berrangé 
wrote:

> On Mon, Aug 06, 2018 at 07:20:10AM +0200, Christian Ehrhardt wrote:
> > In that case I wonder what the libvirt community thinks of the proposed
> > general "Pid is gone means we can assume it is dead" approach?
>
> The key thing with the shutdown process is that we use the dissapperance of
> the PID as the flag to indicate that it is safe to release any resources
> that
> the PID was using. eg the hostdevs are now available for another guest to
> use.
>
> I'd be concerned that if we looking /proc/$PID going away as the flag, then
> we would be releasing the hostdevs for reuse, before the kernel has cleaned
> them up. In the best case this would result in a 2nd guest failing to start
> because the device was still in the case, in the worst case we could crash
> the entire host (though I'd be hopeful vfio prevents that).
>

Yeah I agree that ressources being in use could lead to bad and rather hard
to debug problems.

> An alternative would be to understand on the Kernel side why the PID is
> > gone "too early" and fix that so it stays until fully cleaned up.
> > But even then on the Libvirt side we would need the extended timeout
> values.
>
> Yeah, looks like extended timeouts are unavoidable. The only real
> optimization
> would be to pass an explicit timeout to the kill method, increasing it by 2
> seconds for each hostdev that is assigned. That way we'll scale the timeout
> up as we need, so don't have to predict the worst case number of assigned
> devices.
>

I'd do both:
- extending the KILL path (if force is set) timeout in general to give bad
systems a chance
- extend the maximum by 2s per hostdev

I'll submit that in a few minutes as a reply.


> Regards,
> Daniel
> --
> |: https://berrange.com  -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-
> https://www.instagram.com/dberrange :|
>


-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown

2018-08-06 Thread Daniel P . Berrangé
On Mon, Aug 06, 2018 at 07:20:10AM +0200, Christian Ehrhardt wrote:
> In that case I wonder what the libvirt community thinks of the proposed
> general "Pid is gone means we can assume it is dead" approach?

The key thing with the shutdown process is that we use the dissapperance of
the PID as the flag to indicate that it is safe to release any resources that
the PID was using. eg the hostdevs are now available for another guest to use.

I'd be concerned that if we looking /proc/$PID going away as the flag, then
we would be releasing the hostdevs for reuse, before the kernel has cleaned
them up. In the best case this would result in a 2nd guest failing to start
because the device was still in the case, in the worst case we could crash
the entire host (though I'd be hopeful vfio prevents that).

> An alternative would be to understand on the Kernel side why the PID is
> gone "too early" and fix that so it stays until fully cleaned up.
> But even then on the Libvirt side we would need the extended timeout values.

Yeah, looks like extended timeouts are unavoidable. The only real optimization
would be to pass an explicit timeout to the kill method, increasing it by 2
seconds for each hostdev that is assigned. That way we'll scale the timeout
up as we need, so don't have to predict the worst case number of assigned
devices.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown

2018-08-05 Thread Christian Ehrhardt
On Fri, Aug 3, 2018 at 6:39 PM Alex Williamson 
wrote:

> On Fri,  3 Aug 2018 08:29:39 +0200
> Christian Ehrhardt  wrote:
>
> > Hi,
> > I was recently looking into a case which essentially looked like this:
> >   1. virsh shutdown guest
> >   2. after <1 second the qemu process was gone from /proc/
> >   3. but libvirt spun in virProcessKillPainfully because the process
> >  was still reachable via signals
> >   4. virProcessKillPainfully eventually fails after 15 seconds and the
> >  guest stays in "in shutdown" state forever
> >
> > This is not one of the common cases I've found for
> > virProcessKillPainfully to break:
> > - bad I/O e.g. NFS gets qemu stuck
> > - CPU overload stalls things to death
> > - qemu not being reaped (by init)
> > All of the above would have the process still available in /proc/
> > as Zombie or in uninterruptible sleep, but that is not true in my case.
> >
> > It turned out that the case was dependent on the amount of hostdev
> resources
> > passed to the guest. Debugging showed that with 8 and more likely 16 GPUs
> > passed it took ~18 seconds from SIGTERM to "no more be reachable with
> signal 0".
> > I haven't conducted much more tests but stayed on the 16 GPU case, but
> > I'm rather sure more devices might make it take even longer.
>
> If it's dependent on device assignment, then it's probably either
> related to unmapping DMA or resetting devices.  The former should scale
> with the size of the VM, not the number of devices attached.  The
> latter could increase with each device.  Typically with physical GPUs
> we don't have a function level reset mechanism so we need to do a
> secondary bus reset on the upstream bridge to reset the device, this
> requires a 1s delay to let the bus settle after reset.  So if we're
> gated by these sorts of resets, your scaling doesn't sound
> unreasonable,


So the scaling makes sense with ~16*1s plus a tiny bit of default time to
clean up matching the ~18 seconds I see.
Thanks for that explanation!


> though I'm not sure how these factor into the process
> state you're seeing.


Yeah I'd have thought to still see it in any form like a Zombie or such.
But it really is gone.



> I'd also be surprised if you have a system that
> can host 16 physical GPUs, so maybe this is a vGPU example?


16*physical GPU it is :-)
See https://www.nvidia.com/en-us/data-center/dgx-2/


> Any mdev
> device should provide a reset callback for roughly the equivalent of a
> function level reset.  Implementation of such a reset would be vendor
> specific.


Since it is no classic mdev [1][2], but just 16*physical GPUs the callback
suggestion would not make sens right?
In that case I wonder what the libvirt community thinks of the proposed
general "Pid is gone means we can assume it is dead" approach?

An alternative would be to understand on the Kernel side why the PID is
gone "too early" and fix that so it stays until fully cleaned up.
But even then on the Libvirt side we would need the extended timeout values.

[1]: https://libvirt.org/drvnodedev.html#MDEV
[2]: https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt


> Thanks,

Alex
>


-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown

2018-08-05 Thread no-reply
Hi,

This series was run against 'syntax-check' test by patchew.org, which failed, 
please find the details below:

Type: series
Message-id: 20180803062941.31641-1-christian.ehrha...@canonical.com
Subject: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown

=== TEST SCRIPT BEGIN ===
#!/bin/bash
# Testing script will be invoked under the git checkout with
# HEAD pointing to a commit that has the patches applied on top of "base"
# branch
time bash -c './autogen.sh && make syntax-check'
=== TEST SCRIPT END ===

Updating bcb55ab053bc79561b55d0394490f4b64e0f2d01
Switched to a new branch 'test'
c874968466 process: accept the lack of /proc/ as valid process removal
4c5a51d196 process: wait longer 5->30s on hard shutdown

=== OUTPUT BEGIN ===
Updating submodules...
Submodule 'gnulib' (https://git.savannah.gnu.org/git/gnulib.git/) registered 
for path '.gnulib'
Submodule 'keycodemapdb' (https://gitlab.com/keycodemap/keycodemapdb.git) 
registered for path 'src/keycodemapdb'
Cloning into '/var/tmp/patchew-tester-tmp-tugv8oz4/src/.gnulib'...
Cloning into '/var/tmp/patchew-tester-tmp-tugv8oz4/src/src/keycodemapdb'...
Submodule path '.gnulib': checked out '68df637b5f1b5c10370f6981d2a43a5cf74368df'
Submodule path 'src/keycodemapdb': checked out 
'16e5b0787687d8904dad2c026107409eb9bfcb95'
Running bootstrap...
./bootstrap: Bootstrapping from checked-out libvirt sources...
./bootstrap: consider installing git-merge-changelog from gnulib
./bootstrap: getting gnulib files...
running: libtoolize --install --copy
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'build-aux'.
libtoolize: copying file 'build-aux/config.guess'
libtoolize: copying file 'build-aux/config.sub'
libtoolize: copying file 'build-aux/install-sh'
libtoolize: copying file 'build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
./bootstrap: .gnulib/gnulib-tool--no-changelog   --aux-dir=build-aux   
--doc-base=doc   --lib=libgnu   --m4-base=m4/   --source-base=gnulib/lib/   
--tests-base=gnulib/tests   --local-dir=gnulib/local--lgpl=2 --with-tests 
--makefile-name=gnulib.mk --avoid=pt_chown --avoid=lock-tests   --libtool 
--import ...
Module list with included dependencies (indented):
absolute-header
  accept
accept-tests
alloca
alloca-opt
alloca-opt-tests
allocator
  areadlink
areadlink-tests
arpa_inet
arpa_inet-tests
assure
  autobuild
  base64
base64-tests
binary-io
binary-io-tests
  bind
bind-tests
  bitrotate
bitrotate-tests
btowc
btowc-tests
builtin-expect
  byteswap
byteswap-tests
  c-ctype
c-ctype-tests
  c-strcase
c-strcase-tests
  c-strcasestr
c-strcasestr-tests
  calloc-posix
  canonicalize-lgpl
canonicalize-lgpl-tests
careadlinkat
  chown
chown-tests
  clock-time
cloexec
cloexec-tests
  close
close-tests
  configmake
  connect
connect-tests
  count-leading-zeros
count-leading-zeros-tests
  count-one-bits
count-one-bits-tests
ctype
ctype-tests
  dirname-lgpl
dosname
double-slash-root
dup
dup-tests
dup2
dup2-tests
  environ
environ-tests
errno
errno-tests
error
  execinfo
exitfail
extensions
extern-inline
fatal-signal
  fclose
fclose-tests
  fcntl
  fcntl-h
fcntl-h-tests
fcntl-tests
fd-hook
  fdatasync
fdatasync-tests
fdopen
fdopen-tests
fflush
fflush-tests
  ffs
ffs-tests
  ffsl
ffsl-tests
fgetc-tests
filename
flexmember
float
float-tests
  fnmatch
fnmatch-tests
fpieee
fpucw
fpurge
fpurge-tests
fputc-tests
fread-tests
freading
freading-tests
fseek
fseek-tests
fseeko
fseeko-tests
fstat
fstat-tests
  fsync
fsync-tests
ftell
ftell-tests
ftello
ftello-tests
ftruncate
ftruncate-tests
  func
func-tests
fwrite-tests
  getaddrinfo
getaddrinfo-tests
  getcwd-lgpl
getcwd-lgpl-tests
getdelim
getdelim-tests
getdtablesize
getdtablesize-tests
getgroups
getgroups-tests
  gethostname
gethostname-tests
getline
getline-tests
  getopt-posix
getopt-posix-tests
getpagesize
  getpass
  getpeername
getpeername-tests
getprogname
getprogname-tests
  getsockname
getsockname-tests
getsockopt
getsockopt-tests
gettext-h
  gettimeofday
gettimeofday-tests
getugroups
  gitlog-to-changelog
  gnumakefile
grantpt
grantpt-tests
hard-locale
havelib
host-cpu-c-abi
hostent
  ignore-value
ignore-value-tests
include_next
inet_ntop
inet_ntop-tests
  inet_pton
inet_pton-tests
  intprops
intprops-tests
inttypes
inttypes-incomplete
inttypes-tests
  

Re: [libvirt] [RFC 0/2] Fix detection of slow guest shutdown

2018-08-05 Thread Alex Williamson
On Fri,  3 Aug 2018 08:29:39 +0200
Christian Ehrhardt  wrote:

> Hi,
> I was recently looking into a case which essentially looked like this:
>   1. virsh shutdown guest
>   2. after <1 second the qemu process was gone from /proc/
>   3. but libvirt spun in virProcessKillPainfully because the process
>  was still reachable via signals
>   4. virProcessKillPainfully eventually fails after 15 seconds and the
>  guest stays in "in shutdown" state forever
> 
> This is not one of the common cases I've found for
> virProcessKillPainfully to break:
> - bad I/O e.g. NFS gets qemu stuck
> - CPU overload stalls things to death
> - qemu not being reaped (by init)
> All of the above would have the process still available in /proc/
> as Zombie or in uninterruptible sleep, but that is not true in my case.
> 
> It turned out that the case was dependent on the amount of hostdev resources
> passed to the guest. Debugging showed that with 8 and more likely 16 GPUs
> passed it took ~18 seconds from SIGTERM to "no more be reachable with signal 
> 0".
> I haven't conducted much more tests but stayed on the 16 GPU case, but
> I'm rather sure more devices might make it take even longer.

If it's dependent on device assignment, then it's probably either
related to unmapping DMA or resetting devices.  The former should scale
with the size of the VM, not the number of devices attached.  The
latter could increase with each device.  Typically with physical GPUs
we don't have a function level reset mechanism so we need to do a
secondary bus reset on the upstream bridge to reset the device, this
requires a 1s delay to let the bus settle after reset.  So if we're
gated by these sorts of resets, your scaling doesn't sound
unreasonable, though I'm not sure how these factor into the process
state you're seeing.  I'd also be surprised if you have a system that
can host 16 physical GPUs, so maybe this is a vGPU example?  Any mdev
device should provide a reset callback for roughly the equivalent of a
function level reset.  Implementation of such a reset would be vendor
specific.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list