Re: Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-10-08 Thread Jim Fehlig

On 10/7/21 20:45, Jim Fehlig wrote:

On 10/4/21 08:55, Michal Prívozník wrote:

On 9/30/21 7:15 PM, Jim Fehlig wrote:

On 9/29/21 21:29, Jim Fehlig wrote:

Hi All,

Likely Christian received a bug report that motivated commit
aeda1b8c56, which was later reverted by Michal with commit 72adaf2f10.
In the past, I recall being asked about "internal error: End of file
from qemu monitor" on normal VM shutdown and gave a hand wavy response
using some of Michal's words from the revert commit message.

I recently received a bug report (sorry, but no public link) from a
concerned user about this error and wondered if there is some way to
improve it? I went down some dead ends before circling back to
Christian's patch. When rebased to latest master, I cannot reproduce
the hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
hangs/crashes of libvirtd [2] has now made Christian's patch viable?


Hmm, Nikolay's series improves thread management at daemon shutdown and
doesn't touch VM shutdown logic. But there has been some behavior change
from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
master. At least I don't see libvirtd hanging when running Michal's test
on master + rebased aeda1b8c56.


After reworking the tests a bit, I still don't see libvirtd hanging, but I do 
see VM shutdown stuck in "in shutdown" state. Attaching gdb shows the following 
thread blocked waiting for a response from the monitor that will never come 
since EOF has already occurred on the socket


Thread 21 (Thread 0x7fdb557fa700 (LWP 9110) "rpc-worker"):
#0  0x7fdbc922a4dc in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x7fdbccee2310 in virCondWait (c=0x7fdba403cd40, m=0x7fdba403cd18) at 
../src/util/virthread.c:156
#2  0x7fdb87150a8b in qemuMonitorSend (mon=0x7fdba403cd00, 
msg=0x7fdb557f95b0) at ../src/qemu/qemu_monito

r.c:964
#3  0x7fdb8715fbf1 in qemuMonitorJSONCommandWithFd (mon=0x7fdba403cd00, 
cmd=0x7fdb4015bae0, scm_fd=-1, re

ply=0x7fdb557f9678) at ../src/qemu/qemu_monitor_json.c:327
#4  0x7fdb8715fda0 in qemuMonitorJSONCommand (mon=0x7fdba403cd00, 
cmd=0x7fdb4015bae0, reply=0x7fdb557f967

8) at ../src/qemu/qemu_monitor_json.c:352
#5  0x7fdb87174b71 in qemuMonitorJSONGetIOThreads (mon=0x7fdba403cd00, 
iothreads=0x7fdb557f9798, niothrea

ds=0x7fdb557f9790) at ../src/qemu/qemu_monitor_json.c:7838
#6  0x7fdb8715d059 in qemuMonitorGetIOThreads (mon=0x7fdba403cd00, 
iothreads=0x7fdb557f9798, niothreads=0

x7fdb557f9790) at ../src/qemu/qemu_monitor.c:4083
#7  0x7fdb870e8ae3 in qemuDomainGetIOThreadsMon (driver=0x7fdb6c06a4f0, 
vm=0x7fdb6c063940, iothreads=0x7f

db557f9798, niothreads=0x7fdb557f9790) at ../src/qemu/qemu_driver.c:4941
#8  0x7fdb871129bf in qemuDomainGetStatsIOThread (driver=0x7fdb6c06a4f0, 
dom=0x7fdb6c063940, params=0x7fd

b401c0cc0, privflags=1) at ../src/qemu/qemu_driver.c:18292
#9  0x7fdb871130ee in qemuDomainGetStats (conn=0x7fdb9c006760, 
dom=0x7fdb6c063940, stats=1023, record=0x7

fdb557f98d0, flags=1) at ../src/qemu/qemu_driver.c:18504
#10 0x7fdb87113526 in qemuConnectGetAllDomainStats (conn=0x7fdb9c006760, 
doms=0x0, ndoms=0, stats=1023, r

etStats=0x7fdb557f9990, flags=0) at ../src/qemu/qemu_driver.c:18598
#11 0x7fdbcd163e4e in virConnectGetAllDomainStats (conn=0x7fdb9c006760, 
stats=0, retStats=0x7fdb557f9990,

  flags=0) at ../src/libvirt-domain.c:11975
...

So indeed, reporting the error when processing monitor IO is needed to prevent 
other threads from subsequently writing to the socket. One idea I had was to 
postpone reporting the error until someone tries to write to the socket, 
although not reporting an error when it is encountered seems distasteful. I've 
been testing such a hack (attached) and along with squelching the error message, 
I no longer see VMs stuck in the "in shutdown" state after 32 iterations of the 
test. A simple rebase of aeda1b8c56 on current git master never survived more 
than a dozen iterations. I'll let the test continue to run.


The test has now survived 1134 iterations. I adjusted the patch to be more 
upstream friendly and submitted it as an RFC


https://listman.redhat.com/archives/libvir-list/2021-October/msg00484.html

I restarted the tests with this version of the patch and now at iteration 75.

Regards,
Jim




Re: Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-10-08 Thread Jim Fehlig

On 10/4/21 08:55, Michal Prívozník wrote:

On 9/30/21 7:15 PM, Jim Fehlig wrote:

On 9/29/21 21:29, Jim Fehlig wrote:

Hi All,

Likely Christian received a bug report that motivated commit
aeda1b8c56, which was later reverted by Michal with commit 72adaf2f10.
In the past, I recall being asked about "internal error: End of file
from qemu monitor" on normal VM shutdown and gave a hand wavy response
using some of Michal's words from the revert commit message.

I recently received a bug report (sorry, but no public link) from a
concerned user about this error and wondered if there is some way to
improve it? I went down some dead ends before circling back to
Christian's patch. When rebased to latest master, I cannot reproduce
the hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
hangs/crashes of libvirtd [2] has now made Christian's patch viable?


Hmm, Nikolay's series improves thread management at daemon shutdown and
doesn't touch VM shutdown logic. But there has been some behavior change
from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
master. At least I don't see libvirtd hanging when running Michal's test
on master + rebased aeda1b8c56.


After reworking the tests a bit, I still don't see libvirtd hanging, but I do 
see VM shutdown stuck in "in shutdown" state. Attaching gdb shows the following 
thread blocked waiting for a response from the monitor that will never come 
since EOF has already occurred on the socket


Thread 21 (Thread 0x7fdb557fa700 (LWP 9110) "rpc-worker"):
#0  0x7fdbc922a4dc in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x7fdbccee2310 in virCondWait (c=0x7fdba403cd40, m=0x7fdba403cd18) at 
../src/util/virthread.c:156
#2  0x7fdb87150a8b in qemuMonitorSend (mon=0x7fdba403cd00, 
msg=0x7fdb557f95b0) at ../src/qemu/qemu_monito

r.c:964
#3  0x7fdb8715fbf1 in qemuMonitorJSONCommandWithFd (mon=0x7fdba403cd00, 
cmd=0x7fdb4015bae0, scm_fd=-1, re

ply=0x7fdb557f9678) at ../src/qemu/qemu_monitor_json.c:327
#4  0x7fdb8715fda0 in qemuMonitorJSONCommand (mon=0x7fdba403cd00, 
cmd=0x7fdb4015bae0, reply=0x7fdb557f967

8) at ../src/qemu/qemu_monitor_json.c:352
#5  0x7fdb87174b71 in qemuMonitorJSONGetIOThreads (mon=0x7fdba403cd00, 
iothreads=0x7fdb557f9798, niothrea

ds=0x7fdb557f9790) at ../src/qemu/qemu_monitor_json.c:7838
#6  0x7fdb8715d059 in qemuMonitorGetIOThreads (mon=0x7fdba403cd00, 
iothreads=0x7fdb557f9798, niothreads=0

x7fdb557f9790) at ../src/qemu/qemu_monitor.c:4083
#7  0x7fdb870e8ae3 in qemuDomainGetIOThreadsMon (driver=0x7fdb6c06a4f0, 
vm=0x7fdb6c063940, iothreads=0x7f

db557f9798, niothreads=0x7fdb557f9790) at ../src/qemu/qemu_driver.c:4941
#8  0x7fdb871129bf in qemuDomainGetStatsIOThread (driver=0x7fdb6c06a4f0, 
dom=0x7fdb6c063940, params=0x7fd

b401c0cc0, privflags=1) at ../src/qemu/qemu_driver.c:18292
#9  0x7fdb871130ee in qemuDomainGetStats (conn=0x7fdb9c006760, 
dom=0x7fdb6c063940, stats=1023, record=0x7

fdb557f98d0, flags=1) at ../src/qemu/qemu_driver.c:18504
#10 0x7fdb87113526 in qemuConnectGetAllDomainStats (conn=0x7fdb9c006760, 
doms=0x0, ndoms=0, stats=1023, r

etStats=0x7fdb557f9990, flags=0) at ../src/qemu/qemu_driver.c:18598
#11 0x7fdbcd163e4e in virConnectGetAllDomainStats (conn=0x7fdb9c006760, 
stats=0, retStats=0x7fdb557f9990,

 flags=0) at ../src/libvirt-domain.c:11975
...

So indeed, reporting the error when processing monitor IO is needed to prevent 
other threads from subsequently writing to the socket. One idea I had was to 
postpone reporting the error until someone tries to write to the socket, 
although not reporting an error when it is encountered seems distasteful. I've 
been testing such a hack (attached) and along with squelching the error message, 
I no longer see VMs stuck in the "in shutdown" state after 32 iterations of the 
test. A simple rebase of aeda1b8c56 on current git master never survived more 
than a dozen iterations. I'll let the test continue to run.


For reference, also attached is the libvirtd-hang-test.sh hack I use to 
continually start/stop VMs and a tweaked version of your stats.py test.



That's sort of expected - back in that era we had one event loop for
everything. Thus things like virGetLastErrorCode() which use thread
local variables did not work as expected and thus we needed to save the
error in mon->lastError. But now we have one additional thread per each
QEMU just for handling monitor IO, thus virGetLastErrorCode() works now
and we could lift some restrictions. Therefore I think we can revisit
Christian's patch again. I too see some concerned users who think it's a
true error.


Since Christian's patch won't work as-is, any other ideas beyond what I 
described above?


Regards,
Jim

libvirtd-hang-test.sh
Description: application/shellscript
import time
import libvirt

conn = libvirt.open("qemu:///system")

while True:
try:
print(conn.getAllDomainStats())
print("")
time.sleep(1)
except 

Re: Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-10-04 Thread Michal Prívozník
On 9/30/21 7:15 PM, Jim Fehlig wrote:
> On 9/29/21 21:29, Jim Fehlig wrote:
>> Hi All,
>>
>> Likely Christian received a bug report that motivated commit
>> aeda1b8c56, which was later reverted by Michal with commit 72adaf2f10.
>> In the past, I recall being asked about "internal error: End of file
>> from qemu monitor" on normal VM shutdown and gave a hand wavy response
>> using some of Michal's words from the revert commit message.
>>
>> I recently received a bug report (sorry, but no public link) from a
>> concerned user about this error and wondered if there is some way to
>> improve it? I went down some dead ends before circling back to
>> Christian's patch. When rebased to latest master, I cannot reproduce
>> the hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
>> hangs/crashes of libvirtd [2] has now made Christian's patch viable?
> 
> Hmm, Nikolay's series improves thread management at daemon shutdown and
> doesn't touch VM shutdown logic. But there has been some behavior change
> from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
> master. At least I don't see libvirtd hanging when running Michal's test
> on master + rebased aeda1b8c56.

That's sort of expected - back in that era we had one event loop for
everything. Thus things like virGetLastErrorCode() which use thread
local variables did not work as expected and thus we needed to save the
error in mon->lastError. But now we have one additional thread per each
QEMU just for handling monitor IO, thus virGetLastErrorCode() works now
and we could lift some restrictions. Therefore I think we can revisit
Christian's patch again. I too see some concerned users who think it's a
true error.

Michal



Re: Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-09-30 Thread Jim Fehlig

On 9/30/21 11:24, Daniel P. Berrangé wrote:

On Thu, Sep 30, 2021 at 11:15:18AM -0600, Jim Fehlig wrote:

On 9/29/21 21:29, Jim Fehlig wrote:

Hi All,

Likely Christian received a bug report that motivated commit aeda1b8c56,
which was later reverted by Michal with commit 72adaf2f10. In the past,
I recall being asked about "internal error: End of file from qemu
monitor" on normal VM shutdown and gave a hand wavy response using some
of Michal's words from the revert commit message.

I recently received a bug report (sorry, but no public link) from a
concerned user about this error and wondered if there is some way to
improve it? I went down some dead ends before circling back to
Christian's patch. When rebased to latest master, I cannot reproduce the
hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
hangs/crashes of libvirtd [2] has now made Christian's patch viable?


Hmm, Nikolay's series improves thread management at daemon shutdown and
doesn't touch VM shutdown logic. But there has been some behavior change
from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
master. At least I don't see libvirtd hanging when running Michal's test on
master + rebased aeda1b8c56.


This particular "eof" error message has been a source of never ending
complaints from people who mistakenly think it indicates a significant
problem.


Nod. I've probably been asked about it more times than I'm remembering. "error" 
and "fail" often catch the attention of users and monitor tools.



I've always been wary of hiding it by default as there are potentially
scenarios it which it is interesting to see. I think I'm coming around
to the idea though that we're better off hiding it by default. The
scenarios which care about it will probably already need the user to
contribute full debug level logs in order to diagnose properly.


I've started a variant of Michal's test on git master + rebased aeda1b8c56. The 
test starts 6 VMs, waits 90sec, shuts them down in parallel, waits for shutdowns 
to finish, then repeats. All while continuously calling GetAllDomainStats from 
another client. I'll see how that holds up before re-proposing Christian's patch.


Regards,
Jim




Re: Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-09-30 Thread Jim Fehlig

On 9/29/21 21:29, Jim Fehlig wrote:

Hi All,

Likely Christian received a bug report that motivated commit aeda1b8c56, which 
was later reverted by Michal with commit 72adaf2f10. In the past, I recall being 
asked about "internal error: End of file from qemu monitor" on normal VM 
shutdown and gave a hand wavy response using some of Michal's words from the 
revert commit message.


I recently received a bug report (sorry, but no public link) from a concerned 
user about this error and wondered if there is some way to improve it? I went 
down some dead ends before circling back to Christian's patch. When rebased to 
latest master, I cannot reproduce the hangs reported by Michal [1]. Perhaps 
Nikolay's series to resolve hangs/crashes of libvirtd [2] has now made 
Christian's patch viable?


Hmm, Nikolay's series improves thread management at daemon shutdown and doesn't 
touch VM shutdown logic. But there has been some behavior change from the time 
aeda1b8c56 was committed (3.4.0 dev cycle) to current git master. At least I 
don't see libvirtd hanging when running Michal's test on master + rebased 
aeda1b8c56.


Regards,
Jim


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1536461
[2] https://listman.redhat.com/archives/libvir-list/2020-July/msg01606.html




Re: Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-09-30 Thread Daniel P . Berrangé
On Thu, Sep 30, 2021 at 11:15:18AM -0600, Jim Fehlig wrote:
> On 9/29/21 21:29, Jim Fehlig wrote:
> > Hi All,
> > 
> > Likely Christian received a bug report that motivated commit aeda1b8c56,
> > which was later reverted by Michal with commit 72adaf2f10. In the past,
> > I recall being asked about "internal error: End of file from qemu
> > monitor" on normal VM shutdown and gave a hand wavy response using some
> > of Michal's words from the revert commit message.
> > 
> > I recently received a bug report (sorry, but no public link) from a
> > concerned user about this error and wondered if there is some way to
> > improve it? I went down some dead ends before circling back to
> > Christian's patch. When rebased to latest master, I cannot reproduce the
> > hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
> > hangs/crashes of libvirtd [2] has now made Christian's patch viable?
> 
> Hmm, Nikolay's series improves thread management at daemon shutdown and
> doesn't touch VM shutdown logic. But there has been some behavior change
> from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
> master. At least I don't see libvirtd hanging when running Michal's test on
> master + rebased aeda1b8c56.

This particular "eof" error message has been a source of never ending
complaints from people who mistakenly think it indicates a significant
problem.

I've always been wary of hiding it by default as there are potentially
scenarios it which it is interesting to see. I think I'm coming around
to the idea though that we're better off hiding it by default. The
scenarios which care about it will probably already need the user to
contribute full debug level logs in order to diagnose properly.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Squelch 'eof from qemu monitor' error on normal VM shutdown

2021-09-30 Thread Jim Fehlig

Hi All,

Likely Christian received a bug report that motivated commit aeda1b8c56, which 
was later reverted by Michal with commit 72adaf2f10. In the past, I recall being 
asked about "internal error: End of file from qemu monitor" on normal VM 
shutdown and gave a hand wavy response using some of Michal's words from the 
revert commit message.


I recently received a bug report (sorry, but no public link) from a concerned 
user about this error and wondered if there is some way to improve it? I went 
down some dead ends before circling back to Christian's patch. When rebased to 
latest master, I cannot reproduce the hangs reported by Michal [1]. Perhaps 
Nikolay's series to resolve hangs/crashes of libvirtd [2] has now made 
Christian's patch viable?


Regards
Jim

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1536461
[2] https://listman.redhat.com/archives/libvir-list/2020-July/msg01606.html