There was another suggestion [1] but with it applied the case still
hangs (after 12 and 1 iteration(s), so not much later than usual).
But the threads looked slightly different this time:
Id Target Id Frame
* 1 Thread 0x7f1eab8efb40 (LWP 13686) "libvirtd" __lll_lock_wait () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
2 Thread 0x7f1eab434700 (LWP 13688) "libvirtd" futex_wait_cancelable
(private=<optimized out>, expected=0, futex_word=0x557ce654a534) at
../sysdeps/unix/sysv/linux/futex-internal.h:88
[...]
I see only one process directly in the lowlevellock.S
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f1eaf378945 in __GI___pthread_mutex_lock (mutex=0x7f1e8c0016d0) at
../nptl/pthread_mutex_lock.c:80
#2 0x00007f1eaf4ef095 in virMutexLock (m=<optimized out>) at
../../../src/util/virthread.c:89
#3 0x00007f1eaf580fbc in virChrdevFDStreamCloseCb (st=st@entry=0x7f1e9c0128f0,
opaque=opaque@entry=0x7f1e9c031090) at ../../../src/conf/virchrdev.c:252
#4 0x00007f1eaf48f180 in virFDStreamCloseInt (st=0x7f1e9c0128f0,
streamAbort=<optimized out>) at ../../../src/util/virfdstream.c:742
#5 0x00007f1eaf6bbec9 in virStreamAbort (stream=0x7f1e9c0128f0) at
../../../src/libvirt-stream.c:1244
#6 0x0000557ce5bd83aa in daemonStreamHandleAbort
(client=client@entry=0x557ce65cc650, stream=stream@entry=0x7f1e9c0315b0,
msg=msg@entry=0x557ce65d1e20) at ../../../src/remote/remote_daemon_stream.c:636
#7 0x0000557ce5bd8ee3 in daemonStreamHandleWrite (stream=0x7f1e9c0315b0,
client=0x557ce65cc650) at ../../../src/remote/remote_daemon_stream.c:749
[...]
On a retry this was the same again, so the suggeste dpatch did change
something. But not enough yet.
I need to find which lock that actually is and if possible who holds it at the
moment.
The lock is on
virMutexLock(&priv->devs->lock);
in virChrdevFDStreamCloseCb
(gdb) p priv->devs->lock
$1 = {lock = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust =
Yes, Shared = No, Protocol = Priority protect, Priority ceiling = 0}}
(gdb) p &priv->devs->lock
$2 = (virMutex *) 0x7f4554020b20
Interesting that it lists Status as "Not aquired"
I wanted to check which path that would be, but the value for the hash seems
wrong:
p priv->devs->hash
$6 = (virHashTablePtr) 0x25
Code would usually access hash and 0x25 is not a valid address.
The code after the lock would have failed in
virHashRemoveEntry(priv->devs->hash, priv->path);
This will access 0x25
nextptr = table->table + virHashComputeKey(table, name);
So we are seeing a not fully cleaned up structure here.
Most likely if not being a lock issue it would be a crash instead.
OTOH that might be due to our unlocking with the most recent patch [1],
allowing it the struct to go partially away. I dropped the patch and
debugged again if it would be more useful to check in there for the
actual lock and path.
I was back at my two backtraces fighting for the lock.
The lock now was in a "better" state.
(gdb) p priv->devs->lock
$9 = {lock = pthread_mutex_t = {Type = Normal, Status = Acquired, possibly with
waiters, Owner ID = 23102, Robust = No, Shared = No, Protocol = None}}
(gdb) p priv->devs->hash
$10 = (virHashTablePtr) 0x7f2928000c00
It is a one entry list:
(gdb) p priv->devs->hash->table.next
Cannot access memory at address 0x0
(gdb) p (virHashEntry)priv->devs->hash->table
$13 = {next = 0x7f2928000fe0, name = 0xa4b28ee3, payload = 0x3}
Letting the FDST unlock in between did not help (if anything it made it
worse by a stale partial struct that would crash).
[1]: https://www.redhat.com/archives/libvir-
list/2019-April/msg00207.html
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1822096
Title:
libvirt hangs after utah workload is aborted
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs