[Bug 1822096] Re: libvirt hangs after utah workload is aborted

Christian Ehrhardt  Wed, 03 Apr 2019 04:16:46 -0700

There was another suggestion [1] but with it applied the case still
hangs (after 12 and 1 iteration(s), so not much later than usual).


But the threads looked slightly different this time:
  Id   Target Id                                    Frame 
* 1    Thread 0x7f1eab8efb40 (LWP 13686) "libvirtd" __lll_lock_wait () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
  2    Thread 0x7f1eab434700 (LWP 13688) "libvirtd" futex_wait_cancelable 
(private=<optimized out>, expected=0, futex_word=0x557ce654a534) at 
../sysdeps/unix/sysv/linux/futex-internal.h:88
[...]

I see only one process directly in the lowlevellock.S

(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007f1eaf378945 in __GI___pthread_mutex_lock (mutex=0x7f1e8c0016d0) at 
../nptl/pthread_mutex_lock.c:80
#2  0x00007f1eaf4ef095 in virMutexLock (m=<optimized out>) at 
../../../src/util/virthread.c:89
#3  0x00007f1eaf580fbc in virChrdevFDStreamCloseCb (st=st@entry=0x7f1e9c0128f0, 
opaque=opaque@entry=0x7f1e9c031090) at ../../../src/conf/virchrdev.c:252
#4  0x00007f1eaf48f180 in virFDStreamCloseInt (st=0x7f1e9c0128f0, 
streamAbort=<optimized out>) at ../../../src/util/virfdstream.c:742
#5  0x00007f1eaf6bbec9 in virStreamAbort (stream=0x7f1e9c0128f0) at 
../../../src/libvirt-stream.c:1244
#6  0x0000557ce5bd83aa in daemonStreamHandleAbort 
(client=client@entry=0x557ce65cc650, stream=stream@entry=0x7f1e9c0315b0, 
msg=msg@entry=0x557ce65d1e20) at ../../../src/remote/remote_daemon_stream.c:636
#7  0x0000557ce5bd8ee3 in daemonStreamHandleWrite (stream=0x7f1e9c0315b0, 
client=0x557ce65cc650) at ../../../src/remote/remote_daemon_stream.c:749
[...]

On a retry this was the same again, so the suggeste dpatch did change
something. But not enough yet.

I need to find which lock that actually is and if possible who holds it at the 
moment.
The lock is on
  virMutexLock(&priv->devs->lock);
in virChrdevFDStreamCloseCb


(gdb) p priv->devs->lock
$1 = {lock = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = 
Yes, Shared = No, Protocol = Priority protect, Priority ceiling = 0}}
(gdb) p &priv->devs->lock
$2 = (virMutex *) 0x7f4554020b20

Interesting that it lists Status as "Not aquired"

I wanted to check which path that would be, but the value for the hash seems 
wrong:
p priv->devs->hash
$6 = (virHashTablePtr) 0x25

Code would usually access hash and 0x25 is not a valid address.
The code after the lock would have failed in
  virHashRemoveEntry(priv->devs->hash, priv->path);
This will access 0x25
  nextptr = table->table + virHashComputeKey(table, name);

So we are seeing a not fully cleaned up structure here.
Most likely if not being a lock issue it would be a crash instead.

OTOH that might be due to our unlocking with the most recent patch [1],
allowing it the struct to go partially away. I dropped the patch and
debugged again if it would be more useful to check in there for the
actual lock and path.

I was back at my two backtraces fighting for the lock.
The lock now was in a "better" state.
(gdb) p priv->devs->lock
$9 = {lock = pthread_mutex_t = {Type = Normal, Status = Acquired, possibly with 
waiters, Owner ID = 23102, Robust = No, Shared = No, Protocol = None}}
(gdb) p priv->devs->hash
$10 = (virHashTablePtr) 0x7f2928000c00

It is a one entry list:
(gdb) p priv->devs->hash->table.next
Cannot access memory at address 0x0
(gdb) p (virHashEntry)priv->devs->hash->table
$13 = {next = 0x7f2928000fe0, name = 0xa4b28ee3, payload = 0x3}

Letting the FDST unlock in between did not help (if anything it made it
worse by a stale partial struct that would crash).

[1]: https://www.redhat.com/archives/libvir-
list/2019-April/msg00207.html

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1822096

Title:
  libvirt hangs after utah workload is aborted

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1822096] Re: libvirt hangs after utah workload is aborted

Reply via email to