Problem Analysis
================
The dmesg log 'crash/201809061748/dmesg.201809061748' shows the CPU soft lockup
occurs 25 seconds after the 'sdb' virtio-scsi drive is removed.
This seems to indicate the events are related (there's usually an extra 2s-3s
between an event and the report of the 22s or 23s stuck, for some reason).
[ 3002.697474] sd 0:0:2:0: [sdb] Synchronizing SCSI cache
[ 3002.697545] sd 0:0:2:0: [sdb] Synchronize Cache(10) failed: Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 3028.294602] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
[id:2887]
CPU 1 is waiting on another CPU (to finish something)
-------------------------------
The locked up 'id' process is in page fault handling stack (which is
OK/normal), calling a function on many (SMP) cpus (see
smp_call_function_many() in the top of the stack), and is specifically
in the call to one CPU (see smp_call_function_single+0xae/0x110() in the
RIP register).
[ 3028.301755] CPU: 1 PID: 2887 Comm: id Not tainted 4.4.0-133-generic
#159~14.04.1-Ubuntu
...
[ 3028.301760] RIP: 0010:[<ffffffff81102b1e>] [<ffffffff81102b1e>]
smp_call_function_single+0xae/0x110
...
[ 3028.301797] Call Trace:
[ 3028.301803] [<ffffffff81071f60>] ? do_kernel_range_flush+0x40/0x40
[ 3028.301805] [<ffffffff81102e9e>] smp_call_function_many+0x22e/0x270
[ 3028.301808] [<ffffffff810723f8>] native_flush_tlb_others+0x48/0x120
[ 3028.301810] [<ffffffff8107256d>] flush_tlb_mm_range+0x9d/0x180
[ 3028.301815] [<ffffffff811cb1e3>] ptep_clear_flush+0x53/0x60
[ 3028.301819] [<ffffffff811b75ed>] wp_page_copy.isra.58+0x29d/0x530
[ 3028.301822] [<ffffffff811b955d>] do_wp_page+0x8d/0x590
[ 3028.301824] [<ffffffff811bb826>] handle_mm_fault+0xd86/0x1ac0
[ 3028.301829] [<ffffffff81192c53>] ? free_pages+0x13/0x20
[ 3028.301835] [<ffffffff810aa754>] ? finish_task_switch+0x244/0x2a0
[ 3028.301840] [<ffffffff8106b0fb>] __do_page_fault+0x19b/0x430
[ 3028.301843] [<ffffffff8106b3b2>] do_page_fault+0x22/0x30
[ 3028.301847] [<ffffffff81822928>] page_fault+0x28/0x30
The smp_call_function_many() address line in the stack trace
(smp_call_function_many+0x22e / ffffffff81102e9e) reflects it's called
smp_call_function_single():
ffffffff81102c70 <smp_call_function_many>:
...
ffffffff81102e99: e8 d2 fb ff ff callq ffffffff81102a70
<smp_call_function_single>
ffffffff81102e9e: 48 83 c4 10 add $0x10,%rsp
... which per the address in the RIP register (smp_call_function_single+0xae
ffffffff81102b1e)
is in the (inlined) call to csd_lock_wait() after generic_exec_single().
csd_lock_wait() spins in the value of csd->flags with cpu_relax() / 'pause'
instruction,
waiting for it to be unlocked (i.e., not have the CSD_FLAG_LOCK flag / 0x1
value in the
flags field / offset 0x18)
ffffffff81102a70 <smp_call_function_single>:
...
ffffffff81102b0f: e8 0c fe ff ff callq ffffffff81102920
<generic_exec_single>
ffffffff81102b14: 8b 55 e8 mov -0x18(%rbp),%edx
ffffffff81102b17: 83 e2 01 and $0x1,%edx
ffffffff81102b1a: 74 de je ffffffff81102afa
<smp_call_function_single+0x8a>
ffffffff81102b1c: f3 90 pause
ffffffff81102b1e: 8b 55 e8 mov -0x18(%rbp),%edx
ffffffff81102b21: 83 e2 01 and $0x1,%edx
ffffffff81102b24: 75 f6 jne ffffffff81102b1c
<smp_call_function_single+0xac>
"""
smp_call_function_single()
...
err = generic_exec_single(cpu, csd, func, info);
if (wait)
csd_lock_wait(csd);
...
"""
"""
static void csd_lock_wait(struct call_single_data *csd)
{
while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
cpu_relax();
}
"""
kernel/smp.c: CSD_FLAG_LOCK = 0x01,
include/linux/smp.h:
struct call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info;
unsigned int flags;
};
$ pahole -C call_single_data vmlinux-4.4.0-133-generic
...
struct call_single_data {
struct llist_node llist; /* 0 8
*/
smp_call_func_t func; /* 8 8
*/
void * info; /* 16 8
*/
unsigned int flags; /* 24 4
*/
### ^ 24 == 0x18
...
};
So this shows that CPU 1 is spinning/waiting on another CPU to finish
something.
What the (an)other CPU is doing
-------------------------------
This guest has only 2 CPUs (confirmed with the crashdump file
'dump.201809061748'),
so CPU 0 is the only other CPU that could be doing something.
KERNEL: /home/mfo/<...>/vmlinux-4.4.0-133-generic
DUMPFILE: dump.201809061748 [PARTIAL DUMP]
CPUS: 2
The NMI to other CPUs in the dmesg file reveals it's (still -- 20ish seconds
later)
in the SCSI drive removal stack, specifically in the virtio-scsi target
removal/destroy handler
(see virtscsi_target_destroy() in RIP register).
[ 3028.301874] Sending NMI to other CPUs:
[ 3028.302876] NMI backtrace for cpu 0
[ 3028.302882] CPU: 0 PID: 2246 Comm: kworker/0:0 Not tainted
4.4.0-133-generic #159~14.04.1-Ubuntu
...
[ 3028.302887] RIP: 0010:[<ffffffffc0002039>] [<ffffffffc0002039>]
virtscsi_target_destroy+0x19/0x30 [virtio_scsi]
...
[ 3028.302907] Call Trace:
[ 3028.302908] [<ffffffff815ad4f2>] scsi_target_destroy+0x82/0xd0
[ 3028.302909] [<ffffffff815ad575>]
scsi_target_reap_ref_release+0x35/0x40
[ 3028.302910] [<ffffffff815aefac>] scsi_target_reap+0x2c/0x30
[ 3028.302911] [<ffffffff815b14f9>] __scsi_remove_device+0x89/0xe0
[ 3028.302912] [<ffffffff815b1576>] scsi_remove_device+0x26/0x40
[ 3028.302914] [<ffffffffc0002bc7>] virtscsi_handle_event+0x127/0x1e0
[virtio_scsi]
[ 3028.302915] [<ffffffff8109b0b6>] process_one_work+0x156/0x400
[ 3028.302916] [<ffffffff8109ba9a>] worker_thread+0x11a/0x480
[ 3028.302918] [<ffffffff8109b980>] ? rescuer_thread+0x310/0x310
[ 3028.302919] [<ffffffff810a0f78>] kthread+0xd8/0xf0
[ 3028.302920] [<ffffffff8181c092>] ? __schedule+0x2a2/0x820
[ 3028.302921] [<ffffffff810a0ea0>] ? kthread_park+0x60/0x60
[ 3028.302922] [<ffffffff818203b5>] ret_from_fork+0x55/0x80
[ 3028.302924] [<ffffffff810a0ea0>] ? kthread_park+0x60/0x60
Looking at the region pointed to by the RIP register,
virtscsi_target_destroy+0x19,
(it's in '0x39' counting from '0x20', just looking at the kernel module
objdump)..
It's similarly looping/spinning with pause (cpu_relax()), checking for the
'tgt->reqs' counter until it becomes zero (i.e., completion of all requests in
the virtio SCSI target).
0000000000000020 <virtscsi_target_destroy>:
20: e8 00 00 00 00 callq 25
<virtscsi_target_destroy+0x5>
25: 55 push %rbp
26: 48 8b bf 28 03 00 00 mov 0x328(%rdi),%rdi
2d: 48 89 e5 mov %rsp,%rbp
30: 8b 47 04 mov 0x4(%rdi),%eax
33: 85 c0 test %eax,%eax
35: 74 09 je 40
<virtscsi_target_destroy+0x20>
37: f3 90 pause
39: 8b 47 04 mov 0x4(%rdi),%eax
3c: 85 c0 test %eax,%eax
3e: 75 f7 jne 37
<virtscsi_target_destroy+0x17>
40: e8 00 00 00 00 callq 45
<virtscsi_target_destroy+0x25>
45: 5d pop %rbp
46: c3 retq
47: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
4e: 00 00
"""
static void virtscsi_target_destroy(struct scsi_target *starget)
{
struct virtio_scsi_target_state *tgt = starget->hostdata;
/* we can race with concurrent virtscsi_complete_cmd */
while (atomic_read(&tgt->reqs))
cpu_relax();
kfree(tgt);
}
"""
Let's check what that 'tgt->regs' counter value is with the crashdump.
$ crash ~/ddebs/linux-image-4.4.0-133-generic-
dbgsym_4.4.0-133.159~14.04.1_amd64/usr/lib/debug/boot/vmlinux-4.4.0-133-generic
crash/201809061748/dump.201809061748
crash> set -c 0
PID: 2246
COMMAND: "kworker/0:0"
TASK: ffff880035960000 [THREAD_INFO: ffff880212a90000]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
crash> bt
PID: 2246 TASK: ffff880035960000 CPU: 0 COMMAND: "kworker/0:0"
#0 [ffff88021fc08e38] crash_nmi_callback at ffffffff8104f727
#1 [ffff88021fc08e48] nmi_handle at ffffffff81031407
#2 [ffff88021fc08ea0] default_do_nmi at ffffffff810319e0
#3 [ffff88021fc08ec0] do_nmi at ffffffff81031b19
#4 [ffff88021fc08ee8] end_repeat_nmi at ffffffff81822e17
[exception RIP: virtscsi_target_destroy+25]
RIP: ffffffffc0002039 RSP: ffff880212a93d38 RFLAGS: 00000006
RAX: 0000000000001eb2 RBX: ffff880216be7000 RCX: ffff8802179e0a20
RDX: ffffffffc0002020 RSI: ffff880035bddc28 RDI: ffff880211ea3420
RBP: ffff880212a93d38 R8: 0000000000000000 R9: ffffffff813eb3e5
R10: ffff88021fc1a6c0 R11: 0000000000000000 R12: ffff880035bddc00
R13: ffff880035bddc28 R14: 0000000000000282 R15: 0000000000000002
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#5 [ffff880212a93d38] virtscsi_target_destroy at ffffffffc0002039
[virtio_scsi]
#6 [ffff880212a93d40] scsi_target_destroy at ffffffff815ad4f2
#7 [ffff880212a93d70] scsi_target_reap_ref_release at ffffffff815ad575
#8 [ffff880212a93d90] scsi_target_reap at ffffffff815aefac
#9 [ffff880212a93da0] __scsi_remove_device at ffffffff815b14f9
#10 [ffff880212a93dc0] scsi_remove_device at ffffffff815b1576
#11 [ffff880212a93de0] virtscsi_handle_event at ffffffffc0002bc7
[virtio_scsi]
#12 [ffff880212a93e20] process_one_work at ffffffff8109b0b6
#13 [ffff880212a93e68] worker_thread at ffffffff8109ba9a
#14 [ffff880212a93ec8] kthread at ffffffff810a0f78
#15 [ffff880212a93f50] ret_from_fork at ffffffff818203b5
the stack trace in crash gives us the register values at the time
virtscsi_target_destroy()
was interrupted while running, and RIP (instruction pointer / next instruction
to execute)
shows it's about to reload the EAX register with the value it loops for while
checking.
EAX is tgt->reqs, which loaded as: scsi_target is the first argument, thus in
RDI register,
its 'hostdata' field is in offset 0x328 (which is cast to
virtio_scsi_target_state),
and its 'reqs' field is in offset 0x4:
$ pahole -C scsi_target vmlinux-4.4.0-133-generic | grep hostdata
...
void * hostdata; /* 808 8
*/
### ^ 808 = 0x328
...
$ pahole -C virtio_scsi_target_state
lib/modules/4.4.0-133-generic/kernel/drivers/scsi/virtio_scsi.ko | grep reqs
...
atomic_t reqs; /* 4 4
*/
...
So, let's get the value from memory, based in the RDI register (which at that
point
has the tgt pointer in it):
crash> mod -s virtio_scsi /home/mfo/ddebs/linux-image-4.4.0-133
-generic-
dbgsym_4.4.0-133.159~14.04.1_amd64/usr/lib/debug/lib/modules/4.4.0-133-generic/kernel/drivers/scsi/virtio_scsi.ko
crash> struct virtio_scsi_target_state.reqs ffff880211ea3420
reqs = {
counter = 0x1eb2
}
crash> eval 0x1eb2
hexadecimal: 1eb2
decimal: 7858
And it's indeed a non-zero value, which seems quite high actually,
around ~8k requests.
This is why it's looping.
The value does not seem to decrease, though.
Bug in virtio-scsi driver
-------------------------
It's unlikely that all these were pending and did not finish after all this
time
(~25 seconds since the SCSI removal started, at least +12 seconds since 'sbd'
was
unmounted -- sbd1 and sdc1 are mounted, data is copied, both are unmounted,
sdc1 is mounted).
[ 2215.675484] EXT4-fs (sdb1): mounted filesystem with ordered data
mode. Opts: (null)
[ 2222.563236] EXT4-fs (sdc1): mounted filesystem with ordered data
mode. Opts: (null)
[ 2990.023442] EXT4-fs (sdc1): mounted filesystem with ordered data
mode. Opts: (null)
[ 3002.697474] sd 0:0:2:0: [sdb] Synchronizing SCSI cache
[ 3002.697545] sd 0:0:2:0: [sdb] Synchronize Cache(10) failed: Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 3028.294602] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
[id:2887]
So this look a problem in the driver.
The wait/spin loop is introduced by a downstream/SAUCE commit to avoid a race
with virtscsi_complete_cmd().
This is not present upstream, which dropped the tgt-reqs counter altogether.
However, per the commit description, we still need to synchronize with that
function in Xenial.
So, instead of finding an alternative way to sync for now, let's check why the
reqs counter has problems.
Studying the driver code for references to tgt->reqs, we see it's only
changes in 3 sites>
1) decremented in one site, virtscsi_complete_cmd(),
which is a callback fired after a SCSI command is completed
by the virtio-scsi adapter),
2) incremented in the .queuecommand() path, before calling
virtscsi_queuecommand()
in the only two callers of the .queuecommand() implementation
(virtscsi_queuecommand()).
2.1) - virtscsi_pick_vq_mq()
2.2) - virtscsi_pick_vq()
Complete command path:
virtscsi_complete_cmd()
...
sc->scsi_done(sc);
atomic_dec(&tgt->reqs);
}
In the 1-queue path:
virtscsi_queuecommand_single():
...
atomic_inc(&tgt->reqs);
return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
}
In the >1-queues (not blk-mq) path:
virtio_scsi_vq *virtscsi_pick_vq_mq()
...
atomic_inc(&tgt->reqs);
return &vscsi->req_vqs[hwq];
struct virtio_scsi_vq()
...
if (atomic_inc_return(&tgt->reqs) > 1) {
...
... with their only caller being:
virtscsi_queuecommand_multi(struct Scsi_Host *sh,
struct scsi_cmnd *sc)
{
...
if (shost_use_blk_mq(sh))
req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc);
else
req_vq = virtscsi_pick_vq(vscsi, tgt);
return virtscsi_queuecommand(vscsi, req_vq, sc);
These are the SCSI host templates registering those 2 functions:
static struct scsi_host_template virtscsi_host_template_single = {
...
.queuecommand = virtscsi_queuecommand_single,
...
static struct scsi_host_template virtscsi_host_template_multi = {
...
.queuecommand = virtscsi_queuecommand_multi,
...
Now, observing the implementation of virtscsi_queuecommand(), the problem
happens when the command fails to be kicked to the virtio-scsi adapter:
The function is called with the tgt-reqs counter already incremented.
But if virtscsi_kick_cmd() fails with anything other than -EIO,
the SCSI command will be requeued, and will go through the .queuecommand()
path again, which increments the tgt->reqs counter _again_, for the same
SCSI command (note this may repeat multiple times for the same command,
depending on the virtio-scsi adapter/queue condition/processing/etc).
(The EIO case is not a problem because it completes the SCSI command,
which decrements the tgt-reqs counter).
virtscsi_queuecommand()
...
ret = virtscsi_kick_cmd(req_vq, cmd, req_size,
sizeof(cmd->resp.cmd));
if (ret == -EIO) {
cmd->resp.cmd.response = VIRTIO_SCSI_S_BAD_TARGET;
spin_lock_irqsave(&req_vq->vq_lock, flags);
virtscsi_complete_cmd(vscsi, cmd);
spin_unlock_irqrestore(&req_vq->vq_lock, flags);
} else if (ret != 0) {
return SCSI_MLQUEUE_HOST_BUSY;
}
return 0;
}
So, the solution to keep the tgt-reqs counter correct in this case
is to decrement the tgt-reqs counter in that error condition, when
we know the command will be requeued (and thus tgt->reqs will be
incremented _again_).
Synthetic test-case
-------------------
I could successfully reproduce this problem by attaching GDB to the
QEMU GDB server to debug/play with the Linux kernel, and simulated
that the check for the available elements in the virtio ring/queue
were zero, so that error case happened. I'll detail the steps too.
This was also used to verify the proposed fix.
The fix also has been tested with the test-case reported by the
customer, which is mounting the virtio-scsi disk, copying out of
it, unmounting it, and dettaching it from the guest instance.
This test-case is a loop doing that procedure; it usually fails
in the first or second iteration, and with the fix it has been
running for 35 iterations without problems.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1798110
Title:
xenial: virtio-scsi: CPU soft lockup due to loop in
virtscsi_target_destroy()
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798110/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs