[Bug 1662673] Re: systemd-udevd hung in blk_mq_freeze_queue_wait testing unpartitioned NVMe drive

2018-01-15 Thread Steven Haber
This bug is hitting for me on 16.04 LTS running kernel 4.13.0-16. udev
is stuck in the following stack:

[] blk_mq_freeze_queue_wait+0x4b/0xb0
[] blk_mq_freeze_queue+0x1a/0x20
[] __nvme_revalidate_disk+0x7a/0x3f0 [nvme_core]
[] nvme_revalidate_disk+0x53/0x90 [nvme_core]
[] rescan_partitions+0x8d/0x330
[] __blkdev_reread_part+0x65/0x70
[] blkdev_reread_part+0x23/0x40
[] blkdev_ioctl+0x387/0x910
[] block_ioctl+0x3d/0x50
[] do_vfs_ioctl+0xa1/0x5f0
[] SyS_ioctl+0x79/0x90
[] entry_SYSCALL_64_fastpath+0x1e/0xa9
[] 0x

And the process info:

4 D root797  1  0  80   0 - 11661 blk_mq 03:04 ?
00:00:02 /lib/systemd/systemd-udevd

We have a bunch of read-only parted jobs backing up behind the kernel hang (and 
possibly causing it in the first place):
root  17317  1  0 03:17 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print
root  36839  36832  0 05:39 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print
root  37181  37143  0 05:50 ?00:00:00 /sbin/blockdev --getsize64 
/dev/nvme0n1
root  37340  37333  0 06:00 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print
root  38585  38549  0 08:29 ?00:00:00 /sbin/blockdev --getsize64 
/dev/nvme0n1
root  38742  38735  0 08:39 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print
root  40022  39986  0 11:14 ?00:00:00 /sbin/blockdev --getsize64 
/dev/nvme0n1
root  40184  40177  0 11:24 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print
root  41456  41419  0 13:59 ?00:00:00 /sbin/blockdev --getsize64 
/dev/nvme0n1
root  41615  41608  0 14:09 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print
root  42905  42869  0 16:44 ?00:00:00 /sbin/blockdev --getsize64 
/dev/nvme0n1
root  43062  43054  0 16:54 ?00:00:00 /sbin/parted.rw -m -s -- 
/dev/nvme0n1 unit B print

These are NVME drives with a GPT and two partitions. Let me know if you
need more info.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1662673

Title:
  systemd-udevd hung in blk_mq_freeze_queue_wait testing unpartitioned
  NVMe drive

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1662673/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1791790] Re: Kernel hang on drive pull caused by incomplete backport for bug 1597908

2018-09-10 Thread Steven Haber
Attaching logs gathered by the apport utility. This is for one of our HP
boxes running kernel 4.4.0-131.

** Attachment added: "apport.linux.4RjMFB.apport"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+attachment/5187217/+files/apport.linux.4RjMFB.apport

** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1791790

Title:
  Kernel hang on drive pull caused by incomplete backport for bug
  1597908

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1791790] [NEW] Kernel hang on drive pull caused by incomplete backport for bug 1597908

2018-09-10 Thread Steven Haber
Public bug reported:

A bug was introduced when backporting the fix for
http://bugs.launchpad.net/bugs/1597908. This bug exists in all Ubuntu
16.04 LTS 4.4 kernels >= 4.4.0-36, and many other non-LTS kernels.

This patch changes the context in which timeout work is scheduled for
block devices in the kernel. Previously, timeout work was executed
directly from the timer callback that fired when a deadline was met.
After the patch, timeout work is scheduled using a background work
queue. This means that by the time the work executes, the device queue
which originally scheduled the work could be torn down. In order to
prevent this, the patch takes a reference on the device queue when
executing the timeout work.

The problem is that the last reference to this queue can be removed
before the timeout work can be executed. During teardown, the block
system executes a freeze followed by a drain. The freeze drops the last
reference on the queue. The drain tries to clean up any outstanding
work, including timeout work. After a freeze, the timeout work in the
background queue is unable to obtain a reference, and exits early
without completing work. The work is now permanently stuck in the queue
and it will never be completed. The drain in the device teardown path
spins indefinitely.

The bug manifests as a hang that looks like this:
[] schedule+0x35/0x80
[] hpsa_scan_start+0x109/0x140 [hpsa]
[] ? wake_atomic_t_function+0x60/0x60
[] hpsa_rescan_ctlr_worker+0x1d2/0x652 [hpsa]
[] process_one_work+0x165/0x480
[] worker_thread+0x4b/0x4c0
[] ? process_one_work+0x480/0x480
[] kthread+0xd8/0xf0
[] ? kthread_create_on_node+0x1e0/0x1e0
[] ret_from_fork+0x3f/0x70
[] ? kthread_create_on_node+0x1e0/0x1e0

The fix exists upstream. It applies, builds, and runs cleanly on Ubuntu's most 
recent 4.4 kernel.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=4e9b6f20828ac880dbc1fa2fdbafae779473d1af

We hit this bug nearly 100% of the time on some of our HP hardware. The
HPSA driver has a tendency to aggressively remove missing devices, so it
widens the race. As a result, we've been building our own kernel with
this patch applied. It would be really nice if we could get it into
mainline Ubuntu.

Let me know what additional information is needed. Thanks!

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1791790

Title:
  Kernel hang on drive pull caused by incomplete backport for bug
  1597908

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1791790] Re: Kernel hang on drive pull caused by regression introduced by commit 287922eb0b18

2018-10-04 Thread Steven Haber
I tested the most recent proposed kernel (4.4.0-138) using the same
power faulting methodology as before. Everything looks good. I updated
the tag.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1791790

Title:
  Kernel hang on drive pull caused by regression introduced by commit
  287922eb0b18

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1791790] Re: Kernel hang on drive pull caused by incomplete backport for bug 1597908

2018-09-12 Thread Steven Haber
Hey Joseph! I just ran one of our machines through our drive power
faulting test. It survived 5 hotplug events without crashing. Usually
it's 100% repro of the crash. Seems to work. Thanks much!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1791790

Title:
  Kernel hang on drive pull caused by incomplete backport for bug
  1597908

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1791790] Re: Kernel hang on drive pull caused by incomplete backport for bug 1597908

2018-09-12 Thread Steven Haber
To clarify -- I ran the testing with all of your kernel packages
installed and live, except for cloud-tools, which we don't use on HP
hardware (haha).

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1791790

Title:
  Kernel hang on drive pull caused by incomplete backport for bug
  1597908

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs