[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-27 Thread Launchpad Bug Tracker
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 This bug was fixed in the package linux - 5.2.0-13.14 --- linux (5.2.0-13.14) eoan; urgency=medium * eoan/linux: 5.2.0-13.14 -proposed tracker (LP: #1840261) * NULL pointer dereference

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-15 Thread Ubuntu Kernel Bot
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-15 Thread Ubuntu Kernel Bot
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-06 Thread Andrea Righi
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 ** This bug has been marked a duplicate of bug 1784665 bcache: bch_allocator_thread(): hung task timeout -- You received this bug notification because you are a member of Ubuntu Bugs, which is

Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Ryan Harper
On Mon, Aug 5, 2019 at 1:19 PM Ryan Harper wrote: > > > On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi > wrote: > >> Ryan, I've uploaded a new test kernel with the fix mentioned in the >> comment before: >> >> https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/ >> >> I've

Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Ryan Harper
On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi wrote: > Ryan, I've uploaded a new test kernel with the fix mentioned in the > comment before: > > https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/ > > I've performed over 100 installations using curtin-nvme.sh > (install_count =

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Andrea Righi
Ryan, I've uploaded a new test kernel with the fix mentioned in the comment before: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/ I've performed over 100 installations using curtin-nvme.sh (install_count = 100), no hung task timeout. I'll run other stress tests to make

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Andrea Righi
Some additional info about the deadlock: crash> bt 16588 PID: 16588 TASK: 9ffd7f332b00 CPU: 1 COMMAND: "bcache_allocato" [exception RIP: bch_crc64+57] RIP: c093b2c9 RSP: ab9585767e28 RFLAGS: 0286 RAX: f1f51403756de2bd RBX: RCX:

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Andrea Righi
After some help from Ryan (on IRC) I've been able to run the last reproducer script and trigger the same trace. Now I should be able to collect all the information that I need and hopefully post a new test kernel (fixed for real...) soon. -- You received this bug notification because you are a

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Ryan Harper
Trying the first kernel without the change event sauce also fails: [ 532.823594] bcache: run_cache_set() invalidating existing data [ 532.828876] bcache: register_cache() registered cache device nvme0n1p2 [ 532.869716] bcache: register_bdev() registered backing device vda1 [ 532.994355]

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Ryan Harper
I tried the +3 kernel first, and I got 3 installs and then this hang: [ 549.828710] bcache: run_cache_set() invalidating existing data [ 549.836485] bcache: register_cache() registered cache device nvme1n1p2 [ 549.937486] bcache: register_bdev() registered backing device vdg [ 550.018855]

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Andrea Righi
Ryan, unfortunately the last reproducer script is giving me a lot of errors and I'm still trying to figure out how to make it run to the end (or at least to a point where it's start to run some bcache commands). In the meantime (as anticipated on IRC) I've uploaded a test kernel reverting the

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Ryan Harper
Reproducer script ** Attachment added: "curtin-nvme.sh" https://bugs.launchpad.net/curtin/+bug/1796292/+attachment/5280353/+files/curtin-nvme.sh -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Ryan Harper
On Thu, Aug 1, 2019 at 10:15 AM Andrea Righi wrote: > Thanks Ryan, this is very interesting: > > [ 259.411486] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.537070] bcache: register_bcache() error /dev/vdg: device already > registered

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Andrea Righi
Thanks Ryan, this is very interesting: [ 259.411486] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.537070] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.797830] bcache: register_bcache()

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Ryan Harper
ubuntu@ubuntu:~$ uname -r 4.15.0-56-generic ubuntu@ubuntu:~$ cat /proc/version Linux version 4.15.0-56-generic (arighi@kathleen) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #62~lp1796292 SMP Thu Aug 1 07:45:21 UTC 2019 This failed on the second install while running bcache-super-show

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Andrea Righi
I've uploaded a new test kernel based on the latest bionic kernel from master-next: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292/ In addition to that I've backported all the recent upstream bcache fixes and applied my proposed fix for the potential deadlock in

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-29 Thread Chris Gregan
Escalated to Field Critical as it now happens often enough to block our ability to test proposed product releases. We are unable to test openstack-next at the moment because our test runs fail behind this bug. -- You received this bug notification because you are a member of Ubuntu Bugs, which

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-24 Thread Brad Figg
** Tags added: cscc -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures To manage notifications about this bug go to:

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Ryan Harper
The newer kernel went about 16 runs and then popped this: [ 2137.810559] md: md0: resync done. [ 2296.795633] INFO: task python3:11639 blocked for more than 120 seconds. [ 2296.800320] Tainted: P O 4.15.0-55-generic #60+lp1796292+1 [ 2296.805097] "echo 0 >

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Ryan Harper
Andrea, thanks for the updated kernels. On the first one, I got 23 installs before I ran into an issue; I'll test the newer kernel next. https://paste.ubuntu.com/p/2B4Kk3wbvQ/ [ 5436.870482] BUG: unable to handle kernel NULL pointer dereference at 09b8 [ 5436.873374] IP:

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Andrea Righi
... and, just in case, I've uploaded also a test kernel based on the latest bionic's master-next + a bunch of extra bcache fixes: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292+1/ If the previous kernel is still buggy it'd be nice to try also this one. -- You received this

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Andrea Righi
Hi Ryan, I've uploaded a new test kernel: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/ This one is based on 4.15.0-54.58 and it addresses specifically the bch_bucket_alloc() problem (with this patch applied: https://lore.kernel.org/lkml/20190710093117.GA2792@xps-13/T/#u).

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu Bionic) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu Cosmic) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu Disco) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title:

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Andrea Righi
Good news! I've been able to reproduce the hung task in bch_bucket_alloc() issue locally, using the test case from bug 1784665. I think we're hitting the same problem now. I'll do more tests and will keep you updated. -- You received this bug notification because you are a member of Ubuntu Bugs,

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-04 Thread Andrea Righi
Thanks tons for the tests Ryan! Well, at least the hung task timeout trace is different, so we're making some progress. With the new kernel it seems that we're stuck in bch_bucket_alloc(). I've identified other upstream fixes that could help to prevent this problem. If you're willing to do few

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-04 Thread Andrea Righi
** Changed in: curtin Assignee: Terry Rudd (terrykrudd) => Andrea Righi (arighi) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Ryan Harper
Without the patch, I can reproduce the hang fairly frequently, in one or two loops, which fails in this way: [ 1069.711956] bcache: cancel_writeback_rate_update_dwork() give up waiting for dc->writeback_write_update to quit [ 1088.583986] INFO: task kworker/0:2:436 blocked for more than 120

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Ryan Harper
I've setup our integration test that runs the the CDO-QA bcache/ceph setup. On the updated kernel I got through 10 loops on the deployment before it stacktraced: http://paste.ubuntu.com/p/zVrtvKBfCY/ [ 3939.846908] bcache: bch_cached_dev_attach() Caching vdd as bcache5 on set

Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Jason Hobbs
This is difficult for us to test in our lab because we are using MAAS, and we hit this during MAAS deployments of nodes, so we would need MAAS images built with these kernels. Additionally, this doesn't reproduce every time, it is maybe 1/4 test runs. It may be best to find a way to reproduce this

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Andrea Righi
>From a kernel perspective this big slowness on shutting down a bcache volume might be caused by a locking / race condition issue. If I read correctly this problem has been reproduced in bionic (and in xenial we even got a kernel oops - it looks like caused by a NULL pointer dereference). I would

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-01 Thread Terry Rudd
Canonical kernel team has this item queued in the hotlist to work on. I am assigning to myself to accelerate work ** Changed in: curtin Assignee: (unassigned) => Terry Rudd (terrykrudd) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-01 Thread Brad Figg
** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Eoan) Importance: Undecided Status: Confirmed ** Also affects: linux (Ubuntu Disco) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Cosmic)

Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-06-03 Thread Ryan Harper
On Mon, Jun 3, 2019 at 2:05 PM Andrey Grebennikov < agrebennikov1...@gmail.com> wrote: > Is there an estimate on getting this package in bionic-updates please? > We are starting an SRU of curtin this week. SRU's take at least 7 days from when they hit -proposed possibly longer depending on test

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-06-03 Thread Andrey Grebennikov
Is there an estimate on getting this package in bionic-updates please? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures To

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-22 Thread Dan Watkins
This bug is believed to be fixed in curtin in version 19.1. If this is still a problem for you, please make a comment and set the state back to New Thank you. ** Changed in: curtin Status: New => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs,

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-14 Thread Pedro GuimarĂ£es
This script *should* trigger the issue on Bionic GA: https://pastebin.ubuntu.com/p/WdKGbMWnM6/ Try it with both GA and HWE bionic, the commit on HWE should trigger up. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-14 Thread Pedro GuimarĂ£es
I was looking into kernel commits and I came across this: https://github.com/torvalds/linux/commit/fadd94e05c02afec7b70b0b14915624f1782f578 So, as far as I understood, it actually deals with the issue of manual device detach during a writeback clean-up and causing deadlock. The timeline makes

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-14 Thread Jason Hobbs
** Tags added: cdo-qa foundations-engine -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures To manage notifications about this

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-13 Thread Andrey Grebennikov
@jhobbs Here is the script that cleans up bcache devices on recommission: https://pastebin.ubuntu.com/p/6WCGvM4Q32/ -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for

Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-09 Thread Ryan Harper
On Wed, May 8, 2019 at 11:55 PM Trent Lloyd wrote: > I have been running into this (curtin 18.1-17-gae48e86f- > 0ubuntu1~16.04.1) > > I think this commit basically agrees with my thoughts but I just wanted > to share them explicitly in case they are interesting > > (1) If you *unregister* the

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-09 Thread Ryan Harper
Xenial GA kernel bcache unregister oops: http://paste.ubuntu.com/p/BzfHFjzZ8y/ -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-08 Thread Trent Lloyd
I have been running into this (curtin 18.1-17-gae48e86f- 0ubuntu1~16.04.1) I think this commit basically agrees with my thoughts but I just wanted to share them explicitly in case they are interesting (1) If you *unregister* the cache device from the backing device, it first has to purge all

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-06 Thread Jason Hobbs
This occurrs on a target machine during maas install. Apport is not collected in this case. ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-01 Thread Joshua Powers
Adding affects linux package ** Also affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache