[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 This bug was fixed in the package linux - 5.2.0-13.14 --- linux (5.2.0-13.14) eoan; urgency=medium * eoan/linux: 5.2.0-13.14 -proposed tracker (LP: #1840261) * NULL pointer dereference when Inserting the VIMC module (LP: #1840028) - media: vimc: fix component match compare * Miscellaneous upstream changes - selftests/bpf: remove bpf_util.h from BPF C progs linux (5.2.0-12.13) eoan; urgency=medium * eoan/linux: 5.2.0-12.13 -proposed tracker (LP: #1840184) * Eoan update: v5.2.8 upstream stable release (LP: #1840178) - scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure - libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant - libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock - ALSA: usb-audio: Sanity checks for each pipe and EP types - ALSA: usb-audio: Fix gpf in snd_usb_pipe_sanity_check - HID: wacom: fix bit shift for Cintiq Companion 2 - HID: Add quirk for HP X1200 PIXART OEM mouse - atm: iphase: Fix Spectre v1 vulnerability - bnx2x: Disable multi-cos feature. - drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case - ife: error out when nla attributes are empty - ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6 - ip6_tunnel: fix possible use-after-free on xmit - ipip: validate header length in ipip_tunnel_xmit - mlxsw: spectrum: Fix error path in mlxsw_sp_module_init() - mvpp2: fix panic on module removal - mvpp2: refactor MTU change code - net: bridge: delete local fdb on device init failure - net: bridge: mcast: don't delete permanent entries when fast leave is enabled - net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER - net: fix ifindex collision during namespace removal - net/mlx5e: always initialize frag->last_in_page - net/mlx5: Use reversed order when unregister devices - net: phy: fixed_phy: print gpio error only if gpio node is present - net: phylink: don't start and stop SGMII PHYs in SFP modules twice - net: phylink: Fix flow control for fixed-link - net: phy: mscc: initialize stats array - net: qualcomm: rmnet: Fix incorrect UL checksum offload logic - net: sched: Fix a possible null-pointer dereference in dequeue_func() - net sched: update vlan action for batched events operations - net: sched: use temporary variable for actions indexes - net/smc: do not schedule tx_work in SMC_CLOSED state - net: stmmac: Use netif_tx_napi_add() for TX polling function - NFC: nfcmrvl: fix gpio-handling regression - ocelot: Cancel delayed work before wq destruction - tipc: compat: allow tipc commands without arguments - tipc: fix unitilized skb list crash - tun: mark small packets as owned by the tap sock - net/mlx5: Fix modify_cq_in alignment - net/mlx5e: Prevent encap flow counter update async to user query - r8169: don't use MSI before RTL8168d - bpf: fix XDP vlan selftests test_xdp_vlan.sh - selftests/bpf: add wrapper scripts for test_xdp_vlan.sh - selftests/bpf: reduce time to execute test_xdp_vlan.sh - net: fix bpf_xdp_adjust_head regression for generic-XDP - hv_sock: Fix hang when a connection is closed - net: phy: fix race in genphy_update_link - net/smc: avoid fallback in case of non-blocking connect - rocker: fix memory leaks of fib_work on two error return paths - mlxsw: spectrum_buffers: Further reduce pool size on Spectrum-2 - net/mlx5: Add missing RDMA_RX capabilities - net/mlx5e: Fix matching of speed to PRM link modes - compat_ioctl: pppoe: fix PPPOEIOCSFWD handling - drm/i915/vbt: Fix VBT parsing for the PSR section - Revert "mac80211: set NETIF_F_LLTX when using intermediate tx queues" - spi: bcm2835: Fix 3-wire mode if DMA is enabled - Linux 5.2.8 * Miscellaneous Ubuntu changes - SAUCE: selftests/bpf: do not include Kbuild.include in makefile - update dkms package versions linux (5.2.0-11.12) eoan; urgency=medium * eoan/linux: 5.2.0-11.12 -proposed tracker (LP: #1839646) * Packaging resync (LP: #1786013) - [Packaging] resync getabis - [Packaging] update helper scripts * Eoan update: v5.2.7 upstream stable release (LP: #1839588) - Revert "UBUNTU: SAUCE: Revert "loop: Don't change loop device under exclusive opener"" - ARM: riscpc: fix DMA - ARM: dts: rockchip: Make rk3288-veyron-minnie run at hs200 - ARM: dts: rockchip: Make rk3288-veyron-mickey's emmc work again - clk: meson: mpll: properly handle spread spectrum - ARM: dts: rockchip: Mark that the rk3288 timer might stop in suspend - ftrace: Enable trampoline when rec count returns back to one - arm64: dts: qcom: qcs404-evb: fix l3 min voltage - soc: qcom: rpmpd: fixup rpmpd set performance state - arm64: dts: marvell: mcbin: enlarge PCI memory window - soc: i
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed- xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed- xenial'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed- bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed- bionic'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
*** This bug is a duplicate of bug 1784665 *** https://bugs.launchpad.net/bugs/1784665 ** This bug has been marked a duplicate of bug 1784665 bcache: bch_allocator_thread(): hung task timeout -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
On Mon, Aug 5, 2019 at 1:19 PM Ryan Harper wrote: > > > On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi > wrote: > >> Ryan, I've uploaded a new test kernel with the fix mentioned in the >> comment before: >> >> https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/ >> >> I've performed over 100 installations using curtin-nvme.sh >> (install_count = 100), no hung task timeout. I'll run other stress tests >> to make sure we're not breaking anything else with this fix, but results >> look promising so far. >> >> It'd be great if you could also do a test on your side. Thanks! >> > > Thats excellent news. I'm starting my tests on this kernel now. > I've got 233 consecutive installs successful. > > >> >> -- >> You received this bug notification because you are subscribed to the bug >> report. >> https://bugs.launchpad.net/bugs/1796292 >> >> Title: >> Tight timeout for bcache removal causes spurious failures >> >> To manage notifications about this bug go to: >> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions >> > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi wrote: > Ryan, I've uploaded a new test kernel with the fix mentioned in the > comment before: > > https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/ > > I've performed over 100 installations using curtin-nvme.sh > (install_count = 100), no hung task timeout. I'll run other stress tests > to make sure we're not breaking anything else with this fix, but results > look promising so far. > > It'd be great if you could also do a test on your side. Thanks! > Thats excellent news. I'm starting my tests on this kernel now. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1796292 > > Title: > Tight timeout for bcache removal causes spurious failures > > To manage notifications about this bug go to: > https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Ryan, I've uploaded a new test kernel with the fix mentioned in the comment before: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/ I've performed over 100 installations using curtin-nvme.sh (install_count = 100), no hung task timeout. I'll run other stress tests to make sure we're not breaking anything else with this fix, but results look promising so far. It'd be great if you could also do a test on your side. Thanks! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Some additional info about the deadlock: crash> bt 16588 PID: 16588 TASK: 9ffd7f332b00 CPU: 1 COMMAND: "bcache_allocato" [exception RIP: bch_crc64+57] RIP: c093b2c9 RSP: ab9585767e28 RFLAGS: 0286 RAX: f1f51403756de2bd RBX: RCX: 0065 RDX: 0065 RSI: 9ffd6398 RDI: 9ffd63925346 RBP: ab9585767e28 R8: c093db60 R9: ab9585739000 R10: 007f R11: 1ffef001 R12: R13: 0008 R14: 9ffd6390 R15: 9ffd683d CS: 0010 SS: 0018 #0 [ab9585767e30] bch_prio_write at c09325c0 [bcache] #1 [ab9585767eb0] bch_allocator_thread at c091bdc5 [bcache] #2 [ab9585767f08] kthread at a80b2481 #3 [ab9585767f50] ret_from_fork at a8a00205 crash> bt 14658 PID: 14658 TASK: 9ffd7a9f CPU: 0 COMMAND: "python3" #0 [ab958380bb48] __schedule at a89ae441 #1 [ab958380bbe8] schedule at a89aea7c #2 [ab958380bbf8] bch_bucket_alloc at c091c370 [bcache] #3 [ab958380bc68] __bch_bucket_alloc_set at c091c5ce [bcache] #4 [ab958380bcb8] bch_bucket_alloc_set at c091c66e [bcache] #5 [ab958380bcf8] __uuid_write at c0931b69 [bcache] #6 [ab958380bda0] bch_uuid_write at c0931f76 [bcache] #7 [ab958380bdc0] __cached_dev_store at c0937c08 [bcache] #8 [ab958380be20] bch_cached_dev_store at c0938309 [bcache] #9 [ab958380be50] sysfs_kf_write at a830c97c #10 [ab958380be60] kernfs_fop_write at a830c3e5 #11 [ab958380bea0] __vfs_write at a827e5bb #12 [ab958380beb0] vfs_write at a827e781 #13 [ab958380bee8] sys_write at a827e9fc #14 [ab958380bf30] do_syscall_64 at a8003b03 #15 [ab958380bf50] entry_SYSCALL_64_after_hwframe at a8a00081 RIP: 7faffc7bd154 RSP: 7ffe307cbc88 RFLAGS: 0246 RAX: ffda RBX: 0008 RCX: 7faffc7bd154 RDX: 0008 RSI: 011ce7f0 RDI: 0003 RBP: 7faffccb86c0 R8: R9: R10: 0100 R11: 0246 R12: 0003 R13: R14: 011ce7f0 R15: 00f33e60 ORIG_RAX: 0001 CS: 0033 SS: 002b In this case the task "python3" (pid 14658) gets stuck in a wait that never completes from bch_bucket_alloc(). The task should that should resume "python3" from this wait is "bcache_allocator" (pid 16588), but the resume never happens, because bcache_allocator is stuck in this "retry_invalidate" busy loop: static int bch_allocator_thread(void *arg) { ... retry_invalidate: allocator_wait(ca, ca->set->gc_mark_valid && !ca->invalidate_needs_gc); invalidate_buckets(ca); /* * Now, we write their new gens to disk so we can start writing * new stuff to them: */ allocator_wait(ca, !atomic_read(&ca->set->prio_blocked)); if (CACHE_SYNC(&ca->set->sb)) { /* * This could deadlock if an allocation with a btree * node locked ever blocked - having the btree node * locked would block garbage collection, but here we're * waiting on garbage collection before we invalidate * and free anything. * * But this should be safe since the btree code always * uses btree_check_reserve() before allocating now, and * if it fails it blocks without btree nodes locked. */ if (!fifo_full(&ca->free_inc)) goto retry_invalidate; if (bch_prio_write(ca, false) < 0) { ca->invalidate_needs_gc = 1; wake_up_gc(ca->set); goto retry_invalidate; } } } ... The exact code path is this: bch_prio_write() fails, because it calls bch_bucket_alloc() that fails (out of free buckets), it's waking up the garbage collector (trying to free up some buckets) and it goes back to retry_invalidate, but it's not enough apprently; bch_prio_write() is going to fail over and over again (due to no buckets available), unable to break out of the busy loop => deadlock. Looking better at the code it seems safe to resume the bcache_allocator main loop when bch_prio_write() fails (still keeping the wake_up event to the garbage collector), instead of going back to retry_invalidate. This should give the allocator a better chance to free up some bu
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Trying the first kernel without the change event sauce also fails: [ 532.823594] bcache: run_cache_set() invalidating existing data [ 532.828876] bcache: register_cache() registered cache device nvme0n1p2 [ 532.869716] bcache: register_bdev() registered backing device vda1 [ 532.994355] bcache: bch_cached_dev_attach() Caching vda1 as bcache0 on set 21d89237-231d-4af6-a4c8-4b1b8fa5eef5 [ 533.051588] bcache: register_bcache() error /dev/vda1: device already registered [ 533.094717] bcache: register_bcache() error /dev/vda1: device already registered [ 533.120063] bcache: register_bcache() error /dev/vda1: device already registered [ 533.142517] bcache: register_bcache() error /dev/vda1: device already registered [ 533.191069] bcache: register_bcache() error /dev/vda1: device already registered [ 533.249877] bcache: register_bcache() error /dev/vda1: device already registered [ 533.282653] bcache: register_bcache() error /dev/vda1: device already registered [ 533.301225] bcache: register_bcache() error /dev/vda1: device already registered [ 533.310505] bcache: register_bcache() error /dev/vda1: device already registered [ 533.318959] bcache: register_bcache() error /dev/vda1: device already registered [ 533.374121] bcache: register_bcache() error /dev/vda1: device already registered [ 533.536920] bcache: register_bcache() error /dev/vda1: device already registered [ 533.581468] bcache: register_bcache() error /dev/vda1: device already registered [ 533.589270] bcache: register_bcache() error /dev/vda1: device already registered [ 533.595986] bcache: register_bcache() error /dev/vda1: device already registered [ 533.602638] bcache: register_bcache() error /dev/vda1: device already registered [ 533.651848] bcache: register_bcache() error /dev/vda1: device already registered [ 533.677836] bcache: register_bcache() error /dev/vda1: device already registered [ 533.712074] bcache: register_bcache() error /dev/vda1: device already registered [ 533.717682] bcache: register_bcache() error /dev/vda1: device already registered [ 533.723354] bcache: register_bcache() error /dev/vda1: device already registered [ 533.728951] bcache: register_bcache() error /dev/vda1: device already registered [ 533.777602] bcache: register_bcache() error /dev/vda1: device already registered [ 553.784393] md: md0: resync done. [ 725.983387] INFO: task python3:413 blocked for more than 120 seconds. [ 725.985099] Tainted: P O 4.15.0-56-generic #62~lp1796292+1 [ 725.986820] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 725.988649] python3 D0 413405 0x [ 725.988652] Call Trace: [ 725.988684] __schedule+0x291/0x8a0 [ 725.988687] schedule+0x2c/0x80 [ 725.988710] bch_bucket_alloc+0x1fa/0x350 [bcache] [ 725.988722] ? wait_woken+0x80/0x80 [ 725.988726] __bch_bucket_alloc_set+0xfe/0x150 [bcache] [ 725.988729] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 725.988734] __uuid_write+0x59/0x150 [bcache] [ 725.988738] ? __write_super+0x137/0x170 [bcache] [ 725.988742] bch_uuid_write+0x16/0x40 [bcache] [ 725.988746] __cached_dev_store+0x1d8/0x8a0 [bcache] [ 725.988750] bch_cached_dev_store+0x39/0xc0 [bcache] [ 725.988758] sysfs_kf_write+0x3c/0x50 [ 725.988759] kernfs_fop_write+0x125/0x1a0 [ 725.988765] __vfs_write+0x1b/0x40 [ 725.988766] vfs_write+0xb1/0x1a0 [ 725.988767] SyS_write+0x5c/0xe0 [ 725.988774] do_syscall_64+0x73/0x130 [ 725.988777] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 725.988779] RIP: 0033:0x7fa71a23b154 [ 725.988780] RSP: 002b:7ffec74f2828 EFLAGS: 0246 ORIG_RAX: 0001 [ 725.988783] RAX: ffda RBX: 0008 RCX: 7fa71a23b154 [ 725.988783] RDX: 0008 RSI: 02a90030 RDI: 0003 [ 725.988784] RBP: 7fa71a7366c0 R08: R09: [ 725.988785] R10: 0100 R11: 0246 R12: 0003 [ 725.988785] R13: R14: 02a90030 R15: 027c4e60 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for remo
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
After some help from Ryan (on IRC) I've been able to run the last reproducer script and trigger the same trace. Now I should be able to collect all the information that I need and hopefully post a new test kernel (fixed for real...) soon. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
I tried the +3 kernel first, and I got 3 installs and then this hang: [ 549.828710] bcache: run_cache_set() invalidating existing data [ 549.836485] bcache: register_cache() registered cache device nvme1n1p2 [ 549.937486] bcache: register_bdev() registered backing device vdg [ 550.018855] bcache: bch_cached_dev_attach() Caching vdg as bcache3 on set c7abd3ea-f9c9-415a-b8a6-9efeddc3e030 [ 550.074760] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 550.316246] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 550.545840] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 550.565928] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 550.724285] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 562.352520] md: md0: resync done. [ 725.980380] INFO: task python3:27303 blocked for more than 120 seconds. [ 725.982364] Tainted: P O 4.15.0-56-generic #62~lp1796292+3 [ 725.984228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 725.986284] python3 D0 27303 27293 0x [ 725.986287] Call Trace: [ 725.986319] __schedule+0x291/0x8a0 [ 725.986322] schedule+0x2c/0x80 [ 725.986337] bch_bucket_alloc+0x320/0x3a0 [bcache] [ 725.986359] ? wait_woken+0x80/0x80 [ 725.986363] __bch_bucket_alloc_set+0xfe/0x150 [bcache] [ 725.986367] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 725.986372] __uuid_write+0x59/0x150 [bcache] [ 725.986377] ? __write_super+0x137/0x170 [bcache] [ 725.986382] bch_uuid_write+0x16/0x40 [bcache] [ 725.986386] __cached_dev_store+0x1d8/0x8a0 [bcache] [ 725.986391] bch_cached_dev_store+0x39/0xc0 [bcache] [ 725.986399] sysfs_kf_write+0x3c/0x50 [ 725.986401] kernfs_fop_write+0x125/0x1a0 [ 725.986406] __vfs_write+0x1b/0x40 [ 725.986407] vfs_write+0xb1/0x1a0 [ 725.986409] SyS_write+0x5c/0xe0 [ 725.986416] do_syscall_64+0x73/0x130 [ 725.986419] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 725.986421] RIP: 0033:0x7f2d7c39a154 [ 725.986422] RSP: 002b:7fff80c5e048 EFLAGS: 0246 ORIG_RAX: 0001 [ 725.986426] RAX: ffda RBX: 0008 RCX: 7f2d7c39a154 [ 725.986426] RDX: 0008 RSI: 0247e7d0 RDI: 0003 [ 725.986427] RBP: 7f2d7c8956c0 R08: R09: [ 725.986428] R10: 0100 R11: 0246 R12: 0003 [ 725.986428] R13: R14: 0247e7d0 R15: 021b6e60 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Ryan, unfortunately the last reproducer script is giving me a lot of errors and I'm still trying to figure out how to make it run to the end (or at least to a point where it's start to run some bcache commands). In the meantime (as anticipated on IRC) I've uploaded a test kernel reverting the patch "UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent": https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+1/ As we know, this would re-introduce the problem discussed in bug 1729145, but it'd be interesting to test it anyway, just to see if this patch is somehow related to the bch_bucket_alloc() deadlock. In addition to that I've spent some time looking at the last kernel trace and the code. It looks like bch_bucket_alloc() is always releasing the mutex &ca->set->bucket_lock when it goes to sleep (call to schedule()), but it doesn't release bch_register_lock, that might be also acquired. I was wondering if this could the reason of this deadlock, so I've prepared an additional test kernel that does *not* revert our "UBUNTU SAUCE" patch, but instead it releases the mutex bch_register_lock when bch_bucket_alloc() goes to sleep: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+3/ Sorry for asking all these tests... if I can't find a way to reproduce the bug on my side, asking you to test is the only way that I have to debug this issue. :) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Reproducer script ** Attachment added: "curtin-nvme.sh" https://bugs.launchpad.net/curtin/+bug/1796292/+attachment/5280353/+files/curtin-nvme.sh -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
On Thu, Aug 1, 2019 at 10:15 AM Andrea Righi wrote: > Thanks Ryan, this is very interesting: > > [ 259.411486] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.537070] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.797830] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > [ 259.900392] bcache: register_bcache() error /dev/vdg: device already > registered (emitting change event) > > It looks that we're trying to register /dev/vdg multiple times as a > backing device (make-bcache -B). I'm not getting this message during my > tests, so that might be required to reproduce that particular deadlock. > We carry a specific sauce patch to ensure that if the cacheset is already online, and a backing device shows up later, that the kernel emits the change event to trigger the udev rules to generate the symlink for /dev/bcache/by-uuid. I don't think the patch we carry is at issue since we are just detecting the re-register scenario and emitting a change uevent; https://www.spinics.net/lists/linux-bcache/msg05833.html We may want to resubmit that now to see if they'll take that or even want to deal with the scenario in a cleaner way; > > I'll modify my test case to trigger these errors and see if I can > reproduce the hung task timeout issue. > I can provide you a setup to reproduce this. I'll put together a doc. > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1796292 > > Title: > Tight timeout for bcache removal causes spurious failures > > To manage notifications about this bug go to: > https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Thanks Ryan, this is very interesting: [ 259.411486] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.537070] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.797830] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.900392] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) It looks that we're trying to register /dev/vdg multiple times as a backing device (make-bcache -B). I'm not getting this message during my tests, so that might be required to reproduce that particular deadlock. I'll modify my test case to trigger these errors and see if I can reproduce the hung task timeout issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
ubuntu@ubuntu:~$ uname -r 4.15.0-56-generic ubuntu@ubuntu:~$ cat /proc/version Linux version 4.15.0-56-generic (arighi@kathleen) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #62~lp1796292 SMP Thu Aug 1 07:45:21 UTC 2019 This failed on the second install while running bcache-super-show /dev/vdg Hung task. [ 259.150347] bcache: run_cache_set() invalidating existing data [ 259.158038] bcache: register_cache() registered cache device nvme1n1p2 [ 259.251093] bcache: register_bdev() registered backing device vdg [ 259.379809] bcache: bch_cached_dev_attach() Caching vdg as bcache3 on set 084505ad-5f6c-4666-9e3e-4f1650e8b015 [ 259.411486] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.537070] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.797830] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 259.900392] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event) [ 271.682662] md: md0: resync done. [ 484.525529] INFO: task python3:11257 blocked for more than 120 seconds. [ 484.528933] Tainted: P O 4.15.0-56-generic #62~lp1796292 [ 484.532221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 484.535936] python3 D0 11257 7974 0x [ 484.535941] Call Trace: [ 484.535952] __schedule+0x291/0x8a0 [ 484.535957] schedule+0x2c/0x80 [ 484.535977] bch_bucket_alloc+0x1fa/0x350 [bcache] [ 484.535984] ? wait_woken+0x80/0x80 [ 484.535993] __bch_bucket_alloc_set+0xfe/0x150 [bcache] [ 484.536002] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 484.536014] __uuid_write+0x59/0x150 [bcache] [ 484.536025] ? __write_super+0x137/0x170 [bcache] [ 484.536035] bch_uuid_write+0x16/0x40 [bcache] [ 484.536046] __cached_dev_store+0x1d8/0x8a0 [bcache] [ 484.536057] bch_cached_dev_store+0x39/0xc0 [bcache] [ 484.536061] sysfs_kf_write+0x3c/0x50 [ 484.536064] kernfs_fop_write+0x125/0x1a0 [ 484.536069] __vfs_write+0x1b/0x40 [ 484.536071] vfs_write+0xb1/0x1a0 [ 484.536075] SyS_write+0x5c/0xe0 [ 484.536081] do_syscall_64+0x73/0x130 [ 484.536085] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 484.536088] RIP: 0033:0x7f2ac7aa7154 [ 484.536090] RSP: 002b:7157b628 EFLAGS: 0246 ORIG_RAX: 0001 [ 484.536093] RAX: ffda RBX: 0008 RCX: 7f2ac7aa7154 [ 484.536095] RDX: 0008 RSI: 01c96340 RDI: 0003 [ 484.536096] RBP: 7f2ac7fa26c0 R08: R09: [ 484.536098] R10: 0100 R11: 0246 R12: 0003 [ 484.536099] R13: R14: 01c96340 R15: 019c7db0 I'll note that, I can run bcache-super-show on the device while this was hung. $ sudo bash root@ubuntu:~# bcache-super-show /dev/vdg sb.magicok sb.first_sector 8 [match] sb.csum C71A896B52F1C486 [match] sb.version 1 [backing device] dev.label osddata5 dev.uuid11df8370-fb64-4bb0-8171-dadabb47f6b1 dev.sectors_per_block 1 dev.sectors_per_bucket 1024 dev.data.first_sector 16 dev.data.cache_mode 1 [writeback] dev.data.cache_state1 [clean] cset.uuid 084505ad-5f6c-4666-9e3e-4f1650e8b015 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.n
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
I've uploaded a new test kernel based on the latest bionic kernel from master-next: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292/ In addition to that I've backported all the recent upstream bcache fixes and applied my proposed fix for the potential deadlock in bch_allocator_thread() (https://lkml.org/lkml/2019/7/10/241). I've tested this kernel both on a VM and on a bare metal box, running the test case from bug 1784665 (https://launchpadlibrarian.net/381282009/bcache-basic-repro.sh - with some minor adjustments to match my devices). The tests have been running for more than 1h without triggering any problem (and they're still going). Ryan / Chris: it would be really nice if you could do one more test with this new kernel... and if you're still hitting issues we can try to work on a better reproducer. Thanks again! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Escalated to Field Critical as it now happens often enough to block our ability to test proposed product releases. We are unable to test openstack-next at the moment because our test runs fail behind this bug. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
** Tags added: cscc -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
The newer kernel went about 16 runs and then popped this: [ 2137.810559] md: md0: resync done. [ 2296.795633] INFO: task python3:11639 blocked for more than 120 seconds. [ 2296.800320] Tainted: P O 4.15.0-55-generic #60+lp1796292+1 [ 2296.805097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2296.810071] python3 D0 11639 8301 0x [ 2296.810075] Call Trace: [ 2296.810100] __schedule+0x291/0x8a0 [ 2296.810102] ? __switch_to_asm+0x34/0x70 [ 2296.810103] ? __switch_to_asm+0x40/0x70 [ 2296.810105] schedule+0x2c/0x80 [ 2296.810118] bch_bucket_alloc+0xe5/0x370 [bcache] [ 2296.810128] ? wait_woken+0x80/0x80 [ 2296.810132] __bch_bucket_alloc_set+0x10d/0x160 [bcache] [ 2296.810137] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 2296.810143] __uuid_write+0x59/0x170 [bcache] [ 2296.810148] ? __write_super+0x137/0x170 [bcache] [ 2296.810153] bch_uuid_write+0x16/0x40 [bcache] [ 2296.810159] __cached_dev_store+0x1e5/0x8b0 [bcache] [ 2296.810160] ? __switch_to_asm+0x34/0x70 [ 2296.810161] ? __switch_to_asm+0x40/0x70 [ 2296.810163] ? __switch_to_asm+0x34/0x70 [ 2296.810167] bch_cached_dev_store+0x46/0x110 [bcache] [ 2296.810181] sysfs_kf_write+0x3c/0x50 [ 2296.810182] kernfs_fop_write+0x125/0x1a0 [ 2296.810185] __vfs_write+0x1b/0x40 [ 2296.810187] vfs_write+0xb1/0x1a0 [ 2296.810189] SyS_write+0x55/0xc0 [ 2296.810193] do_syscall_64+0x73/0x130 [ 2296.810194] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 2296.810196] RIP: 0033:0x7f8e80077154 [ 2296.810197] RSP: 002b:7fffc23855e8 EFLAGS: 0246 ORIG_RAX: 0001 [ 2296.810199] RAX: ffda RBX: 0008 RCX: 7f8e80077154 [ 2296.810200] RDX: 0008 RSI: 00d734b0 RDI: 0003 [ 2296.810201] RBP: 7f8e805726c0 R08: R09: [ 2296.810202] R10: 0100 R11: 0246 R12: 0003 [ 2296.810203] R13: R14: 00d734b0 R15: 00aa4db0 [ 2417.627259] INFO: task python3:11639 blocked for more than 120 seconds. [ 2417.632687] Tainted: P O 4.15.0-55-generic #60+lp1796292+1 [ 2417.638276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Andrea, thanks for the updated kernels. On the first one, I got 23 installs before I ran into an issue; I'll test the newer kernel next. https://paste.ubuntu.com/p/2B4Kk3wbvQ/ [ 5436.870482] BUG: unable to handle kernel NULL pointer dereference at 09b8 [ 5436.873374] IP: cache_set_flush+0xf6/0x190 [bcache] [ 5436.875208] PGD 0 P4D 0 [ 5436.876488] Oops: [#1] SMP PTI [ 5436.877842] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) nls_utf8 isofs ppdev nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds serio_raw parport_pc parport qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 bcache psmouse nvme nvme_core virtio_blk virtio_net virtio_scsi floppy i2c_piix4 pata_acpi [ 5436.896104] CPU: 0 PID: 11216 Comm: kworker/0:1 Tainted: P O 4.15.0-54-generic #58+lp1796292+1 [ 5436.899985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 5436.902645] Workqueue: events cache_set_flush [bcache] [ 5436.904374] RIP: 0010:cache_set_flush+0xf6/0x190 [bcache] [ 5436.906183] RSP: 0018:ab52826bbe58 EFLAGS: 00010202 [ 5436.909050] RAX: RBX: 94d104aa0cc0 RCX: [ 5436.911939] RDX: RSI: 2001 RDI: [ 5436.914448] RBP: ab52826bbe78 R08: 94d13f61ac30 R09: 94d13f342b98 [ 5436.917113] R10: ab52803b3d10 R11: 02c6 R12: 0001 [ 5436.919210] R13: 94d13f622140 R14: 94d104aa0db8 R15: [ 5436.921401] FS: () GS:94d13f60() knlGS: [ 5436.923743] CS: 0010 DS: ES: CR0: 80050033 [ 5436.926299] CR2: 09b8 CR3: 38252000 CR4: 06f0 [ 5436.929093] Call Trace: [ 5436.930818] process_one_work+0x1de/0x410 [ 5436.932818] worker_thread+0x32/0x410 [ 5436.935332] kthread+0x121/0x140 [ 5436.937309] ? process_one_work+0x410/0x410 [ 5436.939393] ? kthread_create_worker_on_cpu+0x70/0x70 [ 5436.941263] ret_from_fork+0x35/0x40 [ 5436.943060] Code: b8 00 00 00 a8 02 74 c8 31 f6 4c 89 e7 e8 43 0e ff ff eb bc 66 83 bb 4c f7 ff ff 00 48 8b 83 58 ff ff ff 74 31 41 bc 01 00 00 00 <48> 8b b8 b8 09 00 00 48 85 ff 74 05 e8 f9 9d 0d d3 0f b7 8b 4c [ 5436.950188] RIP: cache_set_flush+0xf6/0x190 [bcache] RSP: ab52826bbe58 [ 5436.952796] CR2: 09b8 [ 5436.954567] ---[ end trace b771415397e98c3d ]--- -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
... and, just in case, I've uploaded also a test kernel based on the latest bionic's master-next + a bunch of extra bcache fixes: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292+1/ If the previous kernel is still buggy it'd be nice to try also this one. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Hi Ryan, I've uploaded a new test kernel: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/ This one is based on 4.15.0-54.58 and it addresses specifically the bch_bucket_alloc() problem (with this patch applied: https://lore.kernel.org/lkml/20190710093117.GA2792@xps-13/T/#u). With this kernel I wasn't able to reproduce the hung task timeout issue in bch_bucket_alloc() anymore. It would be great if you could repeat your test also with this kernel. Thanks in advance! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu Cosmic) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu Disco) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: linux (Ubuntu Bionic) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Good news! I've been able to reproduce the hung task in bch_bucket_alloc() issue locally, using the test case from bug 1784665. I think we're hitting the same problem now. I'll do more tests and will keep you updated. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Thanks tons for the tests Ryan! Well, at least the hung task timeout trace is different, so we're making some progress. With the new kernel it seems that we're stuck in bch_bucket_alloc(). I've identified other upstream fixes that could help to prevent this problem. If you're willing to do few more tests, here's a new test kernel (based on 4.15.0-54-generic + set of bcache upstream fixes): https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/ And, just in case, I've also applied the same set of fixes also to the latest bionic's master-next: https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292/ Testing these two kernels should give us more information about the nature of the problem. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
** Changed in: curtin Assignee: Terry Rudd (terrykrudd) => Andrea Righi (arighi) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Without the patch, I can reproduce the hang fairly frequently, in one or two loops, which fails in this way: [ 1069.711956] bcache: cancel_writeback_rate_update_dwork() give up waiting for dc->writeback_write_update to quit [ 1088.583986] INFO: task kworker/0:2:436 blocked for more than 120 seconds. [ 1088.590330] Tainted: P O 4.15.0-54-generic #58-Ubuntu [ 1088.595831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1088.598210] kworker/0:2 D0 436 2 0x8000 [ 1088.598244] Workqueue: events update_writeback_rate [bcache] [ 1088.598246] Call Trace: [ 1088.598255] __schedule+0x291/0x8a0 [ 1088.598258] ? __switch_to_asm+0x40/0x70 [ 1088.598260] schedule+0x2c/0x80 [ 1088.598262] schedule_preempt_disabled+0xe/0x10 [ 1088.598264] __mutex_lock.isra.2+0x18c/0x4d0 [ 1088.598266] ? __switch_to_asm+0x34/0x70 [ 1088.598268] ? __switch_to_asm+0x34/0x70 [ 1088.598270] ? __switch_to_asm+0x40/0x70 [ 1088.598272] __mutex_lock_slowpath+0x13/0x20 [ 1088.598274] ? __mutex_lock_slowpath+0x13/0x20 [ 1088.598276] mutex_lock+0x2f/0x40 [ 1088.598281] update_writeback_rate+0x98/0x2b0 [bcache] [ 1088.598285] process_one_work+0x1de/0x410 [ 1088.598287] worker_thread+0x32/0x410 [ 1088.598289] kthread+0x121/0x140 [ 1088.598291] ? process_one_work+0x410/0x410 [ 1088.598293] ? kthread_create_worker_on_cpu+0x70/0x70 [ 1088.598295] ret_from_fork+0x35/0x40 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
I've setup our integration test that runs the the CDO-QA bcache/ceph setup. On the updated kernel I got through 10 loops on the deployment before it stacktraced: http://paste.ubuntu.com/p/zVrtvKBfCY/ [ 3939.846908] bcache: bch_cached_dev_attach() Caching vdd as bcache5 on set 275985b3-da58-41f8-9072-958bd960b490 [ 3939.878388] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event) [ 3939.904984] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event) [ 3939.972715] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event) [ 3940.059415] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event) [ 3940.129854] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event) [ 3949.257051] md: md0: resync done. [ 4109.273558] INFO: task python3:19635 blocked for more than 120 seconds. [ 4109.279331] Tainted: P O 4.15.0-55-generic #60+lp796292 [ 4109.284767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4109.288771] python3 D0 19635 16361 0x [ 4109.288774] Call Trace: [ 4109.288818] __schedule+0x291/0x8a0 [ 4109.288822] ? __switch_to_asm+0x34/0x70 [ 4109.288824] ? __switch_to_asm+0x40/0x70 [ 4109.288826] schedule+0x2c/0x80 [ 4109.288852] bch_bucket_alloc+0x1fa/0x350 [bcache] [ 4109.288866] ? wait_woken+0x80/0x80 [ 4109.288872] __bch_bucket_alloc_set+0xfe/0x150 [bcache] [ 4109.288876] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 4109.22] __uuid_write+0x59/0x150 [bcache] [ 4109.288895] ? submit_bio+0x73/0x140 [ 4109.288900] ? __write_super+0x137/0x170 [bcache] [ 4109.288905] bch_uuid_write+0x16/0x40 [bcache] [ 4109.288911] __cached_dev_store+0x1a1/0x6d0 [bcache] [ 4109.288916] bch_cached_dev_store+0x39/0xc0 [bcache] [ 4109.288992] sysfs_kf_write+0x3c/0x50 [ 4109.288998] kernfs_fop_write+0x125/0x1a0 [ 4109.289001] __vfs_write+0x1b/0x40 [ 4109.289003] vfs_write+0xb1/0x1a0 [ 4109.289004] SyS_write+0x55/0xc0 [ 4109.289010] do_syscall_64+0x73/0x130 [ 4109.289014] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 4109.289016] RIP: 0033:0x7f8d2833e154 [ 4109.289018] RSP: 002b:7ffcda55a4e8 EFLAGS: 0246 ORIG_RAX: 0001 [ 4109.289020] RAX: ffda RBX: 0008 RCX: 7f8d2833e154 [ 4109.289021] RDX: 0008 RSI: 022b7360 RDI: 0003 [ 4109.289022] RBP: 7f8d288396c0 R08: R09: [ 4109.289022] R10: 0100 R11: 0246 R12: 0003 [ 4109.289026] R13: R14: 022b7360 R15: 01fe8db0 [ 4109.289033] INFO: task bcache_allocato:22317 blocked for more than 120 seconds. [ 4109.292172] Tainted: P O 4.15.0-55-generic #60+lp796292 [ 4109.295345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4109.298208] bcache_allocato D0 22317 2 0x8000 [ 4109.298212] Call Trace: [ 4109.298217] __schedule+0x291/0x8a0 [ 4109.298225] schedule+0x2c/0x80 [ 4109.298232] bch_bucket_alloc+0x1fa/0x350 [bcache] [ 4109.298235] ? wait_woken+0x80/0x80 [ 4109.298241] bch_prio_write+0x19f/0x340 [bcache] [ 4109.298246] bch_allocator_thread+0x502/0xca0 [bcache] [ 4109.298260] kthread+0x121/0x140 [ 4109.298264] ? bch_invalidate_one_bucket+0x80/0x80 [bcache] [ 4109.298274] ? kthread_create_worker_on_cpu+0x70/0x70 [ 4109.298277] ret_from_fork+0x35/0x40 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscripti
Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
This is difficult for us to test in our lab because we are using MAAS, and we hit this during MAAS deployments of nodes, so we would need MAAS images built with these kernels. Additionally, this doesn't reproduce every time, it is maybe 1/4 test runs. It may be best to find a way to reproduce this outside of MAAS. On Wed, Jul 3, 2019 at 11:16 AM Andrea Righi wrote: > >From a kernel perspective this big slowness on shutting down a bcache > volume might be caused by a locking / race condition issue. If I read > correctly this problem has been reproduced in bionic (and in xenial we > even got a kernel oops - it looks like caused by a NULL pointer > dereference). I would try to address these issues separately. > > About bionic it would be nice to test this commit (also mentioned by > @elmo in comment #28): > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60 > > Moreover, even if we didn't get an explicit NULL pointer dereference > with bionic, I think it would be interesting to test also the following > fixes: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566 > > I've already backported all of them and applied to the latest bionic > kernel. A test kernel is available here: > > https://kernel.ubuntu.com/~arighi/LP-1796292/ > > If it doesn't cost too much it would be great to do a test with it. In > the meantime I'll try to reproduce the problem locally. Thanks in > advance! > > -- > You received this bug notification because you are a member of Canonical > Field High, which is subscribed to the bug report. > https://bugs.launchpad.net/bugs/1796292 > > Title: > Tight timeout for bcache removal causes spurious failures > > Status in curtin: > Fix Released > Status in linux package in Ubuntu: > Confirmed > Status in linux source package in Bionic: > New > Status in linux source package in Cosmic: > New > Status in linux source package in Disco: > New > Status in linux source package in Eoan: > Confirmed > > Bug description: > I've had a number of deployment faults where curtin would report > Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- > deployment of 30+ nodes. Upon retrying the node would usually deploy > fine. Experimentally I've set the timeout ridiculously high, and it > seems I'm getting no faults with this. I'm wondering if the timeout > for removal is set too tight, or might need to be made configurable. > > --- curtin/util.py~ 2018-05-18 18:40:48.0 + > +++ curtin/util.py 2018-10-05 09:40:06.807390367 + > @@ -263,7 +263,7 @@ >return _subp(*args, **kwargs) > > > -def wait_for_removal(path, retries=[1, 3, 5, 7]): > +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): >if not path: >raise ValueError('wait_for_removal: missing path parameter') > > To manage notifications about this bug go to: > https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
>From a kernel perspective this big slowness on shutting down a bcache volume might be caused by a locking / race condition issue. If I read correctly this problem has been reproduced in bionic (and in xenial we even got a kernel oops - it looks like caused by a NULL pointer dereference). I would try to address these issues separately. About bionic it would be nice to test this commit (also mentioned by @elmo in comment #28): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60 Moreover, even if we didn't get an explicit NULL pointer dereference with bionic, I think it would be interesting to test also the following fixes: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566 I've already backported all of them and applied to the latest bionic kernel. A test kernel is available here: https://kernel.ubuntu.com/~arighi/LP-1796292/ If it doesn't cost too much it would be great to do a test with it. In the meantime I'll try to reproduce the problem locally. Thanks in advance! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Eoan) Importance: Undecided Status: Confirmed ** Also affects: linux (Ubuntu Disco) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Cosmic) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Canonical kernel team has this item queued in the hotlist to work on. I am assigning to myself to accelerate work ** Changed in: curtin Assignee: (unassigned) => Terry Rudd (terrykrudd) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: New Status in linux source package in Cosmic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
On Mon, Jun 3, 2019 at 2:05 PM Andrey Grebennikov < agrebennikov1...@gmail.com> wrote: > Is there an estimate on getting this package in bionic-updates please? > We are starting an SRU of curtin this week. SRU's take at least 7 days from when they hit -proposed possibly longer depending on test results. I should have something up in -proposed this week and we'll go from there on testing. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1796292 > > Title: > Tight timeout for bcache removal causes spurious failures > > To manage notifications about this bug go to: > https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Is there an estimate on getting this package in bionic-updates please? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
This bug is believed to be fixed in curtin in version 19.1. If this is still a problem for you, please make a comment and set the state back to New Thank you. ** Changed in: curtin Status: New => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: Fix Released Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
This script *should* trigger the issue on Bionic GA: https://pastebin.ubuntu.com/p/WdKGbMWnM6/ Try it with both GA and HWE bionic, the commit on HWE should trigger up. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
I was looking into kernel commits and I came across this: https://github.com/torvalds/linux/commit/fadd94e05c02afec7b70b0b14915624f1782f578 So, as far as I understood, it actually deals with the issue of manual device detach during a writeback clean-up and causing deadlock. The timeline makes sense when we look for Bionic GA kernel, as well. Bionic GA should not include this fix, but HWE should. Could we run the tests again, but focusing on Bionic hwe? Afair hwe runs 4.18, which should include this commit. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
** Tags added: cdo-qa foundations-engine -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
@jhobbs Here is the script that cleans up bcache devices on recommission: https://pastebin.ubuntu.com/p/6WCGvM4Q32/ -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
On Wed, May 8, 2019 at 11:55 PM Trent Lloyd wrote: > I have been running into this (curtin 18.1-17-gae48e86f- > 0ubuntu1~16.04.1) > > I think this commit basically agrees with my thoughts but I just wanted > to share them explicitly in case they are interesting > > (1) If you *unregister* the cache device from the backing device, it > first has to purge all the dirty data back to the backing device. This > may obviously take a while. > > (2) When doing that, I managed to deadlock bcache at least once on > xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I > assume was trying to write to the bcache.. traceback: > https://pastebin.canonical.com/117528/ - you can't get out of that > without a reboot > Thanks for capturing those; Ive quite a few of my own as an unregister path which _should_ work; but doesn't for various bugs in bcache. I need to attach those OOPS to this bug as well. > > (3) However generally I had good luck simplying "stop"ing the cache > devices (it seems perhaps that is what this bug is designed to do, > switch to stop, instead of unregister?). Specifically though I was > stopping the backing devices, and then later the cache device. It seems > like the current commit is the other way around? > Unregister is just not stable, so stopping is what is being done now. I did attempt stopping bcache devices first and only once all bcache devices were stopped to then stop and remove a cacheset; this proved unreliable under our integration testing of various bcache scenarios. > > ** Tags added: sts > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1796292 > > Title: > Tight timeout for bcache removal causes spurious failures > > To manage notifications about this bug go to: > https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions > -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Xenial GA kernel bcache unregister oops: http://paste.ubuntu.com/p/BzfHFjzZ8y/ -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
I have been running into this (curtin 18.1-17-gae48e86f- 0ubuntu1~16.04.1) I think this commit basically agrees with my thoughts but I just wanted to share them explicitly in case they are interesting (1) If you *unregister* the cache device from the backing device, it first has to purge all the dirty data back to the backing device. This may obviously take a while. (2) When doing that, I managed to deadlock bcache at least once on xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I assume was trying to write to the bcache.. traceback: https://pastebin.canonical.com/117528/ - you can't get out of that without a reboot (3) However generally I had good luck simplying "stop"ing the cache devices (it seems perhaps that is what this bug is designed to do, switch to stop, instead of unregister?). Specifically though I was stopping the backing devices, and then later the cache device. It seems like the current commit is the other way around? ** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
This occurrs on a target machine during maas install. Apport is not collected in this case. ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Confirmed Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures
Adding affects linux package ** Also affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1796292 Title: Tight timeout for bcache removal causes spurious failures Status in curtin: New Status in linux package in Ubuntu: Incomplete Bug description: I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass- deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable. --- curtin/util.py~ 2018-05-18 18:40:48.0 + +++ curtin/util.py 2018-10-05 09:40:06.807390367 + @@ -263,7 +263,7 @@ return _subp(*args, **kwargs) -def wait_for_removal(path, retries=[1, 3, 5, 7]): +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]): if not path: raise ValueError('wait_for_removal: missing path parameter') To manage notifications about this bug go to: https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp