subject:"\[Kernel\-packages\] \[Bug 1796292\] Re\: Tight timeout for bcache removal causes spurious failures"

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-27 Thread Launchpad Bug Tracker

*** This bug is a duplicate of bug 1784665 ***
https://bugs.launchpad.net/bugs/1784665

This bug was fixed in the package linux - 5.2.0-13.14

---
linux (5.2.0-13.14) eoan; urgency=medium

  * eoan/linux: 5.2.0-13.14 -proposed tracker (LP: #1840261)

  * NULL pointer dereference when Inserting the VIMC module (LP: #1840028)
- media: vimc: fix component match compare

  * Miscellaneous upstream changes
- selftests/bpf: remove bpf_util.h from BPF C progs

linux (5.2.0-12.13) eoan; urgency=medium

  * eoan/linux: 5.2.0-12.13 -proposed tracker (LP: #1840184)

  * Eoan update: v5.2.8 upstream stable release (LP: #1840178)
- scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
- libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
- libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
- ALSA: usb-audio: Sanity checks for each pipe and EP types
- ALSA: usb-audio: Fix gpf in snd_usb_pipe_sanity_check
- HID: wacom: fix bit shift for Cintiq Companion 2
- HID: Add quirk for HP X1200 PIXART OEM mouse
- atm: iphase: Fix Spectre v1 vulnerability
- bnx2x: Disable multi-cos feature.
- drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
- ife: error out when nla attributes are empty
- ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6
- ip6_tunnel: fix possible use-after-free on xmit
- ipip: validate header length in ipip_tunnel_xmit
- mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
- mvpp2: fix panic on module removal
- mvpp2: refactor MTU change code
- net: bridge: delete local fdb on device init failure
- net: bridge: mcast: don't delete permanent entries when fast leave is
  enabled
- net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
- net: fix ifindex collision during namespace removal
- net/mlx5e: always initialize frag->last_in_page
- net/mlx5: Use reversed order when unregister devices
- net: phy: fixed_phy: print gpio error only if gpio node is present
- net: phylink: don't start and stop SGMII PHYs in SFP modules twice
- net: phylink: Fix flow control for fixed-link
- net: phy: mscc: initialize stats array
- net: qualcomm: rmnet: Fix incorrect UL checksum offload logic
- net: sched: Fix a possible null-pointer dereference in dequeue_func()
- net sched: update vlan action for batched events operations
- net: sched: use temporary variable for actions indexes
- net/smc: do not schedule tx_work in SMC_CLOSED state
- net: stmmac: Use netif_tx_napi_add() for TX polling function
- NFC: nfcmrvl: fix gpio-handling regression
- ocelot: Cancel delayed work before wq destruction
- tipc: compat: allow tipc commands without arguments
- tipc: fix unitilized skb list crash
- tun: mark small packets as owned by the tap sock
- net/mlx5: Fix modify_cq_in alignment
- net/mlx5e: Prevent encap flow counter update async to user query
- r8169: don't use MSI before RTL8168d
- bpf: fix XDP vlan selftests test_xdp_vlan.sh
- selftests/bpf: add wrapper scripts for test_xdp_vlan.sh
- selftests/bpf: reduce time to execute test_xdp_vlan.sh
- net: fix bpf_xdp_adjust_head regression for generic-XDP
- hv_sock: Fix hang when a connection is closed
- net: phy: fix race in genphy_update_link
- net/smc: avoid fallback in case of non-blocking connect
- rocker: fix memory leaks of fib_work on two error return paths
- mlxsw: spectrum_buffers: Further reduce pool size on Spectrum-2
- net/mlx5: Add missing RDMA_RX capabilities
- net/mlx5e: Fix matching of speed to PRM link modes
- compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
- drm/i915/vbt: Fix VBT parsing for the PSR section
- Revert "mac80211: set NETIF_F_LLTX when using intermediate tx queues"
- spi: bcm2835: Fix 3-wire mode if DMA is enabled
- Linux 5.2.8

  * Miscellaneous Ubuntu changes
- SAUCE: selftests/bpf: do not include Kbuild.include in makefile
- update dkms package versions

linux (5.2.0-11.12) eoan; urgency=medium

  * eoan/linux: 5.2.0-11.12 -proposed tracker (LP: #1839646)

  * Packaging resync (LP: #1786013)
- [Packaging] resync getabis
- [Packaging] update helper scripts

  * Eoan update: v5.2.7 upstream stable release (LP: #1839588)
- Revert "UBUNTU: SAUCE: Revert "loop: Don't change loop device under
  exclusive opener""
- ARM: riscpc: fix DMA
- ARM: dts: rockchip: Make rk3288-veyron-minnie run at hs200
- ARM: dts: rockchip: Make rk3288-veyron-mickey's emmc work again
- clk: meson: mpll: properly handle spread spectrum
- ARM: dts: rockchip: Mark that the rk3288 timer might stop in suspend
- ftrace: Enable trampoline when rec count returns back to one
- arm64: dts: qcom: qcs404-evb: fix l3 min voltage
- soc: qcom: rpmpd: fixup rpmpd set performance state
- arm64: dts: marvell: mcbin: enlarge PCI memory window
- soc: i

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-15 Thread Ubuntu Kernel Bot

*** This bug is a duplicate of bug 1784665 ***
https://bugs.launchpad.net/bugs/1784665

This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
xenial' to 'verification-done-xenial'. If the problem still exists,
change the tag 'verification-needed-xenial' to 'verification-failed-
xenial'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-15 Thread Ubuntu Kernel Bot

*** This bug is a duplicate of bug 1784665 ***
https://bugs.launchpad.net/bugs/1784665

This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
bionic' to 'verification-done-bionic'. If the problem still exists,
change the tag 'verification-needed-bionic' to 'verification-failed-
bionic'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Andrea Righi

*** This bug is a duplicate of bug 1784665 ***
https://bugs.launchpad.net/bugs/1784665

** This bug has been marked a duplicate of bug 1784665
   bcache: bch_allocator_thread(): hung task timeout

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Ryan Harper

On Mon, Aug 5, 2019 at 1:19 PM Ryan Harper 
wrote:

>
>
> On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi 
> wrote:
>
>> Ryan, I've uploaded a new test kernel with the fix mentioned in the
>> comment before:
>>
>> https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/
>>
>> I've performed over 100 installations using curtin-nvme.sh
>> (install_count = 100), no hung task timeout. I'll run other stress tests
>> to make sure we're not breaking anything else with this fix, but results
>> look promising so far.
>>
>> It'd be great if you could also do a test on your side. Thanks!
>>
>
> Thats excellent news.  I'm starting my tests on this kernel now.
>

I've got 233 consecutive installs successful.


>
>
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1796292
>>
>> Title:
>>   Tight timeout for bcache removal causes spurious failures
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>>
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Ryan Harper

On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi 
wrote:

> Ryan, I've uploaded a new test kernel with the fix mentioned in the
> comment before:
>
> https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/
>
> I've performed over 100 installations using curtin-nvme.sh
> (install_count = 100), no hung task timeout. I'll run other stress tests
> to make sure we're not breaking anything else with this fix, but results
> look promising so far.
>
> It'd be great if you could also do a test on your side. Thanks!
>

Thats excellent news.  I'm starting my tests on this kernel now.


>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
>   Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Andrea Righi

Ryan, I've uploaded a new test kernel with the fix mentioned in the
comment before:

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/

I've performed over 100 installations using curtin-nvme.sh
(install_count = 100), no hung task timeout. I'll run other stress tests
to make sure we're not breaking anything else with this fix, but results
look promising so far.

It'd be great if you could also do a test on your side. Thanks!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-05 Thread Andrea Righi

Some additional info about the deadlock:

crash> bt 16588
PID: 16588  TASK: 9ffd7f332b00  CPU: 1   COMMAND: "bcache_allocato"
[exception RIP: bch_crc64+57]
RIP: c093b2c9  RSP: ab9585767e28  RFLAGS: 0286
RAX: f1f51403756de2bd  RBX:   RCX: 0065
RDX: 0065  RSI: 9ffd6398  RDI: 9ffd63925346
RBP: ab9585767e28   R8: c093db60   R9: ab9585739000
R10: 007f  R11: 1ffef001  R12: 
R13: 0008  R14: 9ffd6390  R15: 9ffd683d
CS: 0010  SS: 0018
 #0 [ab9585767e30] bch_prio_write at c09325c0 [bcache]
 #1 [ab9585767eb0] bch_allocator_thread at c091bdc5 [bcache]
 #2 [ab9585767f08] kthread at a80b2481
 #3 [ab9585767f50] ret_from_fork at a8a00205

crash> bt 14658
PID: 14658  TASK: 9ffd7a9f  CPU: 0   COMMAND: "python3"
 #0 [ab958380bb48] __schedule at a89ae441
 #1 [ab958380bbe8] schedule at a89aea7c
 #2 [ab958380bbf8] bch_bucket_alloc at c091c370 [bcache]
 #3 [ab958380bc68] __bch_bucket_alloc_set at c091c5ce [bcache]
 #4 [ab958380bcb8] bch_bucket_alloc_set at c091c66e [bcache]
 #5 [ab958380bcf8] __uuid_write at c0931b69 [bcache]
 #6 [ab958380bda0] bch_uuid_write at c0931f76 [bcache]
 #7 [ab958380bdc0] __cached_dev_store at c0937c08 [bcache]
 #8 [ab958380be20] bch_cached_dev_store at c0938309 [bcache]
 #9 [ab958380be50] sysfs_kf_write at a830c97c
#10 [ab958380be60] kernfs_fop_write at a830c3e5
#11 [ab958380bea0] __vfs_write at a827e5bb
#12 [ab958380beb0] vfs_write at a827e781
#13 [ab958380bee8] sys_write at a827e9fc
#14 [ab958380bf30] do_syscall_64 at a8003b03
#15 [ab958380bf50] entry_SYSCALL_64_after_hwframe at a8a00081
RIP: 7faffc7bd154  RSP: 7ffe307cbc88  RFLAGS: 0246
RAX: ffda  RBX: 0008  RCX: 7faffc7bd154
RDX: 0008  RSI: 011ce7f0  RDI: 0003
RBP: 7faffccb86c0   R8:    R9: 
R10: 0100  R11: 0246  R12: 0003
R13:   R14: 011ce7f0  R15: 00f33e60
ORIG_RAX: 0001  CS: 0033  SS: 002b

In this case the task "python3" (pid 14658) gets stuck in a wait that
never completes from bch_bucket_alloc(). The task should that should
resume "python3" from this wait is "bcache_allocator" (pid 16588), but
the resume never happens, because bcache_allocator is stuck in this
"retry_invalidate" busy loop:

static int bch_allocator_thread(void *arg)
{
...
retry_invalidate:
allocator_wait(ca, ca->set->gc_mark_valid &&
   !ca->invalidate_needs_gc);
invalidate_buckets(ca);

/*
 * Now, we write their new gens to disk so we can start writing
 * new stuff to them:
 */
allocator_wait(ca, !atomic_read(&ca->set->prio_blocked));
if (CACHE_SYNC(&ca->set->sb)) {
/*
 * This could deadlock if an allocation with a btree
 * node locked ever blocked - having the btree node
 * locked would block garbage collection, but here we're
 * waiting on garbage collection before we invalidate
 * and free anything.
 *
 * But this should be safe since the btree code always
 * uses btree_check_reserve() before allocating now, and
 * if it fails it blocks without btree nodes locked.
 */
if (!fifo_full(&ca->free_inc))
goto retry_invalidate;

if (bch_prio_write(ca, false) < 0) {
ca->invalidate_needs_gc = 1;
wake_up_gc(ca->set);
goto retry_invalidate;
}
}
}
...

The exact code path is this: bch_prio_write() fails, because it calls
bch_bucket_alloc() that fails (out of free buckets), it's waking up the
garbage collector (trying to free up some buckets) and it goes back to
retry_invalidate, but it's not enough apprently; bch_prio_write() is
going to fail over and over again (due to no buckets available), unable
to break out of the busy loop => deadlock.

Looking better at the code it seems safe to resume the bcache_allocator
main loop when bch_prio_write() fails (still keeping the wake_up event
to the garbage collector), instead of going back to retry_invalidate.
This should give the allocator a better chance to free up some bu

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Ryan Harper

Trying the first kernel without the change event sauce also fails:

[  532.823594] bcache: run_cache_set() invalidating existing data
[  532.828876] bcache: register_cache() registered cache device nvme0n1p2
[  532.869716] bcache: register_bdev() registered backing device vda1
[  532.994355] bcache: bch_cached_dev_attach() Caching vda1 as bcache0 on set 
21d89237-231d-4af6-a4c8-4b1b8fa5eef5
[  533.051588] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.094717] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.120063] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.142517] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.191069] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.249877] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.282653] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.301225] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.310505] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.318959] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.374121] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.536920] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.581468] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.589270] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.595986] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.602638] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.651848] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.677836] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.712074] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.717682] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.723354] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.728951] bcache: register_bcache() error /dev/vda1: device already 
registered
[  533.777602] bcache: register_bcache() error /dev/vda1: device already 
registered
[  553.784393] md: md0: resync done.
[  725.983387] INFO: task python3:413 blocked for more than 120 seconds.
[  725.985099]   Tainted: P   O 4.15.0-56-generic 
#62~lp1796292+1
[  725.986820] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  725.988649] python3 D0   413405 0x
[  725.988652] Call Trace:
[  725.988684]  __schedule+0x291/0x8a0
[  725.988687]  schedule+0x2c/0x80
[  725.988710]  bch_bucket_alloc+0x1fa/0x350 [bcache]
[  725.988722]  ? wait_woken+0x80/0x80
[  725.988726]  __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[  725.988729]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[  725.988734]  __uuid_write+0x59/0x150 [bcache]
[  725.988738]  ? __write_super+0x137/0x170 [bcache]
[  725.988742]  bch_uuid_write+0x16/0x40 [bcache]
[  725.988746]  __cached_dev_store+0x1d8/0x8a0 [bcache]
[  725.988750]  bch_cached_dev_store+0x39/0xc0 [bcache]
[  725.988758]  sysfs_kf_write+0x3c/0x50
[  725.988759]  kernfs_fop_write+0x125/0x1a0
[  725.988765]  __vfs_write+0x1b/0x40
[  725.988766]  vfs_write+0xb1/0x1a0
[  725.988767]  SyS_write+0x5c/0xe0
[  725.988774]  do_syscall_64+0x73/0x130
[  725.988777]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  725.988779] RIP: 0033:0x7fa71a23b154
[  725.988780] RSP: 002b:7ffec74f2828 EFLAGS: 0246 ORIG_RAX: 
0001
[  725.988783] RAX: ffda RBX: 0008 RCX: 7fa71a23b154
[  725.988783] RDX: 0008 RSI: 02a90030 RDI: 0003
[  725.988784] RBP: 7fa71a7366c0 R08:  R09: 
[  725.988785] R10: 0100 R11: 0246 R12: 0003
[  725.988785] R13:  R14: 02a90030 R15: 027c4e60

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for remo

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Andrea Righi

After some help from Ryan (on IRC) I've been able to run the last
reproducer script and trigger the same trace. Now I should be able to
collect all the information that I need and hopefully post a new test
kernel (fixed for real...) soon.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Ryan Harper

I tried the +3 kernel first, and I got 3 installs and then this hang:


[  549.828710] bcache: run_cache_set() invalidating existing data
[  549.836485] bcache: register_cache() registered cache device nvme1n1p2
[  549.937486] bcache: register_bdev() registered backing device vdg
[  550.018855] bcache: bch_cached_dev_attach() Caching vdg as bcache3 on set 
c7abd3ea-f9c9-415a-b8a6-9efeddc3e030
[  550.074760] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  550.316246] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  550.545840] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  550.565928] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  550.724285] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  562.352520] md: md0: resync done.
[  725.980380] INFO: task python3:27303 blocked for more than 120 seconds.
[  725.982364]   Tainted: P   O 4.15.0-56-generic 
#62~lp1796292+3
[  725.984228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  725.986284] python3 D0 27303  27293 0x
[  725.986287] Call Trace:
[  725.986319]  __schedule+0x291/0x8a0
[  725.986322]  schedule+0x2c/0x80
[  725.986337]  bch_bucket_alloc+0x320/0x3a0 [bcache]
[  725.986359]  ? wait_woken+0x80/0x80
[  725.986363]  __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[  725.986367]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[  725.986372]  __uuid_write+0x59/0x150 [bcache]
[  725.986377]  ? __write_super+0x137/0x170 [bcache]
[  725.986382]  bch_uuid_write+0x16/0x40 [bcache]
[  725.986386]  __cached_dev_store+0x1d8/0x8a0 [bcache]
[  725.986391]  bch_cached_dev_store+0x39/0xc0 [bcache]
[  725.986399]  sysfs_kf_write+0x3c/0x50
[  725.986401]  kernfs_fop_write+0x125/0x1a0
[  725.986406]  __vfs_write+0x1b/0x40
[  725.986407]  vfs_write+0xb1/0x1a0
[  725.986409]  SyS_write+0x5c/0xe0
[  725.986416]  do_syscall_64+0x73/0x130
[  725.986419]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  725.986421] RIP: 0033:0x7f2d7c39a154
[  725.986422] RSP: 002b:7fff80c5e048 EFLAGS: 0246 ORIG_RAX: 
0001
[  725.986426] RAX: ffda RBX: 0008 RCX: 7f2d7c39a154
[  725.986426] RDX: 0008 RSI: 0247e7d0 RDI: 0003
[  725.986427] RBP: 7f2d7c8956c0 R08:  R09: 
[  725.986428] R10: 0100 R11: 0246 R12: 0003
[  725.986428] R13:  R14: 0247e7d0 R15: 021b6e60

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-02 Thread Andrea Righi

Ryan, unfortunately the last reproducer script is giving me a lot of
errors and I'm still trying to figure out how to make it run to the end
(or at least to a point where it's start to run some bcache commands).

In the meantime (as anticipated on IRC) I've uploaded a test kernel
reverting the patch "UBUNTU: SAUCE: (no-up) bcache: decouple emitting a
cached_dev CHANGE uevent":

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+1/

As we know, this would re-introduce the problem discussed in bug
1729145, but it'd be interesting to test it anyway, just to see if this
patch is somehow related to the bch_bucket_alloc() deadlock.

In addition to that I've spent some time looking at the last kernel
trace and the code. It looks like bch_bucket_alloc() is always releasing
the mutex &ca->set->bucket_lock when it goes to sleep (call to
schedule()), but it doesn't release bch_register_lock, that might be
also acquired. I was wondering if this could the reason of this
deadlock, so I've prepared an additional test kernel that does *not*
revert our "UBUNTU SAUCE" patch, but instead it releases the mutex
bch_register_lock when bch_bucket_alloc() goes to sleep:

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+3/

Sorry for asking all these tests... if I can't find a way to reproduce
the bug on my side, asking you to test is the only way that I have to
debug this issue. :)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Ryan Harper

Reproducer script

** Attachment added: "curtin-nvme.sh"
   
https://bugs.launchpad.net/curtin/+bug/1796292/+attachment/5280353/+files/curtin-nvme.sh

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Ryan Harper

On Thu, Aug 1, 2019 at 10:15 AM Andrea Righi 
wrote:

> Thanks Ryan, this is very interesting:
>
> [ 259.411486] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
> [ 259.537070] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
> [ 259.797830] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
> [ 259.900392] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
>
> It looks that we're trying to register /dev/vdg multiple times as a
> backing device (make-bcache -B). I'm not getting this message during my
> tests, so that might be required to reproduce that particular deadlock.
>

We carry a specific sauce patch to ensure that if the cacheset is already
online, and a backing device
shows up later, that the kernel emits the change event to trigger the udev
rules to generate the
symlink for /dev/bcache/by-uuid.  I don't think the patch we carry is at
issue since we are just detecting
the re-register scenario and emitting a change uevent;

https://www.spinics.net/lists/linux-bcache/msg05833.html

We may want to resubmit that now to see if they'll take that or even want
to deal with the scenario
in a cleaner way;


>
> I'll modify my test case to trigger these errors and see if I can
> reproduce the hung task timeout issue.
>

I can provide you a setup to reproduce this.  I'll put together a doc.


> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
>   Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Andrea Righi

Thanks Ryan, this is very interesting:

[ 259.411486] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[ 259.537070] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[ 259.797830] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[ 259.900392] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)

It looks that we're trying to register /dev/vdg multiple times as a
backing device (make-bcache -B). I'm not getting this message during my
tests, so that might be required to reproduce that particular deadlock.

I'll modify my test case to trigger these errors and see if I can
reproduce the hung task timeout issue.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Ryan Harper

ubuntu@ubuntu:~$ uname -r
4.15.0-56-generic
ubuntu@ubuntu:~$ cat /proc/version
Linux version 4.15.0-56-generic (arighi@kathleen) (gcc version 7.4.0 (Ubuntu 
7.4.0-1ubuntu1~18.04.1)) #62~lp1796292 SMP Thu Aug 1 07:45:21 UTC 2019

This failed on the second install while running bcache-super-show /dev/vdg
Hung task.

[  259.150347] bcache: run_cache_set() invalidating existing data
[  259.158038] bcache: register_cache() registered cache device nvme1n1p2
[  259.251093] bcache: register_bdev() registered backing device vdg
[  259.379809] bcache: bch_cached_dev_attach() Caching vdg as bcache3 on set 
084505ad-5f6c-4666-9e3e-4f1650e8b015
[  259.411486] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  259.537070] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  259.797830] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  259.900392] bcache: register_bcache() error /dev/vdg: device already 
registered (emitting change event)
[  271.682662] md: md0: resync done.
[  484.525529] INFO: task python3:11257 blocked for more than 120 seconds.
[  484.528933]   Tainted: P   O 4.15.0-56-generic #62~lp1796292
[  484.532221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  484.535936] python3 D0 11257   7974 0x
[  484.535941] Call Trace:
[  484.535952]  __schedule+0x291/0x8a0
[  484.535957]  schedule+0x2c/0x80
[  484.535977]  bch_bucket_alloc+0x1fa/0x350 [bcache]
[  484.535984]  ? wait_woken+0x80/0x80
[  484.535993]  __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[  484.536002]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[  484.536014]  __uuid_write+0x59/0x150 [bcache]
[  484.536025]  ? __write_super+0x137/0x170 [bcache]
[  484.536035]  bch_uuid_write+0x16/0x40 [bcache]
[  484.536046]  __cached_dev_store+0x1d8/0x8a0 [bcache]
[  484.536057]  bch_cached_dev_store+0x39/0xc0 [bcache]
[  484.536061]  sysfs_kf_write+0x3c/0x50
[  484.536064]  kernfs_fop_write+0x125/0x1a0
[  484.536069]  __vfs_write+0x1b/0x40
[  484.536071]  vfs_write+0xb1/0x1a0
[  484.536075]  SyS_write+0x5c/0xe0
[  484.536081]  do_syscall_64+0x73/0x130
[  484.536085]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  484.536088] RIP: 0033:0x7f2ac7aa7154
[  484.536090] RSP: 002b:7157b628 EFLAGS: 0246 ORIG_RAX: 
0001
[  484.536093] RAX: ffda RBX: 0008 RCX: 7f2ac7aa7154
[  484.536095] RDX: 0008 RSI: 01c96340 RDI: 0003
[  484.536096] RBP: 7f2ac7fa26c0 R08:  R09: 
[  484.536098] R10: 0100 R11: 0246 R12: 0003
[  484.536099] R13:  R14: 01c96340 R15: 019c7db0


I'll note that, I can run bcache-super-show on the device while this was hung.

$ sudo bash
root@ubuntu:~# bcache-super-show /dev/vdg 
sb.magicok
sb.first_sector 8 [match]
sb.csum C71A896B52F1C486 [match]
sb.version  1 [backing device]

dev.label   osddata5
dev.uuid11df8370-fb64-4bb0-8171-dadabb47f6b1
dev.sectors_per_block   1
dev.sectors_per_bucket  1024
dev.data.first_sector   16
dev.data.cache_mode 1 [writeback]
dev.data.cache_state1 [clean]

cset.uuid   084505ad-5f6c-4666-9e3e-4f1650e8b015

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.n

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-08-01 Thread Andrea Righi

I've uploaded a new test kernel based on the latest bionic kernel from 
master-next:
https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292/

In addition to that I've backported all the recent upstream bcache fixes
and applied my proposed fix for the potential deadlock in
bch_allocator_thread() (https://lkml.org/lkml/2019/7/10/241).

I've tested this kernel both on a VM and on a bare metal box, running the test 
case from bug 
1784665 (https://launchpadlibrarian.net/381282009/bcache-basic-repro.sh - with 
some minor adjustments to match my devices).

The tests have been running for more than 1h without triggering any
problem (and they're still going).

Ryan / Chris: it would be really nice if you could do one more test with
this new kernel... and if you're still hitting issues we can try to work
on a better reproducer. Thanks again!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-29 Thread Chris Gregan

Escalated to Field Critical as it now happens often enough to block our
ability to test proposed product releases. We are unable to test
openstack-next at the moment because our test runs fail behind this bug.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-24 Thread Brad Figg

** Tags added: cscc

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Ryan Harper

The newer kernel went about 16 runs and then popped this:

[ 2137.810559] md: md0: resync done.
[ 2296.795633] INFO: task python3:11639 blocked for more than 120 seconds.
[ 2296.800320]   Tainted: P   O 4.15.0-55-generic 
#60+lp1796292+1
[ 2296.805097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2296.810071] python3 D0 11639   8301 0x
[ 2296.810075] Call Trace:
[ 2296.810100]  __schedule+0x291/0x8a0
[ 2296.810102]  ? __switch_to_asm+0x34/0x70
[ 2296.810103]  ? __switch_to_asm+0x40/0x70
[ 2296.810105]  schedule+0x2c/0x80
[ 2296.810118]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 2296.810128]  ? wait_woken+0x80/0x80
[ 2296.810132]  __bch_bucket_alloc_set+0x10d/0x160 [bcache]
[ 2296.810137]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 2296.810143]  __uuid_write+0x59/0x170 [bcache]
[ 2296.810148]  ? __write_super+0x137/0x170 [bcache]
[ 2296.810153]  bch_uuid_write+0x16/0x40 [bcache]
[ 2296.810159]  __cached_dev_store+0x1e5/0x8b0 [bcache]
[ 2296.810160]  ? __switch_to_asm+0x34/0x70
[ 2296.810161]  ? __switch_to_asm+0x40/0x70
[ 2296.810163]  ? __switch_to_asm+0x34/0x70
[ 2296.810167]  bch_cached_dev_store+0x46/0x110 [bcache]
[ 2296.810181]  sysfs_kf_write+0x3c/0x50
[ 2296.810182]  kernfs_fop_write+0x125/0x1a0
[ 2296.810185]  __vfs_write+0x1b/0x40
[ 2296.810187]  vfs_write+0xb1/0x1a0
[ 2296.810189]  SyS_write+0x55/0xc0
[ 2296.810193]  do_syscall_64+0x73/0x130
[ 2296.810194]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2296.810196] RIP: 0033:0x7f8e80077154
[ 2296.810197] RSP: 002b:7fffc23855e8 EFLAGS: 0246 ORIG_RAX: 
0001
[ 2296.810199] RAX: ffda RBX: 0008 RCX: 7f8e80077154
[ 2296.810200] RDX: 0008 RSI: 00d734b0 RDI: 0003
[ 2296.810201] RBP: 7f8e805726c0 R08:  R09: 
[ 2296.810202] R10: 0100 R11: 0246 R12: 0003
[ 2296.810203] R13:  R14: 00d734b0 R15: 00aa4db0
[ 2417.627259] INFO: task python3:11639 blocked for more than 120 seconds.
[ 2417.632687]   Tainted: P   O 4.15.0-55-generic 
#60+lp1796292+1
[ 2417.638276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Ryan Harper

Andrea,  thanks for the updated kernels.

On the first one, I got 23 installs before I ran into an issue; I'll
test the newer kernel next.

https://paste.ubuntu.com/p/2B4Kk3wbvQ/

[ 5436.870482] BUG: unable to handle kernel NULL pointer dereference at 
09b8
[ 5436.873374] IP: cache_set_flush+0xf6/0x190 [bcache]
[ 5436.875208] PGD 0 P4D 0 
[ 5436.876488] Oops:  [#1] SMP PTI
[ 5436.877842] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) 
zcommon(PO) znvpair(PO) spl(O) nls_utf8 isofs ppdev nls_iso8859_1 kvm_intel kvm 
irqbypass joydev input_leds serio_raw parport_pc parport qemu_fw_cfg mac_hid 
sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi virtio_rng ip_tables x_tables autofs4 btrfs 
zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 bcache psmouse 
nvme nvme_core virtio_blk virtio_net virtio_scsi floppy i2c_piix4 pata_acpi
[ 5436.896104] CPU: 0 PID: 11216 Comm: kworker/0:1 Tainted: P   O 
4.15.0-54-generic #58+lp1796292+1
[ 5436.899985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.0.0 02/06/2015
[ 5436.902645] Workqueue: events cache_set_flush [bcache]
[ 5436.904374] RIP: 0010:cache_set_flush+0xf6/0x190 [bcache]
[ 5436.906183] RSP: 0018:ab52826bbe58 EFLAGS: 00010202
[ 5436.909050] RAX:  RBX: 94d104aa0cc0 RCX: 
[ 5436.911939] RDX:  RSI: 2001 RDI: 
[ 5436.914448] RBP: ab52826bbe78 R08: 94d13f61ac30 R09: 94d13f342b98
[ 5436.917113] R10: ab52803b3d10 R11: 02c6 R12: 0001
[ 5436.919210] R13: 94d13f622140 R14: 94d104aa0db8 R15: 
[ 5436.921401] FS:  () GS:94d13f60() 
knlGS:
[ 5436.923743] CS:  0010 DS:  ES:  CR0: 80050033
[ 5436.926299] CR2: 09b8 CR3: 38252000 CR4: 06f0
[ 5436.929093] Call Trace:
[ 5436.930818]  process_one_work+0x1de/0x410
[ 5436.932818]  worker_thread+0x32/0x410
[ 5436.935332]  kthread+0x121/0x140
[ 5436.937309]  ? process_one_work+0x410/0x410
[ 5436.939393]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 5436.941263]  ret_from_fork+0x35/0x40
[ 5436.943060] Code: b8 00 00 00 a8 02 74 c8 31 f6 4c 89 e7 e8 43 0e ff ff eb 
bc 66 83 bb 4c f7 ff ff 00 48 8b 83 58 ff ff ff 74 31 41 bc 01 00 00 00 <48> 8b 
b8 b8 09 00 00 48 85 ff 74 05 e8 f9 9d 0d d3 0f b7 8b 4c 
[ 5436.950188] RIP: cache_set_flush+0xf6/0x190 [bcache] RSP: ab52826bbe58
[ 5436.952796] CR2: 09b8
[ 5436.954567] ---[ end trace b771415397e98c3d ]---

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Andrea Righi

... and, just in case, I've uploaded also a test kernel based on the latest 
bionic's master-next + a bunch of extra bcache fixes:
https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292+1/

If the previous kernel is still buggy it'd be nice to try also this one.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-10 Thread Andrea Righi

Hi Ryan, I've uploaded a new test kernel:
https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/

This one is based on 4.15.0-54.58 and it addresses specifically the
bch_bucket_alloc() problem (with this patch applied:
https://lore.kernel.org/lkml/20190710093117.GA2792@xps-13/T/#u).

With this kernel I wasn't able to reproduce the hung task timeout issue
in bch_bucket_alloc() anymore.

It would be great if you could repeat your test also with this kernel.
Thanks in advance!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Launchpad Bug Tracker

Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: linux (Ubuntu Cosmic)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Launchpad Bug Tracker

Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: linux (Ubuntu Disco)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Launchpad Bug Tracker

Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: linux (Ubuntu Bionic)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-05 Thread Andrea Righi

Good news! I've been able to reproduce the hung task in
bch_bucket_alloc() issue locally, using the test case from bug 1784665.
I think we're hitting the same problem now. I'll do more tests and will
keep you updated.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed
Status in linux source package in Disco:
  Confirmed
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-04 Thread Andrea Righi

Thanks tons for the tests Ryan! Well, at least the hung task timeout
trace is different, so we're making some progress.

With the new kernel it seems that we're stuck in bch_bucket_alloc().
I've identified other upstream fixes that could help to prevent this
problem.

If you're willing to do few more tests, here's a new test kernel (based
on 4.15.0-54-generic + set of bcache upstream fixes):

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/

And, just in case, I've also applied the same set of fixes also to the
latest bionic's master-next:

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292/

Testing these two kernels should give us more information about the
nature of the problem.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-04 Thread Andrea Righi

** Changed in: curtin
 Assignee: Terry Rudd (terrykrudd) => Andrea Righi (arighi)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Ryan Harper

Without the patch, I can reproduce the hang fairly frequently, in one or
two loops, which fails in this way:

[ 1069.711956] bcache: cancel_writeback_rate_update_dwork() give up waiting for 
dc->writeback_write_update to quit
[ 1088.583986] INFO: task kworker/0:2:436 blocked for more than 120 seconds.
[ 1088.590330]   Tainted: P   O 4.15.0-54-generic #58-Ubuntu
[ 1088.595831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1088.598210] kworker/0:2 D0   436  2 0x8000
[ 1088.598244] Workqueue: events update_writeback_rate [bcache]
[ 1088.598246] Call Trace:
[ 1088.598255]  __schedule+0x291/0x8a0
[ 1088.598258]  ? __switch_to_asm+0x40/0x70
[ 1088.598260]  schedule+0x2c/0x80
[ 1088.598262]  schedule_preempt_disabled+0xe/0x10
[ 1088.598264]  __mutex_lock.isra.2+0x18c/0x4d0
[ 1088.598266]  ? __switch_to_asm+0x34/0x70
[ 1088.598268]  ? __switch_to_asm+0x34/0x70
[ 1088.598270]  ? __switch_to_asm+0x40/0x70
[ 1088.598272]  __mutex_lock_slowpath+0x13/0x20
[ 1088.598274]  ? __mutex_lock_slowpath+0x13/0x20
[ 1088.598276]  mutex_lock+0x2f/0x40
[ 1088.598281]  update_writeback_rate+0x98/0x2b0 [bcache]
[ 1088.598285]  process_one_work+0x1de/0x410
[ 1088.598287]  worker_thread+0x32/0x410
[ 1088.598289]  kthread+0x121/0x140
[ 1088.598291]  ? process_one_work+0x410/0x410
[ 1088.598293]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 1088.598295]  ret_from_fork+0x35/0x40

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Ryan Harper

I've setup our integration test that runs the the CDO-QA bcache/ceph
setup.

On the updated kernel I got through 10 loops on the deployment before it
stacktraced:

http://paste.ubuntu.com/p/zVrtvKBfCY/

[ 3939.846908] bcache: bch_cached_dev_attach() Caching vdd as bcache5 on set 
275985b3-da58-41f8-9072-958bd960b490
[ 3939.878388] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3939.904984] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3939.972715] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3940.059415] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3940.129854] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3949.257051] md: md0: resync done.
[ 4109.273558] INFO: task python3:19635 blocked for more than 120 seconds.
[ 4109.279331]   Tainted: P   O 4.15.0-55-generic #60+lp796292
[ 4109.284767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 4109.288771] python3 D0 19635  16361 0x
[ 4109.288774] Call Trace:
[ 4109.288818]  __schedule+0x291/0x8a0
[ 4109.288822]  ? __switch_to_asm+0x34/0x70
[ 4109.288824]  ? __switch_to_asm+0x40/0x70
[ 4109.288826]  schedule+0x2c/0x80
[ 4109.288852]  bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 4109.288866]  ? wait_woken+0x80/0x80
[ 4109.288872]  __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[ 4109.288876]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 4109.22]  __uuid_write+0x59/0x150 [bcache]
[ 4109.288895]  ? submit_bio+0x73/0x140
[ 4109.288900]  ? __write_super+0x137/0x170 [bcache]
[ 4109.288905]  bch_uuid_write+0x16/0x40 [bcache]
[ 4109.288911]  __cached_dev_store+0x1a1/0x6d0 [bcache]
[ 4109.288916]  bch_cached_dev_store+0x39/0xc0 [bcache]
[ 4109.288992]  sysfs_kf_write+0x3c/0x50
[ 4109.288998]  kernfs_fop_write+0x125/0x1a0
[ 4109.289001]  __vfs_write+0x1b/0x40
[ 4109.289003]  vfs_write+0xb1/0x1a0
[ 4109.289004]  SyS_write+0x55/0xc0
[ 4109.289010]  do_syscall_64+0x73/0x130
[ 4109.289014]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 4109.289016] RIP: 0033:0x7f8d2833e154
[ 4109.289018] RSP: 002b:7ffcda55a4e8 EFLAGS: 0246 ORIG_RAX: 
0001
[ 4109.289020] RAX: ffda RBX: 0008 RCX: 7f8d2833e154
[ 4109.289021] RDX: 0008 RSI: 022b7360 RDI: 0003
[ 4109.289022] RBP: 7f8d288396c0 R08:  R09: 
[ 4109.289022] R10: 0100 R11: 0246 R12: 0003
[ 4109.289026] R13:  R14: 022b7360 R15: 01fe8db0
[ 4109.289033] INFO: task bcache_allocato:22317 blocked for more than 120 
seconds.
[ 4109.292172]   Tainted: P   O 4.15.0-55-generic #60+lp796292
[ 4109.295345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 4109.298208] bcache_allocato D0 22317  2 0x8000
[ 4109.298212] Call Trace:
[ 4109.298217]  __schedule+0x291/0x8a0
[ 4109.298225]  schedule+0x2c/0x80
[ 4109.298232]  bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 4109.298235]  ? wait_woken+0x80/0x80
[ 4109.298241]  bch_prio_write+0x19f/0x340 [bcache]
[ 4109.298246]  bch_allocator_thread+0x502/0xca0 [bcache]
[ 4109.298260]  kthread+0x121/0x140
[ 4109.298264]  ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
[ 4109.298274]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 4109.298277]  ret_from_fork+0x35/0x40

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscripti

Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Jason Hobbs

This is difficult for us to test in our lab because we are using MAAS, and
we hit this during MAAS deployments of nodes, so we would need MAAS images
built with these kernels. Additionally, this doesn't reproduce every time,
it is maybe 1/4 test runs. It may be best to find a way to reproduce this
outside of MAAS.

On Wed, Jul 3, 2019 at 11:16 AM Andrea Righi 
wrote:

> >From a kernel perspective this big slowness on shutting down a bcache
> volume might be caused by a locking / race condition issue. If I read
> correctly this problem has been reproduced in bionic (and in xenial we
> even got a kernel oops - it looks like caused by a NULL pointer
> dereference). I would try to address these issues separately.
>
> About bionic it would be nice to test this commit (also mentioned by
> @elmo in comment #28):
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60
>
> Moreover, even if we didn't get an explicit NULL pointer dereference
> with bionic, I think it would be interesting to test also the following
> fixes:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566
>
> I've already backported all of them and applied to the latest bionic
> kernel. A test kernel is available here:
>
> https://kernel.ubuntu.com/~arighi/LP-1796292/
>
> If it doesn't cost too much it would be great to do a test with it. In
> the meantime I'll try to reproduce the problem locally. Thanks in
> advance!
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
>   Tight timeout for bcache removal causes spurious failures
>
> Status in curtin:
>   Fix Released
> Status in linux package in Ubuntu:
>   Confirmed
> Status in linux source package in Bionic:
>   New
> Status in linux source package in Cosmic:
>   New
> Status in linux source package in Disco:
>   New
> Status in linux source package in Eoan:
>   Confirmed
>
> Bug description:
>   I've had a number of deployment faults where curtin would report
>   Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
>   deployment of 30+ nodes. Upon retrying the node would usually deploy
>   fine. Experimentally I've set the timeout ridiculously high, and it
>   seems I'm getting no faults with this. I'm wondering if the timeout
>   for removal is set too tight, or might need to be made configurable.
>
>   --- curtin/util.py~ 2018-05-18 18:40:48.0 +
>   +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
>   @@ -263,7 +263,7 @@
>return _subp(*args, **kwargs)
>
>
>   -def wait_for_removal(path, retries=[1, 3, 5, 7]):
>   +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
>if not path:
>raise ValueError('wait_for_removal: missing path parameter')
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-03 Thread Andrea Righi

>From a kernel perspective this big slowness on shutting down a bcache
volume might be caused by a locking / race condition issue. If I read
correctly this problem has been reproduced in bionic (and in xenial we
even got a kernel oops - it looks like caused by a NULL pointer
dereference). I would try to address these issues separately.

About bionic it would be nice to test this commit (also mentioned by
@elmo in comment #28):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60

Moreover, even if we didn't get an explicit NULL pointer dereference
with bionic, I think it would be interesting to test also the following
fixes:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566

I've already backported all of them and applied to the latest bionic
kernel. A test kernel is available here:

https://kernel.ubuntu.com/~arighi/LP-1796292/

If it doesn't cost too much it would be great to do a test with it. In
the meantime I'll try to reproduce the problem locally. Thanks in
advance!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-01 Thread Brad Figg

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Eoan)
   Importance: Undecided
   Status: Confirmed

** Also affects: linux (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Cosmic)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-07-01 Thread Terry Rudd

Canonical kernel team has this item queued in the hotlist to work on.  I
am assigning to myself to accelerate work

** Changed in: curtin
 Assignee: (unassigned) => Terry Rudd (terrykrudd)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-06-03 Thread Ryan Harper

On Mon, Jun 3, 2019 at 2:05 PM Andrey Grebennikov <
agrebennikov1...@gmail.com> wrote:

> Is there an estimate on getting this package in bionic-updates please?
>

We are starting an SRU of curtin this week.  SRU's take at least 7 days
from when they hit -proposed
possibly longer depending on test results.

I should have something up in -proposed this week and we'll go from there
on testing.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
>   Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)

  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-06-03 Thread Andrey Grebennikov

Is there an estimate on getting this package in bionic-updates please?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-22 Thread Dan Watkins

This bug is believed to be fixed in curtin in version 19.1. If this is
still a problem for you, please make a comment and set the state back to
New

Thank you.

** Changed in: curtin
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-14 Thread Pedro Guimarães

This script *should* trigger the issue on Bionic GA: 
https://pastebin.ubuntu.com/p/WdKGbMWnM6/
Try it with both GA and HWE bionic, the commit on HWE should trigger up.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-14 Thread Pedro Guimarães

I was looking into kernel commits and I came across this:
https://github.com/torvalds/linux/commit/fadd94e05c02afec7b70b0b14915624f1782f578

So, as far as I understood, it actually deals with the issue of manual
device detach during a writeback clean-up and causing deadlock. The
timeline makes sense when we look for Bionic GA kernel, as well. Bionic
GA should not include this fix, but HWE should.

Could we run the tests again, but focusing on Bionic hwe? Afair hwe runs
4.18, which should include this commit.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-14 Thread Jason Hobbs

** Tags added: cdo-qa foundations-engine

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-13 Thread Andrey Grebennikov

@jhobbs

Here is the script that cleans up bcache devices on recommission:
https://pastebin.ubuntu.com/p/6WCGvM4Q32/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-09 Thread Ryan Harper

On Wed, May 8, 2019 at 11:55 PM Trent Lloyd 
wrote:

> I have been running into this (curtin 18.1-17-gae48e86f-
> 0ubuntu1~16.04.1)
>
> I think this commit basically agrees with my thoughts but I just wanted
> to share them explicitly in case they are interesting
>
>  (1) If you *unregister* the cache device from the backing device, it
> first has to purge all the dirty data back to the backing device. This
> may obviously take a while.
>
>  (2) When doing that, I managed to deadlock bcache at least once on
> xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I
> assume was trying to write to the bcache.. traceback:
> https://pastebin.canonical.com/117528/ - you can't get out of that
> without a reboot
>

Thanks for capturing those; Ive quite a few of my own as an unregister path
which _should_ work; but doesn't for various bugs in bcache.  I need to
attach those OOPS to this bug as well.


>
>  (3) However generally I had good luck simplying "stop"ing the cache
> devices (it seems perhaps that is what this bug is designed to do,
> switch to stop, instead of unregister?). Specifically though I was
> stopping the backing devices, and then later the cache device. It seems
> like the current commit is the other way around?
>

Unregister is just not stable, so stopping is what is being done now.

I did attempt stopping bcache devices first and only once all bcache
devices were
stopped to then stop and remove a cacheset; this proved unreliable under our
integration testing of various bcache scenarios.


>
> ** Tags added: sts
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
>   Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-09 Thread Ryan Harper

Xenial GA kernel bcache unregister oops:

http://paste.ubuntu.com/p/BzfHFjzZ8y/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-08 Thread Trent Lloyd

I have been running into this (curtin 18.1-17-gae48e86f-
0ubuntu1~16.04.1)

I think this commit basically agrees with my thoughts but I just wanted
to share them explicitly in case they are interesting

 (1) If you *unregister* the cache device from the backing device, it
first has to purge all the dirty data back to the backing device. This
may obviously take a while.

 (2) When doing that, I managed to deadlock bcache at least once on
xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I
assume was trying to write to the bcache.. traceback:
https://pastebin.canonical.com/117528/ - you can't get out of that
without a reboot

 (3) However generally I had good luck simplying "stop"ing the cache
devices (it seems perhaps that is what this bug is designed to do,
switch to stop, instead of unregister?). Specifically though I was
stopping the backing devices, and then later the cache device. It seems
like the current commit is the other way around?

** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-06 Thread Jason Hobbs

This occurrs on a target machine during maas install. Apport is not
collected in this case.

** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

2019-05-01 Thread Joshua Powers

Adding affects linux package

** Also affects: linux (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  New
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~ 2018-05-18 18:40:48.0 +
  +++ curtin/util.py  2018-10-05 09:40:06.807390367 +
  @@ -263,7 +263,7 @@
   return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
   if not path:
   raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

47 matches

Mail list logo