[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-14 Thread Matthew Ruffell
Hi @hloeung, these patches are available in 4.15.0-128-generic, and
5.4.0-58-generic.

They are both re-spins of 4.15.0-126-generic and 5.4.0-56-generic,
respectively.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-14 Thread Haw Loeung
@mruffell, that's for Bionic, but for Focal is that still
5.4.0-56-generic? Or is there a new respun kernel to include these
patches?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-13 Thread Matthew Ruffell
Hi Benjamin,

The respun kernel has now landed in -updates, and is version
4.15.0-128-generic.

Please re-schedule the maintenance window for the Launchpad git server,
and re-attempt moving to the fixed kernel.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-02 Thread Matthew Ruffell
Hi Benjamin,

I have good news. The SRU has completed, and the new kernels have now
been released to -updates. Their versions are:

Bionic:
4.15.0-126-generic

Focal:
5.4.0-56-generic

You can go ahead and schedule that maintenance window now, to install
the latest kernel from -updates. These kernels also have full livepatch
support, which is good news for you.

Let me know how the 4.15.0-126-generic kernel goes on the Launchpad git
server, since it should perform just as well as the test kernel you are
currently running.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-01 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 4.15.0-126.129

---
linux (4.15.0-126.129) bionic; urgency=medium

  * bionic/linux: 4.15.0-126.129 -proposed tracker (LP: #1905305)

  * CVE-2020-4788
- SAUCE: powerpc/64s: Define MASKABLE_RELON_EXCEPTION_PSERIES_OOL
- SAUCE: powerpc/64s: move some exception handlers out of line
- powerpc/64s: flush L1D on kernel entry
- SAUCE: powerpc: Add a framework for user access tracking
- powerpc: Implement user_access_begin and friends
- powerpc: Fix __clear_user() with KUAP enabled
- powerpc/uaccess: Evaluate macro arguments once, before user access is
  allowed
- powerpc/64s: flush L1D after user accesses

linux (4.15.0-125.128) bionic; urgency=medium

  * bionic/linux: 4.15.0-125.128 -proposed tracker (LP: #1903137)

  * Update kernel packaging to support forward porting kernels (LP: #1902957)
- [Debian] Update for leader included in BACKPORT_SUFFIX

  * Avoid double newline when running insertchanges (LP: #1903293)
- [Packaging] insertchanges: avoid double newline

  * EFI: Fails when BootCurrent entry does not exist (LP: #183)
- efivarfs: Replace invalid slashes with exclamation marks in dentries.

  * CVE-2020-14351
- perf/core: Fix race in the perf_mmap_close() function

  * raid10: Block discard is very slow, causing severe delays for mkfs and
fstrim operations (LP: #1896578)
- md: add md_submit_discard_bio() for submitting discard bio
- md/raid10: extend r10bio devs to raid disks
- md/raid10: pull codes that wait for blocked dev into one function
- md/raid10: improve raid10 discard request
- md/raid10: improve discard request for far layout

  * Bionic: btrfs: kernel BUG at /build/linux-
eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
- btrfs: use offset_in_page instead of open-coding it
- btrfs: use BUG() instead of BUG_ON(1)
- btrfs: drop unnecessary offset_in_page in extent buffer helpers
- btrfs: extent_io: do extra check for extent buffer read write functions
- btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
- btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
- btrfs: ctree: check key order before merging tree blocks

  * Bionic update: upstream stable patchset 2020-11-04 (LP: #1902943)
- USB: gadget: f_ncm: Fix NDP16 datagram validation
- gpio: tc35894: fix up tc35894 interrupt configuration
- vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock
- vsock/virtio: stop workers during the .remove()
- vsock/virtio: add transport parameter to the
  virtio_transport_reset_no_sock()
- net: virtio_vsock: Enhance connection semantics
- Input: i8042 - add nopnp quirk for Acer Aspire 5 A515
- ftrace: Move RCU is watching check after recursion check
- drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_config
- drivers/net/wan/hdlc_fr: Add needed_headroom for PVC devices
- drm/sun4i: mixer: Extend regmap max_register
- net: dec: de2104x: Increase receive ring size for Tulip
- rndis_host: increase sleep time in the query-response loop
- nvme-core: get/put ctrl and transport module in nvme_dev_open/release()
- drivers/net/wan/lapbether: Make skb->protocol consistent with the header
- drivers/net/wan/hdlc: Set skb->protocol before transmitting
- mac80211: do not allow bigger VHT MPDUs than the hardware supports
- spi: fsl-espi: Only process interrupts for expected events
- nvme-fc: fail new connections to a deleted host or remote port
- pinctrl: mvebu: Fix i2c sda definition for 98DX3236
- nfs: Fix security label length not being reset
- clk: samsung: exynos4: mark 'chipid' clock as CLK_IGNORE_UNUSED
- iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate()
- i2c: cpm: Fix i2c_ram structure
- Input: trackpoint - enable Synaptics trackpoints
- random32: Restore __latent_entropy attribute on net_rand_state
- epoll: do not insert into poll queues until all sanity checks are done
- epoll: replace ->visited/visited_list with generation count
- epoll: EPOLL_CTL_ADD: close the race in decision to take fast path
- ep_create_wakeup_source(): dentry name can change under you...
- netfilter: ctnetlink: add a range check for l3/l4 protonum
- drm/syncobj: Fix drm_syncobj_handle_to_fd refcount leak
- fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.h
- Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts
- Revert "ravb: Fixed to be able to unload modules"
- fbcon: Fix global-out-of-bounds read in fbcon_get_font()
- net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key()
- usermodehelper: reset umask to default before executing user process
- platform/x86: thinkpad_acpi: initialize tp_nvram_state variable
- platform/x86: thinkpad_acpi: re-initialize ACPI buffer size when reuse
- driver 

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-30 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 5.4.0-56.62

---
linux (5.4.0-56.62) focal; urgency=medium

  * focal/linux: 5.4.0-56.62 -proposed tracker (LP: #1905300)

  * CVE-2020-4788
- selftests/powerpc: rfi_flush: disable entry flush if present
- powerpc/64s: flush L1D on kernel entry
- powerpc/64s: flush L1D after user accesses
- selftests/powerpc: entry flush test

linux (5.4.0-55.61) focal; urgency=medium

  * focal/linux: 5.4.0-55.61 -proposed tracker (LP: #1903175)

  * Update kernel packaging to support forward porting kernels (LP: #1902957)
- [Debian] Update for leader included in BACKPORT_SUFFIX

  * Avoid double newline when running insertchanges (LP: #1903293)
- [Packaging] insertchanges: avoid double newline

  * EFI: Fails when BootCurrent entry does not exist (LP: #183)
- efivarfs: Replace invalid slashes with exclamation marks in dentries.

  * CVE-2020-14351
- perf/core: Fix race in the perf_mmap_close() function

  * raid10: Block discard is very slow, causing severe delays for mkfs and
fstrim operations (LP: #1896578)
- md: add md_submit_discard_bio() for submitting discard bio
- md/raid10: extend r10bio devs to raid disks
- md/raid10: pull codes that wait for blocked dev into one function
- md/raid10: improve raid10 discard request
- md/raid10: improve discard request for far layout
- dm raid: fix discard limits for raid1 and raid10
- dm raid: remove unnecessary discard limits for raid10

  * Bionic: btrfs: kernel BUG at /build/linux-
eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
- btrfs: drop unnecessary offset_in_page in extent buffer helpers
- btrfs: extent_io: do extra check for extent buffer read write functions
- btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
- btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
- btrfs: ctree: check key order before merging tree blocks

  * Ethernet no link lights after reboot (Intel i225-v 2.5G) (LP: #1902578)
- igc: Add PHY power management control

  * Undetected Data corruption in MPI workloads that use VSX for reductions on
POWER9 DD2.1 systems (LP: #1902694)
- powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load 
emulation
- selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load
  workaround

  * [20.04 FEAT] Support/enhancement of NVMe IPL (LP: #1902179)
- s390: nvme ipl
- s390: nvme reipl
- s390/ipl: support NVMe IPL kernel parameters

  * uvcvideo: add mapping for HEVC payloads (LP: #1895803)
- media: uvcvideo: Add mapping for HEVC payloads

  * Focal update: v5.4.73 upstream stable release (LP: #1902115)
- ibmveth: Switch order of ibmveth_helper calls.
- ibmveth: Identify ingress large send packets.
- ipv4: Restore flowi4_oif update before call to xfrm_lookup_route
- mlx4: handle non-napi callers to napi_poll
- net: fec: Fix phy_device lookup for phy_reset_after_clk_enable()
- net: fec: Fix PHY init after phy_reset_after_clk_enable()
- net: fix pos incrementment in ipv6_route_seq_next
- net/smc: fix valid DMBE buffer sizes
- net/tls: sendfile fails with ktls offload
- net: usb: qmi_wwan: add Cellient MPL200 card
- tipc: fix the skb_unshare() in tipc_buf_append()
- socket: fix option SO_TIMESTAMPING_NEW
- can: m_can_platform: don't call m_can_class_suspend in runtime suspend
- can: j1935: j1939_tp_tx_dat_new(): fix missing initialization of skbcnt
- net: j1939: j1939_session_fresh_new(): fix missing initialization of 
skbcnt
- net/ipv4: always honour route mtu during forwarding
- net_sched: remove a redundant goto chain check
- r8169: fix data corruption issue on RTL8402
- cxgb4: handle 4-tuple PEDIT to NAT mode translation
- binder: fix UAF when releasing todo list
- ALSA: bebob: potential info leak in hwdep_read()
- ALSA: hda/hdmi: fix incorrect locking in hdmi_pcm_close
- nvme-pci: disable the write zeros command for Intel 600P/P3100
- chelsio/chtls: fix socket lock
- chelsio/chtls: correct netdevice for vlan interface
- chelsio/chtls: correct function return and return type
- ibmvnic: save changed mac address to adapter->mac_addr
- net: ftgmac100: Fix Aspeed ast2600 TX hang issue
- net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
- net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling
  ether_setup
- net: Properly typecast int values to set sk_max_pacing_rate
- net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
- nexthop: Fix performance regression in nexthop deletion
- nfc: Ensure presence of NFC_ATTR_FIRMWARE_NAME attribute in
  nfc_genl_fw_download()
- r8169: fix operation under forced interrupt threading
- selftests: forwarding: Add missing 'rp_filter' configuration
- tcp: fix to update snd_wl1 in bulk receiver fast 

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-30 Thread Benjamin Allot
Ack !

I'll check with Launchpad team then, I think they would probably prefer
to wait for the -updates indeed.

Thanks again for your work dans Dan's.

Cheers,

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-27 Thread Matthew Ruffell
Hi Benjamin,

No worries about being busy.

Now, the kernel is scheduled to be released early next week, around the
30th of November. I think at this stage it is best to wait it out and
install the kernel once it reaches -updates.

That way you will have a fixed kernel that is supported by livepatch,
and you don't have to justify a reboot twice.

I did some regression testing in my comments above, and everything looks
okay. These patches also worked great in your test kernel. We have done
the best can to verify the kernel in the time given, so don't worry
about testing at this stage.

I'll let you know once the kernel has reached -updates, likely Monday or
Tuesday next week.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-27 Thread Benjamin Allot
Hello Matthew, sorry for the lack of response.

I'll check with Launchpad people if we can justify a reboot of the
server soon and will keep you posted !

Regards,

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-25 Thread Matthew Ruffell
Performing verification for Bionic.

Since Benjamin hasn't responded, I will try and verify the best I can.

I made a instance on AWS. I used a c5d.large instance type, and added
8gb extra EBS storage.

I installed the latest kernel from -updates to get a performance
baseline. kernel is 4.15.0-124-generic.

I made a bcache disk with the following.

Note, the 8gb disk was used as the cache disk, and the 50gb disk the
backing disk. Having the cache small is to try force cache evictions
often, and possibly try trigger the bug.

$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1 259:00 46.6G  0 disk 
nvme0n1 259:108G  0 disk 
nvme2n1 259:208G  0 disk 
└─nvme2n1p1 259:308G  0 part /

$ sudo apt install bcache-tools
$ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=512 count=8
$ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=512 count=8
$ sudo wipefs -a /dev/nvme0n1
$ sudo wipefs -a /dev/nvme1n1
$ sudo make-bcache -C /dev/nvme0n1 -B /dev/nvme1n1
UUID:   3f28ca5d-856b-42e9-bbb7-54cae12b5538
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:0
nbuckets:   16384
block_size: 1
bucket_size:1024
nr_in_set:  1
nr_this_dev:0
first_bucket:   1
UUID:   cc3e36fd-3694-4c50-aeac-0b79d2faab4a
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:1
block_size: 1
data_offset:16
$ sudo mkfs.ext4 /dev/bcache0
$ sudo mkdir /media/bcache
$ sudo mount /dev/bcache0 /media/bcache
$ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab

>From there, I installed fio to do some benchmarks, and to try apply some
IO pressure to the cache.

$ sudo apt install fio

I used the following fio jobfile:

https://paste.ubuntu.com/p/RNBmXdy3zG/

It is based on the ssd test in:
https://github.com/axboe/fio/blob/master/examples/ssd-test.fio

Running the fio job gives us the following output:

https://paste.ubuntu.com/p/ghkQcyT2sv/

Now we have the baseline, I enabled -proposed and installed
4.15.0-125-generic and rebooted.

I started the fio job again, and got the following output:

# uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020

https://paste.ubuntu.com/p/DSTnKvXMGZ/

If you compare the two outputs, there really isn't much difference in
latencies / read / write speeds. The bcache patches don't seem to cause
any large impacts.

I managed to set up a bcache disk, and did some IO stress tests. Things
seem to be okay.

Since we had positive test results on the test kernel on the Launchpad
git server, and the above shows we don't appear to have any regressions,
I will mark this bug as verified for Bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-25 Thread Matthew Ruffell
Performing verification for Focal.

Since Benjamin hasn't responded, I will try and verify the best I can.

I made a instance on AWS. I used a c5d.large instance type, and added
8gb extra EBS storage.

I installed the latest kernel from -updates to get a performance
baseline. kernel is 5.4.0-54-generic.

I made a bcache disk with the following.

Note, the 8gb disk was used as the cache disk, and the 50gb disk the
backing disk. Having the cache small is to try force cache evictions
often, and possibly try trigger the bug.

$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme2n1 259:00 46.6G  0 disk 
nvme1n1 259:108G  0 disk 
nvme0n1 259:208G  0 disk 
└─nvme0n1p1 259:308G  0 part /

$ sudo apt install bcache-tools
$ sudo dd if=/dev/zero if=/dev/nvme1n1 bs=512 count=8
$ sudo dd if=/dev/zero if=/dev/nvme2n1 bs=512 count=8
$ sudo wipefs -a /dev/nvme1n1
$ sudo wipefs -a /dev/nvme2n1
$ sudo make-bcache -C /dev/nvme1n1 -B /dev/nvme2n1
UUID:   3f28ca5d-856b-42e9-bbb7-54cae12b5538
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:0
nbuckets:   16384
block_size: 1
bucket_size:1024
nr_in_set:  1
nr_this_dev:0
first_bucket:   1
UUID:   cc3e36fd-3694-4c50-aeac-0b79d2faab4a
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:1
block_size: 1
data_offset:16
$ sudo mkfs.ext4 /dev/bcache0
$ sudo mkdir /media/bcache
$ sudo mount /dev/bcache0 /media/bcache
$ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab

>From there, I installed fio to do some benchmarks, and to try apply some
IO pressure to the cache.

$ sudo apt install fio

I used the following fio jobfile:

https://paste.ubuntu.com/p/RNBmXdy3zG/

It is based on the ssd test in:
https://github.com/axboe/fio/blob/master/examples/ssd-test.fio

Running the fio job gives us the following output:

https://paste.ubuntu.com/p/HrWGNDJPfv/

Now we have the baseline, I enabled -proposed and installed
5.4.0-55-generic and rebooted.

I started the fio job again, and got the following output:

# uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020

https://paste.ubuntu.com/p/pDVXnspmvs/

If you compare the two outputs, there really isn't much difference in
latencies / read / write speeds. The bcache patches don't seem to cause
any large impacts.

I managed to set up a bcache disk, and did some IO stress tests. Things
seem to be okay.

Since we had positive test results on the test kernel on the Launchpad
git server, and the above shows we don't appear to have any regressions,
I will mark this bug as verified for Focal.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-22 Thread Matthew Ruffell
Hi Benjamin,

The kernel team have built the next kernel update, and they have placed
it in -proposed for verification.

The versions are 4.15.0-125-generic for Bionic, and 5.4.0-55-generic for
Focal.

Can you please schedule a maintenance window for the Launchpad git
server, to install the new kernel in -proposed, and reboot into it, so
we can verify that it fixes the problem.

Instructions to install (On a Bionic system):
Enable -proposed by running the following command to make a new sources.list.d 
entry:
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-bionic-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main
EOF
2) sudo apt update
3) sudo apt install linux-image-4.15.0-125-generic 
linux-modules-4.15.0-125-generic \
linux-modules-extra-4.15.0-125-generic linux-headers-4.15.0-125-generic 
linux-headers-4.15.0-125
4) sudo reboot
5) uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020
6) sudo rm /etc/apt/sources.list.d/ubuntu-bionic-proposed.list
7) sudo apt update

If you get a different uname, you may need to adjust your grub
configuration to boot into the correct kernel. Also, since this is a
production machine, make sure you remove the -proposed software source
once you have installed the kernel.

Let me know how this kernel performs, and if everything seems fine after
a week we will mark the Launchpad bug as verified. The timeline for
release to -updates is still set for the 30th of November, give or take
a few days if any CVEs turn up.

I believe this kernel should be live-patchable, although this may not be
the case if the kernel is respun before release. Hopefully you will only
have to schedule the maintenance window just the once.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-17 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
focal' to 'verification-done-focal'. If the problem still exists, change
the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-17 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
bionic' to 'verification-done-bionic'. If the problem still exists,
change the tag 'verification-needed-bionic' to 'verification-failed-
bionic'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-bionic

** Tags added: verification-needed-focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-10-22 Thread Ian
** Changed in: linux (Ubuntu Bionic)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-10-22 Thread Ian
** Changed in: linux (Ubuntu Focal)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-10-20 Thread Matthew Ruffell
Patches have been submitted to the Ubuntu kernel mailing list:

Patchset for Bionic:
https://lists.ubuntu.com/archives/kernel-team/2020-October/114166.html
https://lists.ubuntu.com/archives/kernel-team/2020-October/114167.html
https://lists.ubuntu.com/archives/kernel-team/2020-October/114168.html
https://lists.ubuntu.com/archives/kernel-team/2020-October/114169.html

Patchset for Focal:
https://lists.ubuntu.com/archives/kernel-team/2020-October/114170.html
https://lists.ubuntu.com/archives/kernel-team/2020-October/114171.html
https://lists.ubuntu.com/archives/kernel-team/2020-October/114172.html
https://lists.ubuntu.com/archives/kernel-team/2020-October/114173.html

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-10-20 Thread Matthew Ruffell
** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/1898786
  
  [Impact]
  
  Systems that utilise bcache can experience extremely high IO wait times
  when under constant IO pressure. The IO wait times seem to stay at a
  consistent 1 second, and never drop as long as the bcache shrinker is
  enabled.
  
  If you disable the shrinker, then IO wait drops significantly to normal
  levels.
  
  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".
  
  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:
  
  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  
  It mentions in the description that:
  
  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.
  
  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.
  
  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.
  
  [Fix]
  
  The following commits fix the problem, and all landed in 5.6-rc1:
  
  commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
  Author: Coly Li 
  Date: Fri Jan 24 01:01:40 2020 +0800
  Subject: bcache: remove member accessed from struct btree
  Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686
  
  commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  Author: Coly Li 
  Date: Fri Jan 24 01:01:41 2020 +0800
  Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  
  commit e3de04469a49ee09c89e80bf821508df458ccee6
  Author: Coly Li 
  Date: Fri Jan 24 01:01:42 2020 +0800
  Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6
  
  The first commit is a dependency of the other two. The first commit
  removes a "recently accessed" marker, used to indicate if a particular
  cache has been used recently, and if it has been, not consider it for
  cache eviction. The commit mentions that under heavy IO, all caches will
  end up being recently accessed, and nothing will ever be shrunk.
  
  The second commit changes a previous design decision of skipping the
  first 3 caches to shrink, since it is a common case to call
  bch_mca_scan() with nr being 1, or 2, just like 0 was common in the very
  first commit I mentioned. This time, if we land on 1 or 2, the loop
  exits and nothing happens, and we waste time waiting on locks, just like
  the very first commit I mentioned. The fix is to try shrink caches from
  the tail of the list, and not the beginning.
  
  The third commit fixes a minor issue where we don't want to re-arrange
  the linked list c->btree_cache, which is what the second commit ended up
  doing, and instead, just shrink the cache at the end of the linked list,
  and not change the order.
  
- All commits are clean cherry picks to Bionic and Focal.
+ One minor backport / context adjustment was required in the first commit
+ for Bionic, and the rest are all clean cherry picks to Bionic and Focal.
  
  [Testcase]
  
  This is kind of hard to test, since the problem shows up in production
  environments that are under constant IO pressure, with many different
  items entering and leaving the cache.
  
  The Launchpad git server is currently suffering this issue, and has been
  sitting at a constant IO wait of 1 second / slightly over 1 second which
  was causing slow response times, which was leading to build failures
  when git clones ended up timing out.
  
  We installed a test kernel, which is available in the following PPA:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf294907-test
  
  Once the test kernel had been installed, IO wait times with the shrinker
  enabled dropped to normal levels, and the git server became responsive
  again. We have been monitoring the performance of the git server and
  watching IO wait times in grafana over the past week, and everything is
  looking good, and indicate that these patches solve the issue.
  
  [Regression Potential]
  
  If a regression were to occur, it would only affect users of bcache 

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-10-20 Thread Matthew Ruffell
** Summary changed:

- Issue with bcache bch_mca_scan causing huge IO wait
+ bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu)
   Status: Confirmed => Fix Released

** Changed in: linux (Ubuntu Bionic)
   Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
   Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Bionic)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Changed in: linux (Ubuntu Focal)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Description changed:

- Hello,
+ BugLink: https://bugs.launchpad.net/bugs/1898786
  
- In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
- This is the full list of process and the kernel function they were stuck in 
[0].
+ [Impact]
  
- The main issue can probably be summarized by this perf reports
- * first identify that the cpu are stuck in idle because of something[1]
- * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
- 
- We could see that this seems to be the mutex_lock in the bch_mca_scan
- function [3].
- 
- After running the command:
- 
-  | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
- 583c48d0c2b8/internal/btree_shrinker_disabled"
- 
- The server started to respond normally and the IO wait dropped
- significantly
- 
- Here is a trace of the bcache event related lock in the kernel obtained
- with some bpfcc-tools [4].
- 
- klockstat-bpfcc -c bch_ -i 5 -s 3
- 
- The trace has been run in parallel with the following command line
- 
- echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
- $(date)"; echo 0 | sudo tee
- /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
- 60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
- /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
- 60; echo "End of test: $(date)"
- 
- Trying to dig more, we reduced by 20 GB the memory allocated to a VM on the 
server.
- * The bcache btree size fluctuation seems "normal" [5]
- * I noticed that, when the shrinker was enabled,a lot of time was spent in 
the locks during "bch_btree_insert_node". [6]
- 
- I decided to check if one of the function called during
- bch_btree_insert_node was taking longer than usual when the shrinker was
+ Systems that utilise bcache can experience extremely high IO wait times
+ when under constant IO pressure. The IO wait times seem to stay at a
+ consistent 1 second, and never drop as long as the bcache shrinker is
  enabled.
  
- I finally found the "funclatency" tool and tried do have the same approach I 
had with the klockstat [7]. However, that was inconclusive. I could see there 
that the bch_btree_insert_node was barely called during the whole duration of 
the test.
- Which made me think it's amount of time spent in lock is more due to another 
process acquiring the lock.
+ If you disable the shrinker, then IO wait drops significantly to normal
+ levels.
  
- I'm going to try to have another go with some perf/klockstat/funclatency
- focused on bch_mca_scan and the function called there.
+ We did some perf analysis, and it seems we spend a huge amount of time
+ in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".
  
- Also, here are some memory related metrics [8].
+ Looking at the recent commits in Bionic, we found the following commit
+ merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
+ upstream stable:
  
- Now another perf stacktrace with the command used [9].
- Strangely this one doesn't show any bch_mca_scan at all.
+ commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
+ Author: Coly Li 
+ Date: Wed Nov 13 16:03:24 2019 +0800
+ Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
+ Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
 
  
- I enabled the shrinker again, hoping to get more traces, but apparently the 
timeframe was not right. Not enough load to trigger the cliff resulting in a 
1sec IOwait plateau.
- Which is interesting because that means that without the maximal workload, 
the kernel can cope with the shrinker.
+ It mentions in the description that:
  
- [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
- [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
- [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
- [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
- [4]: https://pastebin.ubuntu.com/p/qhyqP35fCw/
- [5]: https://pastebin.ubuntu.com/p/McjxxqTVjn/
- [6]: https://pastebin.ubuntu.com/p/KmrnW4Ng8F/
- [7]: https://pastebin.ubuntu.com/p/fSX4c7tTFV/
- [8]: