[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Matthew Ruffell Sun, 02 May 2021 22:16:21 -0700

** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/1896578
  
  [Impact]
  
  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to take
  a very long time.
  
  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid
  0, takes 4 seconds.
  
  The bigger the devices, the longer it takes.
  
  The cause is that Raid10 currently uses a 512k chunk size, and uses this
  for the discard_max_bytes value. If we need to discard 1.9TB, the kernel
  splits the request into millions of 512k bio requests, even if the
  underlying device supports larger requests.
  
  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at
  once:
  
  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
  2199023255040
  
  Where the Raid10 md device only supports 512k:
  
  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
  524288
  
  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()
  
  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>] blkdev_issue_discard+0x5d/0xc0
  [<0>] blk_ioctl_discard+0xc4/0x110
  [<0>] blkdev_common_ioctl+0x56c/0x840
  [<0>] blkdev_ioctl+0xeb/0x270
  [<0>] block_ioctl+0x3d/0x50
  [<0>] __x64_sys_ioctl+0x91/0xc0
  [<0>] do_syscall_64+0x38/0x90
  [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
  
  [Fix]
  
  Xiao Ni has developed a patchset which resolves the block discard
  performance problems. These commits have now landed in 5.10-rc1.
  
- commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
- Author: Xiao Ni <x...@redhat.com>
- Date: Tue Aug 25 13:42:59 2020 +0800
+ commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
+ Author: Xiao Ni <x...@redhat.com>
+ Date:   Thu Feb 4 15:50:43 2021 +0800
  Subject: md: add md_submit_discard_bio() for submitting discard bio
- Link: 
https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
- 
- commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
- Author: Xiao Ni <x...@redhat.com>
- Date: Tue Aug 25 13:43:00 2020 +0800
+ Link: 
https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
+ 
+ commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
+ Author: Xiao Ni <x...@redhat.com>
+ Date:   Thu Feb 4 15:50:44 2021 +0800
  Subject: md/raid10: extend r10bio devs to raid disks
- Link: 
https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
- 
- commit f046f5d0d79cdb968f219ce249e497fd1accf484
- Author: Xiao Ni <x...@redhat.com>
- Date: Tue Aug 25 13:43:01 2020 +0800
- Subject: md/raid10: pull codes that wait for blocked dev into one function
- Link: 
https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484
- 
- commit bcc90d280465ebd51ab8688be86e1f00c62dccf9
- Author: Xiao Ni <x...@redhat.com>
- Date: Wed Sep 2 20:00:22 2020 +0800
+ Link: 
https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
+ 
+ commit f2e7e269a7525317752d472bb48a549780e87d22
+ Author: Xiao Ni <x...@redhat.com>
+ Date:   Thu Feb 4 15:50:45 2021 +0800
+ Subject: md/raid10: pull the code that wait for blocked dev into one function
+ Link: 
https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
+ 
+ commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
+ Author: Xiao Ni <x...@redhat.com>
+ Date:   Thu Feb 4 15:50:46 2021 +0800
  Subject: md/raid10: improve raid10 discard request
- Link: 
https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9
- 
- commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359
- Author: Xiao Ni <x...@redhat.com>
- Date: Wed Sep 2 20:00:23 2020 +0800
+ Link: 
https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
+ 
+ commit 254c271da0712ea8914f187588e0f81f7678ee2f
+ Author: Xiao Ni <x...@redhat.com>
+ Date:   Thu Feb 4 15:50:47 2021 +0800
  Subject: md/raid10: improve discard request for far layout
- Link: 
https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359
+ Link: 
https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
  
  There is also an additional commit which is required, and was merged
  after "md/raid10: improve raid10 discard request" was merged. The
  following commits enable Radid10 to use large discards, instead of
  splitting into many bios, since the technical hurdles have now been
  removed.
  
- commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512
+ commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
  Author: Mike Snitzer <snit...@redhat.com>
- Date: Thu Sep 24 13:14:52 2020 -0400
- Subject: dm raid: fix discard limits for raid1 and raid10
- Link: 
https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512
- 
- commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28
- Author: Mike Snitzer <snit...@redhat.com>
- Date:   Thu Sep 24 16:40:12 2020 -0400
- Subject: dm raid: remove unnecessary discard limits for raid10
- Link: 
https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28
- 
- All the commits mentioned follow a similar strategy which was
- implemented in Raid0 in the below commit, which was merged in 4.12-rc2,
- which fixed block discard performance issues in Raid0:
- 
- commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
- Author: Shaohua Li <s...@fb.com>
- Date: Sun May 7 17:36:24 2017 -0700
- Subject: md/md0: optimize raid0 discard handling
- Link: 
https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0
- 
- The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels,
- with the following minor fixups:
+ Date:   Fri Apr 30 14:38:37 2021 -0400
+ Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
+ Link: 
https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
+ 
+ The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15
+ kernels, with the following minor backports:
  
  1) submit_bio_noacct() needed to be renamed to generic_make_request()
  since it was recently changed in:
  
  commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
  Author: Christoph Hellwig <h...@lst.de>
  Date:   Wed Jul 1 10:59:44 2020 +0200
  Subject: block: rename generic_make_request to submit_bio_noacct
  Link: 
https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
  
  2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their
  "address of" '&' removed for one of their arguments for the 4.15 kernel,
  due to changes made in:
  
  commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
  Author: Kent Overstreet <kent.overstr...@gmail.com>
  Date:   Sun May 20 18:25:52 2018 -0400
  Subject: md: convert to bioset_init()/mempool_init()
  Link: 
https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
  
  3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1
  and raid10" and "dm raid: remove unnecessary discard limits for raid10"
  due to not having the following commit, which was merged in 5.1-rc1:
  
  commit 61697a6abd24acba941359c6268a94f4afe4a53d
  Author: Mike Snitzer <snit...@redhat.com>
  Date:   Fri Jan 18 14:19:26 2019 -0500
  Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
  Link: 
https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
  
  4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
  bio_clone_blkcg_association() due to it changing in:
  
  commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
  Author: Dennis Zhou <den...@kernel.org>
  Date:   Wed Dec 5 12:10:35 2018 -0500
  Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
  
https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
  
  [Testcase]
  
  You will need a machine with at least 4x NVMe drives which support block
  discard. I use a i3.8xlarge instance on AWS, since it has all of these
  things.
  
  $ lsblk
  xvda    202:0    0    8G  0 disk
  └─xvda1 202:1    0    8G  0 part /
  nvme0n1 259:2    0  1.7T  0 disk
  nvme1n1 259:0    0  1.7T  0 disk
  nvme2n1 259:1    0  1.7T  0 disk
  nvme3n1 259:3    0  1.7T  0 disk
  
  Create a Raid10 array:
  
  $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
  
  Format the array with XFS:
  
  $ time sudo mkfs.xfs /dev/md0
  real 11m14.734s
  
  $ sudo mkdir /mnt/disk
  $ sudo mount /dev/md0 /mnt/disk
  
  Optional, do a fstrim:
  
  $ time sudo fstrim /mnt/disk
  
  real    11m37.643s
  
  There are test kernels for 5.8, 5.4 and 4.15 available in the following
  PPA:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
  
  If you install a test kernel, we can see that performance dramatically
  improves:
  
  $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
  
  $ time sudo mkfs.xfs /dev/md0
  real  0m4.226s
  user  0m0.020s
  sys   0m0.148s
  
  $ sudo mkdir /mnt/disk
  $ sudo mount /dev/md0 /mnt/disk
  $ time sudo fstrim /mnt/disk
  
  real  0m1.991s
  user  0m0.020s
  sys   0m0.000s
  
  The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
  from 11 minutes to 2 seconds.
  
  Performance Matrix (AWS i3.8xlarge):
  
  Kernel    | mkfs.xfs  | fstrim
  ---------------------------------
  4.15      | 7m23.449s | 7m20.678s
  5.4       | 8m23.219s | 8m23.927s
  5.8       | 2m54.990s | 8m22.010s
  4.15-test | 0m4.286s  | 0m1.657s
  5.4-test  | 0m6.075s  | 0m3.150s
  5.8-test  | 0m2.753s  | 0m2.999s
  
  The test kernel also changes the discard_max_bytes to the underlying
  hardware limit:
  
  $ cat /sys/block/md0/queue/discard_max_bytes
  2199023255040
  
- [Regression Potential]
- 
- If a regression were to occur, then it would affect operations which
- would trigger block discard operations, such as mkfs and fstrim, on
- Raid10 only.
- 
- Other Raid levels would not be affected, although, I should note there
- will be a small risk of regression to Raid0, due to one of its functions
- being re-factored and split out, for use in both Raid0 and Raid10.
- 
- The changes only affect block discard, so only Raid10 arrays backed by
- SSD or NVMe devices which support block discard will be affected.
- Traditional hard disks, or SSD devices which do not support block
- discard would not be affected.
- 
- If a regression were to occur, users could work around the issue by
- running "mkfs.xfs -K <device>" which would skip block discard entirely.
+ [Where problems can occur]
+ 
+ A problem has occurred once before, with the previous revision of this
+ patchset. This has been documented in bug 1907262, and caused a worst
+ case scenario of data loss for some users, in this particular case, on
+ the second and onward disks. This was due to two two faults: the first,
+ incorrectly calculating the start offset for block discard for the
+ second and extra disks. The second bug was an incorrect stripe size for
+ far layouts.
+ 
+ The kernel team was forced to revert the patches in an emergency and the
+ faulty kernel was removed from the archive, and community users urged to
+ avoid the faulty kernel.
+ 
+ These bugs and a few other minor issues have now been corrected, and we
+ have been testing the new patches since mid February. The patches have
+ been tested against the testcase in bug 1907262 and do not cause the
+ disks to become corrupted.
+ 
+ The regression potential is still the same for this patchset though. If
+ a regression were to occur, it could lead to data loss on Raid10 arrays
+ backed by NVMe or SSD disks that support block discard.
+ 
+ If a regression happens, users need to disable the fstrim systemd
+ service as soon as possible, plan an emergency maintainance window, and
+ downgrade the kernel to a previous release, or upgrade to a corrected
+ kernel.


** Tags removed: verification-done-bionic verification-done-focal
verification-done-groovy

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Reply via email to