[Bug 1900438] Re: Bcache bypasse writeback on caching device with fragmentation

dongdong tao Sun, 14 Mar 2021 19:25:57 -0700

** Description changed:

- Hello,
+ SRU Justification:

- An upstream bug has been opened on the matter for quite some time now [0].
- I can reproduce easily on our production compute node instance, which are
trusty host with xenial hwe kernels (4.15.0-101-generic).
- However due to heavy backport and such, doing real tracing is a bit hard
there.
+ [Impact]
+ This bug in bcache [insert correct area] affects I/O performance on all
versions of the kernel [correct versions affected]. It is particularly negative
on ceph if used with bcache.

- I was able to reproduce the behavior on a hwe-bionic kernel as well.
- Since most of our critical deployments use bcache, I think this is a kinda
nasty bug to have.
+ Write I/O latency would suddenly go to around 1 second from around 10 ms
+ when hitting this issue and would easily be stuck there for hours or
+ even days, especially bad for ceph on bcache architecture. This would
+ make ceph extremely slow and make the entire cloud almost unusable.

- Reproducing the issue is relatively easy with the script provided in the bug
[1].
- The script used to capture the stats is this one [2].
+ The root cause is that the dirty bucket had reached the 70 percent
+ threshold, thus causing all writes to go direct to the backing HDD
+ device. It might be fine if it actually had a lot of dirty data, but
+ this happens when dirty data has not even reached over 10 percent, due
+ to having high memory fragmentation. What makes it worse is that the
+ writeback rate might be still at minimum value (8) due to the writeback
+ percent not reached, so it takes ages for bcache to really reclaim
+ enough dirty buckets to get itself out of this situation.

- [0]: https://bugzilla.kernel.org/show_bug.cgi?id=206767
- [1]: https://pastebin.ubuntu.com/p/YnnvvSRhXK/
- [2]: https://pastebin.ubuntu.com/p/XfVpzg32sN/
+ [Fix]
+
+ * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the
+ fragmentation when update the writeback rate”
+
+ The current way to calculate the writeback rate only considered the dirty
sectors.
+ This usually works fine when memory fragmentation is not high, but it will
give us an unreasonably low writeback rate when we are in the situation that a
few dirty sectors have consumed a lot of dirty buckets. In some cases, the
dirty buckets reached CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback) while
the dirty data (sectors) had not even reached the writeback_percent threshold
(i.e., started writeback). In that situation, the writeback rate will still be
the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck
in a non-writeback mode because of the slow writeback.
+
+ We accelerate the rate in 3 stages with different aggressiveness:
+ the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50),
+ the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57),
+ the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64).
+
+ By default the first stage tries to writeback the amount of dirty data
+ in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds,
+ the second stage tries to writeback the amount of dirty data in one bucket
+ in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third
+ stage tries to writeback the amount of dirty data in one bucket in
+ (1 / (dirty_buckets_percent - 64)) milliseconds.
+
+ The initial rate at each stage can be controlled by 3 configurable
+ parameters:
+
+ writeback_rate_fp_term_{low|mid|high}
+
+ They are by default 1, 10, 1000, chosen based on testing and production
+ data, detailed below.
+
+ A. When it comes to the low stage, it is still far from the 70%
+ threshold, so we only want to give it a little bit push by setting the
+ term to 1, it means the initial rate will be 170 if the fragment is 6,
+ it is calculated by bucket_size/fragment, this rate is very small,
+ but still much more reasonable than the minimum 8.
+ For a production bcache with non-heavy workload, if the cache device
+ is bigger than 1 TB, it may take hours to consume 1% buckets,
+ so it is very possible to reclaim enough dirty buckets in this stage,
+ thus to avoid entering the next stage.
+
+ B. If the dirty buckets ratio didn’t turn around during the first stage,
+ it comes to the mid stage, then it is necessary for mid stage
+ to be more aggressive than low stage, so the initial rate is chosen
+ to be 10 times more than the low stage, which means 1700 as the initial
+ rate if the fragment is 6. This is a normal rate
+ we usually see for a normal workload when writeback happens
+ because of writeback_percent.
+
+ C. If the dirty buckets ratio didn't turn around during the low and mid
+ stages, it comes to the third stage, and it is the last chance that
+ we can turn around to avoid the horrible cutoff writeback sync issue,
+ then we choose 100 times more aggressive than the mid stage, that
+ means 170000 as the initial rate if the fragment is 6. This is also
+ inferred from a production bcache, I've got one week's writeback rate
+ data from a production bcache which has quite heavy workloads,
+ again, the writeback is triggered by the writeback percent,
+ the highest rate area is around 100000 to 240000, so I believe this
+ kind aggressiveness at this stage is reasonable for production.
+ And it should be mostly enough because the hint is trying to reclaim
+ 1000 bucket per second, and from that heavy production env,
+ it is consuming 50 buckets per second on average in one week's data.
+
+ Option writeback_consider_fragment is to control whether we want
+ this feature to be on or off, it's on by default.
+
+
+ [Test Case]
+
+ I’ve put all my testing results in below google document, the testing clearly
shows the significant performance improvement.
+
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
+
+ Another testing is that we had built a testing kernel based on bionic
+ 4.15.0-99.100 + the patch, and putting this kernel in a production
+ environment, it’s an openstack environment with ceph on bcache as the
+ storage. It runs for more than one month and doesn’t show any issue.
+
+ [Regression Potential]
+
+ The patch only updates the writeback rate, so it won’t have any impact
+ on the data safety, the only potential regression I can think of is
+ that the backing device might be a bit busier after the dirty buckets
+ reached to BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50% by default) since
+ the writeback rate is accelerated under this highly fragmented
+ situation, but that’s because we are trying to avoid all writes hit the
+ writeback cutoff sync threshold.


** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => dongdong tao (taodd)

** Changed in: linux (Ubuntu Bionic)
     Assignee: (unassigned) => dongdong tao (taodd)

** Changed in: linux (Ubuntu Focal)
     Assignee: (unassigned) => dongdong tao (taodd)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1900438

Title:
  Bcache bypasse writeback on caching device with fragmentation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1900438/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1900438] Re: Bcache bypasse writeback on caching device with fragmentation

Reply via email to