** Description changed:

- Hello,
+ SRU Justification:
  
- An upstream bug has been opened on the matter for quite some time now [0].
- I can reproduce easily on our production compute node instance, which are 
trusty host with xenial hwe kernels (4.15.0-101-generic).
- However due to heavy backport and such, doing real tracing is a bit hard 
there.
+ [Impact]
+ This bug in bcache [insert correct area] affects I/O performance on all 
versions of the kernel [correct versions affected]. It is particularly negative 
on ceph if used with bcache.
  
- I was able to reproduce the behavior on a hwe-bionic kernel as well.
- Since most of our critical deployments use bcache, I think this is a kinda 
nasty bug to have.
+ Write I/O latency would suddenly go to around 1 second from around 10 ms
+ when hitting this issue and would easily be stuck there for hours or
+ even days, especially bad for ceph on bcache architecture. This would
+ make ceph extremely slow and make the entire cloud almost unusable.
  
- Reproducing the issue is relatively easy with the script provided in the bug 
[1].
- The script used to capture the stats is this one [2].
+ The root cause is that the dirty bucket had reached the 70 percent
+ threshold, thus causing all writes to go direct to the backing HDD
+ device. It might be fine if it actually had a lot of dirty data, but
+ this happens when dirty data has not even reached over 10 percent, due
+ to having high memory fragmentation. What makes it worse is that the
+ writeback rate might be still at minimum value (8) due to the writeback
+ percent not reached, so it takes ages for bcache to really reclaim
+ enough dirty buckets to get itself out of this situation.
  
- [0]: https://bugzilla.kernel.org/show_bug.cgi?id=206767
- [1]: https://pastebin.ubuntu.com/p/YnnvvSRhXK/
- [2]: https://pastebin.ubuntu.com/p/XfVpzg32sN/
+ [Fix]
+ 
+ * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the
+ fragmentation when update the writeback rate”
+ 
+ The current way to calculate the writeback rate only considered the dirty 
sectors. 
+ This usually works fine when memory fragmentation is not high, but it will 
give us an unreasonably low writeback rate when we are in the situation that a 
few dirty sectors have consumed a lot of dirty buckets. In some cases, the 
dirty buckets reached  CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback)  while 
the dirty data (sectors) had not even reached the writeback_percent threshold 
(i.e., started writeback). In that situation, the writeback rate will still be 
the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck 
in a non-writeback mode because of the slow writeback.
+ 
+ We accelerate the rate in 3 stages with different aggressiveness:
+ the first stage starts when dirty buckets percent reach above 
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), 
+ the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57),
+ the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). 
+ 
+ By default the first stage tries to writeback the amount of dirty data
+ in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds,
+ the second stage tries to writeback the amount of dirty data in one bucket
+ in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third
+ stage tries to writeback the amount of dirty data in one bucket in
+ (1 / (dirty_buckets_percent - 64)) milliseconds.
+ 
+ The initial rate at each stage can be controlled by 3 configurable
+ parameters: 
+ 
+ writeback_rate_fp_term_{low|mid|high}
+ 
+ They are by default 1, 10, 1000, chosen based on testing and production
+ data, detailed below.
+ 
+ A. When it comes to the low stage, it is still far from the 70%
+    threshold, so we only want to give it a little bit push by setting the
+    term to 1, it means the initial rate will be 170 if the fragment is 6,
+    it is calculated by bucket_size/fragment, this rate is very small,
+    but still much more reasonable than the minimum 8.
+    For a production bcache with non-heavy workload, if the cache device
+    is bigger than 1 TB, it may take hours to consume 1% buckets,
+    so it is very possible to reclaim enough dirty buckets in this stage,
+    thus to avoid entering the next stage.
+ 
+ B. If the dirty buckets ratio didn’t turn around during the first stage,
+    it comes to the mid stage, then it is necessary for mid stage
+    to be more aggressive than low stage, so the initial rate is chosen
+    to be 10 times more than the low stage, which means 1700 as the initial
+    rate if the fragment is 6. This is a normal rate
+    we usually see for a normal workload when writeback happens
+    because of writeback_percent.
+ 
+ C. If the dirty buckets ratio didn't turn around during the low and mid
+    stages, it comes to the third stage, and it is the last chance that
+    we can turn around to avoid the horrible cutoff writeback sync issue,
+    then we choose 100 times more aggressive than the mid stage, that
+    means 170000 as the initial rate if the fragment is 6. This is also
+    inferred from a production bcache, I've got one week's writeback rate
+    data from a production bcache which has quite heavy workloads,
+    again, the writeback is triggered by the writeback percent,
+    the highest rate area is around 100000 to 240000, so I believe this
+    kind aggressiveness at this stage is reasonable for production.
+    And it should be mostly enough because the hint is trying to reclaim
+    1000 bucket per second, and from that heavy production env,
+    it is consuming 50 buckets per second on average in one week's data.
+ 
+ Option writeback_consider_fragment is to control whether we want
+ this feature to be on or off, it's on by default.
+ 
+ 
+ [Test Case]
+ 
+ I’ve put all my testing results in below google document, the testing clearly 
shows the significant performance improvement.
+ 
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
+ 
+ Another testing is that we had built a testing kernel based on bionic
+ 4.15.0-99.100 + the patch, and putting this kernel in a production
+ environment, it’s an openstack environment with ceph on bcache as the
+ storage. It runs for more than one month and doesn’t show any issue.
+ 
+ [Regression Potential]
+ 
+ The patch only updates the writeback rate, so it won’t have any impact
+ on the data safety, the only potential regression I can think of  is
+ that the backing device might be a bit busier after the dirty buckets
+ reached to BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50% by default) since
+ the writeback rate is accelerated under this highly fragmented
+ situation, but that’s because we are trying to avoid all writes hit the
+ writeback cutoff sync threshold.

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => dongdong tao (taodd)

** Changed in: linux (Ubuntu Bionic)
     Assignee: (unassigned) => dongdong tao (taodd)

** Changed in: linux (Ubuntu Focal)
     Assignee: (unassigned) => dongdong tao (taodd)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1900438

Title:
  Bcache bypasse writeback on caching device with fragmentation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1900438/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to