[Bug 1847924] Re: Introduce broken state parsing to mdadm

Guilherme G. Piccoli Wed, 06 Nov 2019 11:41:17 -0800

** Description changed:

  [Impact]
  
  * Currently, mounted raid0/md-linear arrays have no indication/warning
  when one or more members are removed or suffer from some non-recoverable
  error condition. The mdadm tool shows "clean" state regardless if a
  member was removed.
  
  * The patch proposed in this SRU addresses this issue by introducing a
  new state "broken", which is analog to "clean" but indicates that array
  is not in a good/correct state. The commit, available upstream as
  43ebc910 ("mdadm: Introduce new array state 'broken' for raid0/linear")
  [0], was extensively discussed and received a good amount of
  reviews/analysis by both the current mdadm maintainer as well as an old
  maintainer.
  
  * One important note here is that this patch requires a counter-part in the 
kernel to be fully functional, which was SRUed in LP: #1847773.
  It works fine/transparently without this kernel counter-part though.
+ 
+ * We had reports of users testing a scenario of failed raid0 arrays, and
+ getting 'clean' in mdadm proved to cause confusion and doesn't help on
+ noticing something went wrong with the arrays.
+ 
+ * The potential situation this patch (with its kernel counter-part)
+ addresses is: an user has raid0/linear array, and it's mounted. If one
+ member fails and gets removed (either physically, like a power or
+ firmware issue, or in software, like a driver-induced removal due to
+ detected failure), _without_ this patch (and its kernel counter-part)
+ there's nothing to let user know it failed, except filesystem errors in
+ dmesg. Also, non-direct writes to the filesystem will succeed, due to
+ how page-cache/writeback work; even a 'sync' command run will succeed.
+ 
+ * The case described in above bullet was tested and the writes to failed
+ devices succeeded - after a reboot, the files written were present in
+ the array, but corrupted. An user wouldn't noticed that unless if the
+ writes were directed or some checksum was performed in the files. With
+ this patch (and its kernel counter-part), the writes to such failed
+ raid0/linear array are fast-failed and the filesystem goes read-only
+ quickly.
  
  [Test case]
  
  * To test this patch, create a raid0 or linear md array on Linux using
  mdadm, like: "mdadm --create md0 --level=0 --raid-devices=2 /dev/nvme0n1
  /dev/nvme1n1";
  
  * Format the array using a FS of your choice (for example ext4) and
  mount the array;
  
  * Remove one member of the array, for example using sysfs interface (for
  nvme: echo 1 > /sys/block/nvme0n1/device/device/remove, for scsi: echo 1
  > /sys/block/sdX/device/delete);
  
  * Without this patch, the array state shown by "mdadm --detail" is
  "clean", regardless a member is missing/failed.
  
  [Regression potential]
  
- * There's not much potential regression here; we just exhibit arrays'
- state as "broken" if they have one or more missing/failed members; we
- believe the most common "issue" that could be reported from this patch
- is if an userspace tool rely on the array status as being always "clean"
- even for broken devices, then such tool may behave differently with this
- patch.
+ * There are mainly two potential regressions here; the first is user-
+ visible changes introduced by this mdadm patch. The second is if the
+ patch itself has some unnoticed bug.
+ 
+ * For the first type of potential regression: this patch introduces a
+ change in how the array state is displayed in "mdadm --detail <array>"
+ output for raid0/linear arrays *only*. Currently, the tool shows just 2
+ states, "clean" or "active". In the patch being SRUed here, this changes
+ for raid0/linear arrays to read the sysfs array state instead. So for
+ example, we could read "readonly" state here for raid0/linear if the
+ user (or some tool) changes the array to such state. This only affects
+ raid0/linear, the output for other levels didn't change at all.
+ 
+ * Regarding potential unnoticed issues in the code, we changed mainly
+ structs and the "detail" command. Structs were incremented with the new
+ "broken" state and the detail output was changed for raid0/linear as
+ discussed in the previous bullet.
  
  * Note that we *proactively* skipped Xenial SRU here, in order to
  prevent potential regressions - Xenial mdadm tool lacks code
  infrastructure used by this patch, so the decision was for
  safety/stability, by only SRUing Bionic / Disco / Eoan mdadm versions.
  
  [0]
  https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=43ebc910
  
  [other info]
  
- As mdadm for focal hasn't been merged yet, this will need to be added
- there during or after merge.
+ As mdadm for focal (20.04) hasn't been merged yet, this will need to be
+ added there during or after merge.


-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1847924

Title:
  Introduce broken state parsing to mdadm

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1847924/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1847924] Re: Introduce broken state parsing to mdadm

Reply via email to