Robie, I've just updated the description with more information on the
user impact (and why we should have this patch), and also the regression
potential.

I wasn't completely right in my last comment, sorry - for RAID0/Linear
arrays, the output of the array state in "mdadm --detail" may change
after the patch _even for healthy arrays_. One example is the "readonly"
state: we can write this state to the sysfs "array_state" file for a
raid0/linear array, and currently the "mdadm --detail" will keep showing
state "clean" (arguably an odd behavior, but it is how the state output
works right now). With this patch, the state will show as "readonly".

I think we have two options here:

(a) We can backport the patch as is, hence risking some users running
monitoring tools or any kind of scripts parsing "mdadm -detail" output
breaking for raid0/linear arrays.

(b) We can introduce a conditional check for Bionic/Disco/Eoan in the
"mdadm --detail" output to keep "clean" as the array state unless it's
"active" or "broken" (the new state for failed arrays). Then, Focal will
get a merge from a more recent version of mdadm with the original patch
and diverge from this behavior, being aligned with upstream/other
distros.

Basically, the discussion is to "when" introduce the user-visible
change, if during the current releases or between releases (Focal being
the first to change). There's an hypothetical 3rd alternative that I
personally dislike and would avoid, that is introduce such conditional
check for all Ubuntu releases from now on and diverge forever from
upstream/other distros - I don't see a point in doing this.

Let me know your considerations Robie, and thanks for pointing out valid 
concerns!
Cheers,


Guilherme

-- 
You received this bug notification because you are a member of STS
Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1847924

Title:
  Introduce broken state parsing to mdadm

Status in mdadm package in Ubuntu:
  In Progress
Status in mdadm source package in Bionic:
  In Progress
Status in mdadm source package in Disco:
  In Progress
Status in mdadm source package in Eoan:
  In Progress
Status in mdadm source package in Focal:
  In Progress
Status in mdadm package in Debian:
  New

Bug description:
  [Impact]

  * Currently, mounted raid0/md-linear arrays have no indication/warning
  when one or more members are removed or suffer from some non-
  recoverable error condition. The mdadm tool shows "clean" state
  regardless if a member was removed.

  * The patch proposed in this SRU addresses this issue by introducing a
  new state "broken", which is analog to "clean" but indicates that
  array is not in a good/correct state. The commit, available upstream
  as 43ebc910 ("mdadm: Introduce new array state 'broken' for
  raid0/linear") [0], was extensively discussed and received a good
  amount of reviews/analysis by both the current mdadm maintainer as
  well as an old maintainer.

  * One important note here is that this patch requires a counter-part in the 
kernel to be fully functional, which was SRUed in LP: #1847773.
  It works fine/transparently without this kernel counter-part though.

  * We had reports of users testing a scenario of failed raid0 arrays,
  and getting 'clean' in mdadm proved to cause confusion and doesn't
  help on noticing something went wrong with the arrays.

  * The potential situation this patch (with its kernel counter-part)
  addresses is: an user has raid0/linear array, and it's mounted. If one
  member fails and gets removed (either physically, like a power or
  firmware issue, or in software, like a driver-induced removal due to
  detected failure), _without_ this patch (and its kernel counter-part)
  there's nothing to let user know it failed, except filesystem errors
  in dmesg. Also, non-direct writes to the filesystem will succeed, due
  to how page-cache/writeback work; even a 'sync' command run will
  succeed.

  * The case described in above bullet was tested and the writes to
  failed devices succeeded - after a reboot, the files written were
  present in the array, but corrupted. An user wouldn't noticed that
  unless if the writes were directed or some checksum was performed in
  the files. With this patch (and its kernel counter-part), the writes
  to such failed raid0/linear array are fast-failed and the filesystem
  goes read-only quickly.

  [Test case]

  * To test this patch, create a raid0 or linear md array on Linux using
  mdadm, like: "mdadm --create md0 --level=0 --raid-devices=2
  /dev/nvme0n1 /dev/nvme1n1";

  * Format the array using a FS of your choice (for example ext4) and
  mount the array;

  * Remove one member of the array, for example using sysfs interface
  (for nvme: echo 1 > /sys/block/nvme0n1/device/device/remove, for scsi:
  echo 1 > /sys/block/sdX/device/delete);

  * Without this patch, the array state shown by "mdadm --detail" is
  "clean", regardless a member is missing/failed.

  [Regression potential]

  * There are mainly two potential regressions here; the first is user-
  visible changes introduced by this mdadm patch. The second is if the
  patch itself has some unnoticed bug.

  * For the first type of potential regression: this patch introduces a
  change in how the array state is displayed in "mdadm --detail <array>"
  output for raid0/linear arrays *only*. Currently, the tool shows just
  2 states, "clean" or "active". In the patch being SRUed here, this
  changes for raid0/linear arrays to read the sysfs array state instead.
  So for example, we could read "readonly" state here for raid0/linear
  if the user (or some tool) changes the array to such state. This only
  affects raid0/linear, the output for other levels didn't change at
  all.

  * Regarding potential unnoticed issues in the code, we changed mainly
  structs and the "detail" command. Structs were incremented with the
  new "broken" state and the detail output was changed for raid0/linear
  as discussed in the previous bullet.

  * Note that we *proactively* skipped Xenial SRU here, in order to
  prevent potential regressions - Xenial mdadm tool lacks code
  infrastructure used by this patch, so the decision was for
  safety/stability, by only SRUing Bionic / Disco / Eoan mdadm versions.

  [0]
  https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=43ebc910

  [other info]

  As mdadm for focal (20.04) hasn't been merged yet, this will need to
  be added there during or after merge.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1847924/+subscriptions

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to     : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

Reply via email to