[Bug 1907262] [NEW] raid10: discard leads to corrupted file system

Thimo E Tue, 08 Dec 2020 06:35:45 -0800

Public bug reported:

Seems to be closely related to
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578


After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the
fstrim command triggered by fstrim.timer causes a severe number of
mismatches between two RAID10 component devices.

This bug affects several machines in our company with different HW
configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
affected.

How to reproduce:
 - Create a RAID10 LVM and filesystem on two SSDs
    mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2
    pvcreate -ff -y /dev/md0
    vgcreate -f -y VolGroup /dev/md0
    lvcreate -n root    -L 100G -ay -y VolGroup
    mkfs.ext4 /dev/VolGroup/root
    mount /dev/VolGroup/root /mnt
 - Write some data, sync and delete it
    dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
    sync
    rm /mnt/data.raw
 - Check the RAID device
    echo check >/sys/block/md0/md/sync_action
 - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
    cat /sys/block/md0/md/mismatch_cnt
 - Trigger the bug
    fstrim /mnt
 - Re-Check the RAID device
    echo check >/sys/block/md0/md/sync_action
 - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the 
range of N*10000):
    cat /sys/block/md0/md/mismatch_cnt

After investigating this issue on several machines it *seems* that the
first drive does the trim correctly while the second one goes wild. At
least the number and severity of errors found by a  USB stick live
session fsck.ext4 suggests this.

To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
  mdadm --assemble /dev/md127 /dev/nvme0n1p2
  mdadm --run /dev/md127
  fsck.ext4 -n -f /dev/VolGroup/root

  vgchange -a n /dev/VolGroup
  mdadm --stop /dev/md127

  mdadm --assemble /dev/md127 /dev/nvme1n1p2
  mdadm --run /dev/md127
  fsck.ext4 -n -f /dev/VolGroup/root

When starting these fscks without -n, on the first device it seems the
directory structure is OK while on the second device there is only the
lost+found folder left.

Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
before) seems to have a quite similar issue.

Unfortunately the risk/regression assessment in the aforementioned bug
is not complete: the workaround only mitigates the issues during FS
creation. This bug on the other hand is triggered by a weekly service
(fstrim) causing severe file system corruption.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1907262] [NEW] raid10: discard leads to corrupted file system

Reply via email to