[Touch-packages] [Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

2024-04-25 Thread Krister Johansen
Hi Matthew,
It's been a couple months.  We'd really love to get the fix for Focal, Jammy, 
and Noble.  Any chance this could get sponsored and approved soon?

I also checked up on upstream and it appears that they're preparing a
1.47.1 release of e2fsprogs that should include this fix.  It hasn't
been tagged yet, but they're starting the process:
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=3fcbc9ffbeaa0df3dd06113b61f9b3bed4efb92e

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  Resizing cloud-images occasionally fails due to superblock checksum
  mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  In Progress
Status in e2fsprogs source package in Trusty:
  Won't Fix
Status in e2fsprogs source package in Xenial:
  Won't Fix
Status in e2fsprogs source package in Bionic:
  Won't Fix
Status in e2fsprogs source package in Focal:
  In Progress
Status in e2fsprogs source package in Jammy:
  In Progress
Status in e2fsprogs source package in Lunar:
  Won't Fix
Status in e2fsprogs source package in Mantic:
  In Progress
Status in e2fsprogs source package in Noble:
  In Progress

Bug description:
  [Impact]

  This is a long running bug plaguing cloud-images, where on a rare
  occasion resize2fs would fail and the image would not resize to fit
  the entire disk.

  Online resizes would fail due to a superblock checksum mismatch, where
  the superblock in memory differs from what is currently on disk due to
  changes made to the image.

  $ resize2fs /dev/nvme1n1p1
  resize2fs 1.47.0 (5-Feb-2023)
  resize2fs: Superblock checksum does not match superblock while trying to open 
/dev/nvme1n1p1
  Couldn't find valid filesystem superblock.

  Changing the read of the superblock to Direct I/O solves the issue.

  [Testcase]

  Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
  use as a scratch disk.

  Run the following script, courtesy of Krister Johansen and his team:

     #!/usr/bin/bash
     set -euxo pipefail

     while true
     do
     parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
     sleep .5
     mkfs.ext4 /dev/nvme1n1p1
     mount -t ext4 /dev/nvme1n1p1 /mnt
     stress-ng --temp-path /mnt -D 4 &
     STRESS_PID=$!
     sleep 1
     growpart /dev/nvme1n1 1
     resize2fs /dev/nvme1n1p1
     kill $STRESS_PID
     wait $STRESS_PID
     umount /mnt
     wipefs -a /dev/nvme1n1p1
     wipefs -a /dev/nvme1n1
     done

  Test packages are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

  If you install the test packages, the race no longer occurs.

  [Where problems could occur]

  We are changing how resize2fs reads the superblock from underlying
  disks.

  If a regression were to occur, resize2fs could fail to resize offline
  or online volumes. As all cloud-images are online resized during their
  initial boot, this could have a large impact to public and private
  clouds should a regression occur.

  [Other info]

  Upstream mailing list discussion:
  https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/
  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This was fixed in the below commit upstream:

  commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
  Author: Theodore Ts'o 
  Date: Thu, 15 Jun 2023 00:17:01 -0400
  Subject: resize2fs: use Direct I/O when reading the superblock for
   online resizes
  Link: 
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  The commit has not been tagged to any release. All supported Ubuntu
  releases require this fix, and need to be published in standard non-
  ESM archives to be picked up in cloud images.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

2024-02-26 Thread Krister Johansen
Thanks for the update, Matthew.  We're looking forward to getting this
fix from Ubuntu.  My team patched our version the e2fsprogs package from
Focal about a year ago, before we submitted the fix upstream.  Since
that patch, we haven't had any re-occurrences of the problem.  It used
to show up about 4-5 times a day for us. Glad that the fix is working in
your tests as well.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  Resizing cloud-images occasionally fails due to superblock checksum
  mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  In Progress
Status in e2fsprogs source package in Trusty:
  Won't Fix
Status in e2fsprogs source package in Xenial:
  Won't Fix
Status in e2fsprogs source package in Bionic:
  Won't Fix
Status in e2fsprogs source package in Focal:
  In Progress
Status in e2fsprogs source package in Jammy:
  In Progress
Status in e2fsprogs source package in Lunar:
  Won't Fix
Status in e2fsprogs source package in Mantic:
  In Progress

Bug description:
  [Impact]

  This is a long running bug plaguing cloud-images, where on a rare
  occasion resize2fs would fail and the image would not resize to fit
  the entire disk.

  Online resizes would fail due to a superblock checksum mismatch, where
  the superblock in memory differs from what is currently on disk due to
  changes made to the image.

  $ resize2fs /dev/nvme1n1p1
  resize2fs 1.47.0 (5-Feb-2023)
  resize2fs: Superblock checksum does not match superblock while trying to open 
/dev/nvme1n1p1
  Couldn't find valid filesystem superblock.

  Changing the read of the superblock to Direct I/O solves the issue.

  [Testcase]

  Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
  use as a scratch disk.

  Run the following script, courtesy of Krister Johansen and his team:

     #!/usr/bin/bash
     set -euxo pipefail

     while true
     do
     parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
     sleep .5
     mkfs.ext4 /dev/nvme1n1p1
     mount -t ext4 /dev/nvme1n1p1 /mnt
     stress-ng --temp-path /mnt -D 4 &
     STRESS_PID=$!
     sleep 1
     growpart /dev/nvme1n1 1
     resize2fs /dev/nvme1n1p1
     kill $STRESS_PID
     wait $STRESS_PID
     umount /mnt
     wipefs -a /dev/nvme1n1p1
     wipefs -a /dev/nvme1n1
     done

  Test packages are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

  If you install the test packages, the race no longer occurs.

  [Where problems could occur]

  We are changing how resize2fs reads the superblock from underlying
  disks.

  If a regression were to occur, resize2fs could fail to resize offline
  or online volumes. As all cloud-images are online resized during their
  initial boot, this could have a large impact to public and private
  clouds should a regression occur.

  [Other info]

  Upstream mailing list discussion:
  https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/
  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This was fixed in the below commit upstream:

  commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
  Author: Theodore Ts'o 
  Date: Thu, 15 Jun 2023 00:17:01 -0400
  Subject: resize2fs: use Direct I/O when reading the superblock for
   online resizes
  Link: 
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  The commit has not been tagged to any release. All supported Ubuntu
  releases require this fix, and need to be published in standard non-
  ESM archives to be picked up in cloud images.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

2024-02-02 Thread Krister Johansen
Hi Matthew,
Thanks for the update. I'm glad this finally reproduced in your environment.  I 
don't have a great explanation for why it took so much longer there.  I did 
observe that it seemed more likely to occur in us-west-2 during the 9a-5p 
window in the local timezone. The timing may be subtly affected by overall EBS 
utilization.  Just a guess, though.

Thanks again,

-K

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  Resizing cloud-images occasionally fails due to superblock checksum
  mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  In Progress
Status in e2fsprogs source package in Trusty:
  Won't Fix
Status in e2fsprogs source package in Xenial:
  Won't Fix
Status in e2fsprogs source package in Bionic:
  Won't Fix
Status in e2fsprogs source package in Focal:
  In Progress
Status in e2fsprogs source package in Jammy:
  In Progress
Status in e2fsprogs source package in Lunar:
  Won't Fix
Status in e2fsprogs source package in Mantic:
  In Progress

Bug description:
  [Impact]

  This is a long running bug plaguing cloud-images, where on a rare
  occasion resize2fs would fail and the image would not resize to fit
  the entire disk.

  Online resizes would fail due to a superblock checksum mismatch, where
  the superblock in memory differs from what is currently on disk due to
  changes made to the image.

  $ resize2fs /dev/nvme1n1p1
  resize2fs 1.47.0 (5-Feb-2023)
  resize2fs: Superblock checksum does not match superblock while trying to open 
/dev/nvme1n1p1
  Couldn't find valid filesystem superblock.

  Changing the read of the superblock to Direct I/O solves the issue.

  [Testcase]

  Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
  use as a scratch disk.

  Run the following script, courtesy of Krister Johansen and his team:

     #!/usr/bin/bash
     set -euxo pipefail

     while true
     do
     parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
     sleep .5
     mkfs.ext4 /dev/nvme1n1p1
     mount -t ext4 /dev/nvme1n1p1 /mnt
     stress-ng --temp-path /mnt -D 4 &
     STRESS_PID=$!
     sleep 1
     growpart /dev/nvme1n1 1
     resize2fs /dev/nvme1n1p1
     kill $STRESS_PID
     wait $STRESS_PID
     umount /mnt
     wipefs -a /dev/nvme1n1p1
     wipefs -a /dev/nvme1n1
     done

  Test packages are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

  If you install the test packages, the race no longer occurs.

  [Where problems could occur]

  We are changing how resize2fs reads the superblock from underlying
  disks.

  If a regression were to occur, resize2fs could fail to resize offline
  or online volumes. As all cloud-images are online resized during their
  initial boot, this could have a large impact to public and private
  clouds should a regression occur.

  [Other info]

  Upstream mailing list discussion:
  https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/
  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This was fixed in the below commit upstream:

  commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
  Author: Theodore Ts'o 
  Date: Thu, 15 Jun 2023 00:17:01 -0400
  Subject: resize2fs: use Direct I/O when reading the superblock for
   online resizes
  Link: 
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  The commit has not been tagged to any release. All supported Ubuntu
  releases require this fix, and need to be published in standard non-
  ESM archives to be picked up in cloud images.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

2024-01-11 Thread Krister Johansen
Hi Matthew,
Thanks for the update.  I went ahead and tested your updated packages on a 
Focal, Jammy, and Noble image in EC2 this evening.  With the latest packages 
installed, I was unable to reproduce the problem on any of the three installs.  
I'm uncertain which builds were inconsistent about triggering the problem for 
you, but it might be worth noting that the version of the package after Focal 
got an additional partial fix for the superblock checksum mismatch.  In those 
cases, it'll re-try the read of the block up to 3 times before returning a 
failure.  In my previous testing, this would increase the amount of time before 
one hits the problem, but not eliminate it entirely.

Thanks again for you help with getting these patches in.  It's much
appreciated!

-K

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  Resizing cloud-images occasionally fails due to superblock checksum
  mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  In Progress
Status in e2fsprogs source package in Trusty:
  Won't Fix
Status in e2fsprogs source package in Xenial:
  Won't Fix
Status in e2fsprogs source package in Bionic:
  Won't Fix
Status in e2fsprogs source package in Focal:
  In Progress
Status in e2fsprogs source package in Jammy:
  In Progress
Status in e2fsprogs source package in Lunar:
  In Progress
Status in e2fsprogs source package in Mantic:
  In Progress

Bug description:
  [Impact]

  This is a long running bug plaguing cloud-images, where on a rare
  occasion resize2fs would fail and the image would not resize to fit
  the entire disk.

  Online resizes would fail due to a superblock checksum mismatch, where
  the superblock in memory differs from what is currently on disk due to
  changes made to the image.

  $ resize2fs /dev/nvme1n1p1
  resize2fs 1.47.0 (5-Feb-2023)
  resize2fs: Superblock checksum does not match superblock while trying to open 
/dev/nvme1n1p1
  Couldn't find valid filesystem superblock.

  Changing the read of the superblock to Direct I/O solves the issue.

  [Testcase]

  Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
  use as a scratch disk.

  Run the following script, courtesy of Krister Johansen and his team:

     #!/usr/bin/bash
     set -euxo pipefail

     while true
     do
     parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
     sleep .5
     mkfs.ext4 /dev/nvme1n1p1
     mount -t ext4 /dev/nvme1n1p1 /mnt
     stress-ng --temp-path /mnt -D 4 &
     STRESS_PID=$!
     sleep 1
     growpart /dev/nvme1n1 1
     resize2fs /dev/nvme1n1p1
     kill $STRESS_PID
     wait $STRESS_PID
     umount /mnt
     wipefs -a /dev/nvme1n1p1
     wipefs -a /dev/nvme1n1
     done

  Test packages are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

  If you install the test packages, the race no longer occurs.

  [Where problems could occur]

  We are changing how resize2fs reads the superblock from underlying
  disks.

  If a regression were to occur, resize2fs could fail to resize offline
  or online volumes. As all cloud-images are online resized during their
  initial boot, this could have a large impact to public and private
  clouds should a regression occur.

  [Other info]

  Upstream mailing list discussion:
  https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/
  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This was fixed in the below commit upstream:

  commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
  Author: Theodore Ts'o 
  Date: Thu, 15 Jun 2023 00:17:01 -0400
  Subject: resize2fs: use Direct I/O when reading the superblock for
   online resizes
  Link: 
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  The commit has not been tagged to any release. All supported Ubuntu
  releases require this fix, and need to be published in standard non-
  ESM archives to be picked up in cloud images.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

2023-12-18 Thread Krister Johansen
@mruffel just wanted to check back to see if the instructions in the
report worked to reproduce the problem for you.  If so, do you have any
estimate when packages with the patch will be made available?  Thanks!

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  Resizing cloud-images occasionally fails due to superblock checksum
  mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  In Progress
Status in e2fsprogs source package in Trusty:
  Won't Fix
Status in e2fsprogs source package in Xenial:
  Won't Fix
Status in e2fsprogs source package in Bionic:
  Won't Fix
Status in e2fsprogs source package in Focal:
  In Progress
Status in e2fsprogs source package in Jammy:
  In Progress
Status in e2fsprogs source package in Lunar:
  In Progress
Status in e2fsprogs source package in Mantic:
  In Progress

Bug description:
  [Impact]

  This is a long running bug plaguing cloud-images, where on a rare
  occasion resize2fs would fail and the image would not resize to fit
  the entire disk.

  Online resizes would fail due to a superblock checksum mismatch, where
  the superblock in memory differs from what is currently on disk due to
  changes made to the image.

  $ resize2fs /dev/nvme1n1p1
  resize2fs 1.47.0 (5-Feb-2023)
  resize2fs: Superblock checksum does not match superblock while trying to open 
/dev/nvme1n1p1
  Couldn't find valid filesystem superblock.

  Changing the read of the superblock to Direct I/O solves the issue.

  [Testcase]

  Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
  use as a scratch disk.

  Run the following script, courtesy of Krister Johansen and his team:

     #!/usr/bin/bash
     set -euxo pipefail

     while true
     do
     parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
     sleep .5
     mkfs.ext4 /dev/nvme1n1p1
     mount -t ext4 /dev/nvme1n1p1 /mnt
     stress-ng --temp-path /mnt -D 4 &
     STRESS_PID=$!
     sleep 1
     growpart /dev/nvme1n1 1
     resize2fs /dev/nvme1n1p1
     kill $STRESS_PID
     wait $STRESS_PID
     umount /mnt
     wipefs -a /dev/nvme1n1p1
     wipefs -a /dev/nvme1n1
     done

  Test packages are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

  If you install the test packages, the race no longer occurs.

  [Where problems could occur]

  We are changing how resize2fs reads the superblock from underlying
  disks.

  If a regression were to occur, resize2fs could fail to resize offline
  or online volumes. As all cloud-images are online resized during their
  initial boot, this could have a large impact to public and private
  clouds should a regression occur.

  [Other info]

  Upstream mailing list discussion:
  https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/
  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This was fixed in the below commit upstream:

  commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
  Author: Theodore Ts'o 
  Date: Thu, 15 Jun 2023 00:17:01 -0400
  Subject: resize2fs: use Direct I/O when reading the superblock for
   online resizes
  Link: 
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  The commit has not been tagged to any release. All supported Ubuntu
  releases require this fix, and need to be published in standard non-
  ESM archives to be picked up in cloud images.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

2023-11-16 Thread Krister Johansen
Hi,
Just wanted to check back to see if the reproducer and fix worked in your 
testing environments.  I was also curious if it were possible to share any 
plans around when an update that contains this fix might be released.  Thanks 
again.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  Resizing cloud-images occasionally fails due to superblock checksum
  mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  In Progress
Status in e2fsprogs source package in Trusty:
  Won't Fix
Status in e2fsprogs source package in Xenial:
  Won't Fix
Status in e2fsprogs source package in Bionic:
  Won't Fix
Status in e2fsprogs source package in Focal:
  In Progress
Status in e2fsprogs source package in Jammy:
  In Progress
Status in e2fsprogs source package in Lunar:
  In Progress
Status in e2fsprogs source package in Mantic:
  In Progress

Bug description:
  [Impact]

  This is a long running bug plaguing cloud-images, where on a rare
  occasion resize2fs would fail and the image would not resize to fit
  the entire disk.

  Online resizes would fail due to a superblock checksum mismatch, where
  the superblock in memory differs from what is currently on disk due to
  changes made to the image.

  $ resize2fs /dev/nvme1n1p1
  resize2fs 1.47.0 (5-Feb-2023)
  resize2fs: Superblock checksum does not match superblock while trying to open 
/dev/nvme1n1p1
  Couldn't find valid filesystem superblock.

  Changing the read of the superblock to Direct I/O solves the issue.

  [Testcase]

  Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
  use as a scratch disk.

  Run the following script, courtesy of Krister Johansen and his team:

     #!/usr/bin/bash
     set -euxo pipefail

     while true
     do
     parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
     sleep .5
     mkfs.ext4 /dev/nvme1n1p1
     mount -t ext4 /dev/nvme1n1p1 /mnt
     stress-ng --temp-path /mnt -D 4 &
     STRESS_PID=$!
     sleep 1
     growpart /dev/nvme1n1 1
     resize2fs /dev/nvme1n1p1
     kill $STRESS_PID
     wait $STRESS_PID
     umount /mnt
     wipefs -a /dev/nvme1n1p1
     wipefs -a /dev/nvme1n1
     done

  Test packages are available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

  If you install the test packages, the race no longer occurs.

  [Where problems could occur]

  We are changing how resize2fs reads the superblock from underlying
  disks.

  If a regression were to occur, resize2fs could fail to resize offline
  or online volumes. As all cloud-images are online resized during their
  initial boot, this could have a large impact to public and private
  clouds should a regression occur.

  [Other info]

  Upstream mailing list discussion:
  https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/
  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This was fixed in the below commit upstream:

  commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
  Author: Theodore Ts'o 
  Date: Thu, 15 Jun 2023 00:17:01 -0400
  Subject: resize2fs: use Direct I/O when reading the superblock for
   online resizes
  Link: 
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  The commit has not been tagged to any release. All supported Ubuntu
  releases require this fix, and need to be published in standard non-
  ESM archives to be picked up in cloud images.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] Re: superblock checksum mismatch in resize2fs

2023-10-06 Thread Krister Johansen
Thanks for all the responses.  I'm not sure how quickly I'll be able to
get to this either, so I'm hesitant to commit to fixing myself.  That
said, if I can get time to send patches before your team gets to fixing
it, I'll do my best.

To answer the question about how frequently we see this: it was about
4-5 times a day until I applied the patches to our forked version of
e2fsprogs.

A few other things to note about what's going on here.  In 1.45.7,
e2fsprogs added some additional retries to the checksum validation path
on open:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=6338a8467564c3a0a12e9fcb08bdd748d736ac2f

I picked up this patch as well, and found that it helped a bit, but I
was still able to reproduce the problem with the reproducer that I
shared.

My team is running on the linux-aws-5.15 HWE kernel that's from jammy
but shipped to focal.  There's a kernel fix that may help with this
problem too, and it has been present since 5.10.  That said, I haven't
tested this on systems that are running 5.4.  (We don't have very many
of these anymore.)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=05c2c00f3769abb9e323fcaca70d2de0b48af7ba

The 05c2c00f3769 ("ext4: protect superblock modifications with a buffer
lock") may help to ensure that the superblock contents are always
consistent on disk, prior to the DIO read, since the directio path
writes out any dirty cached sb pages prior to issuing the read.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  superblock checksum mismatch in resize2fs

Status in cloud-images:
  New
Status in e2fsprogs package in Ubuntu:
  Confirmed
Status in e2fsprogs source package in Focal:
  New
Status in e2fsprogs source package in Jammy:
  New
Status in e2fsprogs source package in Mantic:
  Confirmed

Bug description:
  Hi,
  We run ext4 on EBS volumes on EC2.  During provisioning, cloud-init will 
occasionally report that resize2fs has failed due to a superblock checksum 
mismatch.  We debugged this internally, and were able to come up with the 
following reproducer:

 #!/usr/bin/bash
 set -euxo pipefail

 while true
 do
 parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
 sleep .5
 mkfs.ext4 /dev/nvme1n1p1
 mount -t ext4 /dev/nvme1n1p1 /mnt
 stress-ng --temp-path /mnt -D 4 &
 STRESS_PID=$!
 sleep 1
 growpart /dev/nvme1n1 1
 resize2fs /dev/nvme1n1p1
 kill $STRESS_PID
 wait $STRESS_PID
 umount /mnt
 wipefs -a /dev/nvme1n1p1
 wipefs -a /dev/nvme1n1
 done

  (This was on a 60gb gp3 volume attached to a c5.4xlarge)

  We were able to find a fix that works and get the patch accepted
  upstream.  The short explanation is that by switching the superblock
  read to direct io, we no longer see the problem.

  The patch is available here, but hasn't been published in a released
  version of e2fsprogs:

  
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  A longer thread with the maintainer is available here:

  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This bug report is to request that Ubuntu backport this patch to the
  versions of e2fsprogs that are in releases that are available in
  images on AWS, preferably Focal and Jammy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 2036467] [NEW] superblock checksum mismatch in resize2fs

2023-09-18 Thread Krister Johansen
Public bug reported:

Hi,
We run ext4 on EBS volumes on EC2.  During provisioning, cloud-init will 
occasionally report that resize2fs has failed due to a superblock checksum 
mismatch.  We debugged this internally, and were able to come up with the 
following reproducer:

   #!/usr/bin/bash
   set -euxo pipefail

   while true
   do
   parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
   sleep .5
   mkfs.ext4 /dev/nvme1n1p1
   mount -t ext4 /dev/nvme1n1p1 /mnt
   stress-ng --temp-path /mnt -D 4 &
   STRESS_PID=$!
   sleep 1
   growpart /dev/nvme1n1 1
   resize2fs /dev/nvme1n1p1
   kill $STRESS_PID
   wait $STRESS_PID
   umount /mnt
   wipefs -a /dev/nvme1n1p1
   wipefs -a /dev/nvme1n1
   done

(This was on a 60gb gp3 volume attached to a c5.4xlarge)

We were able to find a fix that works and get the patch accepted
upstream.  The short explanation is that by switching the superblock
read to direct io, we no longer see the problem.

The patch is available here, but hasn't been published in a released
version of e2fsprogs:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

A longer thread with the maintainer is available here:

https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

This bug report is to request that Ubuntu backport this patch to the
versions of e2fsprogs that are in releases that are available in images
on AWS, preferably Focal and Jammy.

** Affects: e2fsprogs (Ubuntu)
 Importance: Undecided
 Status: New


** Tags: patch patch-accepted-upstream

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467

Title:
  superblock checksum mismatch in resize2fs

Status in e2fsprogs package in Ubuntu:
  New

Bug description:
  Hi,
  We run ext4 on EBS volumes on EC2.  During provisioning, cloud-init will 
occasionally report that resize2fs has failed due to a superblock checksum 
mismatch.  We debugged this internally, and were able to come up with the 
following reproducer:

 #!/usr/bin/bash
 set -euxo pipefail

 while true
 do
 parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
 sleep .5
 mkfs.ext4 /dev/nvme1n1p1
 mount -t ext4 /dev/nvme1n1p1 /mnt
 stress-ng --temp-path /mnt -D 4 &
 STRESS_PID=$!
 sleep 1
 growpart /dev/nvme1n1 1
 resize2fs /dev/nvme1n1p1
 kill $STRESS_PID
 wait $STRESS_PID
 umount /mnt
 wipefs -a /dev/nvme1n1p1
 wipefs -a /dev/nvme1n1
 done

  (This was on a 60gb gp3 volume attached to a c5.4xlarge)

  We were able to find a fix that works and get the patch accepted
  upstream.  The short explanation is that by switching the superblock
  read to direct io, we no longer see the problem.

  The patch is available here, but hasn't been published in a released
  version of e2fsprogs:

  
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

  A longer thread with the maintainer is available here:

  https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/

  This bug report is to request that Ubuntu backport this patch to the
  versions of e2fsprogs that are in releases that are available in
  images on AWS, preferably Focal and Jammy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/e2fsprogs/+bug/2036467/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp