[Bug 1796542] Re: Silent corruption in Linux kernel 4.15

Colin Ian King Tue, 09 Oct 2018 09:27:37 -0700

OK, I've explored the fixes, it addresses the issue and I can't
reproduce the problem with the fix testing with the provided reproducer
as referenced in comment #3


** Description changed:

+ == SRU Justification ==
+ 
+ A silent data corruption was introduced in v4.10-rc1 with commit
+ 72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7
+ with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects
+ users of O_DIRECT, in our case a KVM virtual machine with drives
+ which use qemu's "cache=none" option.
+ 
+ == Fix ==
+ 
+ Upstream commits:
+ 
+ 0aa69fd32a5f766e997ca8ab4723c5a1146efa8b
+   block: add a lower-level bio_add_page interface
+ 
+ b403ea2404889e1227812fa9657667a1deb9c694
+   block: bio_iov_iter_get_pages: fix size of last iovec
+ 
+ 9362dd1109f87a9d0a798fbc890cb339c171ed35 
+   blkdev: __blkdev_direct_IO_simple: fix leak in error case
+ 
+ 17d51b10d7773e4618bcac64648f30f12d4078fb
+   block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
+ 
+ The first 3 patches are required for a clean application of the final
+ patch that actually addresses the problem with a fix to this known
+ issue.
+ 
+ == Regression Potential ==
+ 
+ This touches the block layer, so there is risk potential in data
+ corruption. The fixes have several weeks in the upstream kernel and
+ so far, I see no subsequent fixes required.
+ 
+ == Test Case ==
+ 
+ Build the program listed below [1]
+ kudos to Jan Kara, and run with:
+ 
+ dd if=/dev/zero if=loop.img bs=1M count=2048
+ sudo losetup /dev/loop0 loop.img
+ 
+ ./blkdev-dio-test /dev/loop0 0 &
+ ./blkdev-dio-test /dev/loop0 2048 &
+ 
+ Without the fix, ones lost writes fairly soon.  Without the fix, this
+ runs without any losy write messages.
+ 
+ blkdev-dio-test.c:
+ 
+ #define _GNU_SOURCE
+ #include <stdio.h>
+ #include <unistd.h>
+ #include <fcntl.h>
+ #include <string.h>
+ #include <stdlib.h>
+ #include <sys/uio.h>
+ 
+ #define PAGE_SIZE 4096
+ #define SECT_SIZE 512
+ #define BUF_OFF (2*SECT_SIZE)
+ 
+ int main(int argc, char **argv)
+ {
+       int fd = open(argv[1], O_RDWR | O_DIRECT);
+       int ret;
+       char *buf;
+       loff_t off;
+       struct iovec iov[2];
+       unsigned int seq;
+ 
+       if (fd < 0) {
+               perror("open");
+               return 1;
+       }
+ 
+         off = strtol(argv[2], NULL, 10);
+ 
+         buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
+ 
+       iov[0].iov_base = buf;
+       iov[0].iov_len = SECT_SIZE;
+       iov[1].iov_base = buf + BUF_OFF;
+       iov[1].iov_len = SECT_SIZE;
+ 
+       seq = 0;
+       memset(buf, 0, PAGE_SIZE);
+       while (1) {
+               *(unsigned int *)buf = seq;
+               *(unsigned int *)(buf + BUF_OFF) = seq;
+               ret = pwritev(fd, iov, 2, off);
+               if (ret < 0) {
+                       perror("pwritev");
+                       return 1;
+               }
+               if (ret != 2*SECT_SIZE) {
+                       fprintf(stderr, "Short pwritev: %d\n", ret);
+                       return 1;
+               }
+               ret = pread(fd, buf, PAGE_SIZE, off);
+               if (ret < 0) {
+                       perror("pread");
+                       return 1;
+               }
+               if (ret != PAGE_SIZE) {
+                       fprintf(stderr, "Short read: %d\n", ret);
+                       return 1;
+               }
+               if (*(unsigned int *)buf != seq ||
+                   *(unsigned int *)(buf + SECT_SIZE) != seq) {
+                       printf("Lost write %u: %u %u\n", seq, *(unsigned int 
*)buf, *(unsigned int *)(buf + SECT_SIZE));
+                       return 1;
+               }
+               seq++;
+       }
+ 
+       return 0;
+ }
+ 
+ References:
+ [1] https://www.spinics.net/lists/linux-block/msg28507.html
+ 
+ ====================
+ 
  TLDR: commit 72ecad22d9f198aafee64218512e02ffa7818671 (in v4.10)
  introduced silent data corruption for O_DIRECT uses, it's fixed in
  17d51b10d7773e4618bcac64648f30f12d4078fb (in v4.18)
  
  A silent data corruption was introduced in v4.10-rc1 with commit
  72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with
  commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of
  O_DIRECT, in our case a KVM virtual machine with drives which use qemu's
  "cache=none" option.
  
  This is the commit which fixes the issue:
  ---------------------
  commit 17d51b10d7773e4618bcac64648f30f12d4078fb
  Author: Martin Wilck <[email protected]>
  Date:   Wed Jul 25 23:15:09 2018 +0200
  
-     block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
- 
-     bio_iov_iter_get_pages() currently only adds pages for the next non-zero
-     segment from the iov_iter to the bio. That's suboptimal for callers,
-     which typically try to pin as many pages as fit into the bio. This patch
-     converts the current bio_iov_iter_get_pages() into a static helper, and
-     introduces a new helper that allocates as many pages as
- 
-      1) fit into the bio,
-      2) are present in the iov_iter,
-      3) and can be pinned by MM.
- 
-     Error is returned only if zero pages could be pinned. Because of 3), a
-     zero return value doesn't necessarily mean all pages have been pinned.
-     Callers that have to pin every page in the iov_iter must still call this
-     function in a loop (this is currently the case).
- 
-     This change matters most for __blkdev_direct_IO_simple(), which calls
-     bio_iov_iter_get_pages() only once. If it obtains less pages than
-     requested, it returns a "short write" or "short read", and
-     __generic_file_write_iter() falls back to buffered writes, which may
-     lead to data corruption.
- 
-     Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for 
simplified bdev direct-io")
-     Reviewed-by: Christoph Hellwig <[email protected]>
-     Signed-off-by: Martin Wilck <[email protected]>
-     Signed-off-by: Jens Axboe <[email protected]>
+     block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
+ 
+     bio_iov_iter_get_pages() currently only adds pages for the next non-zero
+     segment from the iov_iter to the bio. That's suboptimal for callers,
+     which typically try to pin as many pages as fit into the bio. This patch
+     converts the current bio_iov_iter_get_pages() into a static helper, and
+     introduces a new helper that allocates as many pages as
+ 
+      1) fit into the bio,
+      2) are present in the iov_iter,
+      3) and can be pinned by MM.
+ 
+     Error is returned only if zero pages could be pinned. Because of 3), a
+     zero return value doesn't necessarily mean all pages have been pinned.
+     Callers that have to pin every page in the iov_iter must still call this
+     function in a loop (this is currently the case).
+ 
+     This change matters most for __blkdev_direct_IO_simple(), which calls
+     bio_iov_iter_get_pages() only once. If it obtains less pages than
+     requested, it returns a "short write" or "short read", and
+     __generic_file_write_iter() falls back to buffered writes, which may
+     lead to data corruption.
+ 
+     Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for 
simplified bdev direct-io")
+     Reviewed-by: Christoph Hellwig <[email protected]>
+     Signed-off-by: Martin Wilck <[email protected]>
+     Signed-off-by: Jens Axboe <[email protected]>
  -------------------------
  
  Since there were a lot of components involved in the initial report to
  us (xfs, guest kernel, guest virtio drivers, qemu, host kernel, storage
  system), we had to isolate it. This is the commit which fixes the data
  corruption bug. We created a reliable reproduction and tested with the
  patch and without the patch. We also created a version of the kernel
  which prints when the data-corrupting path in the kernel is triggered.
  
  > 1) The release of Ubuntu you are using, via 'lsb_release -rd' or
  System -> About Ubuntu
  
  # lsb_release -rd
  Description:  Ubuntu 18.04.1 LTS
  Release:      18.04
  
  > 2) The version of the package you are using, via 'apt-cache policy
  pkgname' or by checking in Software Center
  
  # apt-cache policy linux-image-4.15.0-36-generic
  linux-image-4.15.0-36-generic:
-  Installed: 4.15.0-36.39
-  Candidate: 4.15.0-36.39
-  Version table:
+  Installed: 4.15.0-36.39
+  Candidate: 4.15.0-36.39
+  Version table:
  *** 4.15.0-36.39 500
-       500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
-       500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
-       100 /var/lib/dpkg/status
+     500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
+     500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
+     100 /var/lib/dpkg/status
  
  > 3) What you expected to happen
  
  We ran a fio random write workload over 8x 512MB files over XFS in guest OS, 
over qemu/kvm, over kernel 4.15.0-36.39-generic.
  qemu-system was configured with cache=none, which means Direct IO. This is a 
very common configuration.
  qemu-system was with aio=threads -- the default.
  
  We were expecting no data corruption.
  
  > 4) What happened instead
  
  The guest filesystem was corrupted.

** Description changed:

- == SRU Justification ==
+ == SRU Justification [BIONIC] ==
  
  A silent data corruption was introduced in v4.10-rc1 with commit
  72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7
  with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects
  users of O_DIRECT, in our case a KVM virtual machine with drives
  which use qemu's "cache=none" option.
  
  == Fix ==
  
  Upstream commits:
  
  0aa69fd32a5f766e997ca8ab4723c5a1146efa8b
-   block: add a lower-level bio_add_page interface
+   block: add a lower-level bio_add_page interface
  
  b403ea2404889e1227812fa9657667a1deb9c694
-   block: bio_iov_iter_get_pages: fix size of last iovec
- 
- 9362dd1109f87a9d0a798fbc890cb339c171ed35 
-   blkdev: __blkdev_direct_IO_simple: fix leak in error case
+   block: bio_iov_iter_get_pages: fix size of last iovec
+ 
+ 9362dd1109f87a9d0a798fbc890cb339c171ed35
+   blkdev: __blkdev_direct_IO_simple: fix leak in error case
  
  17d51b10d7773e4618bcac64648f30f12d4078fb
-   block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
+   block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  
  The first 3 patches are required for a clean application of the final
  patch that actually addresses the problem with a fix to this known
  issue.
  
  == Regression Potential ==
  
  This touches the block layer, so there is risk potential in data
  corruption. The fixes have several weeks in the upstream kernel and
  so far, I see no subsequent fixes required.
  
  == Test Case ==
  
  Build the program listed below [1]
  kudos to Jan Kara, and run with:
  
  dd if=/dev/zero if=loop.img bs=1M count=2048
  sudo losetup /dev/loop0 loop.img
  
  ./blkdev-dio-test /dev/loop0 0 &
  ./blkdev-dio-test /dev/loop0 2048 &
  
  Without the fix, ones lost writes fairly soon.  Without the fix, this
  runs without any losy write messages.
  
  blkdev-dio-test.c:
  
  #define _GNU_SOURCE
  #include <stdio.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <string.h>
  #include <stdlib.h>
  #include <sys/uio.h>
  
  #define PAGE_SIZE 4096
  #define SECT_SIZE 512
  #define BUF_OFF (2*SECT_SIZE)
  
  int main(int argc, char **argv)
  {
-       int fd = open(argv[1], O_RDWR | O_DIRECT);
-       int ret;
-       char *buf;
-       loff_t off;
-       struct iovec iov[2];
-       unsigned int seq;
- 
-       if (fd < 0) {
-               perror("open");
-               return 1;
-       }
- 
-         off = strtol(argv[2], NULL, 10);
- 
-         buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
- 
-       iov[0].iov_base = buf;
-       iov[0].iov_len = SECT_SIZE;
-       iov[1].iov_base = buf + BUF_OFF;
-       iov[1].iov_len = SECT_SIZE;
- 
-       seq = 0;
-       memset(buf, 0, PAGE_SIZE);
-       while (1) {
-               *(unsigned int *)buf = seq;
-               *(unsigned int *)(buf + BUF_OFF) = seq;
-               ret = pwritev(fd, iov, 2, off);
-               if (ret < 0) {
-                       perror("pwritev");
-                       return 1;
-               }
-               if (ret != 2*SECT_SIZE) {
-                       fprintf(stderr, "Short pwritev: %d\n", ret);
-                       return 1;
-               }
-               ret = pread(fd, buf, PAGE_SIZE, off);
-               if (ret < 0) {
-                       perror("pread");
-                       return 1;
-               }
-               if (ret != PAGE_SIZE) {
-                       fprintf(stderr, "Short read: %d\n", ret);
-                       return 1;
-               }
-               if (*(unsigned int *)buf != seq ||
-                   *(unsigned int *)(buf + SECT_SIZE) != seq) {
-                       printf("Lost write %u: %u %u\n", seq, *(unsigned int 
*)buf, *(unsigned int *)(buf + SECT_SIZE));
-                       return 1;
-               }
-               seq++;
-       }
- 
-       return 0;
+  int fd = open(argv[1], O_RDWR | O_DIRECT);
+  int ret;
+  char *buf;
+  loff_t off;
+  struct iovec iov[2];
+  unsigned int seq;
+ 
+  if (fd < 0) {
+   perror("open");
+   return 1;
+  }
+ 
+  off = strtol(argv[2], NULL, 10);
+ 
+  buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
+ 
+  iov[0].iov_base = buf;
+  iov[0].iov_len = SECT_SIZE;
+  iov[1].iov_base = buf + BUF_OFF;
+  iov[1].iov_len = SECT_SIZE;
+ 
+  seq = 0;
+  memset(buf, 0, PAGE_SIZE);
+  while (1) {
+   *(unsigned int *)buf = seq;
+   *(unsigned int *)(buf + BUF_OFF) = seq;
+   ret = pwritev(fd, iov, 2, off);
+   if (ret < 0) {
+    perror("pwritev");
+    return 1;
+   }
+   if (ret != 2*SECT_SIZE) {
+    fprintf(stderr, "Short pwritev: %d\n", ret);
+    return 1;
+   }
+   ret = pread(fd, buf, PAGE_SIZE, off);
+   if (ret < 0) {
+    perror("pread");
+    return 1;
+   }
+   if (ret != PAGE_SIZE) {
+    fprintf(stderr, "Short read: %d\n", ret);
+    return 1;
+   }
+   if (*(unsigned int *)buf != seq ||
+       *(unsigned int *)(buf + SECT_SIZE) != seq) {
+    printf("Lost write %u: %u %u\n", seq, *(unsigned int *)buf, *(unsigned int 
*)(buf + SECT_SIZE));
+    return 1;
+   }
+   seq++;
+  }
+ 
+  return 0;
  }
  
  References:
  [1] https://www.spinics.net/lists/linux-block/msg28507.html
  
  ====================
  
  TLDR: commit 72ecad22d9f198aafee64218512e02ffa7818671 (in v4.10)
  introduced silent data corruption for O_DIRECT uses, it's fixed in
  17d51b10d7773e4618bcac64648f30f12d4078fb (in v4.18)
  
  A silent data corruption was introduced in v4.10-rc1 with commit
  72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with
  commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of
  O_DIRECT, in our case a KVM virtual machine with drives which use qemu's
  "cache=none" option.
  
  This is the commit which fixes the issue:
  ---------------------
  commit 17d51b10d7773e4618bcac64648f30f12d4078fb
  Author: Martin Wilck <[email protected]>
  Date:   Wed Jul 25 23:15:09 2018 +0200
  
      block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  
      bio_iov_iter_get_pages() currently only adds pages for the next non-zero
      segment from the iov_iter to the bio. That's suboptimal for callers,
      which typically try to pin as many pages as fit into the bio. This patch
      converts the current bio_iov_iter_get_pages() into a static helper, and
      introduces a new helper that allocates as many pages as
  
       1) fit into the bio,
       2) are present in the iov_iter,
       3) and can be pinned by MM.
  
      Error is returned only if zero pages could be pinned. Because of 3), a
      zero return value doesn't necessarily mean all pages have been pinned.
      Callers that have to pin every page in the iov_iter must still call this
      function in a loop (this is currently the case).
  
      This change matters most for __blkdev_direct_IO_simple(), which calls
      bio_iov_iter_get_pages() only once. If it obtains less pages than
      requested, it returns a "short write" or "short read", and
      __generic_file_write_iter() falls back to buffered writes, which may
      lead to data corruption.
  
      Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for 
simplified bdev direct-io")
      Reviewed-by: Christoph Hellwig <[email protected]>
      Signed-off-by: Martin Wilck <[email protected]>
      Signed-off-by: Jens Axboe <[email protected]>
  -------------------------
  
  Since there were a lot of components involved in the initial report to
  us (xfs, guest kernel, guest virtio drivers, qemu, host kernel, storage
  system), we had to isolate it. This is the commit which fixes the data
  corruption bug. We created a reliable reproduction and tested with the
  patch and without the patch. We also created a version of the kernel
  which prints when the data-corrupting path in the kernel is triggered.
  
  > 1) The release of Ubuntu you are using, via 'lsb_release -rd' or
  System -> About Ubuntu
  
  # lsb_release -rd
  Description:  Ubuntu 18.04.1 LTS
  Release:      18.04
  
  > 2) The version of the package you are using, via 'apt-cache policy
  pkgname' or by checking in Software Center
  
  # apt-cache policy linux-image-4.15.0-36-generic
  linux-image-4.15.0-36-generic:
   Installed: 4.15.0-36.39
   Candidate: 4.15.0-36.39
   Version table:
  *** 4.15.0-36.39 500
      500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
      500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
      100 /var/lib/dpkg/status
  
  > 3) What you expected to happen
  
  We ran a fio random write workload over 8x 512MB files over XFS in guest OS, 
over qemu/kvm, over kernel 4.15.0-36.39-generic.
  qemu-system was configured with cache=none, which means Direct IO. This is a 
very common configuration.
  qemu-system was with aio=threads -- the default.
  
  We were expecting no data corruption.
  
  > 4) What happened instead
  
  The guest filesystem was corrupted.

** Summary changed:

- Silent corruption in Linux kernel 4.15
+ Silent data corruption in Linux kernel 4.15

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1796542

Title:
  Silent data corruption in Linux kernel 4.15

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796542/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1796542] Re: Silent corruption in Linux kernel 4.15

Reply via email to