[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-08-13 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 5.0.0-25.26

---
linux (5.0.0-25.26) disco; urgency=medium

  * CVE-2019-1125
- x86/cpufeatures: Carve out CQM features retrieval
- x86/cpufeatures: Combine word 11 and 12 into a new scattered features word
- x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
- x86/speculation: Enable Spectre v1 swapgs mitigations
- x86/entry/64: Use JMP instead of JMPQ
- x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS

 -- Kleber Sacilotto de Souza   Thu, 01 Aug
2019 12:04:35 +0200

** Changed in: linux (Ubuntu Disco)
   Status: Fix Committed => Fix Released

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2019-1125

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  Fix Released
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member of the
  array, we've noticed after kernel 4.18 we can trigger a crash
  depending if there's I/O in-flight during the array removal. When
  debugging the issue, a second problem was found, that could cause a
  different crash.

  * For the first and more relevant problem, commit cd4a4ae4683d
  ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 0155
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:blk_throtl_bio+0x45/0x970
  [...]
  Call Trace:
   generic_make_request_checks+0x1bf/0x690
   generic_make_request+0x64/0x3f0
   raid0_make_request+0x184/0x620 [raid0]
   ? raid0_make_request+0x184/0x620 [raid0]
   md_handle_request+0x126/0x1a0
   md_make_request+0x7b/0x180
   generic_make_request+0x19e/0x3f0
   submit_bio+0x73/0x140
  [...]

  * When debugging the above issue, by rebuilding the kernel with
  CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit
  37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with
  device removal triggers a crash") introduced a NULL pointer
  dereference in generic_make_request(), that manifests as:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:generic_make_request+0x32b/0x400
  Call Trace:
   submit_bio+0x73/0x140
   ext4_io_submit+0x4d/0x60
   ext4_writepages+0x626/0xe90
   do_writepages+0x4b/0xe0
  [...]

  * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
  ## For issue 1 (md removal crash):
  869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663

  ## For issue 2 (generic_make_request() NULL dereference):
  c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0

  The reasoning for both patches not being present in Linus tree is
  explained in the commit messages, but in summary Ming Lei submitted a
  major clean-up series at the same time I've submitted both patches, it
  wouldn't make sense to accept my patches to soon after remove the code
  paths with his clean-up. But Ming's series rely on legacy I/O path
  removal, and so it's very hard to backport. Hence maintainers
  suggested me to submit my small fixes to stable tree only.

  [Test case]

  For both cases, the test is the same, the only change being a kernel
  config option. To reproduce issue 1 (md removal crash), a regular
  Ubuntu kernel config is enough. For the issue 2, a kernel rebuild with
  CONFIG_BLK_CGROUP=n is necessary.

  Steps to reproduce:

  a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
  
  b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
  (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
  echo 1 > 

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-29 Thread Guilherme G. Piccoli
Verified in disco kernel 5.0.22-generic (that is available in -proposed 
pocket), using the test cases described in the patches.
Thanks,


Guilherme

** Tags removed: verification-needed-disco
** Tags added: verification-done-disco

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member of the
  array, we've noticed after kernel 4.18 we can trigger a crash
  depending if there's I/O in-flight during the array removal. When
  debugging the issue, a second problem was found, that could cause a
  different crash.

  * For the first and more relevant problem, commit cd4a4ae4683d
  ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 0155
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:blk_throtl_bio+0x45/0x970
  [...]
  Call Trace:
   generic_make_request_checks+0x1bf/0x690
   generic_make_request+0x64/0x3f0
   raid0_make_request+0x184/0x620 [raid0]
   ? raid0_make_request+0x184/0x620 [raid0]
   md_handle_request+0x126/0x1a0
   md_make_request+0x7b/0x180
   generic_make_request+0x19e/0x3f0
   submit_bio+0x73/0x140
  [...]

  * When debugging the above issue, by rebuilding the kernel with
  CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit
  37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with
  device removal triggers a crash") introduced a NULL pointer
  dereference in generic_make_request(), that manifests as:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:generic_make_request+0x32b/0x400
  Call Trace:
   submit_bio+0x73/0x140
   ext4_io_submit+0x4d/0x60
   ext4_writepages+0x626/0xe90
   do_writepages+0x4b/0xe0
  [...]

  * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
  ## For issue 1 (md removal crash):
  869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663

  ## For issue 2 (generic_make_request() NULL dereference):
  c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0

  The reasoning for both patches not being present in Linus tree is
  explained in the commit messages, but in summary Ming Lei submitted a
  major clean-up series at the same time I've submitted both patches, it
  wouldn't make sense to accept my patches to soon after remove the code
  paths with his clean-up. But Ming's series rely on legacy I/O path
  removal, and so it's very hard to backport. Hence maintainers
  suggested me to submit my small fixes to stable tree only.

  [Test case]

  For both cases, the test is the same, the only change being a kernel
  config option. To reproduce issue 1 (md removal crash), a regular
  Ubuntu kernel config is enough. For the issue 2, a kernel rebuild with
  CONFIG_BLK_CGROUP=n is necessary.

  Steps to reproduce:

  a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
  
  b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
  (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
  echo 1 > /sys/block/nvme1n1/device/device/remove
  (whereas nvme1n1 is the 2nd array member)

  [Regression potential]

  The fixes are self-contained and small, both validated by a great
  number of subsystem maintainers (including block, raid and stable).
  Commit c9d8d3e9d7a0 was also validated by the author of the offender
  patch it fixes, and has no functional change. Commit 869eec894663 has
  only raid0 driver as scope, and fall-backs raid0 to a previous
  behavior before the introduction of BIO_QUEUE_ENTERED flag (which
  indeed increases 

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-25 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
disco' to 'verification-done-disco'. If the problem still exists, change
the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-disco

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member of the
  array, we've noticed after kernel 4.18 we can trigger a crash
  depending if there's I/O in-flight during the array removal. When
  debugging the issue, a second problem was found, that could cause a
  different crash.

  * For the first and more relevant problem, commit cd4a4ae4683d
  ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 0155
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:blk_throtl_bio+0x45/0x970
  [...]
  Call Trace:
   generic_make_request_checks+0x1bf/0x690
   generic_make_request+0x64/0x3f0
   raid0_make_request+0x184/0x620 [raid0]
   ? raid0_make_request+0x184/0x620 [raid0]
   md_handle_request+0x126/0x1a0
   md_make_request+0x7b/0x180
   generic_make_request+0x19e/0x3f0
   submit_bio+0x73/0x140
  [...]

  * When debugging the above issue, by rebuilding the kernel with
  CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit
  37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with
  device removal triggers a crash") introduced a NULL pointer
  dereference in generic_make_request(), that manifests as:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:generic_make_request+0x32b/0x400
  Call Trace:
   submit_bio+0x73/0x140
   ext4_io_submit+0x4d/0x60
   ext4_writepages+0x626/0xe90
   do_writepages+0x4b/0xe0
  [...]

  * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
  ## For issue 1 (md removal crash):
  869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663

  ## For issue 2 (generic_make_request() NULL dereference):
  c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0

  The reasoning for both patches not being present in Linus tree is
  explained in the commit messages, but in summary Ming Lei submitted a
  major clean-up series at the same time I've submitted both patches, it
  wouldn't make sense to accept my patches to soon after remove the code
  paths with his clean-up. But Ming's series rely on legacy I/O path
  removal, and so it's very hard to backport. Hence maintainers
  suggested me to submit my small fixes to stable tree only.

  [Test case]

  For both cases, the test is the same, the only change being a kernel
  config option. To reproduce issue 1 (md removal crash), a regular
  Ubuntu kernel config is enough. For the issue 2, a kernel rebuild with
  CONFIG_BLK_CGROUP=n is necessary.

  Steps to reproduce:

  a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
  
  b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
  (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
  echo 1 > /sys/block/nvme1n1/device/device/remove
  (whereas nvme1n1 is the 2nd array member)

  [Regression potential]

  The fixes are 

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-24 Thread Brad Figg
** Tags added: cscc

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member of the
  array, we've noticed after kernel 4.18 we can trigger a crash
  depending if there's I/O in-flight during the array removal. When
  debugging the issue, a second problem was found, that could cause a
  different crash.

  * For the first and more relevant problem, commit cd4a4ae4683d
  ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 0155
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:blk_throtl_bio+0x45/0x970
  [...]
  Call Trace:
   generic_make_request_checks+0x1bf/0x690
   generic_make_request+0x64/0x3f0
   raid0_make_request+0x184/0x620 [raid0]
   ? raid0_make_request+0x184/0x620 [raid0]
   md_handle_request+0x126/0x1a0
   md_make_request+0x7b/0x180
   generic_make_request+0x19e/0x3f0
   submit_bio+0x73/0x140
  [...]

  * When debugging the above issue, by rebuilding the kernel with
  CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit
  37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with
  device removal triggers a crash") introduced a NULL pointer
  dereference in generic_make_request(), that manifests as:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:generic_make_request+0x32b/0x400
  Call Trace:
   submit_bio+0x73/0x140
   ext4_io_submit+0x4d/0x60
   ext4_writepages+0x626/0xe90
   do_writepages+0x4b/0xe0
  [...]

  * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
  ## For issue 1 (md removal crash):
  869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663

  ## For issue 2 (generic_make_request() NULL dereference):
  c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0

  The reasoning for both patches not being present in Linus tree is
  explained in the commit messages, but in summary Ming Lei submitted a
  major clean-up series at the same time I've submitted both patches, it
  wouldn't make sense to accept my patches to soon after remove the code
  paths with his clean-up. But Ming's series rely on legacy I/O path
  removal, and so it's very hard to backport. Hence maintainers
  suggested me to submit my small fixes to stable tree only.

  [Test case]

  For both cases, the test is the same, the only change being a kernel
  config option. To reproduce issue 1 (md removal crash), a regular
  Ubuntu kernel config is enough. For the issue 2, a kernel rebuild with
  CONFIG_BLK_CGROUP=n is necessary.

  Steps to reproduce:

  a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
  
  b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
  (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
  echo 1 > /sys/block/nvme1n1/device/device/remove
  (whereas nvme1n1 is the 2nd array member)

  [Regression potential]

  The fixes are self-contained and small, both validated by a great
  number of subsystem maintainers (including block, raid and stable).
  Commit c9d8d3e9d7a0 was also validated by the author of the offender
  patch it fixes, and has no functional change. Commit 869eec894663 has
  only raid0 driver as scope, and fall-backs raid0 to a previous
  behavior before the introduction of BIO_QUEUE_ENTERED flag (which
  indeed increases the amount of checks performed in BIOs), so the
  regression potential is low and restricted to raid0.

To manage notifications about this bug go to:

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-22 Thread Khaled El Mously
** Changed in: linux (Ubuntu Disco)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member of the
  array, we've noticed after kernel 4.18 we can trigger a crash
  depending if there's I/O in-flight during the array removal. When
  debugging the issue, a second problem was found, that could cause a
  different crash.

  * For the first and more relevant problem, commit cd4a4ae4683d
  ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 0155
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:blk_throtl_bio+0x45/0x970
  [...]
  Call Trace:
   generic_make_request_checks+0x1bf/0x690
   generic_make_request+0x64/0x3f0
   raid0_make_request+0x184/0x620 [raid0]
   ? raid0_make_request+0x184/0x620 [raid0]
   md_handle_request+0x126/0x1a0
   md_make_request+0x7b/0x180
   generic_make_request+0x19e/0x3f0
   submit_bio+0x73/0x140
  [...]

  * When debugging the above issue, by rebuilding the kernel with
  CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit
  37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with
  device removal triggers a crash") introduced a NULL pointer
  dereference in generic_make_request(), that manifests as:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:generic_make_request+0x32b/0x400
  Call Trace:
   submit_bio+0x73/0x140
   ext4_io_submit+0x4d/0x60
   ext4_writepages+0x626/0xe90
   do_writepages+0x4b/0xe0
  [...]

  * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
  ## For issue 1 (md removal crash):
  869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663

  ## For issue 2 (generic_make_request() NULL dereference):
  c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0

  The reasoning for both patches not being present in Linus tree is
  explained in the commit messages, but in summary Ming Lei submitted a
  major clean-up series at the same time I've submitted both patches, it
  wouldn't make sense to accept my patches to soon after remove the code
  paths with his clean-up. But Ming's series rely on legacy I/O path
  removal, and so it's very hard to backport. Hence maintainers
  suggested me to submit my small fixes to stable tree only.

  [Test case]

  For both cases, the test is the same, the only change being a kernel
  config option. To reproduce issue 1 (md removal crash), a regular
  Ubuntu kernel config is enough. For the issue 2, a kernel rebuild with
  CONFIG_BLK_CGROUP=n is necessary.

  Steps to reproduce:

  a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
  
  b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
  (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
  echo 1 > /sys/block/nvme1n1/device/device/remove
  (whereas nvme1n1 is the 2nd array member)

  [Regression potential]

  The fixes are self-contained and small, both validated by a great
  number of subsystem maintainers (including block, raid and stable).
  Commit c9d8d3e9d7a0 was also validated by the author of the offender
  patch it fixes, and has no functional change. Commit 869eec894663 has
  only raid0 driver as scope, and fall-backs raid0 to a previous
  behavior before the introduction of BIO_QUEUE_ENTERED flag (which
  indeed increases the amount of checks performed in BIOs), so the
  regression potential is low and restricted to raid0.

To manage notifications about this bug go to:

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-16 Thread Guilherme G. Piccoli
This issue affects only kernels after 4.17 and before 5.2, hence it's
fixed on Bionic and Eoan, and won't be fixed in Cosmic (4.18) since it
is EOL.

Patches submitted to kernel-team ML: https://lists.ubuntu.com/archives
/kernel-team/2019-July/102287.html

Cheers,


Guilherme

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member of the
  array, we've noticed after kernel 4.18 we can trigger a crash
  depending if there's I/O in-flight during the array removal. When
  debugging the issue, a second problem was found, that could cause a
  different crash.

  * For the first and more relevant problem, commit cd4a4ae4683d
  ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 0155
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:blk_throtl_bio+0x45/0x970
  [...]
  Call Trace:
   generic_make_request_checks+0x1bf/0x690
   generic_make_request+0x64/0x3f0
   raid0_make_request+0x184/0x620 [raid0]
   ? raid0_make_request+0x184/0x620 [raid0]
   md_handle_request+0x126/0x1a0
   md_make_request+0x7b/0x180
   generic_make_request+0x19e/0x3f0
   submit_bio+0x73/0x140
  [...]

  * When debugging the above issue, by rebuilding the kernel with
  CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit
  37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with
  device removal triggers a crash") introduced a NULL pointer
  dereference in generic_make_request(), that manifests as:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  PGD 0 P4D 0
  Oops:  [#1] SMP PTI
  RIP: 0010:generic_make_request+0x32b/0x400
  Call Trace:
   submit_bio+0x73/0x140
   ext4_io_submit+0x4d/0x60
   ext4_writepages+0x626/0xe90
   do_writepages+0x4b/0xe0
  [...]

  * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
  ## For issue 1 (md removal crash):
  869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663

  ## For issue 2 (generic_make_request() NULL dereference):
  c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0

  The reasoning for both patches not being present in Linus tree is
  explained in the commit messages, but in summary Ming Lei submitted a
  major clean-up series at the same time I've submitted both patches, it
  wouldn't make sense to accept my patches to soon after remove the code
  paths with his clean-up. But Ming's series rely on legacy I/O path
  removal, and so it's very hard to backport. Hence maintainers
  suggested me to submit my small fixes to stable tree only.

  [Test case]

  For both cases, the test is the same, the only change being a kernel
  config option. To reproduce issue 1 (md removal crash), a regular
  Ubuntu kernel config is enough. For the issue 2, a kernel rebuild with
  CONFIG_BLK_CGROUP=n is necessary.

  Steps to reproduce:

  a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
  
  b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
  (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
  echo 1 > /sys/block/nvme1n1/device/device/remove
  (whereas nvme1n1 is the 2nd array member)

  [Regression potential]

  The fixes are self-contained and small, both validated by a great
  number of subsystem maintainers (including block, raid and stable).
  Commit c9d8d3e9d7a0 was also validated by the author of the offender
  patch it fixes, and has no functional change. Commit 869eec894663 has
  only raid0 driver as scope, and fall-backs raid0 to a previous
  behavior before the introduction of 

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-16 Thread Guilherme G. Piccoli
** Description changed:

- TBD
+ [Impact]
+ 
+ * During raid0 error path testing, by removing one member of the array,
+ we've noticed after kernel 4.18 we can trigger a crash depending if
+ there's I/O in-flight during the array removal. When debugging the
+ issue, a second problem was found, that could cause a different crash.
+ 
+ * For the first and more relevant problem, commit cd4a4ae4683d
+ ("block: don't use blocking queue entered for recursive bio submits") 
introduced the flag BIO_QUEUE_ENTERED in order BIOs that were split do bypass 
the blocking queue entering routine and use the live non-blocking version. What 
happens with md/raid0 though is that their BIOs have their underlying device 
changed to the physical disk (array member). If we remove this physical disk 
(or if it fails), we could have one BIO that had the flag changed to 
BIO_QUEUE_ENTERED and had the device changed to the removed array member 
(before its removal); this bio then skips a lot of checks in 
generic_make_request_checks(), triggering the following crash:
+ 
+ BUG: unable to handle kernel NULL pointer dereference at 0155
+ PGD 0 P4D 0
+ Oops:  [#1] SMP PTI
+ RIP: 0010:blk_throtl_bio+0x45/0x970
+ [...]
+ Call Trace:
+  generic_make_request_checks+0x1bf/0x690
+  generic_make_request+0x64/0x3f0
+  raid0_make_request+0x184/0x620 [raid0]
+  ? raid0_make_request+0x184/0x620 [raid0]
+  md_handle_request+0x126/0x1a0
+  md_make_request+0x7b/0x180
+  generic_make_request+0x19e/0x3f0
+  submit_bio+0x73/0x140
+ [...]
+ 
+ * When debugging the above issue, by rebuilding the kernel with
+ CONFIG_BLK_CGROUP=n we've noticed a different crash. Commit 37f9579f4c31
+ ("blk-mq: Avoid that submitting a bio concurrently with device removal
+ triggers a crash") introduced a NULL pointer dereference in
+ generic_make_request(), that manifests as:
+ 
+ BUG: unable to handle kernel NULL pointer dereference at 0078
+ PGD 0 P4D 0
+ Oops:  [#1] SMP PTI
+ RIP: 0010:generic_make_request+0x32b/0x400
+ Call Trace:
+  submit_bio+0x73/0x140
+  ext4_io_submit+0x4d/0x60
+  ext4_writepages+0x626/0xe90
+  do_writepages+0x4b/0xe0
+ [...]
+ 
+ * For both the issues, we have simple patches that are present in 
linux-stable but not in Linus tree.
+ ## For issue 1 (md removal crash):
+ 869eec894663 ("md/raid0: Do not bypass blocking queue entered for raid0 bios")
+ 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=869eec894663
+ 
+ ## For issue 2 (generic_make_request() NULL dereference):
+ c9d8d3e9d7a0 ("block: Fix a NULL pointer dereference in 
generic_make_request()")
+ 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c9d8d3e9d7a0
+ 
+ The reasoning for both patches not being present in Linus tree is
+ explained in the commit messages, but in summary Ming Lei submitted a
+ major clean-up series at the same time I've submitted both patches, it
+ wouldn't make sense to accept my patches to soon after remove the code
+ paths with his clean-up. But Ming's series rely on legacy I/O path
+ removal, and so it's very hard to backport. Hence maintainers suggested
+ me to submit my small fixes to stable tree only.
+ 
+ [Test case]
+ 
+ For both cases, the test is the same, the only change being a kernel
+ config option. To reproduce issue 1 (md removal crash), a regular Ubuntu
+ kernel config is enough. For the issue 2, a kernel rebuild with
+ CONFIG_BLK_CGROUP=n is necessary.
+ 
+ Steps to reproduce:
+ 
+ a) Create a raid0 md array with 2 NVMe devices as members, and mount it with 
an ext4 filesystem.
+ 
+ b) Run the following oneliner (supposing the raid0 is mounted in /mnt):
+ (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;\
+ echo 1 > /sys/block/nvme1n1/device/device/remove
+ (whereas nvme1n1 is the 2nd array member)
+ 
+ [Regression potential]
+ 
+ The fixes are self-contained and small, both validated by a great number
+ of subsystem maintainers (including block, raid and stable). Commit
+ c9d8d3e9d7a0 was also validated by the author of the offender patch it
+ fixes, and has no functional change. Commit 869eec894663 has only raid0
+ driver as scope, and fall-backs raid0 to a previous behavior before the
+ introduction of BIO_QUEUE_ENTERED flag (which indeed increases the
+ amount of checks performed in BIOs), so the regression potential is low
+ and restricted to raid0.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

  * During raid0 error path testing, by removing one member 

[Kernel-packages] [Bug 1836806] Re: Two crashes on raid0 error path (during a member device removal)

2019-07-16 Thread Guilherme G. Piccoli
** Also affects: linux (Ubuntu Eoan)
   Importance: Medium
 Assignee: Guilherme G. Piccoli (gpiccoli)
   Status: Confirmed

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Bionic)
   Status: New => Fix Released

** Changed in: linux (Ubuntu Eoan)
   Status: Confirmed => Fix Released

** Changed in: linux (Ubuntu Disco)
   Status: New => In Progress

** Changed in: linux (Ubuntu Disco)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Disco)
 Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli)

** Changed in: linux (Ubuntu Bionic)
 Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli)

** Also affects: linux (Ubuntu Cosmic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Cosmic)
   Status: New => Won't Fix

** Changed in: linux (Ubuntu Cosmic)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Cosmic)
 Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836806

Title:
  Two crashes on raid0 error path (during a member device removal)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Won't Fix
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  TBD

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836806/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp