[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2020-07-14 Thread Guilherme G. Piccoli
** Changed in: linux (Ubuntu)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Released

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2019-07-24 Thread Brad Figg
** Tags added: cscc

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Released

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2019-02-04 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 4.4.0-142.168

---
linux (4.4.0-142.168) xenial; urgency=medium

  * linux: 4.4.0-142.168 -proposed tracker (LP: #1811846)

  * Packaging resync (LP: #1786013)
- [Packaging] update helper scripts

  * iptables connlimit allows more connections than the limit when using
multiple CPUs (LP: #1811094)
- netfilter: xt_connlimit: don't store address in the conn nodes
- SAUCE: netfilter: xt_connlimit: remove the 'addr' parameter in add_hlist()
- netfilter: nf_conncount: expose connection list interface
- netfilter: nf_conncount: Fix garbage collection with zones
- netfilter: nf_conncount: fix garbage collection confirm race
- netfilter: nf_conncount: don't skip eviction when age is negative

  * CVE-2017-5715
- SAUCE: x86/speculation: Cleanup IBPB runtime control handling
- SAUCE: x86/speculation: Cleanup IBRS runtime control handling
- SAUCE: x86/speculation: Use x86_spec_ctrl_base in entry/exit code
- SAUCE: x86/speculation: Move RSB_CTXSW hunk

  * Xenial update: 4.4.167 upstream stable release (LP: #1811077)
- media: em28xx: Fix use-after-free when disconnecting
- Revert "wlcore: Add missing PM call for
  wlcore_cmd_wait_for_event_or_timeout()"
- rapidio/rionet: do not free skb before reading its length
- s390/qeth: fix length check in SNMP processing
- usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2
- kvm: mmu: Fix race in emulated page table writes
- xtensa: enable coprocessors that are being flushed
- xtensa: fix coprocessor context offset definitions
- Btrfs: ensure path name is null terminated at btrfs_control_ioctl
- ALSA: wss: Fix invalid snd_free_pages() at error path
- ALSA: ac97: Fix incorrect bit shift at AC97-SPSA control write
- ALSA: control: Fix race between adding and removing a user element
- ALSA: sparc: Fix invalid snd_free_pages() at error path
- ext2: fix potential use after free
- dmaengine: at_hdmac: fix memory leak in at_dma_xlate()
- dmaengine: at_hdmac: fix module unloading
- btrfs: release metadata before running delayed refs
- USB: usb-storage: Add new IDs to ums-realtek
- usb: core: quirks: add RESET_RESUME quirk for Cherry G230 Stream series
- misc: mic/scif: fix copy-paste error in scif_create_remote_lookup
- Kbuild: suppress packed-not-aligned warning for default setting only
- exec: avoid gcc-8 warning for get_task_comm
- disable stringop truncation warnings for now
- kobject: Replace strncpy with memcpy
- unifdef: use memcpy instead of strncpy
- kernfs: Replace strncpy with memcpy
- ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
- drm: gma500: fix logic error
- scsi: bfa: convert to strlcpy/strlcat
- staging: rts5208: fix gcc-8 logic error warning
- kdb: use memmove instead of overlapping memcpy
- iser: set sector for ambiguous mr status errors
- uprobes: Fix handle_swbp() vs. unregister() + register() race once more
- MIPS: ralink: Fix mt7620 nd_sd pinmux
- mips: fix mips_get_syscall_arg o32 check
- drm/ast: Fix incorrect free on ioregs
- scsi: scsi_devinfo: cleanly zero-pad devinfo strings
- ALSA: trident: Suppress gcc string warning
- scsi: csiostor: Avoid content leaks and casts
- kgdboc: Fix restrict error
- kgdboc: Fix warning with module build
- leds: call led_pwm_set() in leds-pwm to enforce default LED_OFF
- leds: turn off the LED and wait for completion on unregistering LED class
  device
- leds: leds-gpio: Fix return value check in create_gpio_led()
- Input: xpad - quirk all PDP Xbox One gamepads
- Input: matrix_keypad - check for errors from of_get_named_gpio()
- Input: elan_i2c - add ELAN0620 to the ACPI table
- Input: elan_i2c - add ACPI ID for Lenovo IdeaPad 330-15ARR
- Input: elan_i2c - add support for ELAN0621 touchpad
- btrfs: Always try all copies when reading extent buffers
- Btrfs: fix use-after-free when dumping free space
- ARC: change defconfig defaults to ARCv2
- arc: [devboards] Add support of NFSv3 ACL
- mm: cleancache: fix corruption on missed inode invalidation
- usb: gadget: dummy: fix nonsensical comparisons
- iommu/vt-d: Fix NULL pointer dereference in prq_event_thread()
- iommu/ipmmu-vmsa: Fix crash on early domain free
- can: rcar_can: Fix erroneous registration
- batman-adv: Expand merged fragment buffer for full packet
- bnx2x: Assign unique DMAE channel number for FW DMAE transactions.
- qed: Fix PTT leak in qed_drain()
- qed: Fix reading wrong value in loop condition
- net/mlx4_core: Zero out lkey field in SW2HW_MPT fw command
- net/mlx4_core: Fix uninitialized variable compilation warning
- net/mlx4: Fix UBSAN warning of signed integer overflow
- net: faraday: ftmac100: remove netif_running(netdev) check before 
disabling
  interrupts

[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2019-01-23 Thread Guilherme G. Piccoli
Also verified that in an AWS instance running the aws kernel in
-proposed:

$ uname -rv
4.4.0-1075-aws #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019


$ dmesg #cleared that before testing

[ 5429.330967] nvme :00:04.0: I/O 10 QID 1 timeout, completion polled
[39630.417191] nvme :00:04.0: I/O 0 QID 2 timeout, completion polled
[52869.430017] nvme :00:04.0: I/O 17 QID 1 timeout, completion polled
[69245.492236] nvme :00:04.0: I/O 3 QID 2 timeout, completion polled
[111366.577241] nvme :00:04.0: I/O 20 QID 2 timeout, completion polled
[120978.583000] nvme :00:04.0: I/O 11 QID 2 timeout, completion polled
[136204.614939] nvme :00:04.0: I/O 7 QID 1 timeout, completion polled
[142200.642476] nvme :00:04.0: I/O 3 QID 1 timeout, completion polled
[145549.661249] nvme :00:04.0: I/O 29 QID 2 timeout, completion polled


$ uptime
 10:57:48 up 1 day, 20:22,  2 users,  load average: 0.00, 0.03, 0.02

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2019-01-17 Thread Guilherme G. Piccoli
The bug was verified in Xenial kernel 4.4.0-142.168 available in
-proposed.

I'm running in an AWS 2-cpu instance, which exhibits the issue if we run
a small reproducer script (a loop that basically changes IRQ affinity
for the NVMe MSIs/legacy interrupt among the CPUs and performs a 4K
write to the device + sync):

$ uptime
 17:42:11 up  4:00,  2 users,  load average: 0.19, 0.14, 0.08

$ uname -rmv
4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64

$ dmesg
[ 2218.252634] nvme :00:04.0: I/O 6 QID 2 timeout, completion polled
[ 2451.245962] nvme :00:04.0: I/O 22 QID 2 timeout, completion polled
[ 6672.249406] nvme :00:04.0: I/O 3 QID 1 timeout, completion polled
[ 8425.253863] nvme :00:04.0: I/O 28 QID 2 timeout, completion polled



Cheers,


Guilherme

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2019-01-17 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
xenial' to 'verification-done-xenial'. If the problem still exists,
change the tag 'verification-needed-xenial' to 'verification-failed-
xenial'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2019-01-09 Thread Khaled El Mously
** Changed in: linux (Ubuntu Xenial)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2018-12-07 Thread Guilherme G. Piccoli
Patch sent to Ubuntu kernel mailing-list:
https://lists.ubuntu.com/archives/kernel-team/2018-December/097208.html

** Changed in: linux (Ubuntu)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu Xenial)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  In Progress

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2018-12-07 Thread Guilherme G. Piccoli
** Description changed:

- Description to be updated
+ [Impact]
+ * NVMe controllers potentially could miss to send an interrupt, specially
+ due to bugs in virtual devices(which are common those days - qemu has its
+ own NVMe virtual device, so does AWS). This would be a difficult to
+ debug situation, because NVMe driver only reports the request timeout,
+ not the reason.
  
- [Impact]
+ * The upstream patch proposed to SRU here here, 7776db1ccc12
+ ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
+ in these cases, by pro-actively polling the CQEs on request timeouts, to
+ check if the specific request was completed and some issue (probably a
+ missed interrupt) prevented the driver to notice, or if the request really
+ wasn't completed, which indicates more severe issues.
  
-  * 1
+ * Although quite useful for debugging, this patch could help to mitigate
+ issues in cloud environments like AWS, in case we may have jitter in
+ request completion and the i/o timeout was set to low values, or even
+ in case of atypical bugs in the virtual NVMe controller. With this patch,
+ if polling succeeds the NVMe driver will continue working instead of
+ trying a reset controller procedure, which may lead to fails in the 
+ rootfs - refer to https://launchpad.net/bugs/1788035.
+ 
  
  [Test Case]
  
-  * 2
+ * It's a bit tricky to artificially create a situation of missed interrupt;
+ one idea was to implement a small hack in the NVMe qemu virtual device
+ that given a trigger in guest kernel, will induce the virtual device to
+ skip an interrupt. The hack patch is present in comment #1 below.
+ 
+ * To trigger such hack from guest kernel, all is needed is to issue a
+ raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
+ After that, just perform some I/Os to see one of them aborting - one could
+ use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".
+ 
  
  [Regression Potential]
  
-  * 3
+ * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
+ the patch could cause all requests to be completed via polling, which
+ prevents the adapter reset. This is however a very hypothetical situation,
+ which would also happen in the mainline kernel (since it has the patch).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Confirmed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical 

[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2018-12-07 Thread Guilherme G. Piccoli
** Changed in: linux (Ubuntu Xenial)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Xenial)
 Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Confirmed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2018-12-07 Thread Eric Desrochers
** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Confirmed

Bug description:
  [Impact]
  * NVMe controllers potentially could miss to send an interrupt, specially
  due to bugs in virtual devices(which are common those days - qemu has its
  own NVMe virtual device, so does AWS). This would be a difficult to
  debug situation, because NVMe driver only reports the request timeout,
  not the reason.

  * The upstream patch proposed to SRU here here, 7776db1ccc12
  ("NVMe/pci: Poll CQ on timeout") was designed to provide more information
  in these cases, by pro-actively polling the CQEs on request timeouts, to
  check if the specific request was completed and some issue (probably a
  missed interrupt) prevented the driver to notice, or if the request really
  wasn't completed, which indicates more severe issues.

  * Although quite useful for debugging, this patch could help to mitigate
  issues in cloud environments like AWS, in case we may have jitter in
  request completion and the i/o timeout was set to low values, or even
  in case of atypical bugs in the virtual NVMe controller. With this patch,
  if polling succeeds the NVMe driver will continue working instead of
  trying a reset controller procedure, which may lead to fails in the 
  rootfs - refer to https://launchpad.net/bugs/1788035.

  
  [Test Case]

  * It's a bit tricky to artificially create a situation of missed interrupt;
  one idea was to implement a small hack in the NVMe qemu virtual device
  that given a trigger in guest kernel, will induce the virtual device to
  skip an interrupt. The hack patch is present in comment #1 below.

  * To trigger such hack from guest kernel, all is needed is to issue a
  raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
  After that, just perform some I/Os to see one of them aborting - one could
  use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

  
  [Regression Potential]

  * There are no clear risks in adding such polling mechanism to the NVMe 
driver; one bad thing that was neverreported but could happen with this patch 
is the device could be in a bad state IRQ-wise that a reset would fix, but
  the patch could cause all requests to be completed via polling, which
  prevents the adapter reset. This is however a very hypothetical situation,
  which would also happen in the mainline kernel (since it has the patch).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1807393] Re: nvme - Polling on timeout

2018-12-07 Thread Guilherme G. Piccoli
** Patch added: "TEST patch for qemu nvme virtual device"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+attachment/5220068/+files/0001-hw-block-nvme-NVMe-hack-to-forcibly-miss-an-interrup.patch

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1807393

Title:
  nvme - Polling on timeout

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Description to be updated

  [Impact]

   * 1

  [Test Case]

   * 2

  [Regression Potential]

   * 3

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp