Public bug reported:

We have an ubuntu server with eight Samsung 980 Pro PCIe 4.0 NVMe SSDs
(model MZ-V8P1T0BW). The nvme drives fail sporadically, leaving messages
like this in dmesg:

[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 998 QID 1 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 999 QID 1 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 627 QID 7 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 628 QID 7 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 134 QID 22 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 298 QID 42 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 299 QID 42 timeout, aborting
[Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 838 QID 44 timeout, aborting
[Mon Oct  3 01:55:20 2022] nvme nvme7: I/O 998 QID 1 timeout, reset controller
[Mon Oct  3 01:55:51 2022] nvme nvme7: I/O 16 QID 0 timeout, reset controller
[Mon Oct  3 01:56:42 2022] nvme nvme7: Device not ready; aborting reset, 
CSTS=0x1
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct  3 01:57:03 2022] nvme nvme7: Device not ready; aborting reset, 
CSTS=0x1
[Mon Oct  3 01:57:03 2022] nvme nvme7: Removing after probe failure status: -19
[Mon Oct  3 01:57:23 2022] nvme nvme7: Device not ready; aborting reset, 
CSTS=0x1
[Mon Oct  3 01:57:23 2022] nvme7n1: detected capacity change from 1953525168 to 0
[Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934235440 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
[Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934230640 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934230752 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
[Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934230472 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934235552 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0

The server is currently running this OS and kernel:

* Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-48-generic x86_64).

We first encountered this issue over a year ago, not long after the
machine was first set up. At the time it was running this OS and kernel:

* Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64)

We have encountered this issue repeatedly over the past year. The time
between failures can be hours, days or weeks. 1-3 weeks is typical.

A zip file with 13 dmesg outputs sampled over the last year is attached.

Powering down the machine, then powering it up again, always brings the
disk back into working condition.

Of the eight nvme disks present, the disk that fails appears to be
random.

Here are some possibly related bug reports:

* #1910866 (I first mentioned this issue there as a comment)
* #1991291

Here is some context for the server's use and configuration, in case
it's useful:

* The motherboard is a Supermicro H12SSL-NT with PCIe bifurcation support
* The M.2 NVMe disks are connected through a pair of ASUS Hyper M.2 X16 PCIe 
4.0 X4 Expansion Cards, with 4 disks attached to each card. These cards rely on 
the 4x4x4x4x PCIe bifurcation feature supplied by the motherboard.
* The disks are paired into zfs vdev mirrors, with a stripe across 4 mirrors 
forming a zfs pool.
* The machine is used as a virtualization host (VDI server), running Windows 
guests on linux KVM.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-5.15.0-48-generic 5.15.0-48.54
ProcVersionSignature: Ubuntu 5.15.0-48.54-generic 5.15.53
Uname: Linux 5.15.0-48-generic x86_64
NonfreeKernelModules: nvidia zfs zunicode zavl icp zcommon znvpair
AlsaVersion: Advanced Linux Sound Architecture Driver Version 
k5.15.0-48-generic.
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', 
'/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D12p', 
'/dev/snd/pcmC0D11p', '/dev/snd/pcmC0D10p', '/dev/snd/pcmC0D9p', 
'/dev/snd/pcmC0D8p', '/dev/snd/pcmC0D7p', '/dev/snd/pcmC0D3p', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
CasperMD5CheckResult: pass
Date: Thu Oct  6 15:00:04 2022
InstallationDate: Installed on 2021-06-04 (489 days ago)
InstallationMedia: Ubuntu-Server 20.04.2 LTS "Focal Fossa" - Release amd64 
(20210201.2)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Supermicro Super Server
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic 
root=UUID=3ecf491d-34aa-454f-9a15-0744851f71a8 ro crashkernel=512M-:192M
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-48-generic N/A
 linux-backports-modules-5.15.0-48-generic  N/A
 linux-firmware                             20220329.git681281e4-0ubuntu3.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to jammy on 2022-07-09 (89 days ago)
WifiSyslog:
 
dmi.bios.date: 04/14/2022
dmi.bios.release: 5.14
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.4
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: H12SSL-NT
dmi.board.vendor: Supermicro
dmi.board.version: 1.01
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 17
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: 
dmi:bvnAmericanMegatrendsInc.:bvr2.4:bd04/14/2022:br5.14:svnSupermicro:pnSuperServer:pvr0123456789:rvnSupermicro:rnH12SSL-NT:rvr1.01:cvnSupermicro:ct17:cvr0123456789:skuTobefilledbyO.E.M.:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: Super Server
dmi.product.sku: To be filled by O.E.M.
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug jammy

** Attachment added: "Collection of dmesg outputs demonstrating this issue 
between 2021-09-15 and 2022-10-06"
   
https://bugs.launchpad.net/bugs/1992106/+attachment/5621852/+files/scj1342-ssd-failures.zip

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1992106

Title:
  nvme disks fail intermittently after I/O QID timeout aborts with
  status 0x371, resume function after power cycle

Status in linux package in Ubuntu:
  New

Bug description:
  We have an ubuntu server with eight Samsung 980 Pro PCIe 4.0 NVMe SSDs
  (model MZ-V8P1T0BW). The nvme drives fail sporadically, leaving
  messages like this in dmesg:

  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 998 QID 1 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 999 QID 1 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 627 QID 7 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 628 QID 7 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 134 QID 22 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 298 QID 42 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 299 QID 42 timeout, aborting
  [Mon Oct  3 01:54:49 2022] nvme nvme7: I/O 838 QID 44 timeout, aborting
  [Mon Oct  3 01:55:20 2022] nvme nvme7: I/O 998 QID 1 timeout, reset controller
  [Mon Oct  3 01:55:51 2022] nvme nvme7: I/O 16 QID 0 timeout, reset controller
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Device not ready; aborting reset, 
CSTS=0x1
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:56:42 2022] nvme nvme7: Abort status: 0x371
  [Mon Oct  3 01:57:03 2022] nvme nvme7: Device not ready; aborting reset, 
CSTS=0x1
  [Mon Oct  3 01:57:03 2022] nvme nvme7: Removing after probe failure status: 
-19
  [Mon Oct  3 01:57:23 2022] nvme nvme7: Device not ready; aborting reset, 
CSTS=0x1
  [Mon Oct  3 01:57:23 2022] nvme7n1: detected capacity change from 1953525168 
to 0
  [Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934235440 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
  [Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934230640 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
  [Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934230752 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
  [Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934230472 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
  [Mon Oct  3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 
934235552 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0

  The server is currently running this OS and kernel:

  * Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-48-generic x86_64).

  We first encountered this issue over a year ago, not long after the
  machine was first set up. At the time it was running this OS and
  kernel:

  * Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64)

  We have encountered this issue repeatedly over the past year. The time
  between failures can be hours, days or weeks. 1-3 weeks is typical.

  A zip file with 13 dmesg outputs sampled over the last year is
  attached.

  Powering down the machine, then powering it up again, always brings
  the disk back into working condition.

  Of the eight nvme disks present, the disk that fails appears to be
  random.

  Here are some possibly related bug reports:

  * #1910866 (I first mentioned this issue there as a comment)
  * #1991291

  Here is some context for the server's use and configuration, in case
  it's useful:

  * The motherboard is a Supermicro H12SSL-NT with PCIe bifurcation support
  * The M.2 NVMe disks are connected through a pair of ASUS Hyper M.2 X16 PCIe 
4.0 X4 Expansion Cards, with 4 disks attached to each card. These cards rely on 
the 4x4x4x4x PCIe bifurcation feature supplied by the motherboard.
  * The disks are paired into zfs vdev mirrors, with a stripe across 4 mirrors 
forming a zfs pool.
  * The machine is used as a virtualization host (VDI server), running Windows 
guests on linux KVM.

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-5.15.0-48-generic 5.15.0-48.54
  ProcVersionSignature: Ubuntu 5.15.0-48.54-generic 5.15.53
  Uname: Linux 5.15.0-48-generic x86_64
  NonfreeKernelModules: nvidia zfs zunicode zavl icp zcommon znvpair
  AlsaVersion: Advanced Linux Sound Architecture Driver Version 
k5.15.0-48-generic.
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.1
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', 
'/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D12p', 
'/dev/snd/pcmC0D11p', '/dev/snd/pcmC0D10p', '/dev/snd/pcmC0D9p', 
'/dev/snd/pcmC0D8p', '/dev/snd/pcmC0D7p', '/dev/snd/pcmC0D3p', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
  Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
  CasperMD5CheckResult: pass
  Date: Thu Oct  6 15:00:04 2022
  InstallationDate: Installed on 2021-06-04 (489 days ago)
  InstallationMedia: Ubuntu-Server 20.04.2 LTS "Focal Fossa" - Release amd64 
(20210201.2)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Supermicro Super Server
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=C.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 astdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic 
root=UUID=3ecf491d-34aa-454f-9a15-0744851f71a8 ro crashkernel=512M-:192M
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-48-generic N/A
   linux-backports-modules-5.15.0-48-generic  N/A
   linux-firmware                             20220329.git681281e4-0ubuntu3.5
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to jammy on 2022-07-09 (89 days ago)
  WifiSyslog:
   
  dmi.bios.date: 04/14/2022
  dmi.bios.release: 5.14
  dmi.bios.vendor: American Megatrends Inc.
  dmi.bios.version: 2.4
  dmi.board.asset.tag: To be filled by O.E.M.
  dmi.board.name: H12SSL-NT
  dmi.board.vendor: Supermicro
  dmi.board.version: 1.01
  dmi.chassis.asset.tag: To be filled by O.E.M.
  dmi.chassis.type: 17
  dmi.chassis.vendor: Supermicro
  dmi.chassis.version: 0123456789
  dmi.modalias: 
dmi:bvnAmericanMegatrendsInc.:bvr2.4:bd04/14/2022:br5.14:svnSupermicro:pnSuperServer:pvr0123456789:rvnSupermicro:rnH12SSL-NT:rvr1.01:cvnSupermicro:ct17:cvr0123456789:skuTobefilledbyO.E.M.:
  dmi.product.family: To be filled by O.E.M.
  dmi.product.name: Super Server
  dmi.product.sku: To be filled by O.E.M.
  dmi.product.version: 0123456789
  dmi.sys.vendor: Supermicro

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1992106/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to