[Bug 1910866] Re: nvme drive fails after some time

2021-10-12 Thread Joshua Sjoding
We have an ubuntu server running a set of eight Samsung 980 Pro PCIe 4.0
NVMe SSDs (model MZ-V8P1T0BW) on Ubuntu 20.04.3 LTS (GNU/Linux
5.4.0-88-generic x86_64). We've seen this happen at least 5 times over
the past month, and not always on the same SSD. We first saw it happen
on 5.4.0-81. Some samples from dmesg are below.

This is a production system that runs a set of virtual desktop
instances. Thankfully we use these in a zfs pool with four pairs of RAID
1 vdevs, so the only outage we've had so far is when it hit both members
of a mirrored pair. After a reboot the SSDs come back up.

[Mon Sep  6 12:58:36 2021] nvme nvme5: I/O 132 QID 46 timeout, aborting
[Mon Sep  6 12:58:37 2021] nvme nvme5: I/O 133 QID 46 timeout, aborting
[Mon Sep  6 12:58:39 2021] nvme nvme5: I/O 134 QID 46 timeout, aborting
[Mon Sep  6 12:58:40 2021] nvme nvme5: I/O 135 QID 46 timeout, aborting
[Mon Sep  6 12:58:40 2021] nvme nvme5: I/O 784 QID 48 timeout, aborting
[Mon Sep  6 12:58:41 2021] nvme nvme5: I/O 136 QID 46 timeout, aborting
[Mon Sep  6 12:58:41 2021] nvme nvme5: I/O 137 QID 46 timeout, aborting
[Mon Sep  6 12:58:42 2021] nvme nvme5: I/O 492 QID 28 timeout, aborting
[Mon Sep  6 12:59:07 2021] nvme nvme5: I/O 132 QID 46 timeout, reset controller
[Mon Sep  6 12:59:38 2021] nvme nvme5: I/O 24 QID 0 timeout, reset controller
[Mon Sep  6 13:00:29 2021] nvme nvme5: Device not ready; aborting reset
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:33 2021] INFO: task txg_quiesce:2172 blocked for more than 
120 seconds.
[Mon Sep  6 13:00:33 2021]   Tainted: P   OE 5.4.0-81-generic 
#91-Ubuntu

[Tue Sep 21 21:18:36 2021] nvme nvme2: I/O 175 QID 38 timeout, aborting
[Tue Sep 21 21:18:37 2021] nvme nvme2: I/O 240 QID 26 timeout, aborting
[Tue Sep 21 21:18:47 2021] nvme nvme2: I/O 718 QID 23 timeout, aborting
[Tue Sep 21 21:18:56 2021] nvme nvme2: I/O 719 QID 23 timeout, aborting
[Tue Sep 21 21:19:06 2021] nvme nvme2: I/O 175 QID 38 timeout, reset controller
[Tue Sep 21 21:19:37 2021] nvme nvme2: I/O 17 QID 0 timeout, reset controller
[Tue Sep 21 21:20:27 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:47 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:47 2021] nvme nvme2: Removing after probe failure status: -19
[Tue Sep 21 21:21:08 2021] nvme nvme2: Device not ready; aborting reset

[Tue Oct  5 16:54:59 2021] nvme nvme6: I/O 1013 QID 38 timeout, aborting
[Tue Oct  5 16:54:59 2021] nvme nvme6: I/O 727 QID 39 timeout, aborting
[Tue Oct  5 16:55:03 2021] nvme nvme6: I/O 1014 QID 38 timeout, aborting
[Tue Oct  5 16:55:05 2021] nvme nvme6: I/O 1015 QID 38 timeout, aborting
[Tue Oct  5 16:55:25 2021] nvme nvme6: I/O 15 QID 21 timeout, aborting
[Tue Oct  5 16:55:25 2021] nvme nvme6: I/O 408 QID 37 timeout, aborting
[Tue Oct  5 16:55:29 2021] nvme nvme6: I/O 1013 QID 38 timeout, reset controller
[Tue Oct  5 16:55:59 2021] nvme nvme6: I/O 11 QID 0 timeout, reset controller
[Tue Oct  5 16:56:51 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:57:11 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct  5 16:57:11 2021] nvme nvme6: Removing after probe failure status: -19
[Tue Oct  5 16:57:32 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct  5 16:57:32 2021] blk_update_request: I/O error, dev nvme6n1, sector 
842198232 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0

[Mon Oct 11 12:14:38 2021] nvme nvme2: I/O 306 QID 48 timeout, aborting
[Mon Oct 11 12:14:39 2021] nvme nvme2: I/O 827 QID 14 timeout, aborting
[Mon Oct 11 12:15:01 2021] nvme nvme2: I/O 828 QID 14 timeout, aborting
[Mon Oct 11 12:15:05 2021] nvme nvme2: I/O 829 QID 14 timeout, aborting
[Mon Oct 11 12:15:07 2021] nvme nvme2: I/O 830 QID 14 timeout, aborting
[Mon Oct 11 12:15:08 2021] nvme nvme2: I/O 306 QID 48 timeout, reset controller
[Mon Oct 11 12:15:38 2021] nvme nvme2: I/O 20 QID 0 timeout, reset controller
[Mon Oct 11 12:16:29 2021] nvme nvme2: Device not ready; 

[Bug 1910866] Re: nvme drive fails after some time

2021-10-02 Thread Andre Ruiz
I'm seeing this in focal kernel 5.4.0-88. Is this expected? Do I have to
switch to the hwe kernel pointed above to fix this?

The laptop has been stable for a long time and then suddenly started
having this exact symptom a few days ago. I'm wondering if this was
introduced in latest ga kernels for focal or if it was always there.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-04-07 Thread stevecam
** Also affects: debian
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-26 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 5.8.0-41.46

---
linux (5.8.0-41.46) groovy; urgency=medium

  * groovy/linux: 5.8.0-41.46 -proposed tracker (LP: #1912219)

  * Groovy update: upstream stable patchset 2020-12-17 (LP: #1908555) // nvme
drive fails after some time (LP: #1910866)
- Revert "nvme-pci: remove last_sq_tail"

  * initramfs unpacking failed (LP: #1835660)
- SAUCE: lib/decompress_unlz4.c: correctly handle zero-padding around 
initrds.

  * overlay: permission regression in 5.4.0-51.56 due to patches related to
CVE-2020-16120 (LP: #1900141)
- ovl: do not fail because of O_NOATIME

 -- Kleber Sacilotto de Souza   Mon, 18 Jan
2021 17:01:08 +0100

** Changed in: linux (Ubuntu Groovy)
   Status: Fix Committed => Fix Released

** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2020-16120

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-25 Thread Kelsey Skunberg
@Andrew, thank you for testing! I'm switching verification status to
'verification-done-groovy'.

** Tags removed: verification-needed-groovy
** Tags added: verification-done-groovy

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-21 Thread Andrew Hayzen
@Kleber I have installed the focal hwe kernel from proposed (as seen
below). So far when A/B testing this kernel it is working correctly :-)
I will continue running this kernel and report any issues I have.

Also note that I have been continuously running the test kernel (from
comment 22) since last week and it has worked perfectly so far :-)

I look forward to this migrating from -proposed into focal.

$ uname -a
Linux xps-13-9360 5.8.0-41-generic #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ apt policy linux-generic-hwe-20.04
linux-generic-hwe-20.04:
  Installed: 5.8.0.41.46~20.04.27
  Candidate: 5.8.0.41.46~20.04.27
  Version table:
 *** 5.8.0.41.46~20.04.27 500
500 http://gb.archive.ubuntu.com/ubuntu focal-proposed/main amd64 
Packages
100 /var/lib/dpkg/status
 5.8.0.40.45~20.04.25 500
500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 
Packages
 5.8.0.38.43~20.04.23 500
500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
 5.4.0.26.32 500
500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-21 Thread Kleber Sacilotto de Souza
Hello Alan or anyone else affected,

The fix for this bug is also available on the hwe kernel for Focal
currently in -proposed (version 5.8.0-41.46~20.04.1). Feedback whether
this kernel fixes the nvme issue would be appreciated.

Thank you.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-20 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
groovy' to 'verification-done-groovy'. If the problem still exists,
change the tag 'verification-needed-groovy' to 'verification-failed-
groovy'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-groovy

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-19 Thread Kleber Sacilotto de Souza
Thank you Andrew for your feedback!

We have applied the fix for groovy/linux (and focal/linux-hwe-5.8) and
the new kernels will be available in -proposed soon. These packages are
planned to be promoted to -updates early next week.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-18 Thread Kleber Sacilotto de Souza
** Changed in: linux (Ubuntu Groovy)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-18 Thread Kleber Sacilotto de Souza
** Also affects: linux (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Groovy)
   Status: New => In Progress

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Andrew Hayzen
@Marcelo So far it looks good :-) It passes the "fio" command test when
A/B testing between a known bad kernel and this new kernel. I will
continue running it on this machine over the weekend to ensure longer
usage doesn't have any remaining issues - but looks like it resolves the
issue so far :-D Thanks!

$ uname -a
Linux xps-13-9360 5.8.0-38-generic #43+lp1910866 SMP Fri Jan 15 20:29:27 UTC 
2021 x86_64 x86_64 x86_64 GNU/Linux

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Andrew Hayzen
Thanks! I'll take a look :-)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Marcelo Cerri
Hi, Andrew.

I created a test kernel with the fix and it is available at:

https://kernel.ubuntu.com/~mhcerri/lp1910866_linux-5.8.0-38-generic_5.8.0-38.43+lp1910866_amd64.tar.gz

You can install it by extracting the tarball and installing the debian
packages:

$ tar xf lp1910866_linux-5.8.0-38-generic_5.8.0-38.43+lp1910866_amd64.tar.gz
$ sudo apt install ./*.deb

Please let us know if the test kernel solves the problem.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Terry Rudd
Andrew, we plan to address this in the Focal 5.8 hwe kernel and we're
going to be building a test kernel.  We would really appreciate you
testing it since you have a reliable reproducer.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Andrew Hayzen
@kaihengfeng Thanks for the quick response!  bug 1908555 linked there
only lists groovy as a target series, I hope that this will also be
applied to the focal HWE kernel :-)

Also I am happy to test any kernel in a -proposed channel or PPA to
confirm it fixes the issue if that helps :-)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Kai-Heng Feng
OK, the fix will be in next 5.8 update:
commit f62ddacc4cb141b86ed647f9dd9eeb7653b0cc43
Author: Keith Busch 
Date:   Fri Oct 30 10:28:54 2020 -0700

Revert "nvme-pci: remove last_sq_tail"

BugLink: https://bugs.launchpad.net/bugs/1908555

[ Upstream commit 38210800bf66d7302da1bb5b624ad68638da1562 ]

Multiple CPUs may be mapped to the same hctx, allowing mulitple
submission contexts to attempt commit_rqs(). We need to verify we're
not writing the same doorbell value multiple times since that's a spec
violation.

Revert commit 54b2fcee1db041a83b52b51752dade6090cf952f.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1878596
Reported-by: "B.L. Jones" 
Signed-off-by: Keith Busch 
Signed-off-by: Sasha Levin 
Signed-off-by: Kamal Mostafa 
Signed-off-by: Ian May 


** Bug watch added: Red Hat Bugzilla #1878596
   https://bugzilla.redhat.com/show_bug.cgi?id=1878596

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-15 Thread Andrew Hayzen
@kaihengfeng

So v5.7 was fine and after many reboots it has been found that this
commit below introduced the issue.

Do I also need to find when the issue was resolved ? (between v5.8-rc1
and v5.9.10) or is this information enough ?


54b2fcee1db041a83b52b51752dade6090cf952f is the first bad commit
commit 54b2fcee1db041a83b52b51752dade6090cf952f
Author: Keith Busch 
Date:   Mon Apr 27 11:54:46 2020 -0700

nvme-pci: remove last_sq_tail

The nvme driver does not have enough tags to wrap the queue, and blk-mq
will no longer call commit_rqs() when there are no new submissions to
notify.

Signed-off-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 

 drivers/nvme/host/pci.c | 23 ---
 1 file changed, 4 insertions(+), 19 deletions(-)


And my $ git bisect log is the following FWIW.
git bisect start
# good: [3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162] Linux 5.7
git bisect good 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
# bad: [b3a9e3b9622ae10064826dccb4f7a52bd88c7407] Linux 5.8-rc1
git bisect bad b3a9e3b9622ae10064826dccb4f7a52bd88c7407
# bad: [ee01c4d72adffb7d424535adf630f2955748fa8b] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad ee01c4d72adffb7d424535adf630f2955748fa8b
# bad: [16d91548d1057691979de4686693f0ff92f46000] Merge tag 'xfs-5.8-merge-8' 
of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad 16d91548d1057691979de4686693f0ff92f46000
# good: [cfa3b8068b09f25037146bfd5eed041b78878bee] Merge tag 'for-linus-hmm' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
git bisect good cfa3b8068b09f25037146bfd5eed041b78878bee
# good: [3fd911b69b3117e03181262fc19ae6c3ef6962ce] Merge tag 
'drm-misc-next-2020-05-07' of git://anongit.freedesktop.org/drm/drm-misc into 
drm-next
git bisect good 3fd911b69b3117e03181262fc19ae6c3ef6962ce
# good: [1966391fa576e1fb2701be8bcca197d8f72737b7] mm/migrate.c: 
attach_page_private already does the get_page
git bisect good 1966391fa576e1fb2701be8bcca197d8f72737b7
# bad: [0c8d3fceade2ab1bbac68bca013e62bfdb851d19] bcache: configure the 
asynchronous registertion to be experimental
git bisect bad 0c8d3fceade2ab1bbac68bca013e62bfdb851d19
# bad: [84b8d0d7aa159652dc191d58c4d353b6c9173c54] nvmet: use type-name map for 
ana states
git bisect bad 84b8d0d7aa159652dc191d58c4d353b6c9173c54
# good: [72e6329f86c714785ac195d293cb19dd24507880] nvme-fc and nvmet-fc: revise 
LLDD api for LS reception and LS request
git bisect good 72e6329f86c714785ac195d293cb19dd24507880
# good: [e4fcc72c1a420bdbe425530dd19724214ceb44ec] nvmet-fc: slight cleanup for 
kbuild test warnings
git bisect good e4fcc72c1a420bdbe425530dd19724214ceb44ec
# good: [31fdad7be18992606078caed6ff71741fa76310a] nvme: consolodate io settings
git bisect good 31fdad7be18992606078caed6ff71741fa76310a
# bad: [2a5bcfdd41d68559567cec3c124a75e093506cc1] nvme-pci: align io queue 
count with allocted nvme_queue in nvme_probe
git bisect bad 2a5bcfdd41d68559567cec3c124a75e093506cc1
# good: [6623c5b3dfa5513190d729a8516db7a5163ec7de] nvme: clean up error 
handling in nvme_init_ns_head
git bisect good 6623c5b3dfa5513190d729a8516db7a5163ec7de
# good: [74943d45eef4db64b1e5c9f7ad1d018576e113c5] nvme-pci: remove volatile 
cqes
git bisect good 74943d45eef4db64b1e5c9f7ad1d018576e113c5
# bad: [54b2fcee1db041a83b52b51752dade6090cf952f] nvme-pci: remove last_sq_tail
git bisect bad 54b2fcee1db041a83b52b51752dade6090cf952f
# first bad commit: [54b2fcee1db041a83b52b51752dade6090cf952f] nvme-pci: remove 
last_sq_tail

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-12 Thread Kai-Heng Feng
Thanks a lot!
Can you please test v5.7? Stable release (point release) isn't linear with 
mainline kernel.

Once you are sure v5.7 is good, we can start a bisect:
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect good v5.7
$ git bisect bad v5.8-rc1
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If it still have the same issue,
$ git bisect bad
Otherwise,
$ git bisect good
Repeat to "make -j`nproc` deb-pkg" until you find the offending commit.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-12 Thread Andrew Hayzen
And the bisect between 5.4.78 (good) and 5.8.18 (bad).

The following results with the mainline kernel
v5.8.18/FAIL
v5.8.4/ FAIL
v5.8-rc5/   FAIL
v5.8-rc1/   FAIL
v5.7.19/PASS
v5.7.18/PASS
v5.7.16/PASS
v5.6.14/PASS
v5.4.78/PASS

>From these and the previous comment's results it appears that the issue
was introduced with 5.8-rc1 and then was fixed with 5.9.9 or 5.9.10.
(it is unfortunate that 5.9.9 is missing so I cannot try it).

@kaihengfeng let me know if there is any other information I can
provide.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-12 Thread Andrew Hayzen
So bisecting between 5.8.18 (bad) and 5.11-rc3 (good).

The following results with the mainline kernel
v5.11-rc3/  PASS
v5.9.12/PASS
v5.9.10/PASS
v5.9.9/ MISSING
v5.9.8/ FAIL (could not boot long enough for full test)
v5.9.7/ FAIL (could not boot long enough for full test)
v5.9.2/ FAIL (could not boot long enough for full test)
v5.8.18/FAIL

Note that 5.9.2, 5.9.7, 5.9.8 all crashed during either boot or logging
in (but after performing REISUB they all entered the Dell BIOS/recovery
stating that the hard disk could not be found, so I assume this is the
same failure).

>From these results it appears that between 5.9.8 and 5.9.10 it was
fixed.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-12 Thread Andrew Hayzen
OK, so using https://people.canonical.com/~kernel/info/kernel-version-
map.html that states that Ubuntu kernel 5.8.0-36.40~20.04.1 matches
mainline version 5.8.18. I have installed 5.8.18 and it fails ! So it is
not the Ubuntu patches.

Ubuntu Kernels:
linux-image-5.4.0-59-generic: PASS
linux-image-5.8.0-36-generic: FAIL

Mainline Kernels:
linux-image-unsigned-5.8.18-050818-generic: FAIL
linux-image-unsigned-5.11.0-051100rc3-generic: PASS

I'll see if I can find where it changes from FAIL to PASS between 5.8.18
in the mainline kernels. Please advise if should also/instead compare
between 5.4 and 5.8.18 :-)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-12 Thread Andrew Hayzen
@kaihengfeng

I have found that running the command "fio --name=basic
--directory=/path/to/empty/directory --size=1G --rw=randrw --numjobs=4
--loops=5" runs fine on linux-image-5.4.0-59-generic but when trying
with linux-image-5.8.0-36-generic it would freeze the system in the
"Laying out IO file" stage. I checked with two subsequent boots that the
5.8 does fail like this on an empty directory and will now use this as
my "test" if a kernel works or not.

I have installed the 5.11 rc3 mainline kernel you linked, note I have
had to disable secure boot to be able to use it. But this kernel worked
successfully on two boots with the fio test above.

So in summary so far on my system with the fio test:
linux-image-5.4.0-59-generic: PASS
linux-image-5.8.0-36-generic: FAIL
linux-image-unsigned-5.11.0-051100rc3-generic: PASS

Please advise how to proceed here, should I start manually picking (by
bisecting) kernels between 5.8 and 5.11 or between 5.4 and 5.8 ?

Also I guess I should also try 5.8 mainline to ensure that any Ubuntu
patches aren't causing an issue?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-11 Thread Kai-Heng Feng
Andrew, since you can reliably reproduce the issue, can you please test latest 
mainline kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11-rc3/amd64/

And we'll do a bisect or reverse-bisect based on the result.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-11 Thread Andrew Hayzen
FYI I have captured the `sudo lspci -vv` output on the kernel 5.8
*before* the issue here https://pastebin.ubuntu.com/p/GtZyTWzKTd/ it is
subtly different to the 5.4 kernel (which has not had the issue) in case
that mattered.

I was also able to reproduce the issue again by causing high disk I/O,
specifically I needed to have writes occurring for it to happen (I was
recursive grep'ing the whole filesystem while installing apt/pip
packages inside a docker container).

This then froze the system for 120 seconds until write timeouts
occurred, then the disk was remounted as read-only. After this point
commands on the system would fail with I/O errors (even basic ones such
as "top", although some such as "mount" still work).

However our plan was to try to retrieve more information by copying the
lspci binary and libs into a tmpfs system in RAM, so it'd still be
accessible when the disk stopped. This almost worked, but it appears a
few more configuration files would need to be placed in RAM (I could run
"lspci --help" but not "lspci" or "lspci -vv"). Instead popey has
suggested maybe using a USB key with debootstrap/chroot. (Any
suggestions of how we can retrieve more information at this point are
welcome and any commands that would be useful to run).

Also as a note, if I use REISUB (
https://en.m.wikipedia.org/wiki/Magic_SysRq_key#Uses ) to reboot the
machine it enters a Dell BIOS/recovery thing that states that "No Hard
Disk is found". Then after a full power off the machine works again.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-11 Thread Alan Pope  濾
I've tried doing various IO intensive things to trigger it but no luck
yet.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-11 Thread Andrew Hayzen
Note for me it is happening quite rapidly (sometimes after 5-10 minutes)
of high disk load. Eg the first times it happened when apt was running
update-grub and then when pip3 install was running. Then to capture the
logs above i started a `find /` and `find ~` at the same time and this
was enough to break it.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-11 Thread Alan Pope  濾
I can try, but I can't trigger it to happen. Given I had 60 days uptime
on my system before it happened last time, and 12 days the time before
that. That gives you some idea of the interval between it happening.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-11 Thread Andrew Hayzen
@kairhengfeng  Yes this is a regression after the upgrade from 5.4 to
5.8. After the upgrade I had it multiple times and now I have switched
back to 5.4 my machine is stable again.

I do not think I can run `lspci -vv` *after* the issue happens, as my
NVMe drive goes read-only, so all commands fail.

This is the output of `sudo lspci -vv` on the kernel 5.4 and *before* it
happens https://pastebin.ubuntu.com/p/tCshwbhpqs/  Let me know if also
running this on 5.8 *before* it happens could be useful or not.

@popey are you able to run this command before and after it happens with
your dual disk system ?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-10 Thread Kai-Heng Feng
Is this a regresison? Did it start to happen after upgrade from 5.4 to
5.8?

And is it possible to attach `lspci -vv` after the issue happen?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-09 Thread Alan Pope  濾
It's the TOSHIBA-RD400 on /home for me that's failing.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-09 Thread Andrew Hayzen
I'm on Ubuntu 20.04, and after updating to the HWE 5.8 kernel recently I
have also been suffering my nvme drive becoming read only after a period
of time. I have now switched back to the 5.4 kernel and not suffered the
issue again.

I am on a single disk system so had to run dmesg --follow remotely on
another machine to retrieve log information.

Here is a pastebin of around the time my system locks up
https://pastebin.ubuntu.com/p/FKsJV8VwRw/ (note it has similar errors, a
timeout aborting, then a reset, then i have a call trace etc).

Here is a pastebin of the smartctl output
https://pastebin.ubuntu.com/p/W9w2nHYhd2/ the drive itself appears to be
fine and not failing (it does seem to increment "Error Information Log
Entries" when this lockup happens - but when viewing the error it is
just full of 0x).


System info when the lockup happened:

Machine: Dell XPS 13 9360
Drive: THNSN5512GPUK NVMe TOSHIBA 512GB
Kernel at the time: $ apt policy linux-image-generic-hwe-20.04
linux-image-generic-hwe-20.04:
  Installed: 5.8.0.36.40~20.04.21
  Candidate: 5.8.0.36.40~20.04.21
  Version table:
 *** 5.8.0.36.40~20.04.21 500
500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 
Packages
500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
100 /var/lib/dpkg/status
 5.4.0.26.32 500
500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

Let me know if I can provide any more info :-)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1910866] Re: nvme drive fails after some time

2021-01-09 Thread Kai-Heng Feng
Which one is the failing one? Samsung or OCZ?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs