Public bug reported:

On AWS, it is possible to get a EBS volume stuck in the detaching state,
where it is not present in lsblk or lspci, but the cloud console says
the volume is detaching, and pci / nvme errors are output to dmesg every
180 seconds.

To reproduce reliably:

1) Start a AWS instance. Can be any type, nitro or regular. I have
successfully reproduced on m5.large and t3.small. Add an extra volume
during creation, of any size.

2) Connect to the instance. lsblk and lspci show the nvme device there.
Detach the volume from the web console. It will detach successfully.
You can attach and then detach the volume again, and it will also work
successfully.

3) With the volume detached, reboot the instance. I do "sudo reboot".

4) When the instance comes back up and you have logged in, attach the
volume.

dmesg will have (normal):

kernel: [   67] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
kernel: [   67] pci 0000:00:1f.0: reg 0x10: [mem 0x00000000-0x00003fff]
kernel: [   67] pci 0000:00:1f.0: BAR 0: assigned [mem 0x80000000-0x80003fff]
kernel: [   67] nvme nvme1: pci function 0000:00:1f.0
kernel: [   67] nvme 0000:00:1f.0: enabling device (0000 -> 0002)
kernel: [   67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10

5) Detach the volume from the web console. If you keep refreshing the volume
view, you will see the volume in the detaching state and the volume is still
in use.

The device will be missing from lsblk and lspci.

dmesg will print these messages every 180 seconds:

4.4 -> 4.15

kernel: [  603] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
kernel: [  603] pci 0000:00:1f.0: reg 0x10: [mem 0x80000000-0x80003fff]
kernel: [  603] pci 0000:00:1f.0: BAR 0: assigned [mem 0x80000000-0x80003fff]
kernel: [  603] nvme nvme1: pci function 0000:00:1f.0
kernel: [  603] nvme nvme1: failed to mark controller live
kernel: [  603] nvme nvme1: Removing after probe failure status: 0

Latest mainline kernel:

kernel: [  243] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
kernel: [  243] pci 0000:00:1f.0: reg 0x10: [mem 0xc0000000-0xc0003fff]
kernel: [  243] pci 0000:00:1f.0: BAR 0: assigned [mem 0xc0000000-0xc0003fff]
kernel: [  243] nvme nvme1: pci function 0000:00:1f.0
kernel: [  244] nvme nvme1: failed to mark controller CONNECTING
kernel: [  244] nvme nvme1: Removing after probe failure status: 0

The volume is now stuck detaching and the above will continue until you
forcefully detach the volume.

This seems to effect all distributions and all kernels.
I have tested xenial, bionic, bionic + hwe, bionic + eoan,
bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the
same symptoms.

I tried to trace NVMe on 5.2, but with not much success on finding any problem:
https://paste.ubuntu.com/p/c6rmDpvHJk/

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Incomplete


** Tags: sts

** Description changed:

- On AWS, it is possible to get a EBS volume stuck in the detaching state, where
- it is not present in lsblk or lspci, but the cloud console says the volume is
- detaching, and pci / nvme errors are output to dmesg every 180 seconds.
+ On AWS, it is possible to get a EBS volume stuck in the detaching state,
+ where it is not present in lsblk or lspci, but the cloud console says
+ the volume is detaching, and pci / nvme errors are output to dmesg every
+ 180 seconds.
  
  To reproduce reliably:
  
- 1) Start a AWS instance. Can be any type, nitro or regular. I have 
successfully
-    reproduced on m5.large and t3.small. Add an extra volume during creation, 
-    of any size.
-    
+ 1) Start a AWS instance. Can be any type, nitro or regular. I have
+ successfully reproduced on m5.large and t3.small. Add an extra volume
+ during creation, of any size.
+ 
  2) Connect to the instance. lsblk and lspci show the nvme device there.
-    Detach the volume from the web console. It will detach successfully.
-    You can attach and then detach the volume again, and it will also work
-    successfully.
-    
+ Detach the volume from the web console. It will detach successfully.
+ You can attach and then detach the volume again, and it will also work
+ successfully.
+ 
  3) With the volume detached, reboot the instance. I do "sudo reboot".
  
  4) When the instance comes back up and you have logged in, attach the
  volume.
  
  dmesg will have (normal):
  
- kernel: [   67.065673] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
- kernel: [   67.065785] pci 0000:00:1f.0: reg 0x10: [mem 0x00000000-0x00003fff]
- kernel: [   67.066864] pci 0000:00:1f.0: BAR 0: assigned [mem 
0x80000000-0x80003fff]
- kernel: [   67.067081] nvme nvme1: pci function 0000:00:1f.0
- kernel: [   67.067800] nvme 0000:00:1f.0: enabling device (0000 -> 0002)
- kernel: [   67.069445] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
+ kernel: [   67] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
+ kernel: [   67] pci 0000:00:1f.0: reg 0x10: [mem 0x00000000-0x00003fff]
+ kernel: [   67] pci 0000:00:1f.0: BAR 0: assigned [mem 0x80000000-0x80003fff]
+ kernel: [   67] nvme nvme1: pci function 0000:00:1f.0
+ kernel: [   67] nvme 0000:00:1f.0: enabling device (0000 -> 0002)
+ kernel: [   67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
  
  5) Detach the volume from the web console. If you keep refreshing the volume
-    view, you will see the volume in the detaching state and the volume is 
still
-    in use.
-    
+ view, you will see the volume in the detaching state and the volume is still
+ in use.
+ 
  The device will be missing from lsblk and lspci.
-    
+ 
  dmesg will print these messages every 180 seconds:
-    
+ 
  4.4 -> 4.15
-    
- kernel: [  603.877961] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
- kernel: [  603.878621] pci 0000:00:1f.0: reg 0x10: [mem 0x80000000-0x80003fff]
- kernel: [  603.881910] pci 0000:00:1f.0: BAR 0: assigned [mem 
0x80000000-0x80003fff]
- kernel: [  603.882118] nvme nvme1: pci function 0000:00:1f.0
- kernel: [  603.998310] nvme nvme1: failed to mark controller live
- kernel: [  603.998312] nvme nvme1: Removing after probe failure status: 0
+ 
+ kernel: [  603] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
+ kernel: [  603] pci 0000:00:1f.0: reg 0x10: [mem 0x80000000-0x80003fff]
+ kernel: [  603] pci 0000:00:1f.0: BAR 0: assigned [mem 0x80000000-0x80003fff]
+ kernel: [  603] nvme nvme1: pci function 0000:00:1f.0
+ kernel: [  603] nvme nvme1: failed to mark controller live
+ kernel: [  603] nvme nvme1: Removing after probe failure status: 0
  
  Latest mainline kernel:
  
- kernel: [  243.950050] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
- kernel: [  243.950632] pci 0000:00:1f.0: reg 0x10: [mem 0xc0000000-0xc0003fff]
- kernel: [  243.953871] pci 0000:00:1f.0: BAR 0: assigned [mem 
0xc0000000-0xc0003fff]
- kernel: [  243.954013] nvme nvme1: pci function 0000:00:1f.0
- kernel: [  244.170972] nvme nvme1: failed to mark controller CONNECTING
- kernel: [  244.170973] nvme nvme1: Removing after probe failure status: 0
+ kernel: [  243] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
+ kernel: [  243] pci 0000:00:1f.0: reg 0x10: [mem 0xc0000000-0xc0003fff]
+ kernel: [  243] pci 0000:00:1f.0: BAR 0: assigned [mem 0xc0000000-0xc0003fff]
+ kernel: [  243] nvme nvme1: pci function 0000:00:1f.0
+ kernel: [  244] nvme nvme1: failed to mark controller CONNECTING
+ kernel: [  244] nvme nvme1: Removing after probe failure status: 0
  
  The volume is now stuck detaching and the above will continue until you
  forcefully detach the volume.
  
- This seems to effect all distributions and all kernels. 
- I have tested xenial, bionic, bionic + hwe, bionic + eoan, 
+ This seems to effect all distributions and all kernels.
+ I have tested xenial, bionic, bionic + hwe, bionic + eoan,
  bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the
  same symptoms.
  
- I tried to trace NVMe on 5.2, but with not much success on finding any 
problem: 
+ I tried to trace NVMe on 5.2, but with not much success on finding any 
problem:
  https://paste.ubuntu.com/p/c6rmDpvHJk/

** Tags added: sts

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1837833

Title:
  EBS Volumes get stuck detaching from AWS instances

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to