[Bug 1925211] Re: Hot-unplug of disks leaves broken block devices around in Hirsute on s390x

Christian Ehrhardt  Wed, 21 Apr 2021 00:16:15 -0700

I was wondering if I could trigger the same issue on an lpar as it would
raise the severity IMHO. I have no claim on completeness of these tests
in regard to all that could happen. I tried what I considered low
hanging fruits in regard to this cross check.



Pre-condition each time
- a dasd attached to the system
- not used e.g. as a FS
- no aliases enabled
=> this (more or less) matches our former KVM based test case


$ lscss | grep 1523; lsdasd 0.0.1523; ll /dev/dasdc
0.0.1523 0.0.0183  3390/0c 3990/e9 yes  f0  f0  ff   10111213 00000000
Bus-ID    Status    Name      Device  Type         BlkSz  Size      Blocks
================================================================================
0.0.1523  active    dasdc     94:8    ECKD         4096   7043MB    1803060
brw-rw---- 1 root disk 94, 8 Apr 21 06:21 /dev/dasdc


I was tracking the same state after the removing action and ran udevadm monitor 
to see is a unbind happened.


---

#1 cio purge
$ sudo cio_ignore -a 0.0.1523; sudo cio_ignore --purge

=> can't take away online devices, and I'm not interested in initial
blocking ..

---

#2 chzdev
$ sudo chzdev --disable 0.0.1523

=> properly removed

---

#3 remove the dasds on the storage server
"LSS 08 SRV_SS0_0823" is mapped to s1lp5 0.0.1523 - removing that on the 
storage server

By default that fails:

Error - delete of volume SRV_SS0_0823 failed.
8:28 AM
Error: CMUN02948E IBM.2107-75DXP71/0823 The Delete logical volume task cannot 
be initiated because the Allow Host Pre-check Control Switch is set to true and 
the volume that you have specified is online to a host.

In the old UI the force option is available as checkbox - trying via that.
Done.


The system does not realize that the disk is gone, I/O on it (e.g. dasdfmt) 
goes into a deadlock.
After a while in that hang the system realizes it is in trouble:

dmesg:
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd): I/O status report for device 0.0.1523:
                              dasd(eckd): in req: 00000000e903a5ac CC:00 FC:00 
AC:00 SC:00 DS:00 CS:00 RC:-11
                              dasd(eckd): device 0.0.1523: Failing CCW: 
0000000000000000
                              dasd(eckd): SORRY - NO VALID SENSE AVAILABLE
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd): Related CP in req: 00000000e903a5ac
                              dasd(eckd): CCW 00000000c3e100c4: 2760000C 
014C5FF0 DAT:  18000000 08231c00  00000000
                              dasd(eckd): CCW 00000000335dd238: 3E20401A 
00A40000 DAT:  00000000 00000000  00000000 00000000  00000000 00000000  
00000000 00000000
Apr 21 06:42:32 s1lp5 kernel: dasd(eckd):......
Apr 21 06:42:32 s1lp5 kernel: dasd-eckd.adb621: 0.0.1523: ERP failed for the 
DASD

udevadm:
KERNEL[1313.022835] remove   /devices/css0/0.0.0183/0.0.1523/block/dasdc/dasdc1 
(block)
UDEV  [1313.024648] remove   /devices/css0/0.0.0183/0.0.1523/block/dasdc/dasdc1 
(block)

Even after the above - the disk is still "present":
$ lscss | grep 1523; lsdasd 0.0.1523; ll /dev/dasdc
0.0.1523 0.0.0183  3390/0c 3990/e9 yes  f0  f0  0f   10111213 00000000
Bus-ID    Status    Name      Device  Type         BlkSz  Size      Blocks
================================================================================
0.0.1523  active    dasdc     94:8    ECKD         4096   7043MB    1803060
brw-rw---- 1 root disk 94, 8 Apr 21 06:26 /dev/dasdc


Only when I detach it from the system via chzdev the hanging processes get 
un-stuck and the device removed.


So maybe case #3 is a good one ... ?
Trying the same with a Focal kernel that didn't have the issue we've seen in 
KVM-disk-detach.
=> 5.4.0-72-generic

The behavior is rather similar to the new 5.11 kernel for this.

Thereby, while not complete, we still can make an assumption that this
case really might only affect the detach of KVM disks. Is that good: no;
But is it so bad that we need to interrupt the kernel cycle, no IMHO it
is not.

So IMHO this can go on into the next normal Kernel SRU cycle and also
gives us a chance for the IBM Developers to chime in which of the
proposed solutions they want.

** Changed in: udev (Ubuntu Hirsute)
       Status: New => Invalid

** Changed in: systemd (Ubuntu Hirsute)
       Status: New => Invalid

** Changed in: linux (Ubuntu Hirsute)
       Status: Confirmed => Triaged

** Changed in: linux (Ubuntu Hirsute)
   Importance: Undecided => High

** Changed in: udev (Ubuntu Hirsute)
   Importance: Critical => Undecided

** Changed in: ubuntu-z-systems
       Status: New => Triaged

** Changed in: ubuntu-z-systems
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1925211

Title:
  Hot-unplug of disks leaves broken block devices around in Hirsute on
  s390x

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1925211/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1925211] Re: Hot-unplug of disks leaves broken block devices around in Hirsute on s390x

Reply via email to