> From: Cornelia Huck <coh...@redhat.com> > Sent: 28 August 2025 05:52 PM > > On Thu, Aug 28 2025, "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > On Thu, Aug 28, 2025 at 02:16:28PM +0200, Cornelia Huck wrote: > >> On Thu, Aug 28 2025, Parav Pandit <pa...@nvidia.com> wrote: > >> > >> >> From: Cornelia Huck <coh...@redhat.com> > >> >> Sent: 27 August 2025 05:04 PM > >> >> > >> >> On Wed, Aug 27 2025, "Michael S. Tsirkin" <m...@redhat.com> wrote: > >> >> > >> >> > On Tue, Aug 26, 2025 at 06:52:03PM +0000, Parav Pandit wrote: > >> >> >> > What I do not understand, is what good does the revert do. Sorry. > >> >> >> > > >> >> >> Let me explain. > >> >> >> It prevents the issue of vblk requests being stuck due to broken VQ. > >> >> >> It prevents the vnet driver start_xmit() to be not stuck on skb > completions. > >> >> > > >> >> > This is the part I don't get. In what scenario, before > >> >> > 43bb40c5b9265 start_xmit is not stuck, but after 43bb40c5b9265 it is > stuck? > >> >> > > >> >> > Once the device is gone, it is not using any buffers at all. > >> >> > >> >> What I also don't understand: virtio-ccw does exactly the same > >> >> thing (virtio_break_device(), added in 2014), and it supports > >> >> surprise removal _only_, yet I don't remember seeing bug reports? > >> > > >> > I suspect that stress testing may not have happened for ccw with active > vblk Ios and outstanding transmit pkt and cvq commands. > >> > Hard to say as we don't have ccw hw or systems. > >> > >> cc:ing linux-s390 list. I'd be surprised if nobody ever tested > >> surprise removal on a loaded system in the last 11 years. > > > > > > As it became very clear from follow up discussion, the issue is > > nothing to do with virtio, it is with a broken hypervisor that allows > > device to DMA into guest memory while also telling the guest that the > > device has been removed. > > > > I guess s390 is just not broken like this. > > Ah good, I missed that -- that indeed sounds broken, and needs to be fixed > there. Nop. This is not the issue. You missed this focused on fixing the device.
The fact is: the driver is expecting the IOs and CVQ commands and DMA to succeed even after device is removed. The driver is expecting the device reset to also succeed. Stefan already pointed out this in the vblk driver patches. This is why you see call traces on del_gendisk(), CVQ commands. Again, it is the broken drivers not the device. Device can stop the DMA and stop responding to the requests and kernel 6.X will continue to hang as long as it has cited commit.