On Wed, Apr 07, 2021 at 12:09:23PM +0000, Max Gurtovoy wrote:
> According to the spec after writing 0 to device_status, the driver MUST
> wait for a read of device_status to return 0 before reinitializing the
> device. In case we have a device that won't return 0, the reset
> operation will loop forever and cause the host/vm to stuck. Set timeout
> for 3 minutes before giving up on the device.
> 
> Signed-off-by: Max Gurtovoy <[email protected]>
> ---
>  drivers/virtio/virtio_pci_modern.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_pci_modern.c 
> b/drivers/virtio/virtio_pci_modern.c
> index cc3412a96a17..dcee616e8d21 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -162,6 +162,7 @@ static int vp_reset(struct virtio_device *vdev)
>  {
>       struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>       struct virtio_pci_modern_device *mdev = &vp_dev->mdev;
> +     unsigned long timeout = jiffies + msecs_to_jiffies(180000);
>  
>       /* 0 status means a reset. */
>       vp_modern_set_status(mdev, 0);
> @@ -169,9 +170,16 @@ static int vp_reset(struct virtio_device *vdev)
>        * device_status to return 0 before reinitializing the device.
>        * This will flush out the status write, and flush in device writes,
>        * including MSI-X interrupts, if any.
> +      * Set a timeout before giving up on the device.
>        */
> -     while (vp_modern_get_status(mdev))
> +     while (vp_modern_get_status(mdev)) {
> +             if (time_after(jiffies, timeout)) {
> +                     dev_err(&vdev->dev, "virtio: device not ready. "
> +                             "Aborting. Try again later\n");
> +                     return -EAGAIN;
> +             }
>               msleep(1);
> +     }
>       /* Flush pending VQ/configuration callbacks. */
>       vp_synchronize_vectors(vdev);
>       return 0;

Problem is everyone just ignores the return code from reset.
Timing out like that has a chance to cause a lot of trouble
if the device remains active - we need to make reset robust.

What exactly is going on with the device that
get status never returns 0? E.g. maybe it's in a state
where it's returning all 1's because it's wedged permanently -
using that would be better...



> -- 
> 2.25.4

_______________________________________________
Virtualization mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Reply via email to