Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-04 Thread Michael S. Tsirkin
On Wed, Oct 04, 2023 at 12:15:48PM +0200, Laszlo Ersek wrote:
> On 10/3/23 17:55, Stefan Hajnoczi wrote:
> > On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin  wrote:
> >>
> >> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
> >>> (1) The virtio-1.0 specification
> >>>  writes:
> >>>
>  3 General Initialization And Device Operation
>  3.1   Device Initialization
>  3.1.1 Driver Requirements: Device Initialization
> 
>  [...]
> 
>  7. Perform device-specific setup, including discovery of virtqueues for
> the device, optional per-bus setup, reading and possibly writing the
> device’s virtio configuration space, and population of virtqueues.
> 
>  8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >>>
> >>> and
> >>>
>  4 Virtio Transport Options
>  4.1   Virtio Over PCI Bus
>  4.1.4 Virtio Structure PCI Capabilities
>  4.1.4.3   Common configuration structure layout
>  4.1.4.3.2 Driver Requirements: Common configuration structure layout
> 
>  [...]
> 
>  The driver MUST configure the other virtqueue fields before enabling the
>  virtqueue with queue_enable.
> 
>  [...]
> >>>
> >>> These together mean that the following sub-sequence of steps is valid for
> >>> a virtio-1.0 guest driver:
> >>>
> >>> (1.1) set "queue_enable" for the needed queues as the final part of device
> >>> initialization step (7),
> >>>
> >>> (1.2) set DRIVER_OK in step (8),
> >>>
> >>> (1.3) immediately start sending virtio requests to the device.
> >>>
> >>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> >>> special virtio feature is negotiated, then virtio rings start in disabled
> >>> state, according to
> >>> .
> >>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> >>> enabling vrings.
> >>>
> >>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> >>> operation, which travels from the guest through QEMU to the vhost-user
> >>> backend, using a unix domain socket.
> >>>
> >>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> >>> evades QEMU -- it travels from guest to the vhost-user backend via
> >>> eventfd.
> >>>
> >>> This means that steps (1.1) and (1.3) travel through different channels,
> >>> and their relative order can be reversed, as perceived by the vhost-user
> >>> backend.
> >>>
> >>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> >>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> >>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> >>> crate.)
> >>>
> >>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> >>> device initialization steps (i.e., control plane operations), and
> >>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> >>> operation). In the Rust-language virtiofsd, this creates a race between
> >>> two components that run *concurrently*, i.e., in different threads or
> >>> processes:
> >>>
> >>> - Control plane, handling vhost-user protocol messages:
> >>>
> >>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >>>   [crates/vhost-user-backend/src/handler.rs] handles
> >>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> >>>   flag according to the message processed.
> >>>
> >>> - Data plane, handling virtio / FUSE requests:
> >>>
> >>>   The "VringEpollHandler::handle_event" method
> >>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >>>   virtio / FUSE request, consuming the virtio kick at the same time. If
> >>>   the vring's "enabled" flag is set, the virtio / FUSE request is
> >>>   processed genuinely. If the vring's "enabled" flag is clear, then the
> >>>   virtio / FUSE request is discarded.
> >>>
> >>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> >>> However, if the data plane processor in virtiofsd wins the race, then it
> >>> sees the FUSE_INIT *before* the control plane processor took notice of
> >>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> >>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> >>> back to waiting for further virtio / FUSE requests with epoll_wait.
> >>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> >>>
> >>> The deadlock is not deterministic. OVMF hangs infrequently during first
> >>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> >>> shell.
> >>>
> >>> The race can be "reliably masked" by inserting a very small delay -- a
> >>> single debug message -- at the top of "VringEpollHandler::handle_event",
> >>> i.e., just before the data plane processor checks the "enabled" 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-04 Thread Laszlo Ersek
On 10/3/23 17:55, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin  wrote:
>>
>> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>>  writes:
>>>
 3 General Initialization And Device Operation
 3.1   Device Initialization
 3.1.1 Driver Requirements: Device Initialization

 [...]

 7. Perform device-specific setup, including discovery of virtqueues for
the device, optional per-bus setup, reading and possibly writing the
device’s virtio configuration space, and population of virtqueues.

 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
 4 Virtio Transport Options
 4.1   Virtio Over PCI Bus
 4.1.4 Virtio Structure PCI Capabilities
 4.1.4.3   Common configuration structure layout
 4.1.4.3.2 Driver Requirements: Common configuration structure layout

 [...]

 The driver MUST configure the other virtqueue fields before enabling the
 virtqueue with queue_enable.

 [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> .
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>>   [crates/vhost-user-backend/src/handler.rs] handles
>>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>>   flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>>   The "VringEpollHandler::handle_event" method
>>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>>   virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-04 Thread Laszlo Ersek
On 10/3/23 17:55, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin  wrote:
>>
>> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>>  writes:
>>>
 3 General Initialization And Device Operation
 3.1   Device Initialization
 3.1.1 Driver Requirements: Device Initialization

 [...]

 7. Perform device-specific setup, including discovery of virtqueues for
the device, optional per-bus setup, reading and possibly writing the
device’s virtio configuration space, and population of virtqueues.

 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
 4 Virtio Transport Options
 4.1   Virtio Over PCI Bus
 4.1.4 Virtio Structure PCI Capabilities
 4.1.4.3   Common configuration structure layout
 4.1.4.3.2 Driver Requirements: Common configuration structure layout

 [...]

 The driver MUST configure the other virtqueue fields before enabling the
 virtqueue with queue_enable.

 [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> .
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>>   [crates/vhost-user-backend/src/handler.rs] handles
>>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>>   flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>>   The "VringEpollHandler::handle_event" method
>>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>>   virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Stefan Hajnoczi
On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin  wrote:
>
> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
> > (1) The virtio-1.0 specification
> >  writes:
> >
> > > 3 General Initialization And Device Operation
> > > 3.1   Device Initialization
> > > 3.1.1 Driver Requirements: Device Initialization
> > >
> > > [...]
> > >
> > > 7. Perform device-specific setup, including discovery of virtqueues for
> > >the device, optional per-bus setup, reading and possibly writing the
> > >device’s virtio configuration space, and population of virtqueues.
> > >
> > > 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >
> > and
> >
> > > 4 Virtio Transport Options
> > > 4.1   Virtio Over PCI Bus
> > > 4.1.4 Virtio Structure PCI Capabilities
> > > 4.1.4.3   Common configuration structure layout
> > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > >
> > > [...]
> > >
> > > The driver MUST configure the other virtqueue fields before enabling the
> > > virtqueue with queue_enable.
> > >
> > > [...]
> >
> > These together mean that the following sub-sequence of steps is valid for
> > a virtio-1.0 guest driver:
> >
> > (1.1) set "queue_enable" for the needed queues as the final part of device
> > initialization step (7),
> >
> > (1.2) set DRIVER_OK in step (8),
> >
> > (1.3) immediately start sending virtio requests to the device.
> >
> > (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > special virtio feature is negotiated, then virtio rings start in disabled
> > state, according to
> > .
> > In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > enabling vrings.
> >
> > Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > operation, which travels from the guest through QEMU to the vhost-user
> > backend, using a unix domain socket.
> >
> > Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > evades QEMU -- it travels from guest to the vhost-user backend via
> > eventfd.
> >
> > This means that steps (1.1) and (1.3) travel through different channels,
> > and their relative order can be reversed, as perceived by the vhost-user
> > backend.
> >
> > That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > crate.)
> >
> > Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > device initialization steps (i.e., control plane operations), and
> > immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > operation). In the Rust-language virtiofsd, this creates a race between
> > two components that run *concurrently*, i.e., in different threads or
> > processes:
> >
> > - Control plane, handling vhost-user protocol messages:
> >
> >   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >   [crates/vhost-user-backend/src/handler.rs] handles
> >   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> >   flag according to the message processed.
> >
> > - Data plane, handling virtio / FUSE requests:
> >
> >   The "VringEpollHandler::handle_event" method
> >   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >   virtio / FUSE request, consuming the virtio kick at the same time. If
> >   the vring's "enabled" flag is set, the virtio / FUSE request is
> >   processed genuinely. If the vring's "enabled" flag is clear, then the
> >   virtio / FUSE request is discarded.
> >
> > Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > However, if the data plane processor in virtiofsd wins the race, then it
> > sees the FUSE_INIT *before* the control plane processor took notice of
> > VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > back to waiting for further virtio / FUSE requests with epoll_wait.
> > Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> >
> > The deadlock is not deterministic. OVMF hangs infrequently during first
> > boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > shell.
> >
> > The race can be "reliably masked" by inserting a very small delay -- a
> > single debug message -- at the top of "VringEpollHandler::handle_event",
> > i.e., just before the data plane processor checks the "enabled" field of
> > the vring. That delay suffices for the control plane processor to act upon
> > VHOST_USER_SET_VRING_ENABLE.
> >
> > We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Stefan Hajnoczi
On Tue, 3 Oct 2023 at 10:40, Michael S. Tsirkin  wrote:
>
> On Tue, Oct 03, 2023 at 09:08:15AM -0400, Stefan Hajnoczi wrote:
> > On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> > > > One more question:
> > > >
> > > > Why is the disabled state not needed by regular (non-vhost) virtio-net 
> > > > devices?
> > >
> > > Tap does the same - it purges queued packets:
> > >
> > > int tap_disable(NetClientState *nc)
> > > {
> > > TAPState *s = DO_UPCAST(TAPState, nc, nc);
> > > int ret;
> > >
> > > if (s->enabled == 0) {
> > > return 0;
> > > } else {
> > > ret = tap_fd_disable(s->fd);
> > > if (ret == 0) {
> > > qemu_purge_queued_packets(nc);
> > > s->enabled = false;
> > > tap_update_fd_handler(s);
> > > }
> > > return ret;
> > > }
> > > }
> >
> > tap_disable() is not equivalent to the vhost-user "started but
> > disabled" ring state. tap_disable() is a synchronous one-time action,
> > while "started but disabled" is a continuous state.
>
> well, yes. but practically guests do not queue too many buffers
> after disabling a queue. I don't know if they reliably don't
> or it's racy and we didn't notice it yet - I think it
> was mostly dpdk that had this and that's usually
> used with vhost-user.
>
> > The "started but disabled" ring state isn't needed to achieve this.
> > The back-end can just drop tx buffers upon receiving
> > VHOST_USER_SET_VRING_ENABLE .num=0.
>
> yes, maybe that would have been a better way to do this.
>
>
> > The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> > introduced before the the "started but disabled" state was defined,
> > and it explicitly mentions tap attach/detach:
> >
> > commit 7263a0ad784b719ebed736a1119cc2e08110
> > Author: Changchun Ouyang 
> > Date:   Wed Sep 23 12:20:01 2015 +0800
> >
> > vhost-user: add a new message to disable/enable a specific virt queue.
> >
> > Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> > a specific virt queue, which is similar to attach/detach queue for
> > tap device.
> >
> > and then later:
> >
> > commit c61f09ed855b5009f816242ce281fd01586d4646
> > Author: Michael S. Tsirkin 
> > Date:   Mon Nov 23 12:48:52 2015 +0200
> >
> > vhost-user: clarify start and enable
> >
> > >
> > > what about non tap backends? I suspect they just aren't
> > > used widely with multiqueue so no one noticed.
> >
> > I still don't understand why "started but disabled" is needed instead
> > of just two ring states: enabled and disabled.
>
> With dropping packets when ring is disabled? Maybe that would
> have been enough. I also failed to realize it's specific to
> net, seemed generic to me :(
>
> > It seems like the cleanest path going forward is to keep the "ignore
> > rx, discard tx" semantics for virtio-net devices but to clarify in the
> > spec that other device types do not process the ring:
> >
> > "
> > * started but disabled: the back-end must not process the ring. For legacy
> >   reasons there is an exception for the networking device, where the
> >   back-end must process and discard any TX packets and not process
> >   other rings.
> > "
> >
> > What do you think?
> >
> > Stefan
>
> Okay... I hope we are not missing any devices which need virtio net
> semantics. Care checking them all?

Sure, I will check them. I'm very curious myself whether virtio-vsock
is affected (it has rx and tx queues).

I will report back.

Stefan

>
> --
> MST
>



Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Michael S. Tsirkin
On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
> (1) The virtio-1.0 specification
>  writes:
> 
> > 3 General Initialization And Device Operation
> > 3.1   Device Initialization
> > 3.1.1 Driver Requirements: Device Initialization
> >
> > [...]
> >
> > 7. Perform device-specific setup, including discovery of virtqueues for
> >the device, optional per-bus setup, reading and possibly writing the
> >device’s virtio configuration space, and population of virtqueues.
> >
> > 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> 
> and
> 
> > 4 Virtio Transport Options
> > 4.1   Virtio Over PCI Bus
> > 4.1.4 Virtio Structure PCI Capabilities
> > 4.1.4.3   Common configuration structure layout
> > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >
> > [...]
> >
> > The driver MUST configure the other virtqueue fields before enabling the
> > virtqueue with queue_enable.
> >
> > [...]
> 
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
> 
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
> 
> (1.2) set DRIVER_OK in step (8),
> 
> (1.3) immediately start sending virtio requests to the device.
> 
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> .
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
> 
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
> 
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
> 
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
> 
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
> 
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
> 
> - Control plane, handling vhost-user protocol messages:
> 
>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>   [crates/vhost-user-backend/src/handler.rs] handles
>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>   flag according to the message processed.
> 
> - Data plane, handling virtio / FUSE requests:
> 
>   The "VringEpollHandler::handle_event" method
>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>   virtio / FUSE request, consuming the virtio kick at the same time. If
>   the vring's "enabled" flag is set, the virtio / FUSE request is
>   processed genuinely. If the vring's "enabled" flag is clear, then the
>   virtio / FUSE request is discarded.
> 
> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> However, if the data plane processor in virtiofsd wins the race, then it
> sees the FUSE_INIT *before* the control plane processor took notice of
> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> back to waiting for further virtio / FUSE requests with epoll_wait.
> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> 
> The deadlock is not deterministic. OVMF hangs infrequently during first
> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> shell.
> 
> The race can be "reliably masked" by inserting a very small delay -- a
> single debug message -- at the top of "VringEpollHandler::handle_event",
> i.e., just before the data plane processor checks the "enabled" field of
> the vring. That delay suffices for the control plane processor to act upon
> VHOST_USER_SET_VRING_ENABLE.
> 
> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> cannot advance to the FUSE_INIT submission before virtiofsd's control
> plane processor takes notice of the queue being enabled.
> 
> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> 
> - 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Michael S. Tsirkin
On Tue, Oct 03, 2023 at 09:08:15AM -0400, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin  wrote:
> >
> > On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> > > One more question:
> > >
> > > Why is the disabled state not needed by regular (non-vhost) virtio-net 
> > > devices?
> >
> > Tap does the same - it purges queued packets:
> >
> > int tap_disable(NetClientState *nc)
> > {
> > TAPState *s = DO_UPCAST(TAPState, nc, nc);
> > int ret;
> >
> > if (s->enabled == 0) {
> > return 0;
> > } else {
> > ret = tap_fd_disable(s->fd);
> > if (ret == 0) {
> > qemu_purge_queued_packets(nc);
> > s->enabled = false;
> > tap_update_fd_handler(s);
> > }
> > return ret;
> > }
> > }
> 
> tap_disable() is not equivalent to the vhost-user "started but
> disabled" ring state. tap_disable() is a synchronous one-time action,
> while "started but disabled" is a continuous state.

well, yes. but practically guests do not queue too many buffers
after disabling a queue. I don't know if they reliably don't
or it's racy and we didn't notice it yet - I think it
was mostly dpdk that had this and that's usually
used with vhost-user.

> The "started but disabled" ring state isn't needed to achieve this.
> The back-end can just drop tx buffers upon receiving
> VHOST_USER_SET_VRING_ENABLE .num=0.

yes, maybe that would have been a better way to do this.


> The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> introduced before the the "started but disabled" state was defined,
> and it explicitly mentions tap attach/detach:
> 
> commit 7263a0ad784b719ebed736a1119cc2e08110
> Author: Changchun Ouyang 
> Date:   Wed Sep 23 12:20:01 2015 +0800
> 
> vhost-user: add a new message to disable/enable a specific virt queue.
> 
> Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> a specific virt queue, which is similar to attach/detach queue for
> tap device.
> 
> and then later:
> 
> commit c61f09ed855b5009f816242ce281fd01586d4646
> Author: Michael S. Tsirkin 
> Date:   Mon Nov 23 12:48:52 2015 +0200
> 
> vhost-user: clarify start and enable
> 
> >
> > what about non tap backends? I suspect they just aren't
> > used widely with multiqueue so no one noticed.
> 
> I still don't understand why "started but disabled" is needed instead
> of just two ring states: enabled and disabled.

With dropping packets when ring is disabled? Maybe that would
have been enough. I also failed to realize it's specific to
net, seemed generic to me :(

> It seems like the cleanest path going forward is to keep the "ignore
> rx, discard tx" semantics for virtio-net devices but to clarify in the
> spec that other device types do not process the ring:
> 
> "
> * started but disabled: the back-end must not process the ring. For legacy
>   reasons there is an exception for the networking device, where the
>   back-end must process and discard any TX packets and not process
>   other rings.
> "
> 
> What do you think?
> 
> Stefan

Okay... I hope we are not missing any devices which need virtio net
semantics. Care checking them all?

-- 
MST




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Michael S. Tsirkin
On Mon, Oct 02, 2023 at 08:17:07PM -0400, Stefan Hajnoczi wrote:
> On Mon, 2 Oct 2023 at 18:36, Michael S. Tsirkin  wrote:
> >
> > On Mon, Oct 02, 2023 at 05:12:27PM -0400, Stefan Hajnoczi wrote:
> > > On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin  wrote:
> > > >
> > > > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> > > > > >
> > > > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  
> > > > > > > wrote:
> > > > > > >>
> > > > > > >> (1) The virtio-1.0 specification
> > > > > > >>  
> > > > > > >> writes:
> > > > > > >>
> > > > > > >>> 3 General Initialization And Device Operation
> > > > > > >>> 3.1   Device Initialization
> > > > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>>
> > > > > > >>> 7. Perform device-specific setup, including discovery of 
> > > > > > >>> virtqueues for
> > > > > > >>>the device, optional per-bus setup, reading and possibly 
> > > > > > >>> writing the
> > > > > > >>>device’s virtio configuration space, and population of 
> > > > > > >>> virtqueues.
> > > > > > >>>
> > > > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is 
> > > > > > >>> “live”.
> > > > > > >>
> > > > > > >> and
> > > > > > >>
> > > > > > >>> 4 Virtio Transport Options
> > > > > > >>> 4.1   Virtio Over PCI Bus
> > > > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > > > >>> 4.1.4.3   Common configuration structure layout
> > > > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure 
> > > > > > >>> layout
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>>
> > > > > > >>> The driver MUST configure the other virtqueue fields before 
> > > > > > >>> enabling the
> > > > > > >>> virtqueue with queue_enable.
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>
> > > > > > >> These together mean that the following sub-sequence of steps is 
> > > > > > >> valid for
> > > > > > >> a virtio-1.0 guest driver:
> > > > > > >>
> > > > > > >> (1.1) set "queue_enable" for the needed queues as the final part 
> > > > > > >> of device
> > > > > > >> initialization step (7),
> > > > > > >>
> > > > > > >> (1.2) set DRIVER_OK in step (8),
> > > > > > >>
> > > > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > > > >>
> > > > > > >> (2) When vhost-user is enabled, and the 
> > > > > > >> VHOST_USER_F_PROTOCOL_FEATURES
> > > > > > >> special virtio feature is negotiated, then virtio rings start in 
> > > > > > >> disabled
> > > > > > >> state, according to
> > > > > > >> .
> > > > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are 
> > > > > > >> needed for
> > > > > > >> enabling vrings.
> > > > > > >>
> > > > > > >> Therefore setting "queue_enable" from the guest (1.1) is a 
> > > > > > >> *control plane*
> > > > > > >> operation, which travels from the guest through QEMU to the 
> > > > > > >> vhost-user
> > > > > > >> backend, using a unix domain socket.
> > > > > > >>
> > > > > > >> Whereas sending a virtio request (1.3) is a *data plane* 
> > > > > > >> operation, which
> > > > > > >> evades QEMU -- it travels from guest to the vhost-user backend 
> > > > > > >> via
> > > > > > >> eventfd.
> > > > > > >>
> > > > > > >> This means that steps (1.1) and (1.3) travel through different 
> > > > > > >> channels,
> > > > > > >> and their relative order can be reversed, as perceived by the 
> > > > > > >> vhost-user
> > > > > > >> backend.
> > > > > > >>
> > > > > > >> That's exactly what happens when OVMF's virtiofs driver 
> > > > > > >> (VirtioFsDxe) runs
> > > > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses 
> > > > > > >> version
> > > > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the 
> > > > > > >> vhost
> > > > > > >> crate.)
> > > > > > >>
> > > > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes 
> > > > > > >> through the
> > > > > > >> device initialization steps (i.e., control plane operations), and
> > > > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data 
> > > > > > >> plane
> > > > > > >> operation). In the Rust-language virtiofsd, this creates a race 
> > > > > > >> between
> > > > > > >> two components that run *concurrently*, i.e., in different 
> > > > > > >> threads or
> > > > > > >> processes:
> > > > > > >>
> > > > > > >> - Control plane, handling vhost-user protocol messages:
> > > > > > >>
> > > > > > >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > > > >>   [crates/vhost-user-backend/src/handler.rs] handles
> > > > > > >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > > > > > >> "enabled"
> > > > > > >>   flag according to the 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Laszlo Ersek
On 10/3/23 16:25, Michael S. Tsirkin wrote:
> On Tue, Oct 03, 2023 at 03:23:24PM +0200, Laszlo Ersek wrote:
>> On 10/3/23 15:08, Stefan Hajnoczi wrote:
>>> On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin  wrote:

 On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> One more question:
>
> Why is the disabled state not needed by regular (non-vhost) virtio-net 
> devices?

 Tap does the same - it purges queued packets:

 int tap_disable(NetClientState *nc)
 {
 TAPState *s = DO_UPCAST(TAPState, nc, nc);
 int ret;

 if (s->enabled == 0) {
 return 0;
 } else {
 ret = tap_fd_disable(s->fd);
 if (ret == 0) {
 qemu_purge_queued_packets(nc);
 s->enabled = false;
 tap_update_fd_handler(s);
 }
 return ret;
 }
 }
>>>
>>> tap_disable() is not equivalent to the vhost-user "started but
>>> disabled" ring state. tap_disable() is a synchronous one-time action,
>>> while "started but disabled" is a continuous state.
>>>
>>> The "started but disabled" ring state isn't needed to achieve this.
>>> The back-end can just drop tx buffers upon receiving
>>> VHOST_USER_SET_VRING_ENABLE .num=0.
>>>
>>> The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
>>> introduced before the the "started but disabled" state was defined,
>>> and it explicitly mentions tap attach/detach:
>>>
>>> commit 7263a0ad784b719ebed736a1119cc2e08110
>>> Author: Changchun Ouyang 
>>> Date:   Wed Sep 23 12:20:01 2015 +0800
>>>
>>> vhost-user: add a new message to disable/enable a specific virt queue.
>>>
>>> Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
>>> a specific virt queue, which is similar to attach/detach queue for
>>> tap device.
>>>
>>> and then later:
>>>
>>> commit c61f09ed855b5009f816242ce281fd01586d4646
>>> Author: Michael S. Tsirkin 
>>> Date:   Mon Nov 23 12:48:52 2015 +0200
>>>
>>> vhost-user: clarify start and enable
>>>

 what about non tap backends? I suspect they just aren't
 used widely with multiqueue so no one noticed.
>>>
>>> I still don't understand why "started but disabled" is needed instead
>>> of just two ring states: enabled and disabled.
>>>
>>> It seems like the cleanest path going forward is to keep the "ignore
>>> rx, discard tx" semantics for virtio-net devices but to clarify in the
>>> spec that other device types do not process the ring:
>>>
>>> "
>>> * started but disabled: the back-end must not process the ring. For legacy
>>>   reasons there is an exception for the networking device, where the
>>>   back-end must process and discard any TX packets and not process
>>>   other rings.
>>> "
>>>
>>> What do you think?
>>
>> ... from a vhost-user backend perspective, won't this create a need for
>> all "ring processor" (~ virtio event loop) implementations to support
>> both methods? IIUC, the "virtio pop" is usually independent of the
>> particular device to which the requests are ultimately delivered. So the
>> event loop would have to grow a new parameter regarding "what to do in
>> the started-but-disabled state", the network device would have to pass
>> in one value (-> pop & drop), and all other devices would have to pass
>> in the other value (stop popping).
>>
>> ... I figure in rust-vmm/vhost it would affect the "handle_event"
>> function in "crates/vhost-user-backend/src/event_loop.rs".
>>
>> Do I understand right? (Not disagreeing, just pondering the impact on
>> backends.)
>>
>> Laszlo
> 
> Already the case I guess - RX ring is not processed, TX is. Right?
> 

Ah I see your point, this distinction must already exist in event loops.

But... as far as I can tell, it's not there in rust-vmm/vhost.

Laszlo




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Michael S. Tsirkin
On Tue, Oct 03, 2023 at 03:23:24PM +0200, Laszlo Ersek wrote:
> On 10/3/23 15:08, Stefan Hajnoczi wrote:
> > On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin  wrote:
> >>
> >> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> >>> One more question:
> >>>
> >>> Why is the disabled state not needed by regular (non-vhost) virtio-net 
> >>> devices?
> >>
> >> Tap does the same - it purges queued packets:
> >>
> >> int tap_disable(NetClientState *nc)
> >> {
> >> TAPState *s = DO_UPCAST(TAPState, nc, nc);
> >> int ret;
> >>
> >> if (s->enabled == 0) {
> >> return 0;
> >> } else {
> >> ret = tap_fd_disable(s->fd);
> >> if (ret == 0) {
> >> qemu_purge_queued_packets(nc);
> >> s->enabled = false;
> >> tap_update_fd_handler(s);
> >> }
> >> return ret;
> >> }
> >> }
> > 
> > tap_disable() is not equivalent to the vhost-user "started but
> > disabled" ring state. tap_disable() is a synchronous one-time action,
> > while "started but disabled" is a continuous state.
> > 
> > The "started but disabled" ring state isn't needed to achieve this.
> > The back-end can just drop tx buffers upon receiving
> > VHOST_USER_SET_VRING_ENABLE .num=0.
> > 
> > The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> > introduced before the the "started but disabled" state was defined,
> > and it explicitly mentions tap attach/detach:
> > 
> > commit 7263a0ad784b719ebed736a1119cc2e08110
> > Author: Changchun Ouyang 
> > Date:   Wed Sep 23 12:20:01 2015 +0800
> > 
> > vhost-user: add a new message to disable/enable a specific virt queue.
> > 
> > Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> > a specific virt queue, which is similar to attach/detach queue for
> > tap device.
> > 
> > and then later:
> > 
> > commit c61f09ed855b5009f816242ce281fd01586d4646
> > Author: Michael S. Tsirkin 
> > Date:   Mon Nov 23 12:48:52 2015 +0200
> > 
> > vhost-user: clarify start and enable
> > 
> >>
> >> what about non tap backends? I suspect they just aren't
> >> used widely with multiqueue so no one noticed.
> > 
> > I still don't understand why "started but disabled" is needed instead
> > of just two ring states: enabled and disabled.
> > 
> > It seems like the cleanest path going forward is to keep the "ignore
> > rx, discard tx" semantics for virtio-net devices but to clarify in the
> > spec that other device types do not process the ring:
> > 
> > "
> > * started but disabled: the back-end must not process the ring. For legacy
> >   reasons there is an exception for the networking device, where the
> >   back-end must process and discard any TX packets and not process
> >   other rings.
> > "
> > 
> > What do you think?
> 
> ... from a vhost-user backend perspective, won't this create a need for
> all "ring processor" (~ virtio event loop) implementations to support
> both methods? IIUC, the "virtio pop" is usually independent of the
> particular device to which the requests are ultimately delivered. So the
> event loop would have to grow a new parameter regarding "what to do in
> the started-but-disabled state", the network device would have to pass
> in one value (-> pop & drop), and all other devices would have to pass
> in the other value (stop popping).
> 
> ... I figure in rust-vmm/vhost it would affect the "handle_event"
> function in "crates/vhost-user-backend/src/event_loop.rs".
> 
> Do I understand right? (Not disagreeing, just pondering the impact on
> backends.)
> 
> Laszlo

Already the case I guess - RX ring is not processed, TX is. Right?

-- 
MST




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Laszlo Ersek
On 10/3/23 15:08, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin  wrote:
>>
>> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
>>> One more question:
>>>
>>> Why is the disabled state not needed by regular (non-vhost) virtio-net 
>>> devices?
>>
>> Tap does the same - it purges queued packets:
>>
>> int tap_disable(NetClientState *nc)
>> {
>> TAPState *s = DO_UPCAST(TAPState, nc, nc);
>> int ret;
>>
>> if (s->enabled == 0) {
>> return 0;
>> } else {
>> ret = tap_fd_disable(s->fd);
>> if (ret == 0) {
>> qemu_purge_queued_packets(nc);
>> s->enabled = false;
>> tap_update_fd_handler(s);
>> }
>> return ret;
>> }
>> }
> 
> tap_disable() is not equivalent to the vhost-user "started but
> disabled" ring state. tap_disable() is a synchronous one-time action,
> while "started but disabled" is a continuous state.
> 
> The "started but disabled" ring state isn't needed to achieve this.
> The back-end can just drop tx buffers upon receiving
> VHOST_USER_SET_VRING_ENABLE .num=0.
> 
> The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> introduced before the the "started but disabled" state was defined,
> and it explicitly mentions tap attach/detach:
> 
> commit 7263a0ad784b719ebed736a1119cc2e08110
> Author: Changchun Ouyang 
> Date:   Wed Sep 23 12:20:01 2015 +0800
> 
> vhost-user: add a new message to disable/enable a specific virt queue.
> 
> Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> a specific virt queue, which is similar to attach/detach queue for
> tap device.
> 
> and then later:
> 
> commit c61f09ed855b5009f816242ce281fd01586d4646
> Author: Michael S. Tsirkin 
> Date:   Mon Nov 23 12:48:52 2015 +0200
> 
> vhost-user: clarify start and enable
> 
>>
>> what about non tap backends? I suspect they just aren't
>> used widely with multiqueue so no one noticed.
> 
> I still don't understand why "started but disabled" is needed instead
> of just two ring states: enabled and disabled.
> 
> It seems like the cleanest path going forward is to keep the "ignore
> rx, discard tx" semantics for virtio-net devices but to clarify in the
> spec that other device types do not process the ring:
> 
> "
> * started but disabled: the back-end must not process the ring. For legacy
>   reasons there is an exception for the networking device, where the
>   back-end must process and discard any TX packets and not process
>   other rings.
> "
> 
> What do you think?

... from a vhost-user backend perspective, won't this create a need for
all "ring processor" (~ virtio event loop) implementations to support
both methods? IIUC, the "virtio pop" is usually independent of the
particular device to which the requests are ultimately delivered. So the
event loop would have to grow a new parameter regarding "what to do in
the started-but-disabled state", the network device would have to pass
in one value (-> pop & drop), and all other devices would have to pass
in the other value (stop popping).

... I figure in rust-vmm/vhost it would affect the "handle_event"
function in "crates/vhost-user-backend/src/event_loop.rs".

Do I understand right? (Not disagreeing, just pondering the impact on
backends.)

Laszlo




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Stefan Hajnoczi
On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin  wrote:
>
> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> > One more question:
> >
> > Why is the disabled state not needed by regular (non-vhost) virtio-net 
> > devices?
>
> Tap does the same - it purges queued packets:
>
> int tap_disable(NetClientState *nc)
> {
> TAPState *s = DO_UPCAST(TAPState, nc, nc);
> int ret;
>
> if (s->enabled == 0) {
> return 0;
> } else {
> ret = tap_fd_disable(s->fd);
> if (ret == 0) {
> qemu_purge_queued_packets(nc);
> s->enabled = false;
> tap_update_fd_handler(s);
> }
> return ret;
> }
> }

tap_disable() is not equivalent to the vhost-user "started but
disabled" ring state. tap_disable() is a synchronous one-time action,
while "started but disabled" is a continuous state.

The "started but disabled" ring state isn't needed to achieve this.
The back-end can just drop tx buffers upon receiving
VHOST_USER_SET_VRING_ENABLE .num=0.

The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
introduced before the the "started but disabled" state was defined,
and it explicitly mentions tap attach/detach:

commit 7263a0ad784b719ebed736a1119cc2e08110
Author: Changchun Ouyang 
Date:   Wed Sep 23 12:20:01 2015 +0800

vhost-user: add a new message to disable/enable a specific virt queue.

Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
a specific virt queue, which is similar to attach/detach queue for
tap device.

and then later:

commit c61f09ed855b5009f816242ce281fd01586d4646
Author: Michael S. Tsirkin 
Date:   Mon Nov 23 12:48:52 2015 +0200

vhost-user: clarify start and enable

>
> what about non tap backends? I suspect they just aren't
> used widely with multiqueue so no one noticed.

I still don't understand why "started but disabled" is needed instead
of just two ring states: enabled and disabled.

It seems like the cleanest path going forward is to keep the "ignore
rx, discard tx" semantics for virtio-net devices but to clarify in the
spec that other device types do not process the ring:

"
* started but disabled: the back-end must not process the ring. For legacy
  reasons there is an exception for the networking device, where the
  back-end must process and discard any TX packets and not process
  other rings.
"

What do you think?

Stefan



Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-03 Thread Michael S. Tsirkin
On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> One more question:
> 
> Why is the disabled state not needed by regular (non-vhost) virtio-net 
> devices?

Tap does the same - it purges queued packets:

int tap_disable(NetClientState *nc)
{   
TAPState *s = DO_UPCAST(TAPState, nc, nc);
int ret;

if (s->enabled == 0) {
return 0;
} else {
ret = tap_fd_disable(s->fd);
if (ret == 0) {
qemu_purge_queued_packets(nc);
s->enabled = false;
tap_update_fd_handler(s);
}
return ret;
}   
}   

what about non tap backends? I suspect they just aren't
used widely with multiqueue so no one noticed.

-- 
MST




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Stefan Hajnoczi
On Mon, 2 Oct 2023 at 18:36, Michael S. Tsirkin  wrote:
>
> On Mon, Oct 02, 2023 at 05:12:27PM -0400, Stefan Hajnoczi wrote:
> > On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> > > > >
> > > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  
> > > > > > wrote:
> > > > > >>
> > > > > >> (1) The virtio-1.0 specification
> > > > > >>  
> > > > > >> writes:
> > > > > >>
> > > > > >>> 3 General Initialization And Device Operation
> > > > > >>> 3.1   Device Initialization
> > > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > > >>>
> > > > > >>> [...]
> > > > > >>>
> > > > > >>> 7. Perform device-specific setup, including discovery of 
> > > > > >>> virtqueues for
> > > > > >>>the device, optional per-bus setup, reading and possibly 
> > > > > >>> writing the
> > > > > >>>device’s virtio configuration space, and population of 
> > > > > >>> virtqueues.
> > > > > >>>
> > > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is 
> > > > > >>> “live”.
> > > > > >>
> > > > > >> and
> > > > > >>
> > > > > >>> 4 Virtio Transport Options
> > > > > >>> 4.1   Virtio Over PCI Bus
> > > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > > >>> 4.1.4.3   Common configuration structure layout
> > > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure 
> > > > > >>> layout
> > > > > >>>
> > > > > >>> [...]
> > > > > >>>
> > > > > >>> The driver MUST configure the other virtqueue fields before 
> > > > > >>> enabling the
> > > > > >>> virtqueue with queue_enable.
> > > > > >>>
> > > > > >>> [...]
> > > > > >>
> > > > > >> These together mean that the following sub-sequence of steps is 
> > > > > >> valid for
> > > > > >> a virtio-1.0 guest driver:
> > > > > >>
> > > > > >> (1.1) set "queue_enable" for the needed queues as the final part 
> > > > > >> of device
> > > > > >> initialization step (7),
> > > > > >>
> > > > > >> (1.2) set DRIVER_OK in step (8),
> > > > > >>
> > > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > > >>
> > > > > >> (2) When vhost-user is enabled, and the 
> > > > > >> VHOST_USER_F_PROTOCOL_FEATURES
> > > > > >> special virtio feature is negotiated, then virtio rings start in 
> > > > > >> disabled
> > > > > >> state, according to
> > > > > >> .
> > > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are 
> > > > > >> needed for
> > > > > >> enabling vrings.
> > > > > >>
> > > > > >> Therefore setting "queue_enable" from the guest (1.1) is a 
> > > > > >> *control plane*
> > > > > >> operation, which travels from the guest through QEMU to the 
> > > > > >> vhost-user
> > > > > >> backend, using a unix domain socket.
> > > > > >>
> > > > > >> Whereas sending a virtio request (1.3) is a *data plane* 
> > > > > >> operation, which
> > > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > > >> eventfd.
> > > > > >>
> > > > > >> This means that steps (1.1) and (1.3) travel through different 
> > > > > >> channels,
> > > > > >> and their relative order can be reversed, as perceived by the 
> > > > > >> vhost-user
> > > > > >> backend.
> > > > > >>
> > > > > >> That's exactly what happens when OVMF's virtiofs driver 
> > > > > >> (VirtioFsDxe) runs
> > > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses 
> > > > > >> version
> > > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the 
> > > > > >> vhost
> > > > > >> crate.)
> > > > > >>
> > > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through 
> > > > > >> the
> > > > > >> device initialization steps (i.e., control plane operations), and
> > > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data 
> > > > > >> plane
> > > > > >> operation). In the Rust-language virtiofsd, this creates a race 
> > > > > >> between
> > > > > >> two components that run *concurrently*, i.e., in different threads 
> > > > > >> or
> > > > > >> processes:
> > > > > >>
> > > > > >> - Control plane, handling vhost-user protocol messages:
> > > > > >>
> > > > > >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > > >>   [crates/vhost-user-backend/src/handler.rs] handles
> > > > > >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > > > > >> "enabled"
> > > > > >>   flag according to the message processed.
> > > > > >>
> > > > > >> - Data plane, handling virtio / FUSE requests:
> > > > > >>
> > > > > >>   The "VringEpollHandler::handle_event" method
> > > > > >>   [crates/vhost-user-backend/src/event_loop.rs] handles the 
> > > > > >> incoming
> > > > > >>   virtio / FUSE request, consuming 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Michael S. Tsirkin
On Mon, Oct 02, 2023 at 05:12:27PM -0400, Stefan Hajnoczi wrote:
> On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin  wrote:
> >
> > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> > > >
> > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> > > > >>
> > > > >> (1) The virtio-1.0 specification
> > > > >>  
> > > > >> writes:
> > > > >>
> > > > >>> 3 General Initialization And Device Operation
> > > > >>> 3.1   Device Initialization
> > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> 7. Perform device-specific setup, including discovery of virtqueues 
> > > > >>> for
> > > > >>>the device, optional per-bus setup, reading and possibly writing 
> > > > >>> the
> > > > >>>device’s virtio configuration space, and population of 
> > > > >>> virtqueues.
> > > > >>>
> > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > > >>
> > > > >> and
> > > > >>
> > > > >>> 4 Virtio Transport Options
> > > > >>> 4.1   Virtio Over PCI Bus
> > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > >>> 4.1.4.3   Common configuration structure layout
> > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> The driver MUST configure the other virtqueue fields before 
> > > > >>> enabling the
> > > > >>> virtqueue with queue_enable.
> > > > >>>
> > > > >>> [...]
> > > > >>
> > > > >> These together mean that the following sub-sequence of steps is 
> > > > >> valid for
> > > > >> a virtio-1.0 guest driver:
> > > > >>
> > > > >> (1.1) set "queue_enable" for the needed queues as the final part of 
> > > > >> device
> > > > >> initialization step (7),
> > > > >>
> > > > >> (1.2) set DRIVER_OK in step (8),
> > > > >>
> > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > >>
> > > > >> (2) When vhost-user is enabled, and the 
> > > > >> VHOST_USER_F_PROTOCOL_FEATURES
> > > > >> special virtio feature is negotiated, then virtio rings start in 
> > > > >> disabled
> > > > >> state, according to
> > > > >> .
> > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are 
> > > > >> needed for
> > > > >> enabling vrings.
> > > > >>
> > > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control 
> > > > >> plane*
> > > > >> operation, which travels from the guest through QEMU to the 
> > > > >> vhost-user
> > > > >> backend, using a unix domain socket.
> > > > >>
> > > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, 
> > > > >> which
> > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > >> eventfd.
> > > > >>
> > > > >> This means that steps (1.1) and (1.3) travel through different 
> > > > >> channels,
> > > > >> and their relative order can be reversed, as perceived by the 
> > > > >> vhost-user
> > > > >> backend.
> > > > >>
> > > > >> That's exactly what happens when OVMF's virtiofs driver 
> > > > >> (VirtioFsDxe) runs
> > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses 
> > > > >> version
> > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the 
> > > > >> vhost
> > > > >> crate.)
> > > > >>
> > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > > >> device initialization steps (i.e., control plane operations), and
> > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data 
> > > > >> plane
> > > > >> operation). In the Rust-language virtiofsd, this creates a race 
> > > > >> between
> > > > >> two components that run *concurrently*, i.e., in different threads or
> > > > >> processes:
> > > > >>
> > > > >> - Control plane, handling vhost-user protocol messages:
> > > > >>
> > > > >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > >>   [crates/vhost-user-backend/src/handler.rs] handles
> > > > >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > > > >> "enabled"
> > > > >>   flag according to the message processed.
> > > > >>
> > > > >> - Data plane, handling virtio / FUSE requests:
> > > > >>
> > > > >>   The "VringEpollHandler::handle_event" method
> > > > >>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > > >>   virtio / FUSE request, consuming the virtio kick at the same time. 
> > > > >> If
> > > > >>   the vring's "enabled" flag is set, the virtio / FUSE request is
> > > > >>   processed genuinely. If the vring's "enabled" flag is clear, then 
> > > > >> the
> > > > >>   virtio / FUSE request is discarded.
> > > > >
> > > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > > 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Stefan Hajnoczi
One more question:

Why is the disabled state not needed by regular (non-vhost) virtio-net devices?

On Mon, 2 Oct 2023 at 17:12, Stefan Hajnoczi  wrote:
>
> On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin  wrote:
> >
> > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> > > >
> > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> > > > >>
> > > > >> (1) The virtio-1.0 specification
> > > > >>  
> > > > >> writes:
> > > > >>
> > > > >>> 3 General Initialization And Device Operation
> > > > >>> 3.1   Device Initialization
> > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> 7. Perform device-specific setup, including discovery of virtqueues 
> > > > >>> for
> > > > >>>the device, optional per-bus setup, reading and possibly writing 
> > > > >>> the
> > > > >>>device’s virtio configuration space, and population of 
> > > > >>> virtqueues.
> > > > >>>
> > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > > >>
> > > > >> and
> > > > >>
> > > > >>> 4 Virtio Transport Options
> > > > >>> 4.1   Virtio Over PCI Bus
> > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > >>> 4.1.4.3   Common configuration structure layout
> > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> The driver MUST configure the other virtqueue fields before 
> > > > >>> enabling the
> > > > >>> virtqueue with queue_enable.
> > > > >>>
> > > > >>> [...]
> > > > >>
> > > > >> These together mean that the following sub-sequence of steps is 
> > > > >> valid for
> > > > >> a virtio-1.0 guest driver:
> > > > >>
> > > > >> (1.1) set "queue_enable" for the needed queues as the final part of 
> > > > >> device
> > > > >> initialization step (7),
> > > > >>
> > > > >> (1.2) set DRIVER_OK in step (8),
> > > > >>
> > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > >>
> > > > >> (2) When vhost-user is enabled, and the 
> > > > >> VHOST_USER_F_PROTOCOL_FEATURES
> > > > >> special virtio feature is negotiated, then virtio rings start in 
> > > > >> disabled
> > > > >> state, according to
> > > > >> .
> > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are 
> > > > >> needed for
> > > > >> enabling vrings.
> > > > >>
> > > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control 
> > > > >> plane*
> > > > >> operation, which travels from the guest through QEMU to the 
> > > > >> vhost-user
> > > > >> backend, using a unix domain socket.
> > > > >>
> > > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, 
> > > > >> which
> > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > >> eventfd.
> > > > >>
> > > > >> This means that steps (1.1) and (1.3) travel through different 
> > > > >> channels,
> > > > >> and their relative order can be reversed, as perceived by the 
> > > > >> vhost-user
> > > > >> backend.
> > > > >>
> > > > >> That's exactly what happens when OVMF's virtiofs driver 
> > > > >> (VirtioFsDxe) runs
> > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses 
> > > > >> version
> > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the 
> > > > >> vhost
> > > > >> crate.)
> > > > >>
> > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > > >> device initialization steps (i.e., control plane operations), and
> > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data 
> > > > >> plane
> > > > >> operation). In the Rust-language virtiofsd, this creates a race 
> > > > >> between
> > > > >> two components that run *concurrently*, i.e., in different threads or
> > > > >> processes:
> > > > >>
> > > > >> - Control plane, handling vhost-user protocol messages:
> > > > >>
> > > > >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > >>   [crates/vhost-user-backend/src/handler.rs] handles
> > > > >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > > > >> "enabled"
> > > > >>   flag according to the message processed.
> > > > >>
> > > > >> - Data plane, handling virtio / FUSE requests:
> > > > >>
> > > > >>   The "VringEpollHandler::handle_event" method
> > > > >>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > > >>   virtio / FUSE request, consuming the virtio kick at the same time. 
> > > > >> If
> > > > >>   the vring's "enabled" flag is set, the virtio / FUSE request is
> > > > >>   processed genuinely. If the vring's "enabled" flag is clear, then 
> > > > >> the
> > > > >>   virtio / FUSE request is discarded.
> > > > >

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Stefan Hajnoczi
On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin  wrote:
>
> On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> > >
> > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> > > >>
> > > >> (1) The virtio-1.0 specification
> > > >>  
> > > >> writes:
> > > >>
> > > >>> 3 General Initialization And Device Operation
> > > >>> 3.1   Device Initialization
> > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > >>>
> > > >>> [...]
> > > >>>
> > > >>> 7. Perform device-specific setup, including discovery of virtqueues 
> > > >>> for
> > > >>>the device, optional per-bus setup, reading and possibly writing 
> > > >>> the
> > > >>>device’s virtio configuration space, and population of virtqueues.
> > > >>>
> > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > >>
> > > >> and
> > > >>
> > > >>> 4 Virtio Transport Options
> > > >>> 4.1   Virtio Over PCI Bus
> > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > >>> 4.1.4.3   Common configuration structure layout
> > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > >>>
> > > >>> [...]
> > > >>>
> > > >>> The driver MUST configure the other virtqueue fields before enabling 
> > > >>> the
> > > >>> virtqueue with queue_enable.
> > > >>>
> > > >>> [...]
> > > >>
> > > >> These together mean that the following sub-sequence of steps is valid 
> > > >> for
> > > >> a virtio-1.0 guest driver:
> > > >>
> > > >> (1.1) set "queue_enable" for the needed queues as the final part of 
> > > >> device
> > > >> initialization step (7),
> > > >>
> > > >> (1.2) set DRIVER_OK in step (8),
> > > >>
> > > >> (1.3) immediately start sending virtio requests to the device.
> > > >>
> > > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > >> special virtio feature is negotiated, then virtio rings start in 
> > > >> disabled
> > > >> state, according to
> > > >> .
> > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed 
> > > >> for
> > > >> enabling vrings.
> > > >>
> > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control 
> > > >> plane*
> > > >> operation, which travels from the guest through QEMU to the vhost-user
> > > >> backend, using a unix domain socket.
> > > >>
> > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, 
> > > >> which
> > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > >> eventfd.
> > > >>
> > > >> This means that steps (1.1) and (1.3) travel through different 
> > > >> channels,
> > > >> and their relative order can be reversed, as perceived by the 
> > > >> vhost-user
> > > >> backend.
> > > >>
> > > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) 
> > > >> runs
> > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > >> crate.)
> > > >>
> > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > >> device initialization steps (i.e., control plane operations), and
> > > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > >> operation). In the Rust-language virtiofsd, this creates a race between
> > > >> two components that run *concurrently*, i.e., in different threads or
> > > >> processes:
> > > >>
> > > >> - Control plane, handling vhost-user protocol messages:
> > > >>
> > > >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > >>   [crates/vhost-user-backend/src/handler.rs] handles
> > > >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > > >> "enabled"
> > > >>   flag according to the message processed.
> > > >>
> > > >> - Data plane, handling virtio / FUSE requests:
> > > >>
> > > >>   The "VringEpollHandler::handle_event" method
> > > >>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > >>   virtio / FUSE request, consuming the virtio kick at the same time. If
> > > >>   the vring's "enabled" flag is set, the virtio / FUSE request is
> > > >>   processed genuinely. If the vring's "enabled" flag is clear, then the
> > > >>   virtio / FUSE request is discarded.
> > > >
> > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > while it's disabled?
> > >
> > > That's what the vhost-user spec requires:
> > >
> > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > >
> > > """
> > > started but disabled: the back-end must process the ring without causing
> > > any side effects. For example, for a networking device, in the disabled
> > > state the back-end must not supply any new RX packets, but 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Laszlo Ersek
On 10/2/23 08:57, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2023 at 03:56:03AM +0200, Laszlo Ersek wrote:
>> On 10/1/23 21:25, Michael S. Tsirkin wrote:
>>> Not this actually - v2 of this.
>>
>> Thank you, but:
>>
>> - Stefan's question should be answered still IMO (although if you pick
>> up this series, then that could be interpreted as "QEMU bug, not spec bug")
>>
>> - I was supposed to update the commit message on 7/7 in v3; I didn't
>> want to do it before Stefan's question was answered
>>
>> Thanks!
>> Laszlo
> 
> OK I just answered. I am fine with the patch but I think
> virtiofsd should be fixed too.

Thanks. I'll prepare a v3 with an updated commit message on 7/7 (plus
picking up any new feedback tags).

Cheers
Laszlo




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Michael S. Tsirkin
On Mon, Oct 02, 2023 at 03:56:03AM +0200, Laszlo Ersek wrote:
> On 10/1/23 21:25, Michael S. Tsirkin wrote:
> > Not this actually - v2 of this.
> 
> Thank you, but:
> 
> - Stefan's question should be answered still IMO (although if you pick
> up this series, then that could be interpreted as "QEMU bug, not spec bug")
> 
> - I was supposed to update the commit message on 7/7 in v3; I didn't
> want to do it before Stefan's question was answered
> 
> Thanks!
> Laszlo

OK I just answered. I am fine with the patch but I think
virtiofsd should be fixed too.


-- 
MST




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-02 Thread Michael S. Tsirkin
On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> >
> > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> > >>
> > >> (1) The virtio-1.0 specification
> > >>  writes:
> > >>
> > >>> 3 General Initialization And Device Operation
> > >>> 3.1   Device Initialization
> > >>> 3.1.1 Driver Requirements: Device Initialization
> > >>>
> > >>> [...]
> > >>>
> > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > >>>the device, optional per-bus setup, reading and possibly writing the
> > >>>device’s virtio configuration space, and population of virtqueues.
> > >>>
> > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > >>
> > >> and
> > >>
> > >>> 4 Virtio Transport Options
> > >>> 4.1   Virtio Over PCI Bus
> > >>> 4.1.4 Virtio Structure PCI Capabilities
> > >>> 4.1.4.3   Common configuration structure layout
> > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > >>>
> > >>> [...]
> > >>>
> > >>> The driver MUST configure the other virtqueue fields before enabling the
> > >>> virtqueue with queue_enable.
> > >>>
> > >>> [...]
> > >>
> > >> These together mean that the following sub-sequence of steps is valid for
> > >> a virtio-1.0 guest driver:
> > >>
> > >> (1.1) set "queue_enable" for the needed queues as the final part of 
> > >> device
> > >> initialization step (7),
> > >>
> > >> (1.2) set DRIVER_OK in step (8),
> > >>
> > >> (1.3) immediately start sending virtio requests to the device.
> > >>
> > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > >> special virtio feature is negotiated, then virtio rings start in disabled
> > >> state, according to
> > >> .
> > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed 
> > >> for
> > >> enabling vrings.
> > >>
> > >> Therefore setting "queue_enable" from the guest (1.1) is a *control 
> > >> plane*
> > >> operation, which travels from the guest through QEMU to the vhost-user
> > >> backend, using a unix domain socket.
> > >>
> > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > >> eventfd.
> > >>
> > >> This means that steps (1.1) and (1.3) travel through different channels,
> > >> and their relative order can be reversed, as perceived by the vhost-user
> > >> backend.
> > >>
> > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) 
> > >> runs
> > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > >> crate.)
> > >>
> > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > >> device initialization steps (i.e., control plane operations), and
> > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > >> operation). In the Rust-language virtiofsd, this creates a race between
> > >> two components that run *concurrently*, i.e., in different threads or
> > >> processes:
> > >>
> > >> - Control plane, handling vhost-user protocol messages:
> > >>
> > >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > >>   [crates/vhost-user-backend/src/handler.rs] handles
> > >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > >> "enabled"
> > >>   flag according to the message processed.
> > >>
> > >> - Data plane, handling virtio / FUSE requests:
> > >>
> > >>   The "VringEpollHandler::handle_event" method
> > >>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > >>   virtio / FUSE request, consuming the virtio kick at the same time. If
> > >>   the vring's "enabled" flag is set, the virtio / FUSE request is
> > >>   processed genuinely. If the vring's "enabled" flag is clear, then the
> > >>   virtio / FUSE request is discarded.
> > >
> > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > while it's disabled?
> >
> > That's what the vhost-user spec requires:
> >
> > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> >
> > """
> > started but disabled: the back-end must process the ring without causing
> > any side effects. For example, for a networking device, in the disabled
> > state the back-end must not supply any new RX packets, but must process
> > and discard any TX packets.
> > """
> >
> > This state is different from "stopped", where "the back-end must not
> > process the ring at all".
> >
> > The spec also says,
> >
> > """
> > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > initialized in a disabled state and is enabled by
> > VHOST_USER_SET_VRING_ENABLE with parameter 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-01 Thread Laszlo Ersek
On 10/1/23 21:25, Michael S. Tsirkin wrote:
> Not this actually - v2 of this.

Thank you, but:

- Stefan's question should be answered still IMO (although if you pick
up this series, then that could be interpreted as "QEMU bug, not spec bug")

- I was supposed to update the commit message on 7/7 in v3; I didn't
want to do it before Stefan's question was answered

Thanks!
Laszlo




Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-01 Thread Michael S. Tsirkin
Not this actually - v2 of this.

On Sun, Oct 01, 2023 at 03:24:59PM -0400, Michael S. Tsirkin wrote:
> yes sorry - I am working on a pull request with this
> included.
> 
> On Mon, Sep 25, 2023 at 05:31:17PM +0200, Laszlo Ersek wrote:
> > Ping -- Michael, any comments please? This set (now at v2) has been
> > waiting on your answer since Aug 30th.
> > 
> > Laszlo
> > 
> > On 9/5/23 08:30, Laszlo Ersek wrote:
> > > Michael,
> > > 
> > > On 8/30/23 17:37, Stefan Hajnoczi wrote:
> > >> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> > >>>
> > >>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
> >  On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> > >
> > > (1) The virtio-1.0 specification
> > >  
> > > writes:
> > >
> > >> 3 General Initialization And Device Operation
> > >> 3.1   Device Initialization
> > >> 3.1.1 Driver Requirements: Device Initialization
> > >>
> > >> [...]
> > >>
> > >> 7. Perform device-specific setup, including discovery of virtqueues 
> > >> for
> > >>the device, optional per-bus setup, reading and possibly writing 
> > >> the
> > >>device’s virtio configuration space, and population of virtqueues.
> > >>
> > >> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > >
> > > and
> > >
> > >> 4 Virtio Transport Options
> > >> 4.1   Virtio Over PCI Bus
> > >> 4.1.4 Virtio Structure PCI Capabilities
> > >> 4.1.4.3   Common configuration structure layout
> > >> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > >>
> > >> [...]
> > >>
> > >> The driver MUST configure the other virtqueue fields before enabling 
> > >> the
> > >> virtqueue with queue_enable.
> > >>
> > >> [...]
> > >
> > > These together mean that the following sub-sequence of steps is valid 
> > > for
> > > a virtio-1.0 guest driver:
> > >
> > > (1.1) set "queue_enable" for the needed queues as the final part of 
> > > device
> > > initialization step (7),
> > >
> > > (1.2) set DRIVER_OK in step (8),
> > >
> > > (1.3) immediately start sending virtio requests to the device.
> > >
> > > (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > special virtio feature is negotiated, then virtio rings start in 
> > > disabled
> > > state, according to
> > > .
> > > In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are 
> > > needed for
> > > enabling vrings.
> > >
> > > Therefore setting "queue_enable" from the guest (1.1) is a *control 
> > > plane*
> > > operation, which travels from the guest through QEMU to the vhost-user
> > > backend, using a unix domain socket.
> > >
> > > Whereas sending a virtio request (1.3) is a *data plane* operation, 
> > > which
> > > evades QEMU -- it travels from guest to the vhost-user backend via
> > > eventfd.
> > >
> > > This means that steps (1.1) and (1.3) travel through different 
> > > channels,
> > > and their relative order can be reversed, as perceived by the 
> > > vhost-user
> > > backend.
> > >
> > > That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) 
> > > runs
> > > against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > crate.)
> > >
> > > Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > device initialization steps (i.e., control plane operations), and
> > > immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > operation). In the Rust-language virtiofsd, this creates a race 
> > > between
> > > two components that run *concurrently*, i.e., in different threads or
> > > processes:
> > >
> > > - Control plane, handling vhost-user protocol messages:
> > >
> > >   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > >   [crates/vhost-user-backend/src/handler.rs] handles
> > >   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > > "enabled"
> > >   flag according to the message processed.
> > >
> > > - Data plane, handling virtio / FUSE requests:
> > >
> > >   The "VringEpollHandler::handle_event" method
> > >   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > >   virtio / FUSE request, consuming the virtio kick at the same time. 
> > > If
> > >   the vring's "enabled" flag is set, the virtio / FUSE request is
> > >   processed genuinely. If the vring's "enabled" flag is clear, then 
> > > the
> > >   virtio / FUSE request is 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-10-01 Thread Michael S. Tsirkin
yes sorry - I am working on a pull request with this
included.

On Mon, Sep 25, 2023 at 05:31:17PM +0200, Laszlo Ersek wrote:
> Ping -- Michael, any comments please? This set (now at v2) has been
> waiting on your answer since Aug 30th.
> 
> Laszlo
> 
> On 9/5/23 08:30, Laszlo Ersek wrote:
> > Michael,
> > 
> > On 8/30/23 17:37, Stefan Hajnoczi wrote:
> >> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
> >>>
> >>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
>  On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> >
> > (1) The virtio-1.0 specification
> >  writes:
> >
> >> 3 General Initialization And Device Operation
> >> 3.1   Device Initialization
> >> 3.1.1 Driver Requirements: Device Initialization
> >>
> >> [...]
> >>
> >> 7. Perform device-specific setup, including discovery of virtqueues for
> >>the device, optional per-bus setup, reading and possibly writing the
> >>device’s virtio configuration space, and population of virtqueues.
> >>
> >> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >
> > and
> >
> >> 4 Virtio Transport Options
> >> 4.1   Virtio Over PCI Bus
> >> 4.1.4 Virtio Structure PCI Capabilities
> >> 4.1.4.3   Common configuration structure layout
> >> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >>
> >> [...]
> >>
> >> The driver MUST configure the other virtqueue fields before enabling 
> >> the
> >> virtqueue with queue_enable.
> >>
> >> [...]
> >
> > These together mean that the following sub-sequence of steps is valid 
> > for
> > a virtio-1.0 guest driver:
> >
> > (1.1) set "queue_enable" for the needed queues as the final part of 
> > device
> > initialization step (7),
> >
> > (1.2) set DRIVER_OK in step (8),
> >
> > (1.3) immediately start sending virtio requests to the device.
> >
> > (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > special virtio feature is negotiated, then virtio rings start in 
> > disabled
> > state, according to
> > .
> > In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed 
> > for
> > enabling vrings.
> >
> > Therefore setting "queue_enable" from the guest (1.1) is a *control 
> > plane*
> > operation, which travels from the guest through QEMU to the vhost-user
> > backend, using a unix domain socket.
> >
> > Whereas sending a virtio request (1.3) is a *data plane* operation, 
> > which
> > evades QEMU -- it travels from guest to the vhost-user backend via
> > eventfd.
> >
> > This means that steps (1.1) and (1.3) travel through different channels,
> > and their relative order can be reversed, as perceived by the vhost-user
> > backend.
> >
> > That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) 
> > runs
> > against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > crate.)
> >
> > Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > device initialization steps (i.e., control plane operations), and
> > immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > operation). In the Rust-language virtiofsd, this creates a race between
> > two components that run *concurrently*, i.e., in different threads or
> > processes:
> >
> > - Control plane, handling vhost-user protocol messages:
> >
> >   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >   [crates/vhost-user-backend/src/handler.rs] handles
> >   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's 
> > "enabled"
> >   flag according to the message processed.
> >
> > - Data plane, handling virtio / FUSE requests:
> >
> >   The "VringEpollHandler::handle_event" method
> >   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >   virtio / FUSE request, consuming the virtio kick at the same time. If
> >   the vring's "enabled" flag is set, the virtio / FUSE request is
> >   processed genuinely. If the vring's "enabled" flag is clear, then the
> >   virtio / FUSE request is discarded.
> 
>  Why is virtiofsd monitoring the virtqueue and discarding requests
>  while it's disabled?
> >>>
> >>> That's what the vhost-user spec requires:
> >>>
> >>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> >>>
> >>> """
> >>> started but disabled: the back-end must process the ring without causing
> >>> any side effects. For example, for a networking device, in the 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-09-25 Thread Laszlo Ersek
Ping -- Michael, any comments please? This set (now at v2) has been
waiting on your answer since Aug 30th.

Laszlo

On 9/5/23 08:30, Laszlo Ersek wrote:
> Michael,
> 
> On 8/30/23 17:37, Stefan Hajnoczi wrote:
>> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
>>>
>>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
 On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
>
> (1) The virtio-1.0 specification
>  writes:
>
>> 3 General Initialization And Device Operation
>> 3.1   Device Initialization
>> 3.1.1 Driver Requirements: Device Initialization
>>
>> [...]
>>
>> 7. Perform device-specific setup, including discovery of virtqueues for
>>the device, optional per-bus setup, reading and possibly writing the
>>device’s virtio configuration space, and population of virtqueues.
>>
>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>
> and
>
>> 4 Virtio Transport Options
>> 4.1   Virtio Over PCI Bus
>> 4.1.4 Virtio Structure PCI Capabilities
>> 4.1.4.3   Common configuration structure layout
>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>
>> [...]
>>
>> The driver MUST configure the other virtqueue fields before enabling the
>> virtqueue with queue_enable.
>>
>> [...]
>
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
>
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
>
> (1.2) set DRIVER_OK in step (8),
>
> (1.3) immediately start sending virtio requests to the device.
>
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> .
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
>
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
>
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
>
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
>
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
>
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
>
> - Control plane, handling vhost-user protocol messages:
>
>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>   [crates/vhost-user-backend/src/handler.rs] handles
>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>   flag according to the message processed.
>
> - Data plane, handling virtio / FUSE requests:
>
>   The "VringEpollHandler::handle_event" method
>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>   virtio / FUSE request, consuming the virtio kick at the same time. If
>   the vring's "enabled" flag is set, the virtio / FUSE request is
>   processed genuinely. If the vring's "enabled" flag is clear, then the
>   virtio / FUSE request is discarded.

 Why is virtiofsd monitoring the virtqueue and discarding requests
 while it's disabled?
>>>
>>> That's what the vhost-user spec requires:
>>>
>>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
>>>
>>> """
>>> started but disabled: the back-end must process the ring without causing
>>> any side effects. For example, for a networking device, in the disabled
>>> state the back-end must not supply any new RX packets, but must process
>>> and discard any TX packets.
>>> """
>>>
>>> This state is different from "stopped", where "the back-end must not
>>> process the ring at all".
>>>
>>> The spec also says,
>>>
>>> """
>>> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
>>> initialized in a disabled state and is enabled by
>>> VHOST_USER_SET_VRING_ENABLE 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-09-05 Thread Laszlo Ersek
Michael,

On 8/30/23 17:37, Stefan Hajnoczi wrote:
> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
>>
>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
>>> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:

 (1) The virtio-1.0 specification
  writes:

> 3 General Initialization And Device Operation
> 3.1   Device Initialization
> 3.1.1 Driver Requirements: Device Initialization
>
> [...]
>
> 7. Perform device-specific setup, including discovery of virtqueues for
>the device, optional per-bus setup, reading and possibly writing the
>device’s virtio configuration space, and population of virtqueues.
>
> 8. Set the DRIVER_OK status bit. At this point the device is “live”.

 and

> 4 Virtio Transport Options
> 4.1   Virtio Over PCI Bus
> 4.1.4 Virtio Structure PCI Capabilities
> 4.1.4.3   Common configuration structure layout
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>
> [...]
>
> The driver MUST configure the other virtqueue fields before enabling the
> virtqueue with queue_enable.
>
> [...]

 These together mean that the following sub-sequence of steps is valid for
 a virtio-1.0 guest driver:

 (1.1) set "queue_enable" for the needed queues as the final part of device
 initialization step (7),

 (1.2) set DRIVER_OK in step (8),

 (1.3) immediately start sending virtio requests to the device.

 (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
 special virtio feature is negotiated, then virtio rings start in disabled
 state, according to
 .
 In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
 enabling vrings.

 Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
 operation, which travels from the guest through QEMU to the vhost-user
 backend, using a unix domain socket.

 Whereas sending a virtio request (1.3) is a *data plane* operation, which
 evades QEMU -- it travels from guest to the vhost-user backend via
 eventfd.

 This means that steps (1.1) and (1.3) travel through different channels,
 and their relative order can be reversed, as perceived by the vhost-user
 backend.

 That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
 against the Rust-language virtiofsd version 1.7.2. (Which uses version
 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
 crate.)

 Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
 device initialization steps (i.e., control plane operations), and
 immediately sends a FUSE_INIT request too (i.e., performs a data plane
 operation). In the Rust-language virtiofsd, this creates a race between
 two components that run *concurrently*, i.e., in different threads or
 processes:

 - Control plane, handling vhost-user protocol messages:

   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
   [crates/vhost-user-backend/src/handler.rs] handles
   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
   flag according to the message processed.

 - Data plane, handling virtio / FUSE requests:

   The "VringEpollHandler::handle_event" method
   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
   virtio / FUSE request, consuming the virtio kick at the same time. If
   the vring's "enabled" flag is set, the virtio / FUSE request is
   processed genuinely. If the vring's "enabled" flag is clear, then the
   virtio / FUSE request is discarded.
>>>
>>> Why is virtiofsd monitoring the virtqueue and discarding requests
>>> while it's disabled?
>>
>> That's what the vhost-user spec requires:
>>
>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
>>
>> """
>> started but disabled: the back-end must process the ring without causing
>> any side effects. For example, for a networking device, in the disabled
>> state the back-end must not supply any new RX packets, but must process
>> and discard any TX packets.
>> """
>>
>> This state is different from "stopped", where "the back-end must not
>> process the ring at all".
>>
>> The spec also says,
>>
>> """
>> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
>> initialized in a disabled state and is enabled by
>> VHOST_USER_SET_VRING_ENABLE with parameter 1.
>> """
>>
>> AFAICT virtiofsd follows this requirement.
> 
> Hi Michael,
> You documented the disabled ring state in QEMU commit commit
> c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> and enable") where virtio-net devices discard 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Stefan Hajnoczi
On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek  wrote:
>
> On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
> >>
> >> (1) The virtio-1.0 specification
> >>  writes:
> >>
> >>> 3 General Initialization And Device Operation
> >>> 3.1   Device Initialization
> >>> 3.1.1 Driver Requirements: Device Initialization
> >>>
> >>> [...]
> >>>
> >>> 7. Perform device-specific setup, including discovery of virtqueues for
> >>>the device, optional per-bus setup, reading and possibly writing the
> >>>device’s virtio configuration space, and population of virtqueues.
> >>>
> >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >>
> >> and
> >>
> >>> 4 Virtio Transport Options
> >>> 4.1   Virtio Over PCI Bus
> >>> 4.1.4 Virtio Structure PCI Capabilities
> >>> 4.1.4.3   Common configuration structure layout
> >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >>>
> >>> [...]
> >>>
> >>> The driver MUST configure the other virtqueue fields before enabling the
> >>> virtqueue with queue_enable.
> >>>
> >>> [...]
> >>
> >> These together mean that the following sub-sequence of steps is valid for
> >> a virtio-1.0 guest driver:
> >>
> >> (1.1) set "queue_enable" for the needed queues as the final part of device
> >> initialization step (7),
> >>
> >> (1.2) set DRIVER_OK in step (8),
> >>
> >> (1.3) immediately start sending virtio requests to the device.
> >>
> >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> >> special virtio feature is negotiated, then virtio rings start in disabled
> >> state, according to
> >> .
> >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> >> enabling vrings.
> >>
> >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> >> operation, which travels from the guest through QEMU to the vhost-user
> >> backend, using a unix domain socket.
> >>
> >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> >> evades QEMU -- it travels from guest to the vhost-user backend via
> >> eventfd.
> >>
> >> This means that steps (1.1) and (1.3) travel through different channels,
> >> and their relative order can be reversed, as perceived by the vhost-user
> >> backend.
> >>
> >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> >> crate.)
> >>
> >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> >> device initialization steps (i.e., control plane operations), and
> >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> >> operation). In the Rust-language virtiofsd, this creates a race between
> >> two components that run *concurrently*, i.e., in different threads or
> >> processes:
> >>
> >> - Control plane, handling vhost-user protocol messages:
> >>
> >>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >>   [crates/vhost-user-backend/src/handler.rs] handles
> >>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> >>   flag according to the message processed.
> >>
> >> - Data plane, handling virtio / FUSE requests:
> >>
> >>   The "VringEpollHandler::handle_event" method
> >>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >>   virtio / FUSE request, consuming the virtio kick at the same time. If
> >>   the vring's "enabled" flag is set, the virtio / FUSE request is
> >>   processed genuinely. If the vring's "enabled" flag is clear, then the
> >>   virtio / FUSE request is discarded.
> >
> > Why is virtiofsd monitoring the virtqueue and discarding requests
> > while it's disabled?
>
> That's what the vhost-user spec requires:
>
> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
>
> """
> started but disabled: the back-end must process the ring without causing
> any side effects. For example, for a networking device, in the disabled
> state the back-end must not supply any new RX packets, but must process
> and discard any TX packets.
> """
>
> This state is different from "stopped", where "the back-end must not
> process the ring at all".
>
> The spec also says,
>
> """
> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> initialized in a disabled state and is enabled by
> VHOST_USER_SET_VRING_ENABLE with parameter 1.
> """
>
> AFAICT virtiofsd follows this requirement.

Hi Michael,
You documented the disabled ring state in QEMU commit commit
c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
and enable") where virtio-net devices discard tx buffers. The disabled
state seems to be specific to vhost-user and not covered in the 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Stefano Garzarella

On Wed, Aug 30, 2023 at 11:26:41AM +0200, Laszlo Ersek wrote:

On 8/30/23 10:39, Stefano Garzarella wrote:

On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:

(1) The virtio-1.0 specification
 writes:


What about referring the latest spec available now (1.2)?


I didn't want to do that because the OVMF guest driver was written
against 1.0 (and the spec and the device are backwards compatible).

But, I don't feel strongly about this; I'm OK updating the reference /
quote to 1.2.






3 General Initialization And Device Operation
3.1   Device Initialization
3.1.1 Driver Requirements: Device Initialization

[...]

7. Perform device-specific setup, including discovery of virtqueues for
   the device, optional per-bus setup, reading and possibly writing the
   device’s virtio configuration space, and population of virtqueues.

8. Set the DRIVER_OK status bit. At this point the device is “live”.


and


4 Virtio Transport Options
4.1   Virtio Over PCI Bus
4.1.4 Virtio Structure PCI Capabilities
4.1.4.3   Common configuration structure layout
4.1.4.3.2 Driver Requirements: Common configuration structure layout

[...]

The driver MUST configure the other virtqueue fields before enabling the
virtqueue with queue_enable.

[...]


These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:

(1.1) set "queue_enable" for the needed queues as the final part of
device
initialization step (7),

(1.2) set DRIVER_OK in step (8),

(1.3) immediately start sending virtio requests to the device.

(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed
for
enabling vrings.

Therefore setting "queue_enable" from the guest (1.1) is a *control
plane*
operation, which travels from the guest through QEMU to the vhost-user
backend, using a unix domain socket.

Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.

This means that steps (1.1) and (1.3) travel through different channels,
and their relative order can be reversed, as perceived by the vhost-user
backend.

That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe)
runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)

Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:

- Control plane, handling vhost-user protocol messages:

 The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
 [crates/vhost-user-backend/src/handler.rs] handles
 VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
 flag according to the message processed.

- Data plane, handling virtio / FUSE requests:

 The "VringEpollHandler::handle_event" method
 [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
 virtio / FUSE request, consuming the virtio kick at the same time. If
 the vring's "enabled" flag is set, the virtio / FUSE request is
 processed genuinely. If the vring's "enabled" flag is clear, then the
 virtio / FUSE request is discarded.

Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a
deadlock.

The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.

The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act
upon
VHOST_USER_SET_VRING_ENABLE.

We can deterministically prevent the race in QEMU, by blocking OVMF
inside
step (1.1) -- i.e., in the write to the "queue_enable" register -- until
VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
cannot advance to the FUSE_INIT submission before 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Laszlo Ersek
On 8/30/23 14:10, Stefan Hajnoczi wrote:
> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
>>
>> (1) The virtio-1.0 specification
>>  writes:
>>
>>> 3 General Initialization And Device Operation
>>> 3.1   Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>the device, optional per-bus setup, reading and possibly writing the
>>>device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4 Virtio Transport Options
>>> 4.1   Virtio Over PCI Bus
>>> 4.1.4 Virtio Structure PCI Capabilities
>>> 4.1.4.3   Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> .
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>   [crates/vhost-user-backend/src/handler.rs] handles
>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>   flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>>   The "VringEpollHandler::handle_event" method
>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>   virtio / FUSE request is discarded.
> 
> Why is virtiofsd monitoring the virtqueue and discarding requests
> while it's disabled?

That's what the vhost-user spec requires:

https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states

"""
started but disabled: the back-end must process the ring without causing
any side effects. For example, for a networking device, in the disabled
state the back-end must not supply any new RX packets, but must process
and discard any TX packets.
"""

This state is different from "stopped", where "the back-end must not
process the ring at all".

The spec also says,

"""
If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
initialized in a disabled state and is enabled by
VHOST_USER_SET_VRING_ENABLE with parameter 1.
"""

AFAICT virtiofsd follows this requirement.

> This seems like a bug in the vhost-user backend to me.

I didn't want to exclude that possiblity; that's why I included Eugenio,
German, Liu Jiang, and Sergio in the CC list.

> 
> When the virtqueue is disabled, don't monitor the kickfd.
> 
> When the virtqueue transitions from disabled to enabled, the control
> plane should self-trigger the kickfd so that any available buffers
> will be processed.
> 
> QEMU uses this scheme to switch between vhost/IOThreads and built-in
> virtqueue kick processing.
> 
> This approach is more robust than relying 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Stefan Hajnoczi
On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek  wrote:
>
> (1) The virtio-1.0 specification
>  writes:
>
> > 3 General Initialization And Device Operation
> > 3.1   Device Initialization
> > 3.1.1 Driver Requirements: Device Initialization
> >
> > [...]
> >
> > 7. Perform device-specific setup, including discovery of virtqueues for
> >the device, optional per-bus setup, reading and possibly writing the
> >device’s virtio configuration space, and population of virtqueues.
> >
> > 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>
> and
>
> > 4 Virtio Transport Options
> > 4.1   Virtio Over PCI Bus
> > 4.1.4 Virtio Structure PCI Capabilities
> > 4.1.4.3   Common configuration structure layout
> > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >
> > [...]
> >
> > The driver MUST configure the other virtqueue fields before enabling the
> > virtqueue with queue_enable.
> >
> > [...]
>
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
>
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
>
> (1.2) set DRIVER_OK in step (8),
>
> (1.3) immediately start sending virtio requests to the device.
>
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> .
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
>
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
>
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
>
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
>
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
>
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
>
> - Control plane, handling vhost-user protocol messages:
>
>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>   [crates/vhost-user-backend/src/handler.rs] handles
>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>   flag according to the message processed.
>
> - Data plane, handling virtio / FUSE requests:
>
>   The "VringEpollHandler::handle_event" method
>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>   virtio / FUSE request, consuming the virtio kick at the same time. If
>   the vring's "enabled" flag is set, the virtio / FUSE request is
>   processed genuinely. If the vring's "enabled" flag is clear, then the
>   virtio / FUSE request is discarded.

Why is virtiofsd monitoring the virtqueue and discarding requests
while it's disabled? This seems like a bug in the vhost-user backend
to me.

When the virtqueue is disabled, don't monitor the kickfd.

When the virtqueue transitions from disabled to enabled, the control
plane should self-trigger the kickfd so that any available buffers
will be processed.

QEMU uses this scheme to switch between vhost/IOThreads and built-in
virtqueue kick processing.

This approach is more robust than relying buffers being enqueued after
the virtqueue is enabled.

Stefan

>
> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> However, if the data plane processor in virtiofsd wins the race, then it
> sees the FUSE_INIT *before* the control plane processor took notice of
> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> back to waiting for further virtio / FUSE requests with epoll_wait.
> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>
> The deadlock is not deterministic. OVMF hangs infrequently during first
> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> shell.
>
> The race can be "reliably masked" by inserting a very small delay -- a
> single debug message -- at the top of "VringEpollHandler::handle_event",
> i.e., just before the data plane processor checks the "enabled" field 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Laszlo Ersek
On 8/30/23 10:39, Stefano Garzarella wrote:
> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>> (1) The virtio-1.0 specification
>>  writes:
> 
> What about referring the latest spec available now (1.2)?

I didn't want to do that because the OVMF guest driver was written
against 1.0 (and the spec and the device are backwards compatible).

But, I don't feel strongly about this; I'm OK updating the reference /
quote to 1.2.

> 
>>
>>> 3 General Initialization And Device Operation
>>> 3.1   Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>    the device, optional per-bus setup, reading and possibly writing the
>>>    device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4 Virtio Transport Options
>>> 4.1   Virtio Over PCI Bus
>>> 4.1.4 Virtio Structure PCI Capabilities
>>> 4.1.4.3   Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of
>> device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> .
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed
>> for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control
>> plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe)
>> runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>>  The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>  [crates/vhost-user-backend/src/handler.rs] handles
>>  VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>  flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>>  The "VringEpollHandler::handle_event" method
>>  [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>  virtio / FUSE request, consuming the virtio kick at the same time. If
>>  the vring's "enabled" flag is set, the virtio / FUSE request is
>>  processed genuinely. If the vring's "enabled" flag is clear, then the
>>  virtio / FUSE request is discarded.
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a
>> deadlock.
>>
>> The deadlock is not deterministic. OVMF hangs infrequently during first
>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>> shell.
>>
>> The race can be "reliably masked" by inserting a very small delay -- a
>> single debug message -- at the top of "VringEpollHandler::handle_event",
>> i.e., just before the data plane processor checks the "enabled" field of
>> the vring. That delay suffices for the control plane processor to act
>> upon
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> We can 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Laszlo Ersek
On 8/30/23 10:59, Laszlo Ersek wrote:
> On 8/30/23 10:41, Laszlo Ersek wrote:
>> I'm adding Stefan to the CC list, and an additional piece of explanation
>> below:
>>
>> On 8/27/23 20:29, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>>  writes:
>>>
 3 General Initialization And Device Operation
 3.1   Device Initialization
 3.1.1 Driver Requirements: Device Initialization

 [...]

 7. Perform device-specific setup, including discovery of virtqueues for
the device, optional per-bus setup, reading and possibly writing the
device’s virtio configuration space, and population of virtqueues.

 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
 4 Virtio Transport Options
 4.1   Virtio Over PCI Bus
 4.1.4 Virtio Structure PCI Capabilities
 4.1.4.3   Common configuration structure layout
 4.1.4.3.2 Driver Requirements: Common configuration structure layout

 [...]

 The driver MUST configure the other virtqueue fields before enabling the
 virtqueue with queue_enable.

 [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> .
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>>   [crates/vhost-user-backend/src/handler.rs] handles
>>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>>   flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>>   The "VringEpollHandler::handle_event" method
>>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>>   virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>
>> I can explain why this issue has not been triggered by / witnessed with
>> the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
>>
>> That driver registers *two* driver (callback) structures, a virtio
>> driver, and a filesystem driver.
>>
>> (1) The virtio driver half initializes the virtio device, and takes a
>> note of the particular virtio filesystem, remembering its "tag". See
>> virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
>> -> virtio_fs_add_instance().
>>
>> Importantly, at this time, no FUSE_INIT request is sent.
>>
>> (2) The filesystem driver half has a totally 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Laszlo Ersek
On 8/30/23 10:41, Laszlo Ersek wrote:
> I'm adding Stefan to the CC list, and an additional piece of explanation
> below:
> 
> On 8/27/23 20:29, Laszlo Ersek wrote:
>> (1) The virtio-1.0 specification
>>  writes:
>>
>>> 3 General Initialization And Device Operation
>>> 3.1   Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>the device, optional per-bus setup, reading and possibly writing the
>>>device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4 Virtio Transport Options
>>> 4.1   Virtio Over PCI Bus
>>> 4.1.4 Virtio Structure PCI Capabilities
>>> 4.1.4.3   Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> .
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>   [crates/vhost-user-backend/src/handler.rs] handles
>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>   flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>>   The "VringEpollHandler::handle_event" method
>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>   virtio / FUSE request is discarded.
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> 
> I can explain why this issue has not been triggered by / witnessed with
> the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
> 
> That driver registers *two* driver (callback) structures, a virtio
> driver, and a filesystem driver.
> 
> (1) The virtio driver half initializes the virtio device, and takes a
> note of the particular virtio filesystem, remembering its "tag". See
> virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
> -> virtio_fs_add_instance().
> 
> Importantly, at this time, no FUSE_INIT request is sent.
> 
> (2) The filesystem driver half has a totally independent entry point.
> The relevant parts (after the driver registration) are:
> 
> (a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and
> 
> 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Laszlo Ersek
I'm adding Stefan to the CC list, and an additional piece of explanation
below:

On 8/27/23 20:29, Laszlo Ersek wrote:
> (1) The virtio-1.0 specification
>  writes:
> 
>> 3 General Initialization And Device Operation
>> 3.1   Device Initialization
>> 3.1.1 Driver Requirements: Device Initialization
>>
>> [...]
>>
>> 7. Perform device-specific setup, including discovery of virtqueues for
>>the device, optional per-bus setup, reading and possibly writing the
>>device’s virtio configuration space, and population of virtqueues.
>>
>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> 
> and
> 
>> 4 Virtio Transport Options
>> 4.1   Virtio Over PCI Bus
>> 4.1.4 Virtio Structure PCI Capabilities
>> 4.1.4.3   Common configuration structure layout
>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>
>> [...]
>>
>> The driver MUST configure the other virtqueue fields before enabling the
>> virtqueue with queue_enable.
>>
>> [...]
> 
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
> 
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
> 
> (1.2) set DRIVER_OK in step (8),
> 
> (1.3) immediately start sending virtio requests to the device.
> 
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> .
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
> 
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
> 
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
> 
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
> 
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
> 
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
> 
> - Control plane, handling vhost-user protocol messages:
> 
>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>   [crates/vhost-user-backend/src/handler.rs] handles
>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>   flag according to the message processed.
> 
> - Data plane, handling virtio / FUSE requests:
> 
>   The "VringEpollHandler::handle_event" method
>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>   virtio / FUSE request, consuming the virtio kick at the same time. If
>   the vring's "enabled" flag is set, the virtio / FUSE request is
>   processed genuinely. If the vring's "enabled" flag is clear, then the
>   virtio / FUSE request is discarded.
> 
> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> However, if the data plane processor in virtiofsd wins the race, then it
> sees the FUSE_INIT *before* the control plane processor took notice of
> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> back to waiting for further virtio / FUSE requests with epoll_wait.
> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.

I can explain why this issue has not been triggered by / witnessed with
the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").

That driver registers *two* driver (callback) structures, a virtio
driver, and a filesystem driver.

(1) The virtio driver half initializes the virtio device, and takes a
note of the particular virtio filesystem, remembering its "tag". See
virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
-> virtio_fs_add_instance().

Importantly, at this time, no FUSE_INIT request is sent.

(2) The filesystem driver half has a totally independent entry point.
The relevant parts (after the driver registration) are:

(a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and

(b) if the "tag" was found, (b) virtio_fs_get_tree() ->
virtio_fs_fill_super() -> fuse_send_init().

Importantly, this occurs when guest userspace (i.e., an 

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-30 Thread Stefano Garzarella

On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:

(1) The virtio-1.0 specification
 writes:


What about referring the latest spec available now (1.2)?




3 General Initialization And Device Operation
3.1   Device Initialization
3.1.1 Driver Requirements: Device Initialization

[...]

7. Perform device-specific setup, including discovery of virtqueues for
   the device, optional per-bus setup, reading and possibly writing the
   device’s virtio configuration space, and population of virtqueues.

8. Set the DRIVER_OK status bit. At this point the device is “live”.


and


4 Virtio Transport Options
4.1   Virtio Over PCI Bus
4.1.4 Virtio Structure PCI Capabilities
4.1.4.3   Common configuration structure layout
4.1.4.3.2 Driver Requirements: Common configuration structure layout

[...]

The driver MUST configure the other virtqueue fields before enabling the
virtqueue with queue_enable.

[...]


These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:

(1.1) set "queue_enable" for the needed queues as the final part of device
initialization step (7),

(1.2) set DRIVER_OK in step (8),

(1.3) immediately start sending virtio requests to the device.

(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
enabling vrings.

Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
operation, which travels from the guest through QEMU to the vhost-user
backend, using a unix domain socket.

Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.

This means that steps (1.1) and (1.3) travel through different channels,
and their relative order can be reversed, as perceived by the vhost-user
backend.

That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)

Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:

- Control plane, handling vhost-user protocol messages:

 The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
 [crates/vhost-user-backend/src/handler.rs] handles
 VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
 flag according to the message processed.

- Data plane, handling virtio / FUSE requests:

 The "VringEpollHandler::handle_event" method
 [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
 virtio / FUSE request, consuming the virtio kick at the same time. If
 the vring's "enabled" flag is set, the virtio / FUSE request is
 processed genuinely. If the vring's "enabled" flag is clear, then the
 virtio / FUSE request is discarded.

Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.

The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.

The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act upon
VHOST_USER_SET_VRING_ENABLE.

We can deterministically prevent the race in QEMU, by blocking OVMF inside
step (1.1) -- i.e., in the write to the "queue_enable" register -- until
VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
cannot advance to the FUSE_INIT submission before virtiofsd's control
plane processor takes notice of the queue being enabled.

Wait for VHOST_USER_SET_VRING_ENABLE completion by:

- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
 for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
 has been negotiated, or

- performing a separate 

[PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

2023-08-27 Thread Laszlo Ersek
(1) The virtio-1.0 specification
 writes:

> 3 General Initialization And Device Operation
> 3.1   Device Initialization
> 3.1.1 Driver Requirements: Device Initialization
>
> [...]
>
> 7. Perform device-specific setup, including discovery of virtqueues for
>the device, optional per-bus setup, reading and possibly writing the
>device’s virtio configuration space, and population of virtqueues.
>
> 8. Set the DRIVER_OK status bit. At this point the device is “live”.

and

> 4 Virtio Transport Options
> 4.1   Virtio Over PCI Bus
> 4.1.4 Virtio Structure PCI Capabilities
> 4.1.4.3   Common configuration structure layout
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>
> [...]
>
> The driver MUST configure the other virtqueue fields before enabling the
> virtqueue with queue_enable.
>
> [...]

These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:

(1.1) set "queue_enable" for the needed queues as the final part of device
initialization step (7),

(1.2) set DRIVER_OK in step (8),

(1.3) immediately start sending virtio requests to the device.

(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
enabling vrings.

Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
operation, which travels from the guest through QEMU to the vhost-user
backend, using a unix domain socket.

Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.

This means that steps (1.1) and (1.3) travel through different channels,
and their relative order can be reversed, as perceived by the vhost-user
backend.

That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)

Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:

- Control plane, handling vhost-user protocol messages:

  The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
  [crates/vhost-user-backend/src/handler.rs] handles
  VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
  flag according to the message processed.

- Data plane, handling virtio / FUSE requests:

  The "VringEpollHandler::handle_event" method
  [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
  virtio / FUSE request, consuming the virtio kick at the same time. If
  the vring's "enabled" flag is set, the virtio / FUSE request is
  processed genuinely. If the vring's "enabled" flag is clear, then the
  virtio / FUSE request is discarded.

Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.

The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.

The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act upon
VHOST_USER_SET_VRING_ENABLE.

We can deterministically prevent the race in QEMU, by blocking OVMF inside
step (1.1) -- i.e., in the write to the "queue_enable" register -- until
VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
cannot advance to the FUSE_INIT submission before virtiofsd's control
plane processor takes notice of the queue being enabled.

Wait for VHOST_USER_SET_VRING_ENABLE completion by:

- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
  for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
  has been negotiated, or

- performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
  a backend response regardless of