Re: Deadlock with SATA CD I/O and eject

2023-09-19 Thread John Levon
On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote:

> > In the meantime, we start processing the blk_drain() code, so by the time 
> > this
> > blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> > blk_wait_while_drained().
> > 
> > I don't know the qemu block stack well enough to propose an actual fix.
> > 
> > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce 
> > in
> > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty 
> > sure
> > that's just a band-aid instead of fixing the deadlock.
> 
> Related discussion: 
> https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html
> 
> Actually, it seems we never fixed that problem either?
> 
> Back then I suggested that blk_drain*() should disable request queuing
> because its callers don't want to quiesce the BlockBackend, but rather
> get their own requests completed. I think this approach would solve this
> one as well.

In this case though, it's not their own requests right? We have incoming I/O
from the guest and the eject is a separate operation.

So why it would be OK to disable queuing and have ongoing I/Os (i.e.
blk->in_flight > 0) racing against the block remove?

> Your experiment with moving the queuing later is basically what Paolo

I think it's much more flaky than that, isn't it? It's just clearing out the
*current* pending queue, but nothing would stop new I/Os from being started
before we dropped into the poll for blk->in_flight to be zero again. In my case,
this just happens to work because the prior tray open notification has stopped
new I/O being filed, right?

thanks
john



Deadlock with SATA CD I/O and eject

2023-09-18 Thread John Levon


Observed with base of qemu 6.2.0, but from code inspection it looks to me like
it's still current in upstream master. Apologies if I have missed a fix in this
area.

Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
files.." screen, run an eject e.g.

virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae 
'{"execute":"eject", "arguments": { "id": "sata0-0-0" } }'

qemu will get stuck like so:

gdb) bt
#0  0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6
#1  0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, 
__nfds=, __fds=) at /usr/include/bits/poll2.h:62
#2  qemu_poll_ns (fds=, nfds=, 
timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
#3  0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, 
ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
#4  0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, 
blocking=blocking@entry=true) at ../util/aio-posix.c:607
#5  0x561813ae2fad in bdrv_do_drained_begin (poll=true, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at 
../block/io.c:483
#6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, 
parent=0x0, ignore_bds_parents=, poll=) at 
../block/io.c:446
#7  0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at 
../block/block-backend.c:1741
#8  0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at 
../block/block-backend.c:852
#9  0x56181382b8ab in blockdev_remove_medium (has_device=, 
device=, has_id=, id=, 
errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
#10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, 
has_id=, id=0x561815e6efe0 "sata0-0-0", has_force=, force=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45

We are stuck forever here:

 351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
 352   bool poll)
...
 380 if (poll) {
 381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
 382 }

Because the blk root's ->in_flight is 1, as tested by the condition
blk_root_drained_poll().


Our blk->in_flight user is stuck here:

1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
...
1310 blk_dec_in_flight(blk);
1311 qemu_co_queue_wait(>queued_requests, 
>queued_requests_lock);
1312 blk_inc_in_flight(blk);

Note that before entering this stanza, blk->in_flight was 2. This turns out to
be due to the ide atapi code. In particular, the guest VM is generating lots of
read I/O. The "first IO" arrives into blk via:

cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()

This initial IO completes:

blk_aio_read_entry()->blk_aio_complete()

1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
1561 {
1562 if (acb->has_returned) {
1563 acb->common.cb(acb->common.opaque, acb->rwco.ret);
1564 blk_dec_in_flight(acb->rwco.blk);
1565 qemu_aio_unref(acb);
1566 }
1567 }

Line 1564 is what we need to move blk->in_flight down to zero, but that is never
reached! This is because of what happens at :1563

acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()

That is, the IO callback in the atapi code itself triggers more - synchronous - 
IO.

In the meantime, we start processing the blk_drain() code, so by the time this
blk_pread() actually gets handled, quiesce is set, and we get stuck in the
blk_wait_while_drained().

I don't know the qemu block stack well enough to propose an actual fix.

Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
that's just a band-aid instead of fixing the deadlock.

Any suggestions/clues/thoughts?

thanks
john



Re: [RFC PATCH v2 1/1] vfio-user: add live migration to vfio-user protocol specification

2023-08-23 Thread John Levon
On Wed, Aug 23, 2023 at 10:04:00AM +, William Henderson wrote:

> +* *argsz* is the size of the above structure.

As we discussed offline, this isn't right for any of these. They have the same
->argsz semantics as discussed here:

https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#message-sizes

So this value should in fact be the (max) size of the reply message.

regards
john



Re: [PATCH v1 06/24] vfio-user: Define type vfio_user_pci_dev_info

2022-12-12 Thread John Levon
On Mon, Dec 12, 2022 at 10:01:33AM +0100, Cédric Le Goater wrote:

> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 80b03a2..dc19869 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -19,6 +19,7 @@
> >*/
> >   #include "qemu/osdep.h"
> > +#include CONFIG_DEVICES
> >   #include 
> >   #include 
> > @@ -3421,3 +3422,91 @@ static void register_vfio_pci_dev_type(void)
> >   }
> >   type_init(register_vfio_pci_dev_type)
> > +
> > +
> > +#ifdef CONFIG_VFIO_USER_PCI
> 
> Why not introduce a new file hw/vfio/user.c file ? It would be
> cleaner.

user.c is in this series, and holds the vfio-user implementation - it's not a
PCI specific thing. So it would have to be hw/vfio/user_pci.c or something

regards
john



Re: [PATCH v1 23/24] vfio-user: add coalesced posted writes

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:45PM -0800, John Johnson wrote:

> Add new message to send multiple writes to server

LGTM - but maybe commit message could have some context as to why this is
useful?

Reviewed-by: John Levon 

regards
john



Re: [PATCH v1 22/24] vfio-user: add 'x-msg-timeout' option that specifies msg wait times

2022-12-09 Thread John Levon


LGTM

Reviewed-by: John Levon 

regards
john



Re: [PATCH v1 21/24] vfio-user: pci reset

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:43PM -0800, John Johnson wrote:

> Message to tell the server to reset the device.

LGTM

Reviewed-by: John Levon 

regards
john



Re: [PATCH v1 20/24] vfio-user: dma read/write operations

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:42PM -0800, John Johnson wrote:

>  static void vfio_user_pci_process_req(void *opaque, VFIOUserMsg *msg)
>  {
> +VFIOPCIDevice *vdev = opaque;
> +VFIOUserHdr *hdr = msg->hdr;
> +
> +/* no incoming PCI requests pass FDs */
> +if (msg->fds != NULL) {
> +vfio_user_send_error(vdev->vbasedev.proxy, hdr, EINVAL);
> +vfio_user_putfds(msg);
> +return;
> +}
>  
> +switch (hdr->command) {
> +case VFIO_USER_DMA_READ:
> +vfio_user_dma_read(vdev, (VFIOUserDMARW *)hdr);
> +break;
> +case VFIO_USER_DMA_WRITE:
> +vfio_user_dma_write(vdev, (VFIOUserDMARW *)hdr);
> +break;
> +default:
> +error_printf("vfio_user_process_req unknown cmd %d\n", hdr->command);

__func__ or vfio_user_pci_process_req ?

regards
john



Re: [PATCH v1 19/24] vfio-user: secure DMA support

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:41PM -0800, John Johnson wrote:

> Secure DMA forces the remote process to use DMA r/w messages
> instead of directly mapping guest memeory.

I don't really get why this is called "secure" - shouldn't have an option name
more closely resembling what it actually does?

regards
john



Re: [PATCH v1 18/24] vfio-user: add dma_unmap_all

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:40PM -0800, John Johnson wrote:

> Signed-off-by: John G Johnson 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: Jagannathan Raman 
> ---

> +static int vfio_user_io_dirty_bitmap(VFIOContainer *container,
> +struct vfio_iommu_type1_dirty_bitmap *bitmap,
> +struct vfio_iommu_type1_dirty_bitmap_get *range)
> +{
> +
> +/* vfio-user doesn't support migration */
> +return -EINVAL;
> +}

... and dirty_bitmap callback?

regards
john



Re: [PATCH v1 16/24] vfio-user: proxy container connect/disconnect

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:38PM -0800, John Johnson wrote:

> +/*
> + * The proxy uses a SW IOMMU in lieu of the HW one
> + * used in the ioctl() version.  Mascarade as TYPE1
> + * for maximum capatibility
> + */

capability

> @@ -3692,12 +3698,18 @@ static void vfio_user_instance_finalize(Object *obj)
>  {
>  VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
>  VFIODevice *vbasedev = >vbasedev;
> +VFIOGroup *group = vbasedev->group;
> +
> +vfio_bars_finalize(vdev);
> +g_free(vdev->emulated_config_bits);
> +g_free(vdev->rom);

These changes seem unrelated to this particular patch?

> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 793ca94..312ef9c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -94,6 +94,7 @@ typedef struct VFIOContainer {
>  uint64_t max_dirty_bitmap_size;
>  unsigned long pgsizes;
>  unsigned int dma_max_mappings;
> +VFIOProxy *proxy;
>  QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>  QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>  QLIST_HEAD(, VFIOGroup) group_list;
> @@ -282,6 +283,11 @@ void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>  VFIODevice *vbasedev, Error **errp);
>  
> +int vfio_user_get_device(VFIOGroup *group, VFIODevice *vbasedev, Error 
> **errp);
> +VFIOGroup *vfio_user_get_group(VFIOProxy *proxy, AddressSpace *as,
> +   Error **errp);
> +void vfio_user_put_group(VFIOGroup *group);
> +

Why aren't these in user.h?

regards
john



Re: [PATCH v1 15/24] vfio-user: forward msix BAR accesses to server

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:37PM -0800, John Johnson wrote:

> Server holds device current device pending state
> Use irq masking commands in socket case

As far as I'm able to tell, this looks good to me

regards
john



Re: [PATCH v1 14/24] vfio-user: get and set IRQs

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:36PM -0800, John Johnson wrote:

> +static int vfio_user_io_get_irq_info(VFIODevice *vbasedev,
> + struct vfio_irq_info *irq)
> +{
> +int ret;
> +
> +ret = vfio_user_get_irq_info(vbasedev->proxy, irq);
> +if (ret) {
> +return ret;
> +}
> +
> +if (irq->index > vbasedev->num_irqs) {
> +return -EINVAL;
> +}

Why are we validating ->index *after* requesting the info? Seems a bit weird?

regards
john



Re: [PATCH v1 13/24] vfio-user: pci_user_realize PCI setup

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:35PM -0800, John Johnson wrote:

> PCI BARs read from remote device
> PCI config reads/writes sent to remote server

Reviewed-by: John Levon 

regards
john



Re: [PATCH v1 12/24] vfio-user: region read/write

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:34PM -0800, John Johnson wrote:

> Add support for posted writes on remote devices

LGTM

Reviewed-by: John Levon 

regards
john




Re: [PATCH v1 11/24] vfio-user: get region info

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:33PM -0800, John Johnson wrote:

> Add per-region FD to support mmap() of remote device regions
> 
> +/*
> + * VFIO_USER_DEVICE_GET_REGION_INFO
> + * imported from struct_vfio_region_info

s/struct_vfio_region_info/struct vfio_region_info/

regards
john



Re: [PATCH v1 10/24] vfio-user: get device info

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:32PM -0800, John Johnson wrote:

> +/*
> + * VFIO_USER_DEVICE_GET_INFO
> + * imported from struct_device_info
> + */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint32_t argsz;
> +uint32_t flags;
> +uint32_t num_regions;
> +uint32_t num_irqs;
> +uint32_t cap_offset;
> +} VFIOUserDeviceInfo;

The server doesn't have or populate cap_offset - and this isn't in the protocol
either.

Looks good otherwise

regards
john



Re: [PATCH v1 09/24] vfio-user: define socket send functions

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:31PM -0800, John Johnson wrote:

> Also negotiate protocol version with remote server

LGTM

Reviewed-by: John Levon 

regards
john



Re: [PATCH v1 08/24] vfio-user: define socket receive functions

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:30PM -0800, John Johnson wrote:

> Add infrastructure needed to receive incoming messages
> 
> +static void vfio_user_process(VFIOProxy *proxy, VFIOUserMsg *msg, bool 
> isreply)
> +{
> +
> +/*
> + * Replies signal a waiter, if none just check for errors
> + * and free the message buffer.
> + *
> + * Requests get queued for the BH.
> + */
> +if (isreply) {
> +msg->complete = true;
> +if (msg->type == VFIO_MSG_WAIT) {
> +qemu_cond_signal(>cv);
> +} else {
> +if (msg->hdr->flags & VFIO_USER_ERROR) {
> +error_printf("vfio_user_rcv:  error reply on async request 
> ");

nit, s/vfio_user_rcv/vfio_user_process/ , or even __func__ ?

> +error_prepend(_err, "vfio_user_recv: ");

nit, same here

regards
john



Re: [PATCH v1 07/24] vfio-user: connect vfio proxy to remote server

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:29PM -0800, John Johnson wrote:

> add user.c & user.h files for vfio-user code
> add proxy struct to handle comms with remote server

LGTM

Reviewed-by: John Levon 

regards
john




Re: [PATCH v1 06/24] vfio-user: Define type vfio_user_pci_dev_info

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:28PM -0800, John Johnson wrote:

> New class for vfio-user with its class and instance
> constructors and destructors, and its pci ops.

LGTM

Reviewed-by: John Levon 

regards
john




Re: [PATCH v1 05/24] vfio-user: add device IO ops vector

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:27PM -0800, John Johnson wrote:

> Used for communication with VFIO driver
> (prep work for vfio-user, which will communicate over a socket)
> 
> @@ -1166,12 +1178,13 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, 
> uint32_t addr, int len)
>  if (~emu_bits & (0xU >> (32 - len * 8))) {
>  ssize_t ret;
>  
> -ret = pread(vdev->vbasedev.fd, _val, len,
> -vdev->config_offset + addr);
> +ret = VDEV_CONFIG_READ(vbasedev, addr, len, _val);
>  if (ret != len) {
> -error_report("%s(%s, 0x%x, 0x%x) failed: %m",
> - __func__, vdev->vbasedev.name, addr, len);
> -return -errno;
> +const char *err = ret < 0 ? strerror(-ret) : "short read";
> +
> +error_report("%s(%s, 0x%x, 0x%x) failed: %s",
> + __func__, vbasedev->name, addr, len, err);
> +return -1;

Mention this drive-by return value fix in the commit message?

> @@ -1283,10 +1300,13 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int 
> pos, Error **errp)
>  int ret, entries;
>  Error *err = NULL;
>  
> -if (pread(vdev->vbasedev.fd, , sizeof(ctrl),
> -  vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
> -error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
> -return -errno;
> +ret = VDEV_CONFIG_READ(>vbasedev, pos + PCI_CAP_FLAGS,
> +   sizeof(ctrl), );
> +if (ret != sizeof(ctrl)) {
> +const char *err = ret < 0 ? strerror(-ret) : "short read";
> +
> +error_setg(errp, "failed reading MSI PCI_CAP_FLAGS %s", err);
> +return ret;

I suppose this was already broken, but if we get a short read we won't bubble up
an error properly here since ret is not in -errno form. Probably true of some
other short read paths?

regards
john



Re: [PATCH v1 04/24] vfio-user: add region cache

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:26PM -0800, John Johnson wrote:

> cache VFIO_DEVICE_GET_REGION_INFO results to reduce
> memory alloc/free cycles and as prep work for vfio-user

LGTM

Reviewed-by: John Levon 

regards
john



Re: [PATCH v1 03/24] vfio-user: add container IO ops vector

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:25PM -0800, John Johnson wrote:

> Used for communication with VFIO driver
> (prep work for vfio-user, which will communicate over a socket)

> index e573f5a..6fd40f1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> +
> +extern VFIOContIO vfio_cont_io_ioctl;

Nit, there's no need for this to be non-static, it's only used in
hw/vfio/common.c

regards
john



Re: [PATCH v1 02/24] vfio-user: add VFIO base abstract class

2022-12-09 Thread John Levon
On Tue, Nov 08, 2022 at 03:13:24PM -0800, John Johnson wrote:

> Add an abstract base class both the kernel driver
> and user socket implementations can use to share code.

LGTM

Reviewed-by: John Levon 

regards
john



Re: [PATCH v3 4/4] hw/nvme: add polling support

2022-11-08 Thread John Levon
On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:

> On Nov  3 21:19, Jinhao Fan wrote:
> > On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> > > I agree that the spec is a little unclear on this point. In any case, in
> > > Linux, when the driver has decided that the sq tail must be updated,
> > > it will use this check:
> > > 
> > >(new_idx - event_idx - 1) < (new_idx - old)
> > 
> > When eventidx is already behind, it's like:
> > 
> >  0
> >  1 <- event_idx
> >  2 <- old
> >  3 <- new_idx
> >  4
> >  .
> >  .
> >  .
> > 
> > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
> > 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
> 
> That becomes 1 >= 1, i.e. "true". So this will result in the driver
> doing an mmio doorbell write.

The code is:

static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
 
{   
 
return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);   
 
}   
 

which per the above is "return 1 < 1;", or false. So the above case does *not*
do an mmio write. No?

regards
john



Re: [PATCH 0/1] Update vfio-user module to the latest

2022-08-07 Thread John Levon
On Fri, Aug 05, 2022 at 09:24:56AM +0100, Daniel P. Berrangé wrote:

> > > For the RFC QEMU user space eBPF support,
> > > https://lore.kernel.org/all/20220617073630.535914-6-chen.zh...@intel.com/T/
> > > Maybe introduce the libubpf.so as a subproject like libvfio-user.so is 
> > > more appropriate?
> > 
> > Fair comment. I never noticed them before, but why do we have those
> > submodules in the subprojects/ folder (libvduse, libvfio-user and
> > libvhost-user)? ... I don't think it's the job of QEMU to ship libraries
> > that a user might want to use for a certain feature, so could we please
> > remove those submodules again? If someone wants to use this, they can
> > compile the libraries on their own or help their favorite distribution to
> > ship them as packages.
> 
> FWIW, I don't really agree with shipping libvfio-user.so as a submodule
> either, but the consensus was that we have to do it because there's no
> stable ABI committed to by libvfio-user maintainers yet.  My counterpoint
> is that as long as QEMU ships libvfio-user as a submodule, there's no
> incentive to create a stable ABI, leaving a chicken & egg scenario.

qemu is not the only consumer of the library, so I'm not sure removing the
submodule from qemu moves the needle in either direction.

We are still discovering our abstractions are not quite right in places, so
we're not yet confident enough to mark the API/ABI as stable (nor do we have any
testing of this in place). It would be all downside at this point.

> IOW personally I'd rather libvfio-user.so was put into the distros right
> now, and have the pain ABI incompatible releases act as motivation for
> the creation of a stable ABI.

We can't control what the distributions choose to do, but speaking for
libvfio-user, we would not support this choice or anything like it. It would
only cause pain for users.

> A second factor is that as long as it is a submodule, there is little
> pressure for the distros to actually package the library, which leaves
> us in a place where someone will always object to removing the submodule
> from QEMU because it doesn't exist in distro X.

No distribution has even *asked* us about this.  Do you have some evidence that
by making this more difficult, somehow we'll start hearing from all the distros?
What's the mechanism by which this will work?

It seems to me that all that will happen by removing it, is make it extremely
annoying for anyone wanting to use it with qemu, as every user will have to
figure out which commit is needed for the qemu commit they're trying to compile.

> If we do add something as a submodule for some reason, I'd like us to
> say upfront that this is for a fixed time period (ie maximum of 3
> releases aka 1 year) only after which we'll remove it no matter what.

I'm not clear on the costs of having the submodule: could you please explain why
it's an issue exactly? I can only think of the flaky test issue (for which I'm
sorry).

regards
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread John Levon
On Wed, Jun 15, 2022 at 07:22:22PM +0800, Jinhao Fan wrote:

> >>> Isn't this racy against the driver? Compare
> >>> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> >> 
> >> QEMU has full memory barriers on dma read/write, so I believe this is
> >> safe?
> > 
> > But don't you need to re-read the tail still, for example:
> 
> I think we also have a check for concurrent update on the tail. After writing 
> eventidx, we read the tail again. It is here:
> 
> @@ -5854,6 +5943,11 @@ static void nvme_process_sq(void *opaque)
>  req->status = status;
>  nvme_enqueue_req_completion(cq, req);
>  }
> +
> +if (n->dbbuf_enabled) {
> +nvme_update_sq_eventidx(sq);
> +nvme_update_sq_tail(sq);
> +}

Ah, and we go around the loop another time in this case.

> > driver  device
> > 
> > eventidx is 3
> > 
> > write 4 to tail
> > read tail of 4
> > write 5 to tail
> > read eventidx of 3
> > nvme_dbbuf_need_event (1)
> > 
> > set eventidx to 4
> 
> Therefore, at this point, we read the tail of 5.

The driver could still update the tail after the nvme_update_sq_tail() above.
However, the driver ordering (read tail, then eventidx), does mean that it would
then do an mmio write, so yes, this looks safe, thank you.

regards
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread John Levon
On Wed, Jun 15, 2022 at 11:33:02AM +0200, Klaus Jensen wrote:

> > BTW I'm surprised that this patch has just this:
> > 
> > +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
> > +{
> > +pci_dma_write(>ctrl->parent_obj, sq->ei_addr, >tail,
> > +  sizeof(sq->tail));
> > +}
> > 
> > Isn't this racy against the driver? Compare
> > https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> > 
> > thanks
> > john
> 
> QEMU has full memory barriers on dma read/write, so I believe this is
> safe?

But don't you need to re-read the tail still, for example:


driver  device

eventidx is 3

write 4 to tail
read tail of 4
write 5 to tail
read eventidx of 3
nvme_dbbuf_need_event (1)

set eventidx to 4
go to sleep

at (1), our tail update of 4->5 doesn't straddle the eventidx, so we don't send
any MMIO, and the device won't wake up. This is why the above code checks the
tail twice for any concurrent update.

regards
john





Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-15 Thread John Levon
On Wed, Jun 15, 2022 at 10:48:26AM +0200, Klaus Jensen wrote:

> > By the way, I noticed that the patch never updates the cq's ei_addr value. 
> > Is
> > that on purpose?
> 
> Yeah, I also mentioned this previously[1] and I still think we need to
> update the event index. Otherwise (and my testing confirms this), we end
> up in a situation where the driver skips the mmio, leaving a completion
> queue entry "in use" on the device until some other completion comes
> along.

Hmm, can you expand on this a little bit? We don't touch cq eventidx this in
SPDK either, on the basis that mmio exits are expensive, and we only ever need
to look at cq_head when we're checking for room when posting a completion - and
in that case, we can just look directly at shadow cq_head value.

Can you clarify the exact circumstance that needs an mmio write when the driver
updates cq_head?

BTW I'm surprised that this patch has just this:

+static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
+{
+pci_dma_write(>ctrl->parent_obj, sq->ei_addr, >tail,
+  sizeof(sq->tail));
+}

Isn't this racy against the driver? Compare
https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317

thanks
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-09 Thread John Levon
On Thu, Jun 09, 2022 at 07:27:57PM +0200, Klaus Jensen wrote:

> > It's not just unnecessary, but enabling shadow doorbells on admin queues 
> > will
> > actively break device implementations (such as SPDK), which have had to 
> > presume
> > that driver implementations don't use shadow doorbells for the admin queue.
> 
> I'm ok with following the concensus here, but we all agree that this is
> a blatant spec violation that ended up manifesting itself down the
> stack, right?

Yup. I've been meaning to try following up to get the spec updated to reflect
reality, but haven't made progress there yet.

> So... if QEMU wants to be compliant here, I guess we could ask the
> kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> be a first! Not sure if I am being serious or not here ;)

:)

regards
john



Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support

2022-06-09 Thread John Levon
On Thu, Jun 09, 2022 at 08:29:30AM -0600, Keith Busch wrote:

> On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> > 
> > Keith, is this a bug in the kernel? If the code here would expect the
> > doorbell buffer to be updated for the admin queue as well, would we
> > break?
> 
> The admin queue has to be created before db buffer can be set up, so we can't
> rely on it. And this is a performance feature. Admin commands have never been
> considered performant, so we decided not to consider it though the spec allows
> it.

It's not just unnecessary, but enabling shadow doorbells on admin queues will
actively break device implementations (such as SPDK), which have had to presume
that driver implementations don't use shadow doorbells for the admin queue.

regards
john



Re: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification

2022-03-10 Thread John Levon
On Wed, Mar 09, 2022 at 03:34:53PM -0700, Alex Williamson wrote:

> > +The only device-specific region type and subtype supported by vfio-user is
> > +``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` 
> > (1).
> 
> These should be considered deprecated from the kernel interface.  I
> hope there are plans for vfio-user to adopt the new interface that's
> currently available in linux-next and intended for v5.18.

https://github.com/nutanix/libvfio-user/issues/654

regards
john



Re: [PATCH v4 07/14] vfio-user: run vfio-user context

2022-01-10 Thread John Levon
On Thu, Jan 06, 2022 at 01:35:32PM +, Stefan Hajnoczi wrote:

> > > >> +static void vfu_object_attach_ctx(void *opaque)
> > > >> +{
> > > >> +VfuObject *o = opaque;
> > > >> +GPollFD pfds[1];
> > > >> +int ret;
> > > >> +
> > > >> +qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > > >> +
> > > >> +pfds[0].fd = o->vfu_poll_fd;
> > > >> +pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > > >> +
> > > >> +retry_attach:
> > > >> +ret = vfu_attach_ctx(o->vfu_ctx);
> > > >> +if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> > > >> +qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
> > > >> +goto retry_attach;
> > > >
> > > > This can block the thread indefinitely. Other events like monitor
> > > > commands are not handled in this loop. Please make this asynchronous
> > > > (set an fd handler and return from this function so we can try again
> > > > later).
> > > >
> > > > The vfu_attach_ctx() implementation synchronously negotiates the
> > > > vfio-user connection :(. That's a shame because even if accept(2) is
> > > > handled asynchronously, the negotiation can still block. It would be
> > > > cleanest to have a fully async libvfio-user's vfu_attach_ctx() API to
> > > > avoid blocking. Is that possible?
> > > 
> > > Thanos / John,
> > > 
> > > Any thoughts on this?
> > 
> > I'm discussing this with John and FYI there are other places where 
> > libvfio-user can block, e.g. sending a response or receiving a command. Is 
> > it just the negotiation you want it to be asynchronous or _all_ 
> > libvfio-user operations? Making libvfio-user fully asynchronous might 
> > require a substantial API rewrite.
> 
> I see at least two reasons for a fully async API:
> 
> 1. The program wants to handle other events (e.g. a management REST API)
>from the same event loop thread that invokes libvfio-user. If
>libvfio-user blocks then the other events cannot be handled within a
>reasonable time frame.
> 
>The workaround for this is to use multi-threading and ignore the
>event-driven architecture implied by vfu_get_poll_fd().
> 
> 2. The program handles multiple clients that do not trust each other.
>This could be a software-defined network switch or storage appliance.
>A malicious client can cause a denial-of-service by making a
>libvfio-user call block.
> 
>Again, the program needs separate threads instead of an event loop to
>work around this.

Hi Stefan, we're certainly aware of the rationale. Ultimately we just haven't
yet found resources to fix this: in practice, we don't really hit problems from
the parts that are still synchronous. Of course, that's not a good argument when
it comes to your second issue, and indeed the library is not currently suitable
for multiple untrusted clients.

For Jag but: patches are welcome. But it's not just negotiate(): all sorts of
things are potentially synchronous - for example, it's expected that the region
read/write callbacks are done synchronously. Any other client reply, or
server->client message, is also synchronous.

It's partly why we haven't yet stabilised the API.

regards
john


Re: [PATCH v4 08/14] vfio-user: handle PCI config space accesses

2021-12-16 Thread John Levon
On Thu, Dec 16, 2021 at 11:30:20AM +, Stefan Hajnoczi wrote:

> > +ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
> > +   pci_config_size(o->pci_dev), 
> > _object_cfg_access,
> > +   VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
> > +   NULL, 0, -1, 0);
> 
> Thanos or John: QEMU's pci_host_config_read/write_common() works with
> host-endian values. I don't know which endianness libvfio-user's region
> callbacks expect. Does this patch look endian-safe to you (e.g. will it
> work on a big-endian host)?

https://github.com/nutanix/libvfio-user/issues/615

regards
john



Re: [PATCH v3 06/12] vfio-user: run vfio-user context

2021-10-28 Thread John Levon
On Wed, Oct 27, 2021 at 05:21:30PM +0100, Stefan Hajnoczi wrote:

> > +static void vfu_object_ctx_run(void *opaque)
> > +{
> > +VfuObject *o = opaque;
> > +int ret = -1;
> > +
> > +while (ret != 0) {
> > +ret = vfu_run_ctx(o->vfu_ctx);
> > +if (ret < 0) {
> > +if (errno == EINTR) {
> > +continue;
> > +} else if (errno == ENOTCONN) {
> > +qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > +o->vfu_poll_fd = -1;
> > +object_unparent(OBJECT(o));
> > +break;
> > +} else {
> > +error_setg(_abort, "vfu: Failed to run device %s - 
> > %s",
> > +   o->device, strerror(errno));
> > + break;
> > +}
> 
> The libvfio-user.h doc comments say this function can return -1 (EAGAIN)
> when LIBVFIO_USER_FLAG_ATTACH_NB was used. Is the doc comment wrong
> since this patch seems to rely on vfu_run_ctx() returning 0 when there
> are no more commands to process?

 * @returns the number of requests processed (0 or more); or -1 on error,   
 
 *  with errno set as follows:

So in fact it does return 0. The EAGAIN line needs removing; I'll fix it
shortly.

> > +static void vfu_object_attach_ctx(void *opaque)
> > +{
> > +VfuObject *o = opaque;
> > +int ret;
> > +
> > +qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> > +o->vfu_poll_fd = -1;
> > +
> > +retry_attach:
> > +ret = vfu_attach_ctx(o->vfu_ctx);
> > +if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> > +goto retry_attach;
> 
> Can we wait for the poll fd to become readable instead of spinning? I
> don't know the libvfio-user APIs so I'm not sure, but it would be nice
> to avoid a busy loop.

At this point, ->vfu_poll_fd is a listening socket, no reason why qemu couldn't
epoll on it.

regards
john


Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion

2021-10-27 Thread John Levon
On Wed, Oct 27, 2021 at 11:15:35AM +0100, Stefan Hajnoczi wrote:

> > > I like this approach as well.
> > > As you have mentioned, the device emulation code with first approach
> > > does have to how to handle the region accesses. The second approach will
> > > make things more transparent. Let me see how can I modify what there is
> > > there now and may ask further questions.
> > 
> > Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
> > vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
> > interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?
> 
> The ioregionfd doesn't need to cover an entire BAR. QEMU's MemoryRegions
> form a hierarchy, so it's possible to sub-divide the BAR into several
> MemoryRegions.

I think you're saying that when vfio-user client in qemu calls
VFIO_USER_DEVICE_GET_REGION_IO_FDS, it would create a sub-MR corresponding to
each one, before asking KVM to configure them?

thanks
john



Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion

2021-10-26 Thread John Levon
On Mon, Oct 25, 2021 at 08:21:22AM -0700, Elena wrote:

> > I'm curious what approach you want to propose for QEMU integration. A
> > while back I thought about the QEMU API. It's possible to implement it
> > along the lines of the memory_region_add_eventfd() API where each
> > ioregionfd is explicitly added by device emulation code. An advantage of
> > this approach is that a MemoryRegion can have multiple ioregionfds, but
> > I'm not sure if that is a useful feature.
> >
> 
> This is the approach that is currently in the works. Agree, I dont see
> much of the application here at this point to have multiple ioregions
> per MemoryRegion.
> I added Memory API/eventfd approach to the vfio-user as well to try
> things out.
> 
> > An alternative is to cover the entire MemoryRegion with one ioregionfd.
> > That way the device emulation code can use ioregionfd without much fuss
> > since there is a 1:1 mapping between MemoryRegions, which are already
> > there in existing devices. There is no need to think deeply about which
> > ioregionfds to create for a device.
> >
> > A new API called memory_region_set_aio_context(MemoryRegion *mr,
> > AioContext *ctx) would cause ioregionfd (or a userspace fallback for
> > non-KVM cases) to execute the MemoryRegion->read/write() accessors from
> > the given AioContext. The details of ioregionfd are hidden behind the
> > memory_region_set_aio_context() API, so the device emulation code
> > doesn't need to know the capabilities of ioregionfd.
> 
> > 
> > The second approach seems promising if we want more devices to use
> > ioregionfd inside QEMU because it requires less ioregionfd-specific
> > code.
> > 
> I like this approach as well.
> As you have mentioned, the device emulation code with first approach
> does have to how to handle the region accesses. The second approach will
> make things more transparent. Let me see how can I modify what there is
> there now and may ask further questions.

Sorry I'm a bit late to this discussion, I'm not clear on the above WRT
vfio-user. If an ioregionfd has to cover a whole BAR0 (?), how would this
interact with partly-mmap()able regions like we do with SPDK/vfio-user/NVMe?

thanks
john



Re: [PATCH RFC server v2 01/11] vfio-user: build library

2021-09-11 Thread John Levon
On Fri, Sep 10, 2021 at 05:20:09PM +0200, Philippe Mathieu-Daudé wrote:

> On 8/27/21 7:53 PM, Jagannathan Raman wrote:
> > add the libvfio-user library as a submodule. build it as a cmake
> > subproject.
> > 
> > Signed-off-by: Elena Ufimtseva 
> > Signed-off-by: John G Johnson 
> > Signed-off-by: Jagannathan Raman 
> > ---
> >  configure| 11 +++
> >  meson.build  | 28 
> >  .gitmodules  |  3 +++
> >  MAINTAINERS  |  7 +++
> >  hw/remote/meson.build|  2 ++
> >  subprojects/libvfio-user |  1 +
> >  6 files changed, 52 insertions(+)
> >  create mode 16 subprojects/libvfio-user
> 
> > diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
> > new file mode 16
> > index 000..647c934
> > --- /dev/null
> > +++ b/subprojects/libvfio-user
> > @@ -0,0 +1 @@
> > +Subproject commit 647c9341d2e06266a710ddd075f69c95dd3b8446
> 
> Could we point to a sha1 of a released tag instead?

We don't have releases (yet) partly because we haven't yet stabilized the API.

regards
john


Re: [PATCH RFC v2 09/16] vfio-user: region read/write

2021-09-10 Thread John Levon
On Fri, Sep 10, 2021 at 06:07:56AM +, John Johnson wrote:

> >>> On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
> >>> 
>  +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
>  +   uint64_t offset, uint32_t count, void *data)
>  +{
>  +g_autofree VFIOUserRegionRW *msgp = NULL;
>  +int size = sizeof(*msgp) + count;
>  +
>  +msgp = g_malloc0(size);
>  +vfio_user_request_msg(>hdr, VFIO_USER_REGION_WRITE, size,
>  +  VFIO_USER_NO_REPLY);
> >>> 
> >>> Mirroring 
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_oracle_qemu_issues_10=DwIGaQ=s883GpUCOChKOHiocYtGcg=v7SNLJqx7b9Vfc7ZO82Wg4nnZ8O5XkACFQ30bVKxotI=PJ390CfKPdTFUffSi02whMSqey2en8OJv7dm9VAQKI0=Mfp1xRKET3LEucEeZwUVUvSJ3V0zzGEktOzFwMsTfEE=
> >>>   here for visibility:
> >>> 
> >>> Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
> >>> meaning essentially all writes are posted. But that shouldn't be the 
> >>> case, for
> >>> example for PCI config space, where it's expected that writes will wait 
> >>> for an
> >>> ack before the VCPU continues.
> >> 
> >>I’m not sure following the PCI spec (mem writes posted, config & IO
> >> are not) completely solves the issue if the device uses sparse mmap.  A 
> >> store
> >> to went over the socket can be passed by a load that goes directly to 
> >> memory,
> >> which could break a driver that assumes a load completion means older 
> >> stores
> >> to the same device have also completed.
> > 
> > Sure, but sparse mmaps are under the device's control - so wouldn't that be
> > something of a "don't do that" scenario?
> 
>   The sparse mmaps are under the emulation program’s control, but it
> doesn’t know what registers the guest device driver is using to force stores
> to complete.  The Linux device drivers doc on kernel.org just says the driver
> must read from the same device.

Sure, but any device where that is important wouldn't use the sparse mmaps, no?

There's no other alternative.

regards
john

Re: [PATCH RFC v2 09/16] vfio-user: region read/write

2021-09-09 Thread John Levon
On Thu, Sep 09, 2021 at 06:00:36AM +, John Johnson wrote:

> > On Sep 7, 2021, at 10:24 AM, John Levon  wrote:
> > 
> > On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:
> > 
> >> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> >> +   uint64_t offset, uint32_t count, void *data)
> >> +{
> >> +g_autofree VFIOUserRegionRW *msgp = NULL;
> >> +int size = sizeof(*msgp) + count;
> >> +
> >> +msgp = g_malloc0(size);
> >> +vfio_user_request_msg(>hdr, VFIO_USER_REGION_WRITE, size,
> >> +  VFIO_USER_NO_REPLY);
> > 
> > Mirroring 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_oracle_qemu_issues_10=DwIGaQ=s883GpUCOChKOHiocYtGcg=v7SNLJqx7b9Vfc7ZO82Wg4nnZ8O5XkACFQ30bVKxotI=PJ390CfKPdTFUffSi02whMSqey2en8OJv7dm9VAQKI0=Mfp1xRKET3LEucEeZwUVUvSJ3V0zzGEktOzFwMsTfEE=
> >   here for visibility:
> > 
> > Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
> > meaning essentially all writes are posted. But that shouldn't be the case, 
> > for
> > example for PCI config space, where it's expected that writes will wait for 
> > an
> > ack before the VCPU continues.
> 
>   I’m not sure following the PCI spec (mem writes posted, config & IO
> are not) completely solves the issue if the device uses sparse mmap.  A store
> to went over the socket can be passed by a load that goes directly to memory,
> which could break a driver that assumes a load completion means older stores
> to the same device have also completed.

Sure, but sparse mmaps are under the device's control - so wouldn't that be
something of a "don't do that" scenario?

regards
john

Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context

2021-09-08 Thread John Levon
On Wed, Sep 08, 2021 at 04:02:22PM +0100, Stefan Hajnoczi wrote:

> > We'd have to have a whole separate API to do that, so a separate thread 
> > seems a
> > better approach?
> 
> Whether to support non-blocking properly in libvfio-user is a decision
> for you. If libvfio-user doesn't support non-blocking, then QEMU should
> run a dedicated thread instead of the partially non-blocking approach in
> this patch.

Right, sure. At this point we don't have any plans to implement a separate async
API due to the amount of work involved. 

> A non-blocking approach is nice when there are many devices hosted in a
> single process or a lot of async replies (which requires extra thread
> synchronization with the blocking approach).

I suppose this would be more of a problem with devices where the I/O path has to
be handled via the socket.

regards
john


Re: [PATCH RFC server v2 05/11] vfio-user: run vfio-user context

2021-09-08 Thread John Levon
On Wed, Sep 08, 2021 at 01:58:46PM +0100, Stefan Hajnoczi wrote:

> > +static void *vfu_object_attach_ctx(void *opaque)
> > +{
> > +VfuObject *o = opaque;
> > +int ret;
> > +
> > +retry_attach:
> > +ret = vfu_attach_ctx(o->vfu_ctx);
> > +if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
> 
> Does this loop consume 100% CPU since this is non-blocking?

Looks like it. Instead after vfu_create_ctx, there should be a vfu_get_poll_fd()
to get the listen socket, then a qemu_set_fd_handler(vfu_object_attach_ctx)
to handle the attach when the listen socket is ready, modulo the below part.

> libvfio-user has non-blocking listen_fd but conn_fd is always blocking.

It is, but in vfu_run_ctx(), we poll on it:

```
790 if (vfu_ctx->flags & LIBVFIO_USER_FLAG_ATTACH_NB) { 
 
791 sock_flags = MSG_DONTWAIT | MSG_WAITALL;
 
792 }   
 
793 return get_msg(hdr, sizeof(*hdr), fds, nr_fds, ts->conn_fd, 
sock_flags); 
```

> This means ATTACH_NB is not useful because vfu_attach_ctx() is actually
> blocking.

You're correct that vfu_attach_ctx is in fact partially blocking: after
accepting the connection, we call negotiate(), which can indeed block waiting if
the client hasn't sent anything.

> I think this means vfu_run_ctx() is also blocking in some places

Correct. There's a presumption that if a message is ready, we can read it all
without blocking, and equally that we can write to the socket without blocking.

The library docs are not at all clear on this point.

> and QEMU's event loop might hang :(
> 
> Can you make libvfio-user non-blocking in order to solve these issues?

I presume you're concerned about the security aspect: a malicious client could
withhold a write, and hence hang the device server.

Problem is the libvfio-user API is synchronous: there's no way to return
half-way through a vfu_attach_ctx() (or a vfu_run_ctx() after we read the
header) then resume.

We'd have to have a whole separate API to do that, so a separate thread seems a
better approach?

regards
john


Re: [PATCH RFC v2 14/16] vfio-user: dma read/write operations

2021-09-08 Thread John Levon
On Wed, Sep 08, 2021 at 10:51:11AM +0100, Stefan Hajnoczi wrote:

> > +
> > +buf = g_malloc0(size);
> > +memcpy(buf, msg, sizeof(*msg));
> > +
> > +pci_dma_read(pdev, msg->offset, buf + sizeof(*msg), msg->count);
> 
> The vfio-user spec doesn't go into errors but pci_dma_read() can return
> errors. Hmm...

It's certainly under-specified in the spec, but in terms of the library, we do
return EINVAL if we decide something invalid happened...

regards
john


Re: [PATCH RFC v2 09/16] vfio-user: region read/write

2021-09-07 Thread John Levon
On Mon, Aug 16, 2021 at 09:42:42AM -0700, Elena Ufimtseva wrote:

> +int vfio_user_region_write(VFIODevice *vbasedev, uint32_t index,
> +   uint64_t offset, uint32_t count, void *data)
> +{
> +g_autofree VFIOUserRegionRW *msgp = NULL;
> +int size = sizeof(*msgp) + count;
> +
> +msgp = g_malloc0(size);
> +vfio_user_request_msg(>hdr, VFIO_USER_REGION_WRITE, size,
> +  VFIO_USER_NO_REPLY);

Mirroring https://github.com/oracle/qemu/issues/10 here for visibility:

Currently, vfio_user_region_write uses VFIO_USER_NO_REPLY unconditionally,
meaning essentially all writes are posted. But that shouldn't be the case, for
example for PCI config space, where it's expected that writes will wait for an
ack before the VCPU continues.

regards
john


Re: [PATCH RFC server 05/11] vfio-user: run vfio-user context

2021-08-16 Thread John Levon
On Fri, Aug 13, 2021 at 02:51:53PM +, Jag Raman wrote:

> Thanks for the information about the Blocking and Non-Blocking mode.
> 
> I’d like to explain why we are using a separate thread presently and
> check with you if it’s possible to poll on multiple vfu contexts at the
> same time (similar to select/poll for fds).
> 
> Concerning my understanding on how devices are executed in QEMU,
> QEMU initializes the device instance - where the device registers
> callbacks for BAR and config space accesses. The device is then
> subsequently driven by these callbacks - whenever the vcpu thread tries
> to access the BAR addresses or places a config space access to the PCI
> bus, the vcpu exits to QEMU which handles these accesses. As such, the
> device is driven by the vcpu thread. Since there are no vcpu threads in the
> remote process, we created a separate thread as a replacement. As you
> can see already, this thread blocks on vfu_run_ctx() which I believe polls
> on the socket for messages from client.
> 
> If there is a way to run multiple vfu contexts at the same time, that would
> help with conserving threads on the host CPU. For example, if there’s a
> way to add vfu contexts to a list of contexts that expect messages from
> client, that could be a good idea. Alternatively, this QEMU server could
> also implement a similar mechanism to group all non-blocking vfu
> contexts to just a single thread, instead of having separate threads for
> each context.

You can use vfu_get_poll_fd() to retrieve the underlying socket fd (simplest
would be to do this after vfu_attach_ctx(), but that might depend), then poll on
the fd set, doing vfu_run_ctx() when the fd is ready. An async hangup on the
socket would show up as ENOTCONN, in which case you'd remove the fd from the
set.

Note that we're not completely async yet (e.g. the actual socket read/writes are
synchronous). In practice that's not typically an issue but it could be if you
wanted to support multiple VMs from a single server, etc.


regards
john

Re: [PATCH RFC server 06/11] vfio-user: handle PCI config space accesses

2021-07-26 Thread John Levon
On Mon, Jul 19, 2021 at 04:00:08PM -0400, Jagannathan Raman wrote:

> Define and register handlers for PCI config space accesses
> 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/remote/vfio-user-obj.c | 41 +
>  hw/remote/trace-events|  2 ++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 6a2d0f5..60d9fa8 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -36,6 +36,7 @@
>  #include "sysemu/runstate.h"
>  #include "qemu/notify.h"
>  #include "qemu/thread.h"
> +#include "qemu/main-loop.h"
>  #include "qapi/error.h"
>  #include "sysemu/sysemu.h"
>  #include "hw/qdev-core.h"
> @@ -131,6 +132,35 @@ static void *vfu_object_ctx_run(void *opaque)
>  return NULL;
>  }
>  
> +static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
> + size_t count, loff_t offset,
> + const bool is_write)
> +{
> +VfuObject *o = vfu_get_private(vfu_ctx);
> +uint32_t val = 0;
> +int i;
> +
> +qemu_mutex_lock_iothread();
> +
> +for (i = 0; i < count; i++) {
> +if (is_write) {
> +val = *((uint8_t *)(buf + i));
> +trace_vfu_cfg_write((offset + i), val);
> +pci_default_write_config(PCI_DEVICE(o->pci_dev),
> + (offset + i), val, 1);
> +} else {
> +val = pci_default_read_config(PCI_DEVICE(o->pci_dev),
> +  (offset + i), 1);
> +*((uint8_t *)(buf + i)) = (uint8_t)val;
> +trace_vfu_cfg_read((offset + i), val);
> +}
> +}

Is it always OK to split up the access into single bytes like this?

regards
john



Re: [PATCH RFC server 04/11] vfio-user: find and init PCI device

2021-07-26 Thread John Levon
On Mon, Jul 19, 2021 at 04:00:06PM -0400, Jagannathan Raman wrote:

> +vfu_pci_set_id(o->vfu_ctx,
> +   pci_get_word(o->pci_dev->config + PCI_VENDOR_ID),
> +   pci_get_word(o->pci_dev->config + PCI_DEVICE_ID),
> +   pci_get_word(o->pci_dev->config + 
> PCI_SUBSYSTEM_VENDOR_ID),
> +   pci_get_word(o->pci_dev->config + PCI_SUBSYSTEM_ID));

Since you handle all config space accesses yourselves, is there even any need
for this?

regards
john



Re: [PATCH RFC server 01/11] vfio-user: build library

2021-07-20 Thread John Levon
On Tue, Jul 20, 2021 at 04:20:13PM +0400, Marc-André Lureau wrote:

> > >> +  libvfiouser = static_library('vfiouser',
> > >> +   build_by_default: false,
> > >> +   sources: vfiouser_files,
> > >> +   dependencies: json_c,
> > >> +   include_directories: vfiouser_inc)
> >
> > This way appears to be present convention with QEMU - I’m also not
> > very clear
> > on the reason for it.
> >
> > I’m guessing it’s because QEMU doesn’t build all parts of a submodule. For
> > example, QEMU only builds libfdt in the doc submodule. Similarly,
> > libvfio-user only builds the core library without building the tests and
> > samples.
> >
> You can give subproject options to build limited parts.
> 
> Fwiw, since libvfio-user uses cmake, we may be able to use meson
> cmake.subproject() (https://mesonbuild.com/CMake-module.html).

That'd be great. We also briefly discussed moving away from cmake anyway - since
both SPDK and qemu are meson-based, it seems like it would make sense. I'd
prefer it to be easy to regularly update libvfio-user within these projects.

Ideally, running qemu tests would actually run libvfio-user tests too, for some
level of assurance on the library's internal expectations.

regards
john



Re: [PATCH RFC server 01/11] vfio-user: build library

2021-07-19 Thread John Levon
On Mon, Jul 19, 2021 at 04:00:03PM -0400, Jagannathan Raman wrote:

> add the libvfio-user library as a submodule. build it as part of QEMU
> 
> diff --git a/meson.build b/meson.build
> index 6e4d2d8..f2f9f86 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1894,6 +1894,41 @@ if get_option('cfi') and slirp_opt == 'system'
>   + ' Please configure with --enable-slirp=git')
>  endif
>  
> +vfiouser = not_found
> +if have_system and multiprocess_allowed
> +  have_internal = fs.exists(meson.current_source_dir() / 
> 'libvfio-user/Makefile')
> +
> +  if not have_internal
> +error('libvfio-user source not found - please pull git submodule')
> +  endif
> +
> +  vfiouser_files = [
> +'libvfio-user/lib/dma.c',
> +'libvfio-user/lib/irq.c',
> +'libvfio-user/lib/libvfio-user.c',
> +'libvfio-user/lib/migration.c',
> +'libvfio-user/lib/pci.c',
> +'libvfio-user/lib/pci_caps.c',
> +'libvfio-user/lib/tran_sock.c',
> +  ]
> +
> +  vfiouser_inc = include_directories('libvfio-user/include', 
> 'libvfio-user/lib')
> +
> +  json_c = dependency('json-c', required: false)
> +  if not json_c.found()
> +json_c = dependency('libjson-c')
> +  endif
> +
> +  libvfiouser = static_library('vfiouser',
> +   build_by_default: false,
> +   sources: vfiouser_files,
> +   dependencies: json_c,
> +   include_directories: vfiouser_inc)
> +
> +  vfiouser = declare_dependency(link_with: libvfiouser,
> +include_directories: vfiouser_inc)
> +endif

Why this way, rather than recursing into the submodule? Seems a bit fragile to
encode details of the library here.

regards
john


Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-19 Thread John Levon
On Wed, May 19, 2021 at 03:08:17PM -0600, Alex Williamson wrote:

> > +VFIO_USER_DMA_MAP
> > +-
> > +
> > +Message Format
> > +^^
> > +
> > ++--++
> > +| Name | Value  |
> > ++==++
> > +| Message ID   ||
> > ++--++
> > +| Command  | 2  |
> > ++--++
> > +| Message size | 16 + table size|
> > ++--++
> > +| Flags| Reply bit set in reply |
> > ++--++
> > +| Error| 0/errno|
> > ++--++
> > +| Table| array of table entries |
> > ++--++
> > +
> > +This command message is sent by the client to the server to inform it of 
> > the
> > +memory regions the server can access. It must be sent before the server can
> > +perform any DMA to the client. It is normally sent directly after the 
> > version
> > +handshake is completed, but may also occur when memory is added to the 
> > client,
> > +or if the client uses a vIOMMU. If the client does not expect the server to
> > +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP
> > +commands. If the server does not need to perform DMA then it can ignore 
> > such
> > +commands but it must still reply to them. The table is an array of the
> > +following structure:
> > +
> > +Table entry format
> > +^^
> > +
> > ++-++-+
> > +| Name| Offset | Size|
> > ++=++=+
> > +| Address | 0  | 8   |
> > ++-++-+
> > +| Size| 8  | 8   |
> > ++-++-+
> > +| Offset  | 16 | 8   |
> > ++-++-+
> > +| Protections | 24 | 4   |
> > ++-++-+
> > +| Flags   | 28 | 4   |
> > ++-++-+
> > +| | +-++ |
> > +| | | Bit | Definition | |
> > +| | +=++ |
> > +| | | 0   | Mappable   | |
> > +| | +-++ |
> > ++-++-+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> 
> It might help to explicitly state this value is ignored by the server
> for non-mappable DMA, otherwise a server might assume a specific value
> is required (ex. zero) for such cases.

Generally we say that unused inputs must be zero, but yes, this should be
clarified, thanks.

> > +* *Protections* are the region's protection attributes as encoded in
> > +  .
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *Mappable* indicates that the region can be mapped via the mmap() 
> > system
> > +call using the file descriptor provided in the message meta-data.
> > +
> > +This structure is 32 bytes in size, so the message size is:
> > +16 + (# of table entries * 32).
> > +
> > +If a DMA region being added can be directly mapped by the server, an array 
> > of
> > +file descriptors must be sent as part of the message meta-data. Each 
> > mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX 
> > sockets, the
> 
> Is this saying that if the client provides table entries where indexes
> 1, 3, and 5 are indicated as mappable, then there must be an ancillary
> file descriptor array with 3 entries, where fd[0] maps to entry[1],
> fd[1]:entry[3], and fd[2]:entry[5], even if fd[0-2] are all the
> same file descriptor?

Yes. Though we are planning to change these calls to only support single regions
which would make this moot.

> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. 
> > Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be 
> > accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, 
> > explained
> > +in `Read and Write Operations`_. A command to map over an existing region 
> > must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> > +
> > +Adding multiple DMA regions can partially fail. The response does not 
> > indicate
> > +which regions were added and which were not, therefore it is a client
> > +implementation detail how to recover from the failure.
> > +
> > +.. Note::
> > +   The server can optionally remove succesfully added DMA regions making 
> > this
> > +   operation atomic.
> > +   The client can recover by attempting to unmap one by one all the DMA 
> > regions
> > +   in the VFIO_USER_DMA_MAP 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-11 Thread John Levon
On Tue, May 11, 2021 at 11:09:53AM +0100, Stefan Hajnoczi wrote:

> > > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > > +
> > > > +The reply message will additionally include at least one file 
> > > > descriptor in the
> > > > +ancillary data. Note that more than one sub-region may share the same 
> > > > file
> > > > +descriptor.
> > > 
> > > How does this interact with the maximum number of file descriptors,
> > > max_fds? It is possible that there are more sub-regions than max_fds
> > > allows...
> > 
> > I think this would just be a matter of the client advertising a reasonably 
> > large
> > enough size for max_msg_fds. Do we need to worry about this?
> 
> vhost-user historically only supported passing 8 fds and it became a
> problem there.
> 
> I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
> doorbells). Probably not 1000s.
> 
> If I was implementing a server I would check the negotiated max_fds and
> refuse to start the vfio-user connection if the device has been
> configured to require more sub-regions. Failing early and printing an
> error would allow users to troubleshoot the issue and re-configure the
> client/server.
> 
> This seems okay but the spec doesn't mention it explicitly so I wanted
> to check what you had in mind.

Not for the spec, but I filed https://github.com/nutanix/libvfio-user/issues/489
to track this on the library side. Thanks.

> Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
> be necessary but for now this can be dropped.

I could be wrong, and probably am, but I believe we're basically fine for IRQs
right now, until we want to support servers on separate hosts where we'll
obviously have to re-introduce something like the VM_INTERRUPT message.

> > > > +VFIO_USER_DEVICE_RESET
> > > > +--
> > > 
> > > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > > In some cases a reset involves the server communicating with other
> > > systems or components and this can take an unbounded amount of time.
> > > Therefore this message could hang. For example, if a vfio-user NVMe
> > > device was accessing data on a hung NFS export and there were I/O
> > > requests in flight that need to be aborted.
> > 
> > I'm not sure this is something we could put in the generic spec. Perhaps a
> > caveat?
> 
> It's up to you whether you want to discuss this in the spec or let
> client implementors figure it out themselves. Any vfio-user message can
> take an unbounded amount of time and we could assume that readers will
> think of this.

I'm going to start an "implementation notes" section.

regards
john



Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-10 Thread John Levon
On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> 
> Elena A: I CCed you in case you want to review the

Sorry, we should have included Elena already.

> > +VFIO sparse mmap
> > +
> > +
> > ++--+--+
> > +| Name | Value|
> > ++==+==+
> > +| id   | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
> > ++--+--+
> > +| version  | 0x1  |
> > ++--+--+
> > +| next ||
> > ++--+--+
> > +| sparse mmap info | VFIO region info sparse mmap |
> > ++--+--+
> > +
> > +This capability is defined when only a subrange of the region supports
> > +direct access by the client via mmap(). The VFIO sparse mmap area is 
> > defined in
> > + (``struct vfio_region_sparse_mmap_area``).
> 
> It's a little early to reference struct vfio_region_sparse_mmap_area
> here because the parent struct vfio_region_info_cap_sparse_mmap is only
> referenced below. I suggest combining the two:
> 
>   The VFIO sparse mmap area is defined in  (``struct
>   vfio_region_info_cap_sparse_mmap`` and ``struct
>   vfio_region_sparse_mmap_area``).

Good idea.

> > +Region IO FD info format
> > +
> > +
> > ++-++--+
> > +| Name| Offset | Size |
> > ++=++==+
> > +| argsz   | 16 | 4|
> > ++-++--+
> > +| flags   | 20 | 4|
> > ++-++--+
> > +| index   | 24 | 4|
> > ++-++--+
> > +| count   | 28 | 4|
> > ++-++--+
> > +| sub-regions | 32 | ...  |
> > ++-++--+
> > +
> > +* *argsz* is the size of the region IO FD info structure plus the
> > +  total size of the sub-region array. Thus, each array entry "i" is at 
> > offset
> > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for 
> > both IO
> > +  FD types, but this is not to be relied on.
> > +* *flags* must be zero
> > +* *index* is the index of memory region being queried
> > +* *count* is the number of sub-regions in the array
> > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > +
> > +The client must set ``flags`` to zero and specify the region being queried 
> > in
> > +the ``index``.
> > +
> > +The client sets the ``argsz`` field to indicate the maximum size of the 
> > response
> > +that the server can send, which must be at least the size of the response 
> > header
> > +plus space for the sub-region array. If the full response size exceeds 
> > ``argsz``,
> > +then the server must respond only with the response header and the Region 
> > IO FD
> > +info structure, setting in ``argsz`` the buffer size required to store the 
> > full
> > +response. In this case, no file descriptors are passed back.  The client 
> > then
> > +retries the operation with a larger receive buffer.
> > +
> > +The reply message will additionally include at least one file descriptor 
> > in the
> > +ancillary data. Note that more than one sub-region may share the same file
> > +descriptor.
> 
> How does this interact with the maximum number of file descriptors,
> max_fds? It is possible that there are more sub-regions than max_fds
> allows...

I think this would just be a matter of the client advertising a reasonably large
enough size for max_msg_fds. Do we need to worry about this?

> > +Each sub-region given in the response has one of two possible structures,
> > +depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or
> > +`VFIO_USER_IO_FD_TYPE_IOREGIONFD`:
> > +
> > +Sub-Region IO FD info format (ioeventfd)
> > +
> > +
> > ++---++--+
> > +| Name  | Offset | Size |
> > ++===++==+
> > +| offset| 0  | 8|
> > ++---++--+
> > +| size  | 8  | 8|
> > ++---++--+
> > +| fd_index  | 16 | 4|
> > ++---++--+
> > +| type  | 20 | 4|
> > ++---++--+
> > +| flags | 24 | 4|
> > ++---++--+
> > +| padding   | 28 | 4|
> > ++---++--+
> > +| datamatch | 32 | 8|
> > ++---++--+
> > +
> > +* *offset* is the offset of the start of the sub-region within the region
> > +  requested ("physical address offset" for the region)
> > +* *size* is the length of the sub-region. This may be zero if the access 
> > size is
> > +  not relevant, which may allow for optimizations
> > +* 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-05 Thread John Levon
On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> > +While passing of file descriptors is desirable for performance reasons, it 
> > is
> > +not necessary neither for the client nor for the server to support it in 
> > order
> 
> Double negative. "not" can be removed.

Done. I also took a `:set spell` pass on the whole spec, which should catch your
other typos.

> > +Device Read and Write
> > +^
> > +When the guest executes load or store operations to device memory, the 
> > client
> 
>  calls it "device regions", not "device memory".
> s/device memory/unmapped device regions/?

Spec was sloppy here, yes. Fixed up all the instances I noticed.

> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Updated:

```
Therefore in order for the protocol to be forward compatible, the server should
respond to a client disconnection as follows:

 - all client memory regions are unmapped and cleaned up (including closing any
   passed file descriptors)
 - all IRQ file descriptors passed from the old client are closed
 - the device state should otherwise be retained

The expectation is that when a client reconnects, it will re-establish IRQ and
client memory mappings.

If anything happens to the client (such as qemu really did exit), the control
stack will know about it and can clean up resources accordingly.
```

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

Filed https://github.com/nutanix/libvfio-user/issues/466

> > +* *Offset* is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  .
> 
> Please be more specific. Does it only include PROT_READ and PROT_WRITE?
> What about PROT_EXEC?

Updated to just have PROT_READ/WRITE.

> > +If a DMA region being added can be directly mapped by the server, an array 
> > of
> > +file descriptors must be sent as part of the message meta-data. Each 
> > mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX 
> > sockets, the
> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. 
> > Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be 
> > accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, 
> > explained
> > +in `Read and Write Operations`_. A command to map over an existing region 
> > must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> 
> Does this mean a vIOMMU update, like a protections bits change requires
> an unmap command followed by a map command? That is not an atomic
> operation but hopefully guests don't try to update a vIOMMU mapping
> while accessing it.

Filed https://github.com/nutanix/libvfio-user/issues/467

> > +This command message is sent by the client to the server to inform it that 
> > a
> > +DMA region, previously made available via a VFIO_USER_DMA_MAP command 
> > message,
> > +is no longer available for DMA. It typically occurs when memory is 
> > subtracted
> > +from the client or if the client uses a vIOMMU. If the client does not 
> > expect
> > +the server to perform DMA then it does not need to send to the server
> > +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA 
> > then
> > +it can ignore such commands but it must still reply to them. The table is 
> > an
> 
> I'm a little confused by the last two sentences about not sending or
> ignoring VFIO_USER_DMA_UNMAP. Does it mean that VFIO_USER_DMA_MAP does
> not need to be sent either when the device is known never to need DMA?

If a device is never going to access client memory (either directly mapped or
DMA_READ/WRITE), there's no need to inform the server of VM memory.  I removed
the sentences as they were just confusing, sort of obvious, and not really
relevant to real systems.

> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-04 Thread John Levon
On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be

Thanks for your review so far Stefan!

We'll make sure to respond to all your comments as needed, so this is just a
partial response.

> > +Socket Disconnection Behavior
> > +-
> > +The server and the client can disconnect from each other, either 
> > intentionally
> > +or unexpectedly. Both the client and the server need to know how to handle 
> > such
> > +events.
> > +
> > +Server Disconnection
> > +
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because 
> > of a
> > +   device update) or unintentionally (e.g. because of a crash).
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client 
> > should
> > +reset the VFIO device when it detects the socket has been disconnected.
> > +Error recovery will be driven by the guest's device error handling
> > +behavior.
> > +
> > +Client Disconnection
> > +
> > +The client disconnecting from the server primarily means that the client
> > +has exited. Currently, this means that the guest is shut down so the 
> > device is
> > +no longer needed therefore the server can automatically exit. However, 
> > there
> > +can be cases where a client disconnection should not result in a server 
> > exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which is not yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the server 
> > should
> > +take no action when the client disconnects. If anything happens to the 
> > client
> > +the control stack will know about it and can clean up resources
> > +accordingly.
> 
> Also, hot unplug?
> 
> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Yes. We're still iterating on the implications of these scenarios and trying to
figure out the right approaches, so a lot of this is still very much
under-specified.

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

It's certainly the case that implementation-wise right now these are issues, at
least on the library side. I think I'm probably OK with requiring responses to
be provided prior to async messages like VM_INTERRUPT.

> > +The version data is an optional JSON byte array with the following format:
> 
> RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding
> says:
> 
>   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
> 
> Please indicate the character encoding. I guess it is always UTF-8?

Yes.

> > +| ``"max_fds"``  | number   | Maximum number of file 
> > descriptors  |
> > +||  | the can be received by the 
> > sender.  |
> > +||  | Optional. If not specified then 
> > the |
> > +||  | receiver must assume 
> >|
> > +||  | ``"max_fds"=1``. 
> >|
> 
> Maximum per message? Please clarify and consider renaming it to
> max_msg_fds (it's also more consistent with max_msg_size).

Yes, that's a much better name.

> > +| Name | Type   | Description   |
> > ++==++===+
> > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest |
> > +|  || between the client and the server is used.|
> > ++--++---+
> 
> "in bytes"?

We'll add to the prelude that all memory sizes are in byte units unless
specified otherwise.

> > +Versioning and Feature 

Re: [DRAFT RFC] ioeventfd/ioregionfd support in vfio-user

2021-01-28 Thread John Levon
On Tue, Jan 26, 2021 at 11:01:04AM +, Stefan Hajnoczi wrote:

> > 4) (ioregionfd issue) region_id is 4 bytes, which seems a little awkward 
> > from
> > the server side: this has to encode both the region ID as well as the 
> > offset of
> > the sub-region within that region. Can this be 64 bits wide?
> 
> The user_data field in Elena's ioregionfd patches is 64 bits. Does this
> solve the problem?

I had missed this had been changed from the proposal at 
https://www.spinics.net/lists/kvm/msg208139.html

Thanks for pointing this out.

regards
john


[DRAFT RFC] ioeventfd/ioregionfd support in vfio-user

2021-01-21 Thread John Levon


Hi, please take a look. For context, this addition is against the current
https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst
specification.

kvm@ readers: Stefan suggested this may be of interest from a VFIO point of
view, in case there is any potential cross-over in defining how to wire up these
fds.

Note that this is a new message instead of adding a new region capability to
VFIO_USER_DEVICE_GET_REGION_INFO: with a new capability, there's no way for the
server to know if ioeventfd/ioregionfd is actually used/requested by the client
(the client would just have to close those fds if it didn't want to use FDs). An
explicit new call is much clearer for this.

The ioregionfd feature itself is at proposal stage, so there's a good chance of
further changes depending on that.

I also have these pending issues so far:

1) I'm not familiar with CCW bus, so don't know if
KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY flag makes sense or is supportable in
vfio-user context

2) if ioregionfd subsumes all ioeventfd use cases, we can remove the distinction
from this API, and the client caller can translate to ioregionfd or ioeventfd as
needed

3) is it neccessary for the client to indicate support (e.g. lacking ioregionfd
support) ?

4) (ioregionfd issue) region_id is 4 bytes, which seems a little awkward from
the server side: this has to encode both the region ID as well as the offset of
the sub-region within that region. Can this be 64 bits wide?

thanks
john

(NB: I edited the diff so the new sections are more readable.)

diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst
index e3adc7a..e7274a2 100644
--- a/docs/vfio-user.rst
+++ b/docs/vfio-user.rst
@@ -161,6 +161,17 @@ in the region info reply of a device-specific region. Such 
regions are reflected
 in ``struct vfio_device_info.num_regions``. Thus, for PCI devices this value 
can
 be equal to, or higher than, VFIO_PCI_NUM_REGIONS.
 
+Region I/O via file descriptors
+---
+
+For unmapped regions, region I/O from the client is done via
+VFIO_USER_REGION_READ/WRITE.  As an optimization, ioeventfds or ioregionfds may
+be configured for sub-regions of some regions. A client may request information
+on these sub-regions via VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring the
+returned file descriptors as ioeventfds or ioregionfds, the server can be
+directly notified of I/O (for example, by KVM) without taking a trip through 
the
+client.
+
 Interrupts
 ^^
 The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for
@@ -293,37 +304,39 @@ Commands
 The following table lists the VFIO message command IDs, and whether the
 message command is sent from the client or the server.
 
-+--+-+---+
-| Name | Command | Request Direction |
-+==+=+===+
-| VFIO_USER_VERSION| 1   | client -> server  |
-+--+-+---+
-| VFIO_USER_DMA_MAP| 2   | client -> server  |
-+--+-+---+
-| VFIO_USER_DMA_UNMAP  | 3   | client -> server  |
-+--+-+---+
-| VFIO_USER_DEVICE_GET_INFO| 4   | client -> server  |
-+--+-+---+
-| VFIO_USER_DEVICE_GET_REGION_INFO | 5   | client -> server  |
-+--+-+---+
-| VFIO_USER_DEVICE_GET_IRQ_INFO| 6   | client -> server  |
-+--+-+---+
-| VFIO_USER_DEVICE_SET_IRQS| 7   | client -> server  |
-+--+-+---+
-| VFIO_USER_REGION_READ| 8   | client -> server  |
-+--+-+---+
-| VFIO_USER_REGION_WRITE   | 9   | client -> server  |
-+--+-+---+
-| VFIO_USER_DMA_READ   | 10  | server -> client  |
-+--+-+---+
-| VFIO_USER_DMA_WRITE  | 11  | server -> client  |
-+--+-+---+
-| VFIO_USER_VM_INTERRUPT   | 12  | server -> client  |
-+--+-+---+
-| VFIO_USER_DEVICE_RESET   | 13  | client -> server  |
-+--+-+---+
-| VFIO_USER_DIRTY_PAGES| 14  | client -> server  |
-+--+-+---+
+++-+---+
+| Name   | Command | Request Direction |

Re: [PATCH] virtio: reset device on bad guest index in virtio_load()

2020-11-30 Thread John Levon
On Fri, Nov 20, 2020 at 06:51:07PM +, John Levon wrote:

> If we find a queue with an inconsistent guest index value, explicitly mark the
> device as needing a reset - and broken - via virtio_error().
> 
> There's at least one driver implementation - the virtio-win NetKVM driver - 
> that
> is able to handle a VIRTIO_CONFIG_S_NEEDS_RESET notification and successfully
> restore the device to a working state. Other implementations do not correctly
> handle this, but as the VQ is not in a functional state anyway, this is still
> worth doing.

Ping, anyone have issues with doing this?

cheers
john

> Signed-off-by: John Levon 
> ---
>  hw/virtio/virtio.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index ceb58fda6c..eff35fab7c 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -3161,12 +3161,15 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
> version_id)
>  nheads = vring_avail_idx(>vq[i]) - 
> vdev->vq[i].last_avail_idx;
>  /* Check it isn't doing strange things with descriptor numbers. 
> */
>  if (nheads > vdev->vq[i].vring.num) {
> -qemu_log_mask(LOG_GUEST_ERROR,
> -  "VQ %d size 0x%x Guest index 0x%x "
> -  "inconsistent with Host index 0x%x: delta 
> 0x%x",
> -  i, vdev->vq[i].vring.num,
> -  vring_avail_idx(>vq[i]),
> -  vdev->vq[i].last_avail_idx, nheads);
> +virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
> + "inconsistent with Host index 0x%x: delta 0x%x",
> + i, vdev->vq[i].vring.num,
> + vring_avail_idx(>vq[i]),
> + vdev->vq[i].last_avail_idx, nheads);
> +vdev->vq[i].used_idx = 0;
> +vdev->vq[i].shadow_avail_idx = 0;
> +vdev->vq[i].inuse = 0;
> +continue;
>  }
>  vdev->vq[i].used_idx = vring_used_idx(>vq[i]);
>  vdev->vq[i].shadow_avail_idx = vring_avail_idx(>vq[i]);



[PATCH] virtio: reset device on bad guest index in virtio_load()

2020-11-20 Thread John Levon


If we find a queue with an inconsistent guest index value, explicitly mark the
device as needing a reset - and broken - via virtio_error().

There's at least one driver implementation - the virtio-win NetKVM driver - that
is able to handle a VIRTIO_CONFIG_S_NEEDS_RESET notification and successfully
restore the device to a working state. Other implementations do not correctly
handle this, but as the VQ is not in a functional state anyway, this is still
worth doing.

Signed-off-by: John Levon 
---
 hw/virtio/virtio.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ceb58fda6c..eff35fab7c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3161,12 +3161,15 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
version_id)
 nheads = vring_avail_idx(>vq[i]) - 
vdev->vq[i].last_avail_idx;
 /* Check it isn't doing strange things with descriptor numbers. */
 if (nheads > vdev->vq[i].vring.num) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "VQ %d size 0x%x Guest index 0x%x "
-  "inconsistent with Host index 0x%x: delta 0x%x",
-  i, vdev->vq[i].vring.num,
-  vring_avail_idx(>vq[i]),
-  vdev->vq[i].last_avail_idx, nheads);
+virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
+ "inconsistent with Host index 0x%x: delta 0x%x",
+ i, vdev->vq[i].vring.num,
+ vring_avail_idx(>vq[i]),
+ vdev->vq[i].last_avail_idx, nheads);
+vdev->vq[i].used_idx = 0;
+vdev->vq[i].shadow_avail_idx = 0;
+vdev->vq[i].inuse = 0;
+continue;
 }
 vdev->vq[i].used_idx = vring_used_idx(>vq[i]);
 vdev->vq[i].shadow_avail_idx = vring_avail_idx(>vq[i]);
-- 
2.22.3



[PATCH] virtio: reset device on bad guest index in virtio_load()

2020-11-20 Thread John Levon


If we find a queue with an inconsistent guest index value, explicitly mark the
device as needing a reset - and broken - via virtio_error().

There's at least one driver implementation - the virtio-win NetKVM driver - that
is able to handle a VIRTIO_CONFIG_S_NEEDS_RESET notification and successfully
restore the device to a working state. Other implementations do not correctly
handle this, but as the VQ is not in a functional state anyway, this is still
worth doing.

Signed-off-by: John Levon 
---
 hw/virtio/virtio.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ceb58fda6c..eff35fab7c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3161,12 +3161,15 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
version_id)
 nheads = vring_avail_idx(>vq[i]) - 
vdev->vq[i].last_avail_idx;
 /* Check it isn't doing strange things with descriptor numbers. */
 if (nheads > vdev->vq[i].vring.num) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "VQ %d size 0x%x Guest index 0x%x "
-  "inconsistent with Host index 0x%x: delta 0x%x",
-  i, vdev->vq[i].vring.num,
-  vring_avail_idx(>vq[i]),
-  vdev->vq[i].last_avail_idx, nheads);
+virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
+ "inconsistent with Host index 0x%x: delta 0x%x",
+ i, vdev->vq[i].vring.num,
+ vring_avail_idx(>vq[i]),
+ vdev->vq[i].last_avail_idx, nheads);
+vdev->vq[i].used_idx = 0;
+vdev->vq[i].shadow_avail_idx = 0;
+vdev->vq[i].inuse = 0;
+continue;
 }
 vdev->vq[i].used_idx = vring_used_idx(>vq[i]);
 vdev->vq[i].shadow_avail_idx = vring_avail_idx(>vq[i]);
-- 
2.22.3



Re: [PATCH v5] introduce vfio-user protocol specification

2020-11-07 Thread John Levon
On Thu, Nov 05, 2020 at 05:50:27PM -0800, John G Johnson wrote:

>   The idea behind the version IDs is to identify incompatible protocol
> changes as major versions, and compatible changes as minor versions.  What
> would be the purpose of the third version type?

Well, like any patch version, it'd be for identifying versions on the other side
for reporting, debugging purposes. Not imply anything about the protocol
version. But it's not a big deal.

>   The thing that makes parsing the JSON easier is knowing the version
> beforehand so the parser knows what keys to expect, so I’d like to promote
> major and minor to separate fields in the packet from being embedded at an
> arbitrary points in the JSON string.

I agree that'd be a sensible change (and then I wonder if the little bit of JSON
is actually useful any more).

> >> So can we switch it now so the initial setup is a send/recv too?
> > 
> > I'm fine with that but would first like to hear back from John in case he 
> > objects.
> 
> 
>   I think I write that section, and just switched client and server.  The 
> code
> is written as client proposes, server responds; this is the better model.

Hah, great, thanks.

regards
john



Re: [PATCH v5] introduce vfio-user protocol specification

2020-11-02 Thread John Levon
On Mon, Nov 02, 2020 at 11:29:23AM +, Thanos Makatos wrote:

> > +==++=
> > ==+
> > > | version  | object | ``{"major": , "minor": }``  
> > >   |
> > > |  || 
> > >   |
> > > |  || Version supported by the sender, e.g. "0.1".
> > >   |
> > 
> > It seems quite unlikely but this should specify it's strings not floating 
> > point
> > values maybe?
> > 
> > Definitely applies to max_fds too.
> 
> major and minor are JSON numbers and specifically integers.

It is debatable as to whether there is such a thing as a JSON integer :)

> The rationale behind this is to simplify parsing. Is specifying that
> major/minor/max_fds should be an interger sufficient to clear any vagueness
> here?

I suppose that's OK as long as we never want a 0.1.1 or whatever. I'm not sure
it simplifies parsing, but maybe it does.

> > > Versioning and Feature Support
> > > ^^
> > > Upon accepting a connection, the server must send a VFIO_USER_VERSION
> > message
> > > proposing a protocol version and a set of capabilities. The client 
> > > compares
> > > these with the versions and capabilities it supports and sends a
> > > VFIO_USER_VERSION reply according to the following rules.
> > 
> > I'm curious if there was a specific reason it's this way around, when it 
> > seems
> > more natural for the client to propose first, and the server to reply?
> 
> I'm not aware of any specific reason.

So can we switch it now so the initial setup is a send/recv too?

thanks
john



Re: [PATCH v5] introduce vfio-user protocol specification

2020-10-30 Thread John Levon
On Wed, Oct 28, 2020 at 04:41:31PM +, Thanos Makatos wrote:

> FYI here's v5 of the vfio-user protocol, my --cc in git send-email got messed 
> up somehow

Hi Thanos, this looks great, I just had some minor questions below.

> Command Concurrency
> ---
> A client may pipeline multiple commands without waiting for previous command
> replies.  The server will process commands in the order they are received.
> A consequence of this is if a client issues a command with the *No_reply* bit,
> then subseqently issues a command without *No_reply*, the older command will
> have been processed before the reply to the younger command is sent by the
> server.  The client must be aware of the device's capability to process 
> concurrent
> commands if pipelining is used.  For example, pipelining allows multiple 
> client
> threads to concurently access device memory; the client must ensure these 
> acceses
> obey device semantics.
> 
> An example is a frame buffer device, where the device may allow concurrent 
> access
> to different areas of video memory, but may have indeterminate behavior if 
> concurrent
> acceses are performed to command or status registers.

Is it valid for an unrelated server->client message to appear in between a
client->server request/reply, or not? And vice versa? Either way, seems useful
for the spec to say.

> || +-++ |
> || | Bit | Definition | |
> || +=++ |
> || | 0-3 | Type   | |
> || +-++ |
> || | 4   | No_reply   | |
> || +-++ |
> || | 5   | Error  | |
> || +-++ |
> +++-+
> | Error  | 12 | 4   |
> +++-+
> 
> * *Message ID* identifies the message, and is echoed in the command's reply 
> message.

Is it valid to re-use an ID? When/when not?

>   * *Error* in a reply message indicates the command being acknowledged had
> an error. In this case, the *Error* field will be valid.
> 
> * *Error* in a reply message is a UNIX errno value. It is reserved in a 
> command message.

I'm not quite following why we need a bit flag and an error field. Do you
anticipate a failure, but with errno==0?

> VFIO_USER_VERSION
> -
> 
> +--++
> | Message size | 16 + version length|

Terminating NUL included?

> +--++---+
> | Name | Type   | Description   |
> +==++===+
> | version  | object | ``{"major": , "minor": }``|
> |  ||   |
> |  || Version supported by the sender, e.g. "0.1".  |

It seems quite unlikely but this should specify it's strings not floating point
values maybe?

Definitely applies to max_fds too.

> Common capabilities:
> 
> +---++
> | Name  | Description|
> +===++
> | ``max_fds``   | Maximum number of file descriptors that can be received by |
> |   | the sender. Optional.  |

Could specify the meaning when absent?

By array I presume you mean associative array i.e. an Object. Does the whole
thing look like this:

 {
   "major": ..
   "minor": ..
   "capabilities": {
  "max_fds": ..,
  "migration
   }
 }

or something else?

> Versioning and Feature Support
> ^^
> Upon accepting a connection, the server must send a VFIO_USER_VERSION message
> proposing a protocol version and a set of capabilities. The client compares
> these with the versions and capabilities it supports and sends a
> VFIO_USER_VERSION reply according to the following rules.

I'm curious if there was a specific reason it's this way around, when it seems
more natural for the client to propose first, and the server to reply?

> VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> -

Huge nit, but why are these DMA_*MAP when vfio uses *MAP_DMA ?

> VFIO bitmap format
> * *size* the size for the bitmap, in bytes.

Should this clarify it does *not* include the bitmap header in its size, unlike
other size fields?

> VFIO_USER_DMA_MAP
> "
> If a DMA region being added can be directly mapped by the server, an array of
> file descriptors must be sent as part of the message meta-data. Each region
> entry must have a corresponding file descriptor.

"Each mappable region entry" ?

> descriptors must be passed as SCM_RIGHTS type ancillary