RE: [PATCH v8] introduce vfio-user protocol specification

2021-06-14 Thread Thanos Makatos



> -Original Message-
> From: Stefan Hajnoczi 
> Sent: 04 May 2021 14:52
> To: Thanos Makatos 
> Cc: qemu-devel@nongnu.org; John Levon ;
> John G Johnson ;
> benjamin.wal...@intel.com; Elena Ufimtseva
> ; jag.ra...@oracle.com;
> james.r.har...@intel.com; Swapnil Ingle ;
> konrad.w...@oracle.com; alex.william...@redhat.com;
> yuvalkash...@gmail.com; tina.zh...@intel.com;
> marcandre.lur...@redhat.com; ism...@linux.com;
> kanth.ghatr...@oracle.com; Felipe Franciosi ;
> xiuchun...@intel.com; tomassetti.and...@gmail.com; Raphael Norwitz
> ; changpeng@intel.com;
> dgilb...@redhat.com; Yan Zhao ; Michael S . Tsirkin
> ; Gerd Hoffmann ; Christophe de
> Dinechin ; Jason Wang ;
> Cornelia Huck ; Kirti Wankhede
> ; Paolo Bonzini ;
> mpiszc...@ddn.com; John Levon 
> Subject: Re: [PATCH v8] introduce vfio-user protocol specification
> 
> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be
> > emulated outside QEMU, in a separate process. vfio-user reuses the
> > existing VFIO defines, structs and concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson 
> > Signed-off-by: Thanos Makatos 
> > Signed-off-by: John Levon 
> >
> > ---
> >
> > Changed since v1:
> >   * fix coding style issues
> >   * update MAINTAINERS for VFIO-over-socket
> >   * add vfio-over-socket to ToC
> >
> > Changed since v2:
> >   * fix whitespace
> >
> > Changed since v3:
> >   * rename protocol to vfio-user
> >   * add table of contents
> >   * fix Unicode problems
> >   * fix typos and various reStructuredText issues
> >   * various stylistic improvements
> >   * add backend program conventions
> >   * rewrite part of intro, drop QEMU-specific stuff
> >   * drop QEMU-specific paragraph about implementation
> >   * explain that passing of FDs isn't necessary
> >   * minor improvements in the VFIO section
> >   * various text substitutions for the sake of consistency
> >   * drop paragraph about client and server, already explained in
> >   * intro
> >   * drop device ID
> >   * drop type from version
> >   * elaborate on request concurrency
> >   * convert some inessential paragraphs into notes
> >   * explain why some existing VFIO defines cannot be reused
> >   * explain how to make changes to the protocol
> >   * improve text of DMA map
> >   * reword comment about existing VFIO commands
> >   * add reference to Version section
> >   * reset device on disconnection
> >   * reword live migration section
> >   * replace sys/vfio.h with linux/vfio.h
> >   * drop reference to iovec
> >   * use argz the same way it is used in VFIO
> >   * add type field in header for clarity
> >
> > Changed since v4:
> >   * introduce support for live migration as defined in
> >   * include/uapi/linux/vfio.h
> >   * introduce 'max_fds' and 'migration' capabilities:
> >   * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
> >   * fix minor typos and reworded some text for clarity
> >
> > Changed since v5:
> >   * fix minor typos
> >   * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> >   * clarify meaning of VFIO bitmap size field
> >   * move version major/minor outside JSON
> >   * client proposes version first
> >   * make Errno optional in message header
> >   * clarification about message ID uniqueness
> >   * clarify that server->client request can appear in between
> > client->server request/reply
> >
> > Changed since v6:
> >   * put JSON strings in double quotes
> >   * clarify reply behavior on error
> >   * introduce max message size capability
> >   * clarify semantics when failing to map multiple DMA regions in a
> > single command
> >
> > Changed since v7:
> >   * client proposes version instead of server
> >   * support ioeventfd and ioregionfd for unmapped regions
> >   * reword struct vfio_bitmap for clarity
> >   * clarify use of argsz in VFIO device info
> >   * allow individual IRQs to be disabled
> > ---
> >  MAINTAINERS  |7 +
> >  docs/devel/index.rst |1 +
> >  docs/devel/vfio-user.rst | 1854
> > ++
> >  3 files changed, 1862 insertions(+)
> >  create mode

RE: [PATCH v8] introduce vfio-user protocol specification

2021-06-14 Thread Thanos Makatos
> > Are there rules for avoiding deadlock between client->server and
> > server->client messages? For example, the client sends
> > VFIO_USER_REGION_WRITE and the server sends
> VFIO_USER_VM_INTERRUPT
> > before replying to the write message.
> >
> > Multi-threaded clients and servers could end up deadlocking if
> > messages are processed while polling threads handle other device activity
> (e.g.
> > I/O requests that cause DMA messages).
> >
> > Pipelining has the nice effect that the oldest message must complete
> > before the next pipelined message starts. It imposes a maximum issue
> > depth of 1. Still, it seems like it would be relatively easy to hit
> > re-entrancy or deadlock issues since both the client and the server
> > can initiate messages and may need to wait for a response.
> 
> It's certainly the case that implementation-wise right now these are issues, 
> at
> least on the library side. I think I'm probably OK with requiring responses to
> be provided prior to async messages like VM_INTERRUPT.

I think re-entrancy/deadblock issues are not spec-related but can be 
implementation specific.
In your example, the client can't assume that simply because it sent a 
REGION_WRITE
message the only message the server will send will be the reply: it might as 
well be
a VM_INTERRUPT or DMA_READ message.



RE: [PATCH v8] introduce vfio-user protocol specification

2021-06-14 Thread Thanos Makatos
> > > +VFIO_USER_DMA_UNMAP
> > > +---
> > > +
> > > +This command message is sent by the client to the server to inform
> > > +it that a DMA region, previously made available via a
> > > +VFIO_USER_DMA_MAP command message, is no longer available for
> DMA.
> > > +It typically occurs when memory is subtracted from the client or if
> > > +the client uses a vIOMMU. If the client does not expect the server
> > > +to perform DMA then it does not need to send to the server
> > > +VFIO_USER_DMA_UNMAP commands. If the server does not need to
> > > +perform DMA then it can ignore such commands but it must still
> > > +reply to them. The table is an
> >
> > I'm confused why expectation of DMA plays a factor here.  For example,
> > if QEMU unplugs a DIMM and the server has an mmap of the file
> > descriptor related to that DIMM, does it get to retain the mmap if it
> > doesn't currently have any DMA queued targeting that address range?
> > Can QEMU skip sending an unmap if the PCI bus master bit is disabled
> > on the device preventing further DMA?  How can the associated file
> > descriptor get released?  This doesn't feel strongly specified.
> 
> I thought we'd removed those sentences actually, as they're just confusing.
> In reality, everything is going to both send and handle map/unmap
> messages.
> 
> > Are there any assumptions about address and size of the unmap command
> > relative to the original map command or is the client freely allowed
> > to bisect, overlap, or overextend previous mappings?
> 
> Good question. Filed https://github.com/nutanix/libvfio-user/issues/504 to
> track this.
> 
> I actually don't know what clients would like to be able to do in this 
> respect.

It's probably not worth supporting such behavior at this point, especially since
there's no valid use case yet. Let's drop it. Should we ever want to introduce
such behavior in the future, can should be able to do so by a capability. Also,
the protocol format won't have to change since additional DMA regions can
simply be appended to the message payload.



Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-19 Thread John Levon
On Wed, May 19, 2021 at 03:08:17PM -0600, Alex Williamson wrote:

> > +VFIO_USER_DMA_MAP
> > +-
> > +
> > +Message Format
> > +^^
> > +
> > ++--++
> > +| Name | Value  |
> > ++==++
> > +| Message ID   ||
> > ++--++
> > +| Command  | 2  |
> > ++--++
> > +| Message size | 16 + table size|
> > ++--++
> > +| Flags| Reply bit set in reply |
> > ++--++
> > +| Error| 0/errno|
> > ++--++
> > +| Table| array of table entries |
> > ++--++
> > +
> > +This command message is sent by the client to the server to inform it of 
> > the
> > +memory regions the server can access. It must be sent before the server can
> > +perform any DMA to the client. It is normally sent directly after the 
> > version
> > +handshake is completed, but may also occur when memory is added to the 
> > client,
> > +or if the client uses a vIOMMU. If the client does not expect the server to
> > +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP
> > +commands. If the server does not need to perform DMA then it can ignore 
> > such
> > +commands but it must still reply to them. The table is an array of the
> > +following structure:
> > +
> > +Table entry format
> > +^^
> > +
> > ++-++-+
> > +| Name| Offset | Size|
> > ++=++=+
> > +| Address | 0  | 8   |
> > ++-++-+
> > +| Size| 8  | 8   |
> > ++-++-+
> > +| Offset  | 16 | 8   |
> > ++-++-+
> > +| Protections | 24 | 4   |
> > ++-++-+
> > +| Flags   | 28 | 4   |
> > ++-++-+
> > +| | +-++ |
> > +| | | Bit | Definition | |
> > +| | +=++ |
> > +| | | 0   | Mappable   | |
> > +| | +-++ |
> > ++-++-+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> 
> It might help to explicitly state this value is ignored by the server
> for non-mappable DMA, otherwise a server might assume a specific value
> is required (ex. zero) for such cases.

Generally we say that unused inputs must be zero, but yes, this should be
clarified, thanks.

> > +* *Protections* are the region's protection attributes as encoded in
> > +  .
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *Mappable* indicates that the region can be mapped via the mmap() 
> > system
> > +call using the file descriptor provided in the message meta-data.
> > +
> > +This structure is 32 bytes in size, so the message size is:
> > +16 + (# of table entries * 32).
> > +
> > +If a DMA region being added can be directly mapped by the server, an array 
> > of
> > +file descriptors must be sent as part of the message meta-data. Each 
> > mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX 
> > sockets, the
> 
> Is this saying that if the client provides table entries where indexes
> 1, 3, and 5 are indicated as mappable, then there must be an ancillary
> file descriptor array with 3 entries, where fd[0] maps to entry[1],
> fd[1]:entry[3], and fd[2]:entry[5], even if fd[0-2] are all the
> same file descriptor?

Yes. Though we are planning to change these calls to only support single regions
which would make this moot.

> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. 
> > Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be 
> > accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, 
> > explained
> > +in `Read and Write Operations`_. A command to map over an existing region 
> > must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> > +
> > +Adding multiple DMA regions can partially fail. The response does not 
> > indicate
> > +which regions were added and which were not, therefore it is a client
> > +implementation detail how to recover from the failure.
> > +
> > +.. Note::
> > +   The server can optionally remove succesfully added DMA regions making 
> > this
> > +   operation atomic.
> > +   The client can recover by attempting to unmap one by one all the DMA 
> > regions
> > +   in the VFIO_USER_DMA_MAP 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-19 Thread Alex Williamson
On Wed, 14 Apr 2021 04:41:22 -0700
Thanos Makatos  wrote:
> +
> +VFIO_USER_DMA_MAP
> +-
> +
> +Message Format
> +^^
> +
> ++--++
> +| Name | Value  |
> ++==++
> +| Message ID   ||
> ++--++
> +| Command  | 2  |
> ++--++
> +| Message size | 16 + table size|
> ++--++
> +| Flags| Reply bit set in reply |
> ++--++
> +| Error| 0/errno|
> ++--++
> +| Table| array of table entries |
> ++--++
> +
> +This command message is sent by the client to the server to inform it of the
> +memory regions the server can access. It must be sent before the server can
> +perform any DMA to the client. It is normally sent directly after the version
> +handshake is completed, but may also occur when memory is added to the 
> client,
> +or if the client uses a vIOMMU. If the client does not expect the server to
> +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP
> +commands. If the server does not need to perform DMA then it can ignore such
> +commands but it must still reply to them. The table is an array of the
> +following structure:
> +
> +Table entry format
> +^^
> +
> ++-++-+
> +| Name| Offset | Size|
> ++=++=+
> +| Address | 0  | 8   |
> ++-++-+
> +| Size| 8  | 8   |
> ++-++-+
> +| Offset  | 16 | 8   |
> ++-++-+
> +| Protections | 24 | 4   |
> ++-++-+
> +| Flags   | 28 | 4   |
> ++-++-+
> +| | +-++ |
> +| | | Bit | Definition | |
> +| | +=++ |
> +| | | 0   | Mappable   | |
> +| | +-++ |
> ++-++-+
> +
> +* *Address* is the base DMA address of the region.
> +* *Size* is the size of the region.
> +* *Offset* is the file offset of the region with respect to the associated 
> file
> +  descriptor.

It might help to explicitly state this value is ignored by the server
for non-mappable DMA, otherwise a server might assume a specific value
is required (ex. zero) for such cases.

> +* *Protections* are the region's protection attributes as encoded in
> +  .
> +* *Flags* contains the following region attributes:
> +
> +  * *Mappable* indicates that the region can be mapped via the mmap() system
> +call using the file descriptor provided in the message meta-data.
> +
> +This structure is 32 bytes in size, so the message size is:
> +16 + (# of table entries * 32).
> +
> +If a DMA region being added can be directly mapped by the server, an array of
> +file descriptors must be sent as part of the message meta-data. Each mappable
> +region entry must have a corresponding file descriptor. On AF_UNIX sockets, 
> the

Is this saying that if the client provides table entries where indexes
1, 3, and 5 are indicated as mappable, then there must be an ancillary
file descriptor array with 3 entries, where fd[0] maps to entry[1],
fd[1]:entry[3], and fd[2]:entry[5], even if fd[0-2] are all the
same file descriptor?

> +file descriptors must be passed as SCM_RIGHTS type ancillary data. Otherwise,
> +if a DMA region cannot be directly mapped by the server, it can be accessed 
> by
> +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, 
> explained
> +in `Read and Write Operations`_. A command to map over an existing region 
> must
> +be failed by the server with ``EEXIST`` set in error field in the reply.
> +
> +Adding multiple DMA regions can partially fail. The response does not 
> indicate
> +which regions were added and which were not, therefore it is a client
> +implementation detail how to recover from the failure.
> +
> +.. Note::
> +   The server can optionally remove succesfully added DMA regions making this
> +   operation atomic.
> +   The client can recover by attempting to unmap one by one all the DMA 
> regions
> +   in the VFIO_USER_DMA_MAP command, ignoring failures for regions that do 
> not
> +   exist.

What's the benefit of specifying this server behavior as optional?  I'm
afraid this unspecified error recovery behavior might actually deter
clients from performing batch mappings.  Servers also have little
incentive to do their own cleanup if the client has no way to detect
that behavior.

> +
> +
> +VFIO_USER_DMA_UNMAP
> +---
> +
> +Message Format
> +^^
> +
> 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-11 Thread John Johnson


> On May 10, 2021, at 3:25 PM, John Levon  wrote:
> 
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> 
> 
>>> +VFIO_USER_VM_INTERRUPT
>>> +--
>>> +
>>> +Message format
>>> +^^
>>> +
>>> ++++
>>> +| Name   | Value  |
>>> ++++
>>> +| Message ID ||
>>> ++++
>>> +| Command| 13 |
>>> ++++
>>> +| Message size   | 20 |
>>> ++++
>>> +| Flags  | Reply bit set in reply |
>>> ++++
>>> +| Error  | 0/errno|
>>> ++++
>>> +| Interrupt info | |
>>> ++++
>>> +
>>> +This command message is sent from the server to the client to signal the 
>>> device
>>> +has raised an interrupt.
>> 
>> Except if the client set up irq eventfds?
> 
> Clarified.
> 
>>> +Interrupt info format
>>> +^
>>> +
>>> ++---++--+
>>> +| Name  | Offset | Size |
>>> ++===++==+
>>> +| Sub-index | 16 | 4|
>>> ++---++--+
>>> +
>>> +* *Sub-index* is relative to the IRQ index, e.g., the vector number used 
>>> in PCI
>>> +  MSI/X type interrupts.
>> 
>> Hmm...this is weird. The server tells the client to raise an MSI-X
>> interrupt but does not include the MSI message that resides in the MSI-X
>> table BAR device region? Or should MSI-X interrupts be delivered to the
>> client via VFIO_USER_DMA_WRITE instead?
>> 
>> (Basically it's not clear to me how MSI-X interrupts would work with
>> vfio-user. Reading how they work in kernel VFIO might let me infer it,
>> but it's probably worth explaining this clearly in the spec.)
> 
> It doesn't. We don't have an implementation, and the qemu patches don't get 
> this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
> 
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?
> 

VFIO will set up 2 eventfds for each enabled MSI/X vector.  One is
terminated in KVM for direct injection into the guest.  The other terminates
back in QEMU, and triggers MSI/X SW emulation.  When informing the kernel of
which FDs to use, VFIO prefers the KVM FD, the QMEU one is only used if the
KVM one can’t be created (or is disabled by command line option)

VFIO_USER_INTERRUPT would need an vector number in the request.  I
noticed this when I did the client, but delayed it because of what JohnL said
in another email: VFIO_USER_INTERRUPT is only be needed if the client and server
are in different VMs and can’t use eventfds.

I’m fine with removing it for 1.0, since we don’t support cross-VM
emulation yet.

JJ



Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-11 Thread Stefan Hajnoczi
On Tue, May 11, 2021 at 10:43:57AM +, John Levon wrote:
> On Tue, May 11, 2021 at 11:09:53AM +0100, Stefan Hajnoczi wrote:
> > Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
> > be necessary but for now this can be dropped.
> 
> I could be wrong, and probably am, but I believe we're basically fine for IRQs
> right now, until we want to support servers on separate hosts where we'll
> obviously have to re-introduce something like the VM_INTERRUPT message.

Great!

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-11 Thread John Levon
On Tue, May 11, 2021 at 11:09:53AM +0100, Stefan Hajnoczi wrote:

> > > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > > +
> > > > +The reply message will additionally include at least one file 
> > > > descriptor in the
> > > > +ancillary data. Note that more than one sub-region may share the same 
> > > > file
> > > > +descriptor.
> > > 
> > > How does this interact with the maximum number of file descriptors,
> > > max_fds? It is possible that there are more sub-regions than max_fds
> > > allows...
> > 
> > I think this would just be a matter of the client advertising a reasonably 
> > large
> > enough size for max_msg_fds. Do we need to worry about this?
> 
> vhost-user historically only supported passing 8 fds and it became a
> problem there.
> 
> I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
> doorbells). Probably not 1000s.
> 
> If I was implementing a server I would check the negotiated max_fds and
> refuse to start the vfio-user connection if the device has been
> configured to require more sub-regions. Failing early and printing an
> error would allow users to troubleshoot the issue and re-configure the
> client/server.
> 
> This seems okay but the spec doesn't mention it explicitly so I wanted
> to check what you had in mind.

Not for the spec, but I filed https://github.com/nutanix/libvfio-user/issues/489
to track this on the library side. Thanks.

> Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
> be necessary but for now this can be dropped.

I could be wrong, and probably am, but I believe we're basically fine for IRQs
right now, until we want to support servers on separate hosts where we'll
obviously have to re-introduce something like the VM_INTERRUPT message.

> > > > +VFIO_USER_DEVICE_RESET
> > > > +--
> > > 
> > > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > > In some cases a reset involves the server communicating with other
> > > systems or components and this can take an unbounded amount of time.
> > > Therefore this message could hang. For example, if a vfio-user NVMe
> > > device was accessing data on a hung NFS export and there were I/O
> > > requests in flight that need to be aborted.
> > 
> > I'm not sure this is something we could put in the generic spec. Perhaps a
> > caveat?
> 
> It's up to you whether you want to discuss this in the spec or let
> client implementors figure it out themselves. Any vfio-user message can
> take an unbounded amount of time and we could assume that readers will
> think of this.

I'm going to start an "implementation notes" section.

regards
john



Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-11 Thread Stefan Hajnoczi
On Mon, May 10, 2021 at 10:25:41PM +, John Levon wrote:
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > +Region IO FD info format
> > > +
> > > +
> > > ++-++--+
> > > +| Name| Offset | Size |
> > > ++=++==+
> > > +| argsz   | 16 | 4|
> > > ++-++--+
> > > +| flags   | 20 | 4|
> > > ++-++--+
> > > +| index   | 24 | 4|
> > > ++-++--+
> > > +| count   | 28 | 4|
> > > ++-++--+
> > > +| sub-regions | 32 | ...  |
> > > ++-++--+
> > > +
> > > +* *argsz* is the size of the region IO FD info structure plus the
> > > +  total size of the sub-region array. Thus, each array entry "i" is at 
> > > offset
> > > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for 
> > > both IO
> > > +  FD types, but this is not to be relied on.
> > > +* *flags* must be zero
> > > +* *index* is the index of memory region being queried
> > > +* *count* is the number of sub-regions in the array
> > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > +
> > > +The client must set ``flags`` to zero and specify the region being 
> > > queried in
> > > +the ``index``.
> > > +
> > > +The client sets the ``argsz`` field to indicate the maximum size of the 
> > > response
> > > +that the server can send, which must be at least the size of the 
> > > response header
> > > +plus space for the sub-region array. If the full response size exceeds 
> > > ``argsz``,
> > > +then the server must respond only with the response header and the 
> > > Region IO FD
> > > +info structure, setting in ``argsz`` the buffer size required to store 
> > > the full
> > > +response. In this case, no file descriptors are passed back.  The client 
> > > then
> > > +retries the operation with a larger receive buffer.
> > > +
> > > +The reply message will additionally include at least one file descriptor 
> > > in the
> > > +ancillary data. Note that more than one sub-region may share the same 
> > > file
> > > +descriptor.
> > 
> > How does this interact with the maximum number of file descriptors,
> > max_fds? It is possible that there are more sub-regions than max_fds
> > allows...
> 
> I think this would just be a matter of the client advertising a reasonably 
> large
> enough size for max_msg_fds. Do we need to worry about this?

vhost-user historically only supported passing 8 fds and it became a
problem there.

I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
doorbells). Probably not 1000s.

If I was implementing a server I would check the negotiated max_fds and
refuse to start the vfio-user connection if the device has been
configured to require more sub-regions. Failing early and printing an
error would allow users to troubleshoot the issue and re-configure the
client/server.

This seems okay but the spec doesn't mention it explicitly so I wanted
to check what you had in mind.

> > > +Interrupt info format
> > > +^
> > > +
> > > ++---++--+
> > > +| Name  | Offset | Size |
> > > ++===++==+
> > > +| Sub-index | 16 | 4|
> > > ++---++--+
> > > +
> > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used 
> > > in PCI
> > > +  MSI/X type interrupts.
> > 
> > Hmm...this is weird. The server tells the client to raise an MSI-X
> > interrupt but does not include the MSI message that resides in the MSI-X
> > table BAR device region? Or should MSI-X interrupts be delivered to the
> > client via VFIO_USER_DMA_WRITE instead?
> > 
> > (Basically it's not clear to me how MSI-X interrupts would work with
> > vfio-user. Reading how they work in kernel VFIO might let me infer it,
> > but it's probably worth explaining this clearly in the spec.)
> 
> It doesn't. We don't have an implementation, and the qemu patches don't get 
> this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
> 
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?

I don't remember the details of kernel VFIO irqs but it has an interface
where VFIO notifies KVM of configured irqs so that KVM can set up Posted
Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient
interrupt injection instead since we're not trying to map a host
interrupt to a guest interrupt.

Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
be necessary but for now this can be dropped.

> > > +VFIO_USER_DEVICE_RESET
> > > +--
> > > +
> > > +Message format
> > > +^^
> > > +
> > > ++--++
> > > +| Name | Value  |
> > > 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-10 Thread John Levon
On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> 
> Elena A: I CCed you in case you want to review the

Sorry, we should have included Elena already.

> > +VFIO sparse mmap
> > +
> > +
> > ++--+--+
> > +| Name | Value|
> > ++==+==+
> > +| id   | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
> > ++--+--+
> > +| version  | 0x1  |
> > ++--+--+
> > +| next ||
> > ++--+--+
> > +| sparse mmap info | VFIO region info sparse mmap |
> > ++--+--+
> > +
> > +This capability is defined when only a subrange of the region supports
> > +direct access by the client via mmap(). The VFIO sparse mmap area is 
> > defined in
> > + (``struct vfio_region_sparse_mmap_area``).
> 
> It's a little early to reference struct vfio_region_sparse_mmap_area
> here because the parent struct vfio_region_info_cap_sparse_mmap is only
> referenced below. I suggest combining the two:
> 
>   The VFIO sparse mmap area is defined in  (``struct
>   vfio_region_info_cap_sparse_mmap`` and ``struct
>   vfio_region_sparse_mmap_area``).

Good idea.

> > +Region IO FD info format
> > +
> > +
> > ++-++--+
> > +| Name| Offset | Size |
> > ++=++==+
> > +| argsz   | 16 | 4|
> > ++-++--+
> > +| flags   | 20 | 4|
> > ++-++--+
> > +| index   | 24 | 4|
> > ++-++--+
> > +| count   | 28 | 4|
> > ++-++--+
> > +| sub-regions | 32 | ...  |
> > ++-++--+
> > +
> > +* *argsz* is the size of the region IO FD info structure plus the
> > +  total size of the sub-region array. Thus, each array entry "i" is at 
> > offset
> > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for 
> > both IO
> > +  FD types, but this is not to be relied on.
> > +* *flags* must be zero
> > +* *index* is the index of memory region being queried
> > +* *count* is the number of sub-regions in the array
> > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > +
> > +The client must set ``flags`` to zero and specify the region being queried 
> > in
> > +the ``index``.
> > +
> > +The client sets the ``argsz`` field to indicate the maximum size of the 
> > response
> > +that the server can send, which must be at least the size of the response 
> > header
> > +plus space for the sub-region array. If the full response size exceeds 
> > ``argsz``,
> > +then the server must respond only with the response header and the Region 
> > IO FD
> > +info structure, setting in ``argsz`` the buffer size required to store the 
> > full
> > +response. In this case, no file descriptors are passed back.  The client 
> > then
> > +retries the operation with a larger receive buffer.
> > +
> > +The reply message will additionally include at least one file descriptor 
> > in the
> > +ancillary data. Note that more than one sub-region may share the same file
> > +descriptor.
> 
> How does this interact with the maximum number of file descriptors,
> max_fds? It is possible that there are more sub-regions than max_fds
> allows...

I think this would just be a matter of the client advertising a reasonably large
enough size for max_msg_fds. Do we need to worry about this?

> > +Each sub-region given in the response has one of two possible structures,
> > +depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or
> > +`VFIO_USER_IO_FD_TYPE_IOREGIONFD`:
> > +
> > +Sub-Region IO FD info format (ioeventfd)
> > +
> > +
> > ++---++--+
> > +| Name  | Offset | Size |
> > ++===++==+
> > +| offset| 0  | 8|
> > ++---++--+
> > +| size  | 8  | 8|
> > ++---++--+
> > +| fd_index  | 16 | 4|
> > ++---++--+
> > +| type  | 20 | 4|
> > ++---++--+
> > +| flags | 24 | 4|
> > ++---++--+
> > +| padding   | 28 | 4|
> > ++---++--+
> > +| datamatch | 32 | 8|
> > ++---++--+
> > +
> > +* *offset* is the offset of the start of the sub-region within the region
> > +  requested ("physical address offset" for the region)
> > +* *size* is the length of the sub-region. This may be zero if the access 
> > size is
> > +  not relevant, which may allow for optimizations
> > +* 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-10 Thread Stefan Hajnoczi
On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:

Elena A: I CCed you in case you want to review the
VFIO_USER_DEVICE_GET_REGION_IO_FDS section that describes how ioregionfd
can be used by the vfio-user protocol. If you're busy, don't worry but
it's just nice to know that your Outreachy work fits in to stuff that's
currently being developed :).

> This patch introduces the vfio-user protocol specification (formerly
> known as VFIO-over-socket), which is designed to allow devices to be
> emulated outside QEMU, in a separate process. vfio-user reuses the
> existing VFIO defines, structs and concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Thanos Makatos 
> Signed-off-by: John Levon 
> 
> ---
> 
> Changed since v1:
>   * fix coding style issues
>   * update MAINTAINERS for VFIO-over-socket
>   * add vfio-over-socket to ToC
> 
> Changed since v2:
>   * fix whitespace
> 
> Changed since v3:
>   * rename protocol to vfio-user
>   * add table of contents
>   * fix Unicode problems
>   * fix typos and various reStructuredText issues
>   * various stylistic improvements
>   * add backend program conventions
>   * rewrite part of intro, drop QEMU-specific stuff
>   * drop QEMU-specific paragraph about implementation
>   * explain that passing of FDs isn't necessary
>   * minor improvements in the VFIO section
>   * various text substitutions for the sake of consistency
>   * drop paragraph about client and server, already explained in
>   * intro
>   * drop device ID
>   * drop type from version
>   * elaborate on request concurrency
>   * convert some inessential paragraphs into notes
>   * explain why some existing VFIO defines cannot be reused
>   * explain how to make changes to the protocol
>   * improve text of DMA map
>   * reword comment about existing VFIO commands
>   * add reference to Version section
>   * reset device on disconnection
>   * reword live migration section
>   * replace sys/vfio.h with linux/vfio.h
>   * drop reference to iovec
>   * use argz the same way it is used in VFIO
>   * add type field in header for clarity
> 
> Changed since v4:
>   * introduce support for live migration as defined in
>   * include/uapi/linux/vfio.h
>   * introduce 'max_fds' and 'migration' capabilities:
>   * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
>   * fix minor typos and reworded some text for clarity
> 
> Changed since v5:
>   * fix minor typos
>   * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
>   * clarify meaning of VFIO bitmap size field
>   * move version major/minor outside JSON
>   * client proposes version first
>   * make Errno optional in message header
>   * clarification about message ID uniqueness
>   * clarify that server->client request can appear in between
> client->server request/reply
> 
> Changed since v6:
>   * put JSON strings in double quotes
>   * clarify reply behavior on error
>   * introduce max message size capability
>   * clarify semantics when failing to map multiple DMA regions in a
> single command
> 
> Changed since v7:
>   * client proposes version instead of server
>   * support ioeventfd and ioregionfd for unmapped regions
>   * reword struct vfio_bitmap for clarity
>   * clarify use of argsz in VFIO device info
>   * allow individual IRQs to be disabled
> ---
>  MAINTAINERS  |7 +
>  docs/devel/index.rst |1 +
>  docs/devel/vfio-user.rst | 1854 
> ++
>  3 files changed, 1862 insertions(+)
>  create mode 100644 docs/devel/vfio-user.rst
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 36055f14c5..bd1194002b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1849,6 +1849,13 @@ F: hw/vfio/ap.c
>  F: docs/system/s390x/vfio-ap.rst
>  L: qemu-s3...@nongnu.org
>  
> +vfio-user
> +M: John G Johnson 
> +M: Thanos Makatos 
> +M: John Levon 
> +S: Supported
> +F: docs/devel/vfio-user.rst
> +
>  vhost
>  M: Michael S. Tsirkin 
>  S: Supported
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index 6cf7e2d233..7d1ea63e02 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -42,3 +42,4 @@ Contents:
> qom
> block-coroutine-wrapper
> multi-process
> +   vfio-user
> diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
> new file mode 100644
> index 00..b3498eec02
> --- /dev/null
> +++ b/docs/devel/vfio-user.rst
> @@ -0,0 +1,1854 @@
> +.. include:: 
> +
> +
> +vfio-user Protocol Specification
> +
> +
> +
> +Version_ 0.1
> +
> +
> +.. contents:: Table of Contents
> +
> +Introduction
> +
> +vfio-user is a protocol that allows a device to be emulated in a separate
> +process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
> +of a generic VFIO device type, living inside 

RE: [PATCH v8] introduce vfio-user protocol specification

2021-05-07 Thread Thanos Makatos
> -Original Message-
> From: John Levon 
> Sent: 05 May 2021 17:20
> To: Stefan Hajnoczi 
> Cc: Thanos Makatos ; qemu-
> de...@nongnu.org; John Levon ; John G
> Johnson ; benjamin.wal...@intel.com; Elena
> Ufimtseva ; jag.ra...@oracle.com;
> james.r.har...@intel.com; Swapnil Ingle ;
> konrad.w...@oracle.com; alex.william...@redhat.com;
> yuvalkash...@gmail.com; tina.zh...@intel.com;
> marcandre.lur...@redhat.com; ism...@linux.com;
> kanth.ghatr...@oracle.com; Felipe Franciosi ;
> xiuchun...@intel.com; tomassetti.and...@gmail.com; Raphael Norwitz
> ; changpeng@intel.com;
> dgilb...@redhat.com; Yan Zhao ; Michael S . Tsirkin
> ; Gerd Hoffmann ; Christophe de
> Dinechin ; Jason Wang ;
> Cornelia Huck ; Kirti Wankhede
> ; Paolo Bonzini ;
> mpiszc...@ddn.com
> Subject: Re: [PATCH v8] introduce vfio-user protocol specification
> 
> On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:
> 
> > > +While passing of file descriptors is desirable for performance
> > > +reasons, it is not necessary neither for the client nor for the
> > > +server to support it in order
> >
> > Double negative. "not" can be removed.
> 
> Done. I also took a `:set spell` pass on the whole spec, which should catch
> your other typos.
> 
> > > +Device Read and Write
> > > +^
> > > +When the guest executes load or store operations to device memory,
> > > +the client
> >
> >  calls it "device regions", not "device memory".
> > s/device memory/unmapped device regions/?
> 
> Spec was sloppy here, yes. Fixed up all the instances I noticed.
> 
> > Does anything need to be said about mmaps and file descriptors on
> > disconnected? I guess they need to be cleaned up and are not retained
> > for future reconnection?
> 
> Updated:
> 
> ```
> Therefore in order for the protocol to be forward compatible, the server
> should respond to a client disconnection as follows:
> 
>  - all client memory regions are unmapped and cleaned up (including closing
> any
>passed file descriptors)
>  - all IRQ file descriptors passed from the old client are closed
>  - the device state should otherwise be retained
> 
> The expectation is that when a client reconnects, it will re-establish IRQ and
> client memory mappings.
> 
> If anything happens to the client (such as qemu really did exit), the control
> stack will know about it and can clean up resources accordingly.
> ```
> 
> > Are there rules for avoiding deadlock between client->server and
> > server->client messages? For example, the client sends
> > VFIO_USER_REGION_WRITE and the server sends
> VFIO_USER_VM_INTERRUPT
> > before replying to the write message.
> >
> > Multi-threaded clients and servers could end up deadlocking if
> > messages are processed while polling threads handle other device activity
> (e.g.
> > I/O requests that cause DMA messages).
> >
> > Pipelining has the nice effect that the oldest message must complete
> > before the next pipelined message starts. It imposes a maximum issue
> > depth of 1. Still, it seems like it would be relatively easy to hit
> > re-entrancy or deadlock issues since both the client and the server
> > can initiate messages and may need to wait for a response.
> 
> Filed https://github.com/nutanix/libvfio-user/issues/466
> 
> > > +* *Offset* is the file offset of the region with respect to the
> > > +associated file
> > > +  descriptor.
> > > +* *Protections* are the region's protection attributes as encoded
> > > +in
> > > +  .
> >
> > Please be more specific. Does it only include PROT_READ and
> PROT_WRITE?
> > What about PROT_EXEC?
> 
> Updated to just have PROT_READ/WRITE.
> 
> > > +If a DMA region being added can be directly mapped by the server,
> > > +an array of file descriptors must be sent as part of the message
> > > +meta-data. Each mappable region entry must have a corresponding
> > > +file descriptor. On AF_UNIX sockets, the file descriptors must be
> > > +passed as SCM_RIGHTS type ancillary data. Otherwise, if a DMA
> > > +region cannot be directly mapped by the server, it can be accessed
> > > +by the server using VFIO_USER_DMA_READ and
> VFIO_USER_DMA_WRITE
> > > +messages, explained in `Read and Write Operations`_. A command to
> map over an existing region must be failed by the server with ``EEXIST`` set 
> in
> error field in the reply.
> >
> > Does this mean a vIOMMU update, like a protections bits change
> > require

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-06 Thread Stefan Hajnoczi
Thanks, I will review the rest of the spec early next week.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-05 Thread John Levon
On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> > +While passing of file descriptors is desirable for performance reasons, it 
> > is
> > +not necessary neither for the client nor for the server to support it in 
> > order
> 
> Double negative. "not" can be removed.

Done. I also took a `:set spell` pass on the whole spec, which should catch your
other typos.

> > +Device Read and Write
> > +^
> > +When the guest executes load or store operations to device memory, the 
> > client
> 
>  calls it "device regions", not "device memory".
> s/device memory/unmapped device regions/?

Spec was sloppy here, yes. Fixed up all the instances I noticed.

> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Updated:

```
Therefore in order for the protocol to be forward compatible, the server should
respond to a client disconnection as follows:

 - all client memory regions are unmapped and cleaned up (including closing any
   passed file descriptors)
 - all IRQ file descriptors passed from the old client are closed
 - the device state should otherwise be retained

The expectation is that when a client reconnects, it will re-establish IRQ and
client memory mappings.

If anything happens to the client (such as qemu really did exit), the control
stack will know about it and can clean up resources accordingly.
```

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

Filed https://github.com/nutanix/libvfio-user/issues/466

> > +* *Offset* is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  .
> 
> Please be more specific. Does it only include PROT_READ and PROT_WRITE?
> What about PROT_EXEC?

Updated to just have PROT_READ/WRITE.

> > +If a DMA region being added can be directly mapped by the server, an array 
> > of
> > +file descriptors must be sent as part of the message meta-data. Each 
> > mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX 
> > sockets, the
> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. 
> > Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be 
> > accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, 
> > explained
> > +in `Read and Write Operations`_. A command to map over an existing region 
> > must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> 
> Does this mean a vIOMMU update, like a protections bits change requires
> an unmap command followed by a map command? That is not an atomic
> operation but hopefully guests don't try to update a vIOMMU mapping
> while accessing it.

Filed https://github.com/nutanix/libvfio-user/issues/467

> > +This command message is sent by the client to the server to inform it that 
> > a
> > +DMA region, previously made available via a VFIO_USER_DMA_MAP command 
> > message,
> > +is no longer available for DMA. It typically occurs when memory is 
> > subtracted
> > +from the client or if the client uses a vIOMMU. If the client does not 
> > expect
> > +the server to perform DMA then it does not need to send to the server
> > +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA 
> > then
> > +it can ignore such commands but it must still reply to them. The table is 
> > an
> 
> I'm a little confused by the last two sentences about not sending or
> ignoring VFIO_USER_DMA_UNMAP. Does it mean that VFIO_USER_DMA_MAP does
> not need to be sent either when the device is known never to need DMA?

If a device is never going to access client memory (either directly mapped or
DMA_READ/WRITE), there's no need to inform the server of VM memory.  I removed
the sentences as they were just confusing, sort of obvious, and not really
relevant to real systems.

> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-05 Thread Stefan Hajnoczi
On Tue, May 04, 2021 at 02:31:45PM +, John Levon wrote:
> On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:
> 
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > By the way, this DMA mapping design is the eager mapping approach.  Another
> > approach is the lazy mapping approach where the server requests translations
> > as necessary. The advantage is that the client does not have to send each
> > mapping to the server. In the case of VFIO_USER_DMA_READ/WRITE no mappings
> > need to be sent at all. Only mmaps need mapping messages.
> 
> Are you arguing that we should implement this? It would non-trivially 
> complicate
> the implementations on the server-side, where the library "owns" the mappings
> logic, but an API user is responsible for doing actual read/writes.

It's up to you whether the lazy DMA mapping approach is worth
investigating. It might perform better than the eager approach.
The vhost/vDPA lazy DMA mapping message is struct vhost_iotlb_msg in
Linux if you want to take a look.

> > How do potentially large messages work around max_msg_size? It is hard
> > for the client/server to anticipate the maximum message size that will
> > be required ahead of time, so they can't really know if they will hit a
> > situation where max_msg_size is too low.
> 
> Are there specific messages you're worried about? would it help to add a
> specification stipulation as to minimum size that clients and servers must
> support?
> 
> Ultimately the max msg size exists solely to ease implementation: with a
> reasonable fixed size, we can always consume the entire data in one go, rather
> than doing partial reads. Obviously that needs a limit to avoid unbounded
> allocations.

It came to mind when reading about the dirty bitmap messages. Memory
dirty bitmaps can become large. An 8 GB memory region has a 1 MB dirty
bitmap.

> > > +VFIO_USER_DEVICE_GET_INFO
> > > +-
> > > +
> > > +Message format
> > > +^^
> > > +
> > > ++--++
> > > +| Name | Value  |
> > > ++==++
> > > +| Message ID   ||
> > > ++--++
> > > +| Command  | 4  |
> > > ++--++
> > > +| Message size | 32 |
> > > ++--++
> > > +| Flags| Reply bit set in reply |
> > > ++--++
> > > +| Error| 0/errno|
> > > ++--++
> > > +| Device info  | VFIO device info   |
> > > ++--++
> > > +
> > > +This command message is sent by the client to the server to query for 
> > > basic
> > > +information about the device. The VFIO device info structure is defined 
> > > in
> > > + (``struct vfio_device_info``).
> > 
> > Wait, "VFIO device info format" below is missing the cap_offset field,
> > so it's exactly not the same as ?
> 
> We had to move away from directly consuming struct vfio_device_info when
> cap_offset was added. Generally trying to use vfio.h at all seems like a bad
> idea. That's an implementation thing, but this was a dangling reference we 
> need
> to clean up.

Okay. Dropping "" from the spec would solve this.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-04 Thread John Levon
On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be

Thanks for your review so far Stefan!

We'll make sure to respond to all your comments as needed, so this is just a
partial response.

> > +Socket Disconnection Behavior
> > +-
> > +The server and the client can disconnect from each other, either 
> > intentionally
> > +or unexpectedly. Both the client and the server need to know how to handle 
> > such
> > +events.
> > +
> > +Server Disconnection
> > +
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because 
> > of a
> > +   device update) or unintentionally (e.g. because of a crash).
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client 
> > should
> > +reset the VFIO device when it detects the socket has been disconnected.
> > +Error recovery will be driven by the guest's device error handling
> > +behavior.
> > +
> > +Client Disconnection
> > +
> > +The client disconnecting from the server primarily means that the client
> > +has exited. Currently, this means that the guest is shut down so the 
> > device is
> > +no longer needed therefore the server can automatically exit. However, 
> > there
> > +can be cases where a client disconnection should not result in a server 
> > exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which is not yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the server 
> > should
> > +take no action when the client disconnects. If anything happens to the 
> > client
> > +the control stack will know about it and can clean up resources
> > +accordingly.
> 
> Also, hot unplug?
> 
> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Yes. We're still iterating on the implications of these scenarios and trying to
figure out the right approaches, so a lot of this is still very much
under-specified.

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

It's certainly the case that implementation-wise right now these are issues, at
least on the library side. I think I'm probably OK with requiring responses to
be provided prior to async messages like VM_INTERRUPT.

> > +The version data is an optional JSON byte array with the following format:
> 
> RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding
> says:
> 
>   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
> 
> Please indicate the character encoding. I guess it is always UTF-8?

Yes.

> > +| ``"max_fds"``  | number   | Maximum number of file 
> > descriptors  |
> > +||  | the can be received by the 
> > sender.  |
> > +||  | Optional. If not specified then 
> > the |
> > +||  | receiver must assume 
> >|
> > +||  | ``"max_fds"=1``. 
> >|
> 
> Maximum per message? Please clarify and consider renaming it to
> max_msg_fds (it's also more consistent with max_msg_size).

Yes, that's a much better name.

> > +| Name | Type   | Description   |
> > ++==++===+
> > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest |
> > +|  || between the client and the server is used.|
> > ++--++---+
> 
> "in bytes"?

We'll add to the prelude that all memory sizes are in byte units unless
specified otherwise.

> > +Versioning and Feature 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-05-04 Thread Stefan Hajnoczi
On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> This patch introduces the vfio-user protocol specification (formerly
> known as VFIO-over-socket), which is designed to allow devices to be
> emulated outside QEMU, in a separate process. vfio-user reuses the
> existing VFIO defines, structs and concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Thanos Makatos 
> Signed-off-by: John Levon 
> 
> ---
> 
> Changed since v1:
>   * fix coding style issues
>   * update MAINTAINERS for VFIO-over-socket
>   * add vfio-over-socket to ToC
> 
> Changed since v2:
>   * fix whitespace
> 
> Changed since v3:
>   * rename protocol to vfio-user
>   * add table of contents
>   * fix Unicode problems
>   * fix typos and various reStructuredText issues
>   * various stylistic improvements
>   * add backend program conventions
>   * rewrite part of intro, drop QEMU-specific stuff
>   * drop QEMU-specific paragraph about implementation
>   * explain that passing of FDs isn't necessary
>   * minor improvements in the VFIO section
>   * various text substitutions for the sake of consistency
>   * drop paragraph about client and server, already explained in
>   * intro
>   * drop device ID
>   * drop type from version
>   * elaborate on request concurrency
>   * convert some inessential paragraphs into notes
>   * explain why some existing VFIO defines cannot be reused
>   * explain how to make changes to the protocol
>   * improve text of DMA map
>   * reword comment about existing VFIO commands
>   * add reference to Version section
>   * reset device on disconnection
>   * reword live migration section
>   * replace sys/vfio.h with linux/vfio.h
>   * drop reference to iovec
>   * use argz the same way it is used in VFIO
>   * add type field in header for clarity
> 
> Changed since v4:
>   * introduce support for live migration as defined in
>   * include/uapi/linux/vfio.h
>   * introduce 'max_fds' and 'migration' capabilities:
>   * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
>   * fix minor typos and reworded some text for clarity
> 
> Changed since v5:
>   * fix minor typos
>   * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
>   * clarify meaning of VFIO bitmap size field
>   * move version major/minor outside JSON
>   * client proposes version first
>   * make Errno optional in message header
>   * clarification about message ID uniqueness
>   * clarify that server->client request can appear in between
> client->server request/reply
> 
> Changed since v6:
>   * put JSON strings in double quotes
>   * clarify reply behavior on error
>   * introduce max message size capability
>   * clarify semantics when failing to map multiple DMA regions in a
> single command
> 
> Changed since v7:
>   * client proposes version instead of server
>   * support ioeventfd and ioregionfd for unmapped regions
>   * reword struct vfio_bitmap for clarity
>   * clarify use of argsz in VFIO device info
>   * allow individual IRQs to be disabled
> ---
>  MAINTAINERS  |7 +
>  docs/devel/index.rst |1 +
>  docs/devel/vfio-user.rst | 1854 
> ++
>  3 files changed, 1862 insertions(+)
>  create mode 100644 docs/devel/vfio-user.rst
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 36055f14c5..bd1194002b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1849,6 +1849,13 @@ F: hw/vfio/ap.c
>  F: docs/system/s390x/vfio-ap.rst
>  L: qemu-s3...@nongnu.org
>  
> +vfio-user
> +M: John G Johnson 
> +M: Thanos Makatos 
> +M: John Levon 
> +S: Supported
> +F: docs/devel/vfio-user.rst
> +
>  vhost
>  M: Michael S. Tsirkin 
>  S: Supported
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index 6cf7e2d233..7d1ea63e02 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -42,3 +42,4 @@ Contents:
> qom
> block-coroutine-wrapper
> multi-process
> +   vfio-user
> diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
> new file mode 100644
> index 00..b3498eec02
> --- /dev/null
> +++ b/docs/devel/vfio-user.rst
> @@ -0,0 +1,1854 @@
> +.. include:: 
> +
> +
> +vfio-user Protocol Specification
> +
> +
> +
> +Version_ 0.1
> +
> +
> +.. contents:: Table of Contents
> +
> +Introduction
> +
> +vfio-user is a protocol that allows a device to be emulated in a separate
> +process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
> +of a generic VFIO device type, living inside the VMM, which we call the 
> client,
> +and the core device implementation, living outside the VMM, which we call the
> +server.
> +
> +The `Linux VFIO ioctl interface 
> `_
> +been chosen as the base for this protocol for the following 

Re: [PATCH v8] introduce vfio-user protocol specification

2021-04-27 Thread Stefan Hajnoczi
On Tue, Apr 27, 2021 at 12:02:44PM +, Thanos Makatos wrote:
> 
> 
> > -Original Message-
> > From: Stefan Hajnoczi 
> > Sent: 26 April 2021 16:48
> > To: Thanos Makatos ; Peter Maydell
> > ; Michael S. Tsirkin 
> > Cc: qemu-devel@nongnu.org; John Levon ;
> > John G Johnson ;
> > benjamin.wal...@intel.com; Elena Ufimtseva
> > ; jag.ra...@oracle.com;
> > james.r.har...@intel.com; Swapnil Ingle ;
> > konrad.w...@oracle.com; alex.william...@redhat.com;
> > yuvalkash...@gmail.com; tina.zh...@intel.com;
> > marcandre.lur...@redhat.com; ism...@linux.com;
> > kanth.ghatr...@oracle.com; Felipe Franciosi ;
> > xiuchun...@intel.com; tomassetti.and...@gmail.com; Raphael Norwitz
> > ; changpeng@intel.com;
> > dgilb...@redhat.com; Yan Zhao ; Michael S . Tsirkin
> > ; Gerd Hoffmann ; Christophe de
> > Dinechin ; Jason Wang ;
> > Cornelia Huck ; Kirti Wankhede
> > ; Paolo Bonzini ;
> > mpiszc...@ddn.com; John Levon 
> > Subject: Re: [PATCH v8] introduce vfio-user protocol specification
> > 
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > This patch introduces the vfio-user protocol specification (formerly
> > > known as VFIO-over-socket), which is designed to allow devices to be
> > > emulated outside QEMU, in a separate process. vfio-user reuses the
> > > existing VFIO defines, structs and concepts.
> > >
> > > It has been earlier discussed as an RFC in:
> > > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> > >
> > > Signed-off-by: John G Johnson 
> > > Signed-off-by: Thanos Makatos 
> > > Signed-off-by: John Levon 
> > 
> > No review yet but I wanted to agree on the next steps once the spec has
> > been reviewed.
> > 
> > One or more of you would be added to ./MAINTAINERS and handle future
> > patch review and pull requests for the spec.
> > 
> > The spec will be unstable/experimental at least until QEMU vfio-user
> > implementation has landed. Otherwise it's hard to know whether the
> > protocol really works.
> > 
> > Does this sound good?
> 
> Yes, of course.

Great, hope to hear from Michael and Peter too.

I will review the spec on Monday.

Stefan


signature.asc
Description: PGP signature


RE: [PATCH v8] introduce vfio-user protocol specification

2021-04-27 Thread Thanos Makatos



> -Original Message-
> From: Stefan Hajnoczi 
> Sent: 26 April 2021 16:48
> To: Thanos Makatos ; Peter Maydell
> ; Michael S. Tsirkin 
> Cc: qemu-devel@nongnu.org; John Levon ;
> John G Johnson ;
> benjamin.wal...@intel.com; Elena Ufimtseva
> ; jag.ra...@oracle.com;
> james.r.har...@intel.com; Swapnil Ingle ;
> konrad.w...@oracle.com; alex.william...@redhat.com;
> yuvalkash...@gmail.com; tina.zh...@intel.com;
> marcandre.lur...@redhat.com; ism...@linux.com;
> kanth.ghatr...@oracle.com; Felipe Franciosi ;
> xiuchun...@intel.com; tomassetti.and...@gmail.com; Raphael Norwitz
> ; changpeng@intel.com;
> dgilb...@redhat.com; Yan Zhao ; Michael S . Tsirkin
> ; Gerd Hoffmann ; Christophe de
> Dinechin ; Jason Wang ;
> Cornelia Huck ; Kirti Wankhede
> ; Paolo Bonzini ;
> mpiszc...@ddn.com; John Levon 
> Subject: Re: [PATCH v8] introduce vfio-user protocol specification
> 
> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be
> > emulated outside QEMU, in a separate process. vfio-user reuses the
> > existing VFIO defines, structs and concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson 
> > Signed-off-by: Thanos Makatos 
> > Signed-off-by: John Levon 
> 
> No review yet but I wanted to agree on the next steps once the spec has
> been reviewed.
> 
> One or more of you would be added to ./MAINTAINERS and handle future
> patch review and pull requests for the spec.
> 
> The spec will be unstable/experimental at least until QEMU vfio-user
> implementation has landed. Otherwise it's hard to know whether the
> protocol really works.
> 
> Does this sound good?

Yes, of course.

> 
> Stefan



Re: [PATCH v8] introduce vfio-user protocol specification

2021-04-26 Thread Stefan Hajnoczi
On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> This patch introduces the vfio-user protocol specification (formerly
> known as VFIO-over-socket), which is designed to allow devices to be
> emulated outside QEMU, in a separate process. vfio-user reuses the
> existing VFIO defines, structs and concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Thanos Makatos 
> Signed-off-by: John Levon 

No review yet but I wanted to agree on the next steps once the spec has
been reviewed.

One or more of you would be added to ./MAINTAINERS and handle future
patch review and pull requests for the spec.

The spec will be unstable/experimental at least until QEMU vfio-user
implementation has landed. Otherwise it's hard to know whether the
protocol really works.

Does this sound good?

Stefan


signature.asc
Description: PGP signature