Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-07-15 Thread Stefan Hajnoczi
On Wed, Jul 01, 2020 at 11:23:25PM -0700, John G Johnson wrote:
> 
>   We’ve made the review changes to the doc, and moved to RST format,
> so the doc can go into the QEMU sources.
> 
>   Thanos & JJ
>  
> 
> https://github.com/tmakatos/qemu/blob/master/docs/devel/vfio-over-socket.rst

Great! Feel free to send a patch to qemu-devel so the proposal can be
discussed in detail.

Stefan


signature.asc
Description: PGP signature


Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-07-02 Thread John G Johnson


We’ve made the review changes to the doc, and moved to RST format,
so the doc can go into the QEMU sources.

Thanos & JJ
 

https://github.com/tmakatos/qemu/blob/master/docs/devel/vfio-over-socket.rst





Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-26 Thread Stefan Hajnoczi
On Thu, Jun 25, 2020 at 08:54:25PM -0700, John G Johnson wrote:
> 
> 
> > On Jun 23, 2020, at 5:27 AM, Stefan Hajnoczi  wrote:
> > 
> > On Thu, Jun 18, 2020 at 02:38:04PM -0700, John G Johnson wrote:
> >>> On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi  wrote:
> >>> An issue with file descriptor passing is that it's hard to revoke access
> >>> once the file descriptor has been passed. memfd supports sealing with
> >>> fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
> >>> 
> >>> Memory Protection Keys don't seem to be useful here either and their
> >>> availability is limited (see pkeys(7)).
> >>> 
> >>> One crazy idea is to use KVM as a sandbox for running the device and let
> >>> the vIOMMU control the page tables instead of the device (guest). That
> >>> way the hardware MMU provides memory translation, but I think this is
> >>> impractical because the guest environment is too different from the
> >>> Linux userspace environment.
> >>> 
> >>> As a starting point adding DMA_READ/DMA_WRITE messages would provide the
> >>> functionality and security. Unfortunately it makes DMA expensive and
> >>> performance will suffer.
> >>> 
> >> 
> >>Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
> >> not passing FDs at all?  The performance penalty would be large for the
> >> cases where the client and server are equally trusted.  Or are you
> >> advocating for an option where the slower methods are used for cases
> >> where the server is less trusted?
> > 
> > I think the enforcing IOMMU should be optional (due to the performance
> > overhead) but part of the spec from the start.
> > 
> 
> 
>   With this in mind, we will collapse the current memory region
> messages (VFIO_USER_ADD_MEMORY_REGION and VFIO_USER_SUB_MEMORY_REGION)
> and the IOMMU messages (VFIO_USER_IOMMU_MAP and VFIO_USER_IOMMU_UNMAP)
> into new messages (VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP).  Their
> contents will be the same as the memory region messages.
> 
>   On a system without an IOMMU, the new messages will be used to
> export the system physical address space as DMA addresses.  On a system
> with an IOMMU they will be used to export the valid device DMA ranges
> programmed into the IOMMU by the guest.  This behavior matches how the
> existing QEMU VFIO object programs the host IOMMU.  The server will not
> be aware of whether the client is using an IOMMU.
>
>   In the QEMU VFIO implementation, will will add a ‘secure-dma’
> option that suppresses exporting mmap()able FDs to the server.  All
> DMA will use the slow path to be validated by the client before accessing
> guest memory.
> 
>   Is this acceptable to you (and Alex, of course)?

Sounds good to me.

Stefan


signature.asc
Description: PGP signature


Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-25 Thread John G Johnson



> On Jun 23, 2020, at 5:27 AM, Stefan Hajnoczi  wrote:
> 
> On Thu, Jun 18, 2020 at 02:38:04PM -0700, John G Johnson wrote:
>>> On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi  wrote:
>>> An issue with file descriptor passing is that it's hard to revoke access
>>> once the file descriptor has been passed. memfd supports sealing with
>>> fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
>>> 
>>> Memory Protection Keys don't seem to be useful here either and their
>>> availability is limited (see pkeys(7)).
>>> 
>>> One crazy idea is to use KVM as a sandbox for running the device and let
>>> the vIOMMU control the page tables instead of the device (guest). That
>>> way the hardware MMU provides memory translation, but I think this is
>>> impractical because the guest environment is too different from the
>>> Linux userspace environment.
>>> 
>>> As a starting point adding DMA_READ/DMA_WRITE messages would provide the
>>> functionality and security. Unfortunately it makes DMA expensive and
>>> performance will suffer.
>>> 
>> 
>>  Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
>> not passing FDs at all?  The performance penalty would be large for the
>> cases where the client and server are equally trusted.  Or are you
>> advocating for an option where the slower methods are used for cases
>> where the server is less trusted?
> 
> I think the enforcing IOMMU should be optional (due to the performance
> overhead) but part of the spec from the start.
> 


With this in mind, we will collapse the current memory region
messages (VFIO_USER_ADD_MEMORY_REGION and VFIO_USER_SUB_MEMORY_REGION)
and the IOMMU messages (VFIO_USER_IOMMU_MAP and VFIO_USER_IOMMU_UNMAP)
into new messages (VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP).  Their
contents will be the same as the memory region messages.

On a system without an IOMMU, the new messages will be used to
export the system physical address space as DMA addresses.  On a system
with an IOMMU they will be used to export the valid device DMA ranges
programmed into the IOMMU by the guest.  This behavior matches how the
existing QEMU VFIO object programs the host IOMMU.  The server will not
be aware of whether the client is using an IOMMU.

In the QEMU VFIO implementation, will will add a ‘secure-dma’
option that suppresses exporting mmap()able FDs to the server.  All
DMA will use the slow path to be validated by the client before accessing
guest memory.

Is this acceptable to you (and Alex, of course)?

JJ




Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-23 Thread Stefan Hajnoczi
On Thu, Jun 18, 2020 at 02:38:04PM -0700, John G Johnson wrote:
> > On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi  wrote:
> > An issue with file descriptor passing is that it's hard to revoke access
> > once the file descriptor has been passed. memfd supports sealing with
> > fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
> > 
> > Memory Protection Keys don't seem to be useful here either and their
> > availability is limited (see pkeys(7)).
> > 
> > One crazy idea is to use KVM as a sandbox for running the device and let
> > the vIOMMU control the page tables instead of the device (guest). That
> > way the hardware MMU provides memory translation, but I think this is
> > impractical because the guest environment is too different from the
> > Linux userspace environment.
> > 
> > As a starting point adding DMA_READ/DMA_WRITE messages would provide the
> > functionality and security. Unfortunately it makes DMA expensive and
> > performance will suffer.
> > 
> 
>   Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
> not passing FDs at all?  The performance penalty would be large for the
> cases where the client and server are equally trusted.  Or are you
> advocating for an option where the slower methods are used for cases
> where the server is less trusted?

I think the enforcing IOMMU should be optional (due to the performance
overhead) but part of the spec from the start.

Stefan


signature.asc
Description: PGP signature


Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-18 Thread John G Johnson



> On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi  wrote:
> 
> 
> It's challenging to implement a fast and secure IOMMU. The simplest
> approach is secure but not fast: add protocol messages for
> DMA_READ(iova, length) and DMA_WRITE(iova, buffer, length).
> 

We do have protocol messages for the case where no FD is
associated with the DMA region:  VFIO_USER_DMA_READ/WRITE.


> An issue with file descriptor passing is that it's hard to revoke access
> once the file descriptor has been passed. memfd supports sealing with
> fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
> 
> Memory Protection Keys don't seem to be useful here either and their
> availability is limited (see pkeys(7)).
> 
> One crazy idea is to use KVM as a sandbox for running the device and let
> the vIOMMU control the page tables instead of the device (guest). That
> way the hardware MMU provides memory translation, but I think this is
> impractical because the guest environment is too different from the
> Linux userspace environment.
> 
> As a starting point adding DMA_READ/DMA_WRITE messages would provide the
> functionality and security. Unfortunately it makes DMA expensive and
> performance will suffer.
> 

Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
not passing FDs at all?  The performance penalty would be large for the
cases where the client and server are equally trusted.  Or are you
advocating for an option where the slower methods are used for cases
where the server is less trusted?

JJ





Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-15 Thread Stefan Hajnoczi
On Tue, Jun 09, 2020 at 11:25:41PM -0700, John G Johnson wrote:
> > On Jun 2, 2020, at 8:06 AM, Alex Williamson  
> > wrote:
> > 
> > On Wed, 20 May 2020 17:45:13 -0700
> > John G Johnson  wrote:
> > 
> >>> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> >>> The former seems intended to provide the server with access to the
> >>> entire GPA space, while the latter indicates an IOVA to GPA mapping of
> >>> those regions.  Doesn't this break the basic isolation of a vIOMMU?
> >>> This essentially says to me "here's all the guest memory, but please
> >>> only access these regions for which we're providing DMA mappings".
> >>> That invites abuse.
> >>> 
> >> 
> >>The purpose behind separating QEMU into multiple processes is
> >> to provide an additional layer protection for the infrastructure against
> >> a malign guest, not for the guest against itself, so preventing a server
> >> that has been compromised by a guest from accessing all of guest memory
> >> adds no additional benefit.  We don’t even have an IOMMU in our current
> >> guest model for this reason.
> > 
> > One of the use cases we see a lot with vfio is nested assignment, ie.
> > we assign a device to a VM where the VM includes a vIOMMU, such that
> > the guest OS can then assign the device to userspace within the guest.
> > This is safe to do AND provides isolation within the guest exactly
> > because the device only has access to memory mapped to the device, not
> > the entire guest address space.  I don't think it's just the hypervisor
> > you're trying to protect, we can't assume there are always trusted
> > drivers managing the device.
> > 
> 
>   We intend to support an IOMMU.  The question seems to be whether
> it’s implemented in the server or client.  The current proposal has it
> in the server, ala vhost-user, but we are fine with moving it.

It's challenging to implement a fast and secure IOMMU. The simplest
approach is secure but not fast: add protocol messages for
DMA_READ(iova, length) and DMA_WRITE(iova, buffer, length).

An issue with file descriptor passing is that it's hard to revoke access
once the file descriptor has been passed. memfd supports sealing with
fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.

Memory Protection Keys don't seem to be useful here either and their
availability is limited (see pkeys(7)).

One crazy idea is to use KVM as a sandbox for running the device and let
the vIOMMU control the page tables instead of the device (guest). That
way the hardware MMU provides memory translation, but I think this is
impractical because the guest environment is too different from the
Linux userspace environment.

As a starting point adding DMA_READ/DMA_WRITE messages would provide the
functionality and security. Unfortunately it makes DMA expensive and
performance will suffer.

Stefan


signature.asc
Description: PGP signature


Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-10 Thread John G Johnson



> On Jun 2, 2020, at 8:06 AM, Alex Williamson  
> wrote:
> 
> On Wed, 20 May 2020 17:45:13 -0700
> John G Johnson  wrote:
> 
>>> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
>>> The former seems intended to provide the server with access to the
>>> entire GPA space, while the latter indicates an IOVA to GPA mapping of
>>> those regions.  Doesn't this break the basic isolation of a vIOMMU?
>>> This essentially says to me "here's all the guest memory, but please
>>> only access these regions for which we're providing DMA mappings".
>>> That invites abuse.
>>> 
>> 
>>  The purpose behind separating QEMU into multiple processes is
>> to provide an additional layer protection for the infrastructure against
>> a malign guest, not for the guest against itself, so preventing a server
>> that has been compromised by a guest from accessing all of guest memory
>> adds no additional benefit.  We don’t even have an IOMMU in our current
>> guest model for this reason.
> 
> One of the use cases we see a lot with vfio is nested assignment, ie.
> we assign a device to a VM where the VM includes a vIOMMU, such that
> the guest OS can then assign the device to userspace within the guest.
> This is safe to do AND provides isolation within the guest exactly
> because the device only has access to memory mapped to the device, not
> the entire guest address space.  I don't think it's just the hypervisor
> you're trying to protect, we can't assume there are always trusted
> drivers managing the device.
> 

We intend to support an IOMMU.  The question seems to be whether
it’s implemented in the server or client.  The current proposal has it
in the server, ala vhost-user, but we are fine with moving it.


>> 
>>  The implementation was stolen from vhost-user, with the exception
>> that we push IOTLB translations from client to server like VFIO does, as
>> opposed to pulling them from server to client like vhost-user does.
> 
> It seems that vhost has numerous hacks forcing it to know whether a
> vIOMMU is present as a result of this, vfio has none.
> 

I imagine this decision was driven by performance considerations.
If the IOMMU is implemented on the client side, the server must execute mmap()
or munmap() for every IOMMU MAP/UNMAP message.  If the IOMMU is implemented
on the server side, the server doesn’t need these system calls; it just adds a
SW translation entry to its own table.


>>  That said, neither the qemu-mp nor MUSER implementation uses an
>> IOMMU, so if you prefer another IOMMU model, we can consider it.  We
>> could only send the guest memory file descriptors with IOMMU_MAP_DMA
>> requests, although that would cost performance since each request would
>> require the server to execute an mmap() system call.
> 
> It would seem shortsighted to not fully enable a vIOMMU compatible
> implementation at this time.
> 
>>> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
>>> "an array of file descriptors will be sent as part of the message
>>> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
>>> the table specified as 0?  How does a client learn their Device ID?
>>> 
>> 
>>  SCM_RIGHTS message controls allow sendmsg() to send an array of
>> file descriptors over a UNIX domain socket.
>> 
>>  We’re only supporting one device per socket in this protocol
>> version, so the device ID will always be 0.  This may change in a future
>> revision, so we included the field in the header to avoid a major version
>> change if device multiplexing is added later.
>> 
>> 
>>> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
>>> capability chain), the cap_offset and next pointers within the chain
>>> need to specify what their offset is relative to (ie. the start of the
>>> packet, the start of the vfio compatible data structure, etc).  I
>>> assume the latter for client compatibility.
>>> 
>> 
>>  Yes.  We will attempt to make the language clearer.
>> 
>> 
>>> Also on REGION_INFO, offset is specified as "the base offset to be
>>> given to the mmap() call for regions with the MMAP attribute".  Base
>>> offset from what?  Is the mmap performed on the socket fd?  Do we not
>>> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
>>> Why do we specify "MMIO" in those operations versus simply "REGION"?
>>> Are we arbitrarily excluding support for I/O port regions or device
>>> specific regions?  If these commands replace direct read and write to
>>> an fd offset, how is PCI config space handled?
>>> 
>> 
>>  The base offset refers to the sparse areas, where the sparse area
>> offset is added to the base region offset.  We will try to make the text
>> clearer here as well.
>> 
>>  MMIO was added to distinguish these operations from DMA operations.
>> I can see how this can cause confusion when the region refers to a port 
>> range,
>> so we can change the name to REGION_READ/WRITE. 
>> 
>> 
>>> 

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-06-02 Thread Alex Williamson
On Wed, 20 May 2020 17:45:13 -0700
John G Johnson  wrote:

> > I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> > The former seems intended to provide the server with access to the
> > entire GPA space, while the latter indicates an IOVA to GPA mapping of
> > those regions.  Doesn't this break the basic isolation of a vIOMMU?
> > This essentially says to me "here's all the guest memory, but please
> > only access these regions for which we're providing DMA mappings".
> > That invites abuse.
> >   
> 
>   The purpose behind separating QEMU into multiple processes is
> to provide an additional layer protection for the infrastructure against
> a malign guest, not for the guest against itself, so preventing a server
> that has been compromised by a guest from accessing all of guest memory
> adds no additional benefit.  We don’t even have an IOMMU in our current
> guest model for this reason.

One of the use cases we see a lot with vfio is nested assignment, ie.
we assign a device to a VM where the VM includes a vIOMMU, such that
the guest OS can then assign the device to userspace within the guest.
This is safe to do AND provides isolation within the guest exactly
because the device only has access to memory mapped to the device, not
the entire guest address space.  I don't think it's just the hypervisor
you're trying to protect, we can't assume there are always trusted
drivers managing the device.

> 
>   The implementation was stolen from vhost-user, with the exception
> that we push IOTLB translations from client to server like VFIO does, as
> opposed to pulling them from server to client like vhost-user does.

It seems that vhost has numerous hacks forcing it to know whether a
vIOMMU is present as a result of this, vfio has none.
 
>   That said, neither the qemu-mp nor MUSER implementation uses an
> IOMMU, so if you prefer another IOMMU model, we can consider it.  We
> could only send the guest memory file descriptors with IOMMU_MAP_DMA
> requests, although that would cost performance since each request would
> require the server to execute an mmap() system call.

It would seem shortsighted to not fully enable a vIOMMU compatible
implementation at this time.

> > Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> > "an array of file descriptors will be sent as part of the message
> > meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> > the table specified as 0?  How does a client learn their Device ID?
> >   
> 
>   SCM_RIGHTS message controls allow sendmsg() to send an array of
> file descriptors over a UNIX domain socket.
> 
>   We’re only supporting one device per socket in this protocol
> version, so the device ID will always be 0.  This may change in a future
> revision, so we included the field in the header to avoid a major version
> change if device multiplexing is added later.
> 
> 
> > VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> > capability chain), the cap_offset and next pointers within the chain
> > need to specify what their offset is relative to (ie. the start of the
> > packet, the start of the vfio compatible data structure, etc).  I
> > assume the latter for client compatibility.
> >   
> 
>   Yes.  We will attempt to make the language clearer.
> 
> 
> > Also on REGION_INFO, offset is specified as "the base offset to be
> > given to the mmap() call for regions with the MMAP attribute".  Base
> > offset from what?  Is the mmap performed on the socket fd?  Do we not
> > allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> > Why do we specify "MMIO" in those operations versus simply "REGION"?
> > Are we arbitrarily excluding support for I/O port regions or device
> > specific regions?  If these commands replace direct read and write to
> > an fd offset, how is PCI config space handled?
> >   
> 
>   The base offset refers to the sparse areas, where the sparse area
> offset is added to the base region offset.  We will try to make the text
> clearer here as well.
> 
>   MMIO was added to distinguish these operations from DMA operations.
> I can see how this can cause confusion when the region refers to a port range,
> so we can change the name to REGION_READ/WRITE. 
> 
> 
> > VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> > will include the count specifying the amount of data read.  How does
> > the client specify how much data to read?  Via message size?
> >   
> 
>   This is a bug in the doc.  As you said, the read field should
> be the amount of data to be read.
>   
> 
> > VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> > should only ever have access via IOVA, which implies a DMA mapping
> > exists for the device.  Can you provide an example of why we need these
> > commands since there seems little point to this interface if a device
> > cannot directly interact with VM memory.
> >   
> 
> 

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-20 Thread John G Johnson



> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> The former seems intended to provide the server with access to the
> entire GPA space, while the latter indicates an IOVA to GPA mapping of
> those regions.  Doesn't this break the basic isolation of a vIOMMU?
> This essentially says to me "here's all the guest memory, but please
> only access these regions for which we're providing DMA mappings".
> That invites abuse.
> 

The purpose behind separating QEMU into multiple processes is
to provide an additional layer protection for the infrastructure against
a malign guest, not for the guest against itself, so preventing a server
that has been compromised by a guest from accessing all of guest memory
adds no additional benefit.  We don’t even have an IOMMU in our current
guest model for this reason.

The implementation was stolen from vhost-user, with the exception
that we push IOTLB translations from client to server like VFIO does, as
opposed to pulling them from server to client like vhost-user does.

That said, neither the qemu-mp nor MUSER implementation uses an
IOMMU, so if you prefer another IOMMU model, we can consider it.  We
could only send the guest memory file descriptors with IOMMU_MAP_DMA
requests, although that would cost performance since each request would
require the server to execute an mmap() system call.


> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> "an array of file descriptors will be sent as part of the message
> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> the table specified as 0?  How does a client learn their Device ID?
> 

SCM_RIGHTS message controls allow sendmsg() to send an array of
file descriptors over a UNIX domain socket.

We’re only supporting one device per socket in this protocol
version, so the device ID will always be 0.  This may change in a future
revision, so we included the field in the header to avoid a major version
change if device multiplexing is added later.


> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> capability chain), the cap_offset and next pointers within the chain
> need to specify what their offset is relative to (ie. the start of the
> packet, the start of the vfio compatible data structure, etc).  I
> assume the latter for client compatibility.
> 

Yes.  We will attempt to make the language clearer.


> Also on REGION_INFO, offset is specified as "the base offset to be
> given to the mmap() call for regions with the MMAP attribute".  Base
> offset from what?  Is the mmap performed on the socket fd?  Do we not
> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> Why do we specify "MMIO" in those operations versus simply "REGION"?
> Are we arbitrarily excluding support for I/O port regions or device
> specific regions?  If these commands replace direct read and write to
> an fd offset, how is PCI config space handled?
> 

The base offset refers to the sparse areas, where the sparse area
offset is added to the base region offset.  We will try to make the text
clearer here as well.

MMIO was added to distinguish these operations from DMA operations.
I can see how this can cause confusion when the region refers to a port range,
so we can change the name to REGION_READ/WRITE. 


> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> will include the count specifying the amount of data read.  How does
> the client specify how much data to read?  Via message size?
> 

This is a bug in the doc.  As you said, the read field should
be the amount of data to be read.


> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> should only ever have access via IOVA, which implies a DMA mapping
> exists for the device.  Can you provide an example of why we need these
> commands since there seems little point to this interface if a device
> cannot directly interact with VM memory.
> 

It is a GPA.  The device emulation code would only handle the DMA
addresses the guest programmed it with; the server infrastructure knows
whether an IOMMU exists, and whether the DMA address needs translation to
GPA or not.


> The IOMMU commands should be unnecessary, a vIOMMU should be
> transparent to the server by virtue that the device only knows about
> IOVA mappings accessible to the device.  Requiring the client to expose
> all memory to the server implies that the server must always be trusted.
> 

The client and server are equally trusted; the guest is the untrusted
entity.


> Interrupt info format, s/type/index/, s/vector/subindex/
> 

ok


> In addition to the unused ioctls, the entire concept of groups and
> containers are not found in this specification.  To some degree that
> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
> to group relationship.  However, the container is very much 

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-14 Thread Alex Williamson
On Thu, 14 May 2020 09:32:15 -0700
John G Johnson  wrote:

>   Thanos and I have made some changes to the doc in response to the
> feedback we’ve received.  The biggest difference is that it is less reliant
> on the reader being familiar with the current VFIO implementation.  We’d
> appreciate any additional feedback you could give on the changes.  Thanks
> in advance.
> 
>   Thanos and JJ
> 
> 
> The link remains the same:
> 
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing

Hi,

I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
The former seems intended to provide the server with access to the
entire GPA space, while the latter indicates an IOVA to GPA mapping of
those regions.  Doesn't this break the basic isolation of a vIOMMU?
This essentially says to me "here's all the guest memory, but please
only access these regions for which we're providing DMA mappings".
That invites abuse.

Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
"an array of file descriptors will be sent as part of the message
meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
the table specified as 0?  How does a client learn their Device ID?

VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
capability chain), the cap_offset and next pointers within the chain
need to specify what their offset is relative to (ie. the start of the
packet, the start of the vfio compatible data structure, etc).  I
assume the latter for client compatibility.

Also on REGION_INFO, offset is specified as "the base offset to be
given to the mmap() call for regions with the MMAP attribute".  Base
offset from what?  Is the mmap performed on the socket fd?  Do we not
allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
Why do we specify "MMIO" in those operations versus simply "REGION"?
Are we arbitrarily excluding support for I/O port regions or device
specific regions?  If these commands replace direct read and write to
an fd offset, how is PCI config space handled?

VFIO_USER_MMIO_READ specifies the count field is zero and the reply
will include the count specifying the amount of data read.  How does
the client specify how much data to read?  Via message size?

VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
should only ever have access via IOVA, which implies a DMA mapping
exists for the device.  Can you provide an example of why we need these
commands since there seems little point to this interface if a device
cannot directly interact with VM memory.

The IOMMU commands should be unnecessary, a vIOMMU should be
transparent to the server by virtue that the device only knows about
IOVA mappings accessible to the device.  Requiring the client to expose
all memory to the server implies that the server must always be trusted.

Interrupt info format, s/type/index/, s/vector/subindex/

In addition to the unused ioctls, the entire concept of groups and
containers are not found in this specification.  To some degree that
makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
to group relationship.  However, the container is very much involved in
the development of migration support, where it's the container that
provides dirty bitmaps.  Since we're doing map and unmap without that
container concept here, perhaps we'd equally apply those APIs to this
same socket.  Thanks,

Alex




Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-14 Thread John G Johnson


Thanos and I have made some changes to the doc in response to the
feedback we’ve received.  The biggest difference is that it is less reliant
on the reader being familiar with the current VFIO implementation.  We’d
appreciate any additional feedback you could give on the changes.  Thanks
in advance.

Thanos and JJ


The link remains the same:

https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing



> On Apr 20, 2020, at 4:05 AM, Thanos Makatos  
> wrote:
> 
> Hi,
> 
> I've just shared with you the Google doc we've working on with John where 
> we've
> been drafting the protocol specification, we think it's time for some first
> comments. Please feel free to comment/edit and suggest more people to be on 
> the
> reviewers list.
> 
> You can also find the Google doc here:
> 
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing
> 
> If a Google doc doesn't work for you we're open to suggestions.
> 
> Thanks
> 




Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-11 Thread Stefan Hajnoczi
On Mon, May 04, 2020 at 10:49:11AM -0700, John G Johnson wrote:
> 
> 
> > On May 4, 2020, at 2:45 AM, Stefan Hajnoczi  wrote:
> > 
> > On Fri, May 01, 2020 at 04:28:25PM +0100, Daniel P. Berrangé wrote:
> >> On Fri, May 01, 2020 at 03:01:01PM +, Felipe Franciosi wrote:
> >>> Hi,
> >>> 
>  On Apr 30, 2020, at 4:20 PM, Thanos Makatos  
>  wrote:
>  
> >>> More importantly, considering:
> >>> a) Marc-André's comments about data alignment etc., and
> >>> b) the possibility to run the server on another guest or host,
> >>> we won't be able to use native VFIO types. If we do want to support 
> >>> that
> >>> then
> >>> we'll have to redefine all data formats, similar to
> >>> https://urldefense.proofpoint.com/v2/url?u=https-
> >>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> >>> 
> > 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
> >>> 
> > ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> >>> W7NMgRg=1d_kB7VWQ-
> > 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
> >>> 
> >>> So the protocol will be more like an enhanced version of the 
> >>> Vhost-user
> >>> protocol
> >>> than VFIO. I'm fine with either direction (VFIO vs. enhanced 
> >>> Vhost-user),
> >>> so we need to decide before proceeding as the request format is
> >>> substantially
> >>> different.
> >> 
> >> Regarding the ability to use the protocol on non-AF_UNIX sockets, we 
> >> can
> >> support this future use case without unnecessarily complicating the
> > protocol by
> >> defining the C structs and stating that data alignment and endianness 
> >> for
> > the
> >> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> > or can
> >> be overridden as required.
> > 
> > Defining it to be x86_64 semantics is effectively saying "we're not 
> > going
> > to do anything and it is up to other arch maintainers to fix the 
> > inevitable
> > portability problems that arise".
>  
>  Pretty much.
>  
> > Since this is a new protocol should we take the opportunity to model it
> > explicitly in some common standard RPC protocol language. This would 
> > have
> > the benefit of allowing implementors to use off the shelf APIs for their
> > wire protocol marshalling, and eliminate questions about endianness and
> > alignment across architectures.
>  
>  The problem is that we haven't defined the scope very well. My initial 
>  impression 
>  was that we should use the existing VFIO structs and constants, however 
>  that's 
>  impossible if we're to support non AF_UNIX. We need consensus on this, 
>  we're 
>  open to ideas how to do this.
> >>> 
> >>> Thanos has a point.
> >>> 
> >>> From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
> >>> was written by Stefan, I read:
> >>> 
>  Inventing a new device emulation protocol from scratch has many
>  disadvantages. VFIO could be used as the protocol to avoid reinventing
>  the wheel ...
> >>> 
> >>> At the same time, this appears to be incompatible with the (new?)
> >>> requirement of supporting device emulation which may run in non-VFIO
> >>> compliant OSs or even across OSs (ie. via TCP or similar).
> >> 
> >> To be clear, I don't have any opinion on whether we need to support
> >> cross-OS/TCP or not.
> >> 
> >> I'm merely saying that if we do decide to support cross-OS/TCP, then
> >> I think we need a more explicitly modelled protocol, instead of relying
> >> on serialization of C structs.
> >> 
> >> There could be benefits to an explicitly modelled protocol, even for
> >> local only usage, if we want to more easily support non-C languages
> >> doing serialization, but again I don't have a strong opinion on whether
> >> that's neccessary to worry about or not.
> >> 
> >> So I guess largely the question boils down to setting the scope of
> >> what we want to be able to achieve in terms of RPC endpoints.
> > 
> > The protocol relies on both file descriptor and memory mapping. These
> > are hard to achieve with networking.
> > 
> > I think the closest would be using RDMA to accelerate memory access and
> > switching to a network notification mechanism instead of eventfd.
> > 
> > Sooner or later someone will probably try this. I don't think it makes
> > sense to define this transport in detail now if there are no users, but
> > we should try to make it possible to add it in the future, if necessary.
> > 
> > Another use case that is interesting and not yet directly addressed is:
> > how can another VM play the role of the device? This is important in
> > compute cloud environments where everything is a VM and running a
> > process on the host is not possible.
> > 
> 
>   Cross-VM is not a lot different from networking.  You can’t
> use AF_UNIX; and AF_VSOCK and AF_INET do not support FD 

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-04 Thread John G Johnson



> On May 4, 2020, at 2:45 AM, Stefan Hajnoczi  wrote:
> 
> On Fri, May 01, 2020 at 04:28:25PM +0100, Daniel P. Berrangé wrote:
>> On Fri, May 01, 2020 at 03:01:01PM +, Felipe Franciosi wrote:
>>> Hi,
>>> 
 On Apr 30, 2020, at 4:20 PM, Thanos Makatos  
 wrote:
 
>>> More importantly, considering:
>>> a) Marc-André's comments about data alignment etc., and
>>> b) the possibility to run the server on another guest or host,
>>> we won't be able to use native VFIO types. If we do want to support that
>>> then
>>> we'll have to redefine all data formats, similar to
>>> https://urldefense.proofpoint.com/v2/url?u=https-
>>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
>>> 
> 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
>>> 
> ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
>>> W7NMgRg=1d_kB7VWQ-
> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
>>> 
>>> So the protocol will be more like an enhanced version of the Vhost-user
>>> protocol
>>> than VFIO. I'm fine with either direction (VFIO vs. enhanced 
>>> Vhost-user),
>>> so we need to decide before proceeding as the request format is
>>> substantially
>>> different.
>> 
>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
>> support this future use case without unnecessarily complicating the
> protocol by
>> defining the C structs and stating that data alignment and endianness for
> the
>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> or can
>> be overridden as required.
> 
> Defining it to be x86_64 semantics is effectively saying "we're not going
> to do anything and it is up to other arch maintainers to fix the 
> inevitable
> portability problems that arise".
 
 Pretty much.
 
> Since this is a new protocol should we take the opportunity to model it
> explicitly in some common standard RPC protocol language. This would have
> the benefit of allowing implementors to use off the shelf APIs for their
> wire protocol marshalling, and eliminate questions about endianness and
> alignment across architectures.
 
 The problem is that we haven't defined the scope very well. My initial 
 impression 
 was that we should use the existing VFIO structs and constants, however 
 that's 
 impossible if we're to support non AF_UNIX. We need consensus on this, 
 we're 
 open to ideas how to do this.
>>> 
>>> Thanos has a point.
>>> 
>>> From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
>>> was written by Stefan, I read:
>>> 
 Inventing a new device emulation protocol from scratch has many
 disadvantages. VFIO could be used as the protocol to avoid reinventing
 the wheel ...
>>> 
>>> At the same time, this appears to be incompatible with the (new?)
>>> requirement of supporting device emulation which may run in non-VFIO
>>> compliant OSs or even across OSs (ie. via TCP or similar).
>> 
>> To be clear, I don't have any opinion on whether we need to support
>> cross-OS/TCP or not.
>> 
>> I'm merely saying that if we do decide to support cross-OS/TCP, then
>> I think we need a more explicitly modelled protocol, instead of relying
>> on serialization of C structs.
>> 
>> There could be benefits to an explicitly modelled protocol, even for
>> local only usage, if we want to more easily support non-C languages
>> doing serialization, but again I don't have a strong opinion on whether
>> that's neccessary to worry about or not.
>> 
>> So I guess largely the question boils down to setting the scope of
>> what we want to be able to achieve in terms of RPC endpoints.
> 
> The protocol relies on both file descriptor and memory mapping. These
> are hard to achieve with networking.
> 
> I think the closest would be using RDMA to accelerate memory access and
> switching to a network notification mechanism instead of eventfd.
> 
> Sooner or later someone will probably try this. I don't think it makes
> sense to define this transport in detail now if there are no users, but
> we should try to make it possible to add it in the future, if necessary.
> 
> Another use case that is interesting and not yet directly addressed is:
> how can another VM play the role of the device? This is important in
> compute cloud environments where everything is a VM and running a
> process on the host is not possible.
> 

Cross-VM is not a lot different from networking.  You can’t
use AF_UNIX; and AF_VSOCK and AF_INET do not support FD passing.
You’d either have to add FD passing to AF_VSOCK, which will have
some security issues, or fall back to message passing that will
degrade performance.  You can skip the byte ordering issues, however,
when it’s the same host.

JJ



> The virtio-vhost-user prototype showed that 

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-04 Thread Stefan Hajnoczi
On Fri, May 01, 2020 at 04:28:25PM +0100, Daniel P. Berrangé wrote:
> On Fri, May 01, 2020 at 03:01:01PM +, Felipe Franciosi wrote:
> > Hi,
> > 
> > > On Apr 30, 2020, at 4:20 PM, Thanos Makatos  
> > > wrote:
> > > 
> >  More importantly, considering:
> >  a) Marc-André's comments about data alignment etc., and
> >  b) the possibility to run the server on another guest or host,
> >  we won't be able to use native VFIO types. If we do want to support 
> >  that
> >  then
> >  we'll have to redefine all data formats, similar to
> >  https://urldefense.proofpoint.com/v2/url?u=https-
> >  3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> >  
> > >> 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
> >  
> > >> ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> >  W7NMgRg=1d_kB7VWQ-
> > >> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
> >  
> >  So the protocol will be more like an enhanced version of the Vhost-user
> >  protocol
> >  than VFIO. I'm fine with either direction (VFIO vs. enhanced 
> >  Vhost-user),
> >  so we need to decide before proceeding as the request format is
> >  substantially
> >  different.
> > >>> 
> > >>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> > >>> support this future use case without unnecessarily complicating the
> > >> protocol by
> > >>> defining the C structs and stating that data alignment and endianness 
> > >>> for
> > >> the
> > >>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> > >> or can
> > >>> be overridden as required.
> > >> 
> > >> Defining it to be x86_64 semantics is effectively saying "we're not going
> > >> to do anything and it is up to other arch maintainers to fix the 
> > >> inevitable
> > >> portability problems that arise".
> > > 
> > > Pretty much.
> > > 
> > >> Since this is a new protocol should we take the opportunity to model it
> > >> explicitly in some common standard RPC protocol language. This would have
> > >> the benefit of allowing implementors to use off the shelf APIs for their
> > >> wire protocol marshalling, and eliminate questions about endianness and
> > >> alignment across architectures.
> > > 
> > > The problem is that we haven't defined the scope very well. My initial 
> > > impression 
> > > was that we should use the existing VFIO structs and constants, however 
> > > that's 
> > > impossible if we're to support non AF_UNIX. We need consensus on this, 
> > > we're 
> > > open to ideas how to do this.
> > 
> > Thanos has a point.
> > 
> > From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
> > was written by Stefan, I read:
> > 
> > > Inventing a new device emulation protocol from scratch has many
> > > disadvantages. VFIO could be used as the protocol to avoid reinventing
> > > the wheel ...
> > 
> > At the same time, this appears to be incompatible with the (new?)
> > requirement of supporting device emulation which may run in non-VFIO
> > compliant OSs or even across OSs (ie. via TCP or similar).
> 
> To be clear, I don't have any opinion on whether we need to support
> cross-OS/TCP or not.
> 
> I'm merely saying that if we do decide to support cross-OS/TCP, then
> I think we need a more explicitly modelled protocol, instead of relying
> on serialization of C structs.
> 
> There could be benefits to an explicitly modelled protocol, even for
> local only usage, if we want to more easily support non-C languages
> doing serialization, but again I don't have a strong opinion on whether
> that's neccessary to worry about or not.
> 
> So I guess largely the question boils down to setting the scope of
> what we want to be able to achieve in terms of RPC endpoints.

The protocol relies on both file descriptor and memory mapping. These
are hard to achieve with networking.

I think the closest would be using RDMA to accelerate memory access and
switching to a network notification mechanism instead of eventfd.

Sooner or later someone will probably try this. I don't think it makes
sense to define this transport in detail now if there are no users, but
we should try to make it possible to add it in the future, if necessary.

Another use case that is interesting and not yet directly addressed is:
how can another VM play the role of the device? This is important in
compute cloud environments where everything is a VM and running a
process on the host is not possible.

The virtio-vhost-user prototype showed that it's possible to add this on
top of an existing vhost-user style protocol by terminating the
connection in the device VMM and then communicating with the device
using a new VIRTIO device. Maybe that's the way to do it here too and we
don't need to worry about explicitly designing that into the vfio-user
protocol, but if anyone has other approaches in mind then let's discuss
them now.

Finally, I think the goal of integrating this new protocol into the

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-01 Thread Daniel P . Berrangé
On Fri, May 01, 2020 at 03:01:01PM +, Felipe Franciosi wrote:
> Hi,
> 
> > On Apr 30, 2020, at 4:20 PM, Thanos Makatos  
> > wrote:
> > 
>  More importantly, considering:
>  a) Marc-André's comments about data alignment etc., and
>  b) the possibility to run the server on another guest or host,
>  we won't be able to use native VFIO types. If we do want to support that
>  then
>  we'll have to redefine all data formats, similar to
>  https://urldefense.proofpoint.com/v2/url?u=https-
>  3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
>  
> >> 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
>  
> >> ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
>  W7NMgRg=1d_kB7VWQ-
> >> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
>  
>  So the protocol will be more like an enhanced version of the Vhost-user
>  protocol
>  than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
>  so we need to decide before proceeding as the request format is
>  substantially
>  different.
> >>> 
> >>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> >>> support this future use case without unnecessarily complicating the
> >> protocol by
> >>> defining the C structs and stating that data alignment and endianness for
> >> the
> >>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> >> or can
> >>> be overridden as required.
> >> 
> >> Defining it to be x86_64 semantics is effectively saying "we're not going
> >> to do anything and it is up to other arch maintainers to fix the inevitable
> >> portability problems that arise".
> > 
> > Pretty much.
> > 
> >> Since this is a new protocol should we take the opportunity to model it
> >> explicitly in some common standard RPC protocol language. This would have
> >> the benefit of allowing implementors to use off the shelf APIs for their
> >> wire protocol marshalling, and eliminate questions about endianness and
> >> alignment across architectures.
> > 
> > The problem is that we haven't defined the scope very well. My initial 
> > impression 
> > was that we should use the existing VFIO structs and constants, however 
> > that's 
> > impossible if we're to support non AF_UNIX. We need consensus on this, 
> > we're 
> > open to ideas how to do this.
> 
> Thanos has a point.
> 
> From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
> was written by Stefan, I read:
> 
> > Inventing a new device emulation protocol from scratch has many
> > disadvantages. VFIO could be used as the protocol to avoid reinventing
> > the wheel ...
> 
> At the same time, this appears to be incompatible with the (new?)
> requirement of supporting device emulation which may run in non-VFIO
> compliant OSs or even across OSs (ie. via TCP or similar).

To be clear, I don't have any opinion on whether we need to support
cross-OS/TCP or not.

I'm merely saying that if we do decide to support cross-OS/TCP, then
I think we need a more explicitly modelled protocol, instead of relying
on serialization of C structs.

There could be benefits to an explicitly modelled protocol, even for
local only usage, if we want to more easily support non-C languages
doing serialization, but again I don't have a strong opinion on whether
that's neccessary to worry about or not.

So I guess largely the question boils down to setting the scope of
what we want to be able to achieve in terms of RPC endpoints.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-01 Thread Felipe Franciosi
Hi,

> On Apr 30, 2020, at 4:20 PM, Thanos Makatos  
> wrote:
> 
 More importantly, considering:
 a) Marc-André's comments about data alignment etc., and
 b) the possibility to run the server on another guest or host,
 we won't be able to use native VFIO types. If we do want to support that
 then
 we'll have to redefine all data formats, similar to
 https://urldefense.proofpoint.com/v2/url?u=https-
 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
 
>> 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
 
>> ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
 W7NMgRg=1d_kB7VWQ-
>> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
 
 So the protocol will be more like an enhanced version of the Vhost-user
 protocol
 than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
 so we need to decide before proceeding as the request format is
 substantially
 different.
>>> 
>>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
>>> support this future use case without unnecessarily complicating the
>> protocol by
>>> defining the C structs and stating that data alignment and endianness for
>> the
>>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
>> or can
>>> be overridden as required.
>> 
>> Defining it to be x86_64 semantics is effectively saying "we're not going
>> to do anything and it is up to other arch maintainers to fix the inevitable
>> portability problems that arise".
> 
> Pretty much.
> 
>> Since this is a new protocol should we take the opportunity to model it
>> explicitly in some common standard RPC protocol language. This would have
>> the benefit of allowing implementors to use off the shelf APIs for their
>> wire protocol marshalling, and eliminate questions about endianness and
>> alignment across architectures.
> 
> The problem is that we haven't defined the scope very well. My initial 
> impression 
> was that we should use the existing VFIO structs and constants, however 
> that's 
> impossible if we're to support non AF_UNIX. We need consensus on this, we're 
> open to ideas how to do this.

Thanos has a point.

From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
was written by Stefan, I read:

> Inventing a new device emulation protocol from scratch has many
> disadvantages. VFIO could be used as the protocol to avoid reinventing
> the wheel ...

At the same time, this appears to be incompatible with the (new?)
requirement of supporting device emulation which may run in non-VFIO
compliant OSs or even across OSs (ie. via TCP or similar).

We are happy to support what the community agrees on, but it seems
like there isn't an agreement. Is it worth all of us jumping into
another call to realign?

Cheers,
F.



RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-30 Thread Thanos Makatos
> > > More importantly, considering:
> > > a) Marc-André's comments about data alignment etc., and
> > > b) the possibility to run the server on another guest or host,
> > > we won't be able to use native VFIO types. If we do want to support that
> > > then
> > > we'll have to redefine all data formats, similar to
> > > https://urldefense.proofpoint.com/v2/url?u=https-
> > > 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> > >
> 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
> > >
> ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> > > W7NMgRg=1d_kB7VWQ-
> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
> > >
> > > So the protocol will be more like an enhanced version of the Vhost-user
> > > protocol
> > > than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> > > so we need to decide before proceeding as the request format is
> > > substantially
> > > different.
> >
> > Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> > support this future use case without unnecessarily complicating the
> protocol by
> > defining the C structs and stating that data alignment and endianness for
> the
> > non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> or can
> > be overridden as required.
> 
> Defining it to be x86_64 semantics is effectively saying "we're not going
> to do anything and it is up to other arch maintainers to fix the inevitable
> portability problems that arise".

Pretty much.
 
> Since this is a new protocol should we take the opportunity to model it
> explicitly in some common standard RPC protocol language. This would have
> the benefit of allowing implementors to use off the shelf APIs for their
> wire protocol marshalling, and eliminate questions about endianness and
> alignment across architectures.

The problem is that we haven't defined the scope very well. My initial 
impression 
was that we should use the existing VFIO structs and constants, however that's 
impossible if we're to support non AF_UNIX. We need consensus on this, we're 
open to ideas how to do this.




Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-30 Thread Daniel P . Berrangé
On Thu, Apr 30, 2020 at 11:23:34AM +, Thanos Makatos wrote:
> > > > I've just shared with you the Google doc we've working on with John
> > > where we've
> > > > been drafting the protocol specification, we think it's time for some 
> > > > first
> > > > comments. Please feel free to comment/edit and suggest more people
> > to
> > > be on the
> > > > reviewers list.
> > > >
> > > > You can also find the Google doc here:
> > > >
> > > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-
> > 3A__docs.google.com_document_d_1FspkL0hVEnZqHbdoqGLUpyC38rSk-
> > 5F=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6ogtti46a
> > tk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJW7NMg
> > Rg=RyyhgVrLX2bBTqpXZnBmllqkCg_wyalxwZKkfcYt50c=
> > > 7HhY471TsVwyK8/edit?usp=sharing
> > > >
> > > > If a Google doc doesn't work for you we're open to suggestions.
> > >
> > > I can't add comments to the document so I've inlined them here:
> > >
> > > The spec assumes the reader is already familiar with VFIO and does not
> > > explain concepts like the device lifecycle, regions, interrupts, etc.
> > > We don't need to duplicate detailed VFIO information, but I think the
> > > device model should be explained so that anyone can start from the
> > > VFIO-user spec and begin working on an implementation.  Right now I
> > > think they would have to do some serious investigation of VFIO first in
> > > order to be able to write code.
> > 
> > I've added a high-level overview of how VFIO is used in this context.
> > 
> > > "only the source header files are used"
> > > I notice the current  header is licensed "GPL-2.0 WITH
> > > Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> > > some restrictions on using this header file.  The 
> > > header files were explicitly licensed under the BSD license to make it
> > > easy to use the non __KERNEL__ parts.
> > 
> > My impression is that this note actually relaxes the licensing 
> > requirements, so
> > that proprietary software can use the system call headers and run on Linux
> > without being considered derived work. In any case I'll double check with 
> > our
> > legal team.
> > 
> > > VFIO-user Command Types: please indicate for each request type whether
> > > it is client->server, server->client, or both.  Also is it a "command"
> > > or "request"?
> > 
> > Will do. It's a command.
> > 
> > 
> > > vfio_user_req_type <-- is this an extension on top of ?
> > > Please make it clear what is part of the base  protocol
> > > and what is specific to vfio-user.
> > 
> > Correct, it's an extension over . I've clarified the two 
> > symbol
> > namespaces.
> > 
> > 
> > > VFIO_USER_READ/WRITE serve completely different purposes depending
> > on
> > > whether they are sent client->server or server->client.  I suggest
> > > defining separate request type constants instead of overloading them.
> > 
> > Fixed.
> > 
> > > What is the difference between VFIO_USER_MAP_DMA and
> > > VFIO_USER_REG_MEM?
> > > They both seem to be client->server messages for setting up memory but
> > > I'm not sure why two request types are needed.
> > 
> > John will provide more information on this.
> > 
> > > struct vfio_user_req->data.  Is this really a union so that every
> > > message has the same size, regardless of how many parameters are
> > passed
> > > in the data field?
> > 
> > Correct, it's a union so that every message has the same length.
> > 
> > > "a framebuffer where the guest does multiple stores to the virtual
> > > device."  Do you mean in SMP guests?  Or even in a single CPU guest?
> > 
> > @John
> > 
> > > Also, is there any concurrency requirement on the client and server
> > > side?  Can I implement a client/server that processes requests
> > > sequentially and completes them before moving on to the next request or
> > > would that deadlock for certain message types?
> > 
> > I believe that this might also depend on the device semantics, will need to
> > think about it in greater detail.
> 
> I've looked at this but can't provide a definitive answer yet. I believe the
> safest thing to do is for the server to process requests in order.
> 
> > More importantly, considering:
> > a) Marc-André's comments about data alignment etc., and
> > b) the possibility to run the server on another guest or host,
> > we won't be able to use native VFIO types. If we do want to support that
> > then
> > we'll have to redefine all data formats, similar to
> > https://urldefense.proofpoint.com/v2/url?u=https-
> > 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> > 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
> > ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> > W7NMgRg=1d_kB7VWQ-8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
> > 
> > So the protocol will be more like an enhanced version of the Vhost-user
> > protocol
> > than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> > so we need to decide before proceeding as the request format 

RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-30 Thread Thanos Makatos
> > > I've just shared with you the Google doc we've working on with John
> > where we've
> > > been drafting the protocol specification, we think it's time for some 
> > > first
> > > comments. Please feel free to comment/edit and suggest more people
> to
> > be on the
> > > reviewers list.
> > >
> > > You can also find the Google doc here:
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__docs.google.com_document_d_1FspkL0hVEnZqHbdoqGLUpyC38rSk-
> 5F=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJW7NMg
> Rg=RyyhgVrLX2bBTqpXZnBmllqkCg_wyalxwZKkfcYt50c=
> > 7HhY471TsVwyK8/edit?usp=sharing
> > >
> > > If a Google doc doesn't work for you we're open to suggestions.
> >
> > I can't add comments to the document so I've inlined them here:
> >
> > The spec assumes the reader is already familiar with VFIO and does not
> > explain concepts like the device lifecycle, regions, interrupts, etc.
> > We don't need to duplicate detailed VFIO information, but I think the
> > device model should be explained so that anyone can start from the
> > VFIO-user spec and begin working on an implementation.  Right now I
> > think they would have to do some serious investigation of VFIO first in
> > order to be able to write code.
> 
> I've added a high-level overview of how VFIO is used in this context.
> 
> > "only the source header files are used"
> > I notice the current  header is licensed "GPL-2.0 WITH
> > Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> > some restrictions on using this header file.  The 
> > header files were explicitly licensed under the BSD license to make it
> > easy to use the non __KERNEL__ parts.
> 
> My impression is that this note actually relaxes the licensing requirements, 
> so
> that proprietary software can use the system call headers and run on Linux
> without being considered derived work. In any case I'll double check with our
> legal team.
> 
> > VFIO-user Command Types: please indicate for each request type whether
> > it is client->server, server->client, or both.  Also is it a "command"
> > or "request"?
> 
> Will do. It's a command.
> 
> 
> > vfio_user_req_type <-- is this an extension on top of ?
> > Please make it clear what is part of the base  protocol
> > and what is specific to vfio-user.
> 
> Correct, it's an extension over . I've clarified the two symbol
> namespaces.
> 
> 
> > VFIO_USER_READ/WRITE serve completely different purposes depending
> on
> > whether they are sent client->server or server->client.  I suggest
> > defining separate request type constants instead of overloading them.
> 
> Fixed.
> 
> > What is the difference between VFIO_USER_MAP_DMA and
> > VFIO_USER_REG_MEM?
> > They both seem to be client->server messages for setting up memory but
> > I'm not sure why two request types are needed.
> 
> John will provide more information on this.
> 
> > struct vfio_user_req->data.  Is this really a union so that every
> > message has the same size, regardless of how many parameters are
> passed
> > in the data field?
> 
> Correct, it's a union so that every message has the same length.
> 
> > "a framebuffer where the guest does multiple stores to the virtual
> > device."  Do you mean in SMP guests?  Or even in a single CPU guest?
> 
> @John
> 
> > Also, is there any concurrency requirement on the client and server
> > side?  Can I implement a client/server that processes requests
> > sequentially and completes them before moving on to the next request or
> > would that deadlock for certain message types?
> 
> I believe that this might also depend on the device semantics, will need to
> think about it in greater detail.

I've looked at this but can't provide a definitive answer yet. I believe the
safest thing to do is for the server to process requests in order.

> More importantly, considering:
> a) Marc-André's comments about data alignment etc., and
> b) the possibility to run the server on another guest or host,
> we won't be able to use native VFIO types. If we do want to support that
> then
> we'll have to redefine all data formats, similar to
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> 2Duser.rst=DwIFAw=s883GpUCOChKOHiocYtGcg=XTpYsh5Ps2zJvtw6
> ogtti46atk736SI4vgsJiUKIyDE=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> W7NMgRg=1d_kB7VWQ-8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU= .
> 
> So the protocol will be more like an enhanced version of the Vhost-user
> protocol
> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> so we need to decide before proceeding as the request format is
> substantially
> different.

Regarding the ability to use the protocol on non-AF_UNIX sockets, we can 
support this future use case without unnecessarily complicating the protocol by
defining the C structs and stating that data alignment and endianness for the 
non AF_UNIX case must be the one 

RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-27 Thread Thanos Makatos
> > I've just shared with you the Google doc we've working on with John
> where we've
> > been drafting the protocol specification, we think it's time for some first
> > comments. Please feel free to comment/edit and suggest more people to
> be on the
> > reviewers list.
> >
> > You can also find the Google doc here:
> >
> >
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_
> 7HhY471TsVwyK8/edit?usp=sharing
> >
> > If a Google doc doesn't work for you we're open to suggestions.
> 
> I can't add comments to the document so I've inlined them here:
> 
> The spec assumes the reader is already familiar with VFIO and does not
> explain concepts like the device lifecycle, regions, interrupts, etc.
> We don't need to duplicate detailed VFIO information, but I think the
> device model should be explained so that anyone can start from the
> VFIO-user spec and begin working on an implementation.  Right now I
> think they would have to do some serious investigation of VFIO first in
> order to be able to write code.

I've added a high-level overview of how VFIO is used in this context.

> "only the source header files are used"
> I notice the current  header is licensed "GPL-2.0 WITH
> Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> some restrictions on using this header file.  The 
> header files were explicitly licensed under the BSD license to make it
> easy to use the non __KERNEL__ parts.

My impression is that this note actually relaxes the licensing requirements, so
that proprietary software can use the system call headers and run on Linux
without being considered derived work. In any case I'll double check with our
legal team.
 
> VFIO-user Command Types: please indicate for each request type whether
> it is client->server, server->client, or both.  Also is it a "command"
> or "request"?

Will do. It's a command.

 
> vfio_user_req_type <-- is this an extension on top of ?
> Please make it clear what is part of the base  protocol
> and what is specific to vfio-user.

Correct, it's an extension over . I've clarified the two symbol
namespaces.

 
> VFIO_USER_READ/WRITE serve completely different purposes depending on
> whether they are sent client->server or server->client.  I suggest
> defining separate request type constants instead of overloading them.

Fixed.

> What is the difference between VFIO_USER_MAP_DMA and
> VFIO_USER_REG_MEM?
> They both seem to be client->server messages for setting up memory but
> I'm not sure why two request types are needed.

John will provide more information on this.

> struct vfio_user_req->data.  Is this really a union so that every
> message has the same size, regardless of how many parameters are passed
> in the data field?

Correct, it's a union so that every message has the same length.

> "a framebuffer where the guest does multiple stores to the virtual
> device."  Do you mean in SMP guests?  Or even in a single CPU guest?

@John

> Also, is there any concurrency requirement on the client and server
> side?  Can I implement a client/server that processes requests
> sequentially and completes them before moving on to the next request or
> would that deadlock for certain message types?

I believe that this might also depend on the device semantics, will need to
think about it in greater detail.

More importantly, considering:
a) Marc-André's comments about data alignment etc., and
b) the possibility to run the server on another guest or host,
we won't be able to use native VFIO types. If we do want to support that then
we'll have to redefine all data formats, similar to
https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst.

So the protocol will be more like an enhanced version of the Vhost-user protocol
than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
so we need to decide before proceeding as the request format is substantially
different.



Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-22 Thread Stefan Hajnoczi
On Mon, Apr 20, 2020 at 11:05:25AM +, Thanos Makatos wrote:
> > In order to interoperate we'll need to maintain a protocol
> > specification.  Mayb You and JJ could put that together and CC the vfio,
> > rust-vmm, and QEMU communities for discussion?
> > 
> > It should cover the UNIX domain socket connection semantics (does a
> > listen socket only accept 1 connection at a time?  What happens when the
> > client disconnects?  What happens when the server disconnects?), how
> > VFIO structs are exchanged, any vfio-over-socket specific protocol
> > messages, etc.  Basically everything needed to write an implementation
> > (although it's not necessary to copy the VFIO struct definitions from
> > the kernel headers into the spec or even document their semantics if
> > they are identical to kernel VFIO).
> > 
> > The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> > client implementation in QEMU.  There is a prototype here:
> > https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> > user.c
> > 
> > If there are any volunteers for working on that then this would be a
> > good time to discuss it.
> 
> Hi,
> 
> I've just shared with you the Google doc we've working on with John where 
> we've
> been drafting the protocol specification, we think it's time for some first
> comments. Please feel free to comment/edit and suggest more people to be on 
> the
> reviewers list.
> 
> You can also find the Google doc here:
> 
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing
> 
> If a Google doc doesn't work for you we're open to suggestions.

I can't add comments to the document so I've inlined them here:

The spec assumes the reader is already familiar with VFIO and does not
explain concepts like the device lifecycle, regions, interrupts, etc.
We don't need to duplicate detailed VFIO information, but I think the
device model should be explained so that anyone can start from the
VFIO-user spec and begin working on an implementation.  Right now I
think they would have to do some serious investigation of VFIO first in
order to be able to write code.

"only the source header files are used"
I notice the current  header is licensed "GPL-2.0 WITH
Linux-syscall-note".  I'm not a lawyer but I guess this means there are
some restrictions on using this header file.  The 
header files were explicitly licensed under the BSD license to make it
easy to use the non __KERNEL__ parts.

VFIO-user Command Types: please indicate for each request type whether
it is client->server, server->client, or both.  Also is it a "command"
or "request"?

vfio_user_req_type <-- is this an extension on top of ?
Please make it clear what is part of the base  protocol
and what is specific to vfio-user.

VFIO_USER_READ/WRITE serve completely different purposes depending on
whether they are sent client->server or server->client.  I suggest
defining separate request type constants instead of overloading them.

What is the difference between VFIO_USER_MAP_DMA and VFIO_USER_REG_MEM?
They both seem to be client->server messages for setting up memory but
I'm not sure why two request types are needed.

struct vfio_user_req->data.  Is this really a union so that every
message has the same size, regardless of how many parameters are passed
in the data field?

"a framebuffer where the guest does multiple stores to the virtual
device."  Do you mean in SMP guests?  Or even in a single CPU guest?

Also, is there any concurrency requirement on the client and server
side?  Can I implement a client/server that processes requests
sequentially and completes them before moving on to the next request or
would that deadlock for certain message types?


signature.asc
Description: PGP signature


RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-20 Thread Thanos Makatos
> In order to interoperate we'll need to maintain a protocol
> specification.  Mayb You and JJ could put that together and CC the vfio,
> rust-vmm, and QEMU communities for discussion?
> 
> It should cover the UNIX domain socket connection semantics (does a
> listen socket only accept 1 connection at a time?  What happens when the
> client disconnects?  What happens when the server disconnects?), how
> VFIO structs are exchanged, any vfio-over-socket specific protocol
> messages, etc.  Basically everything needed to write an implementation
> (although it's not necessary to copy the VFIO struct definitions from
> the kernel headers into the spec or even document their semantics if
> they are identical to kernel VFIO).
> 
> The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> client implementation in QEMU.  There is a prototype here:
> https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> user.c
> 
> If there are any volunteers for working on that then this would be a
> good time to discuss it.

Hi,

I've just shared with you the Google doc we've working on with John where we've
been drafting the protocol specification, we think it's time for some first
comments. Please feel free to comment/edit and suggest more people to be on the
reviewers list.

You can also find the Google doc here:

https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing

If a Google doc doesn't work for you we're open to suggestions.

Thanks



Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-03 Thread Stefan Hajnoczi
On Thu, Apr 02, 2020 at 11:46:45AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 02, 2020 at 11:19:42AM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 01, 2020 at 06:58:20PM +0200, Marc-André Lureau wrote:
> > > On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
> > >  wrote:
> > > > > > Bear in mind that since this is just a PoC lots of things can 
> > > > > > break, e.g. some
> > > > > > system call not intercepted etc.
> > > > >
> > > > > Cool, I had a quick look at libvfio and how the transport integrates
> > > > > into libmuser.  The integration on the libmuser side is nice and 
> > > > > small.
> > > > >
> > > > > It seems likely that there will be several different implementations 
> > > > > of
> > > > > the vfio-over-socket device side (server):
> > > > > 1. libmuser
> > > > > 2. A Rust equivalent to libmuser
> > > > > 3. Maybe a native QEMU implementation for multi-process QEMU (I think 
> > > > > JJ
> > > > >has been investigating this?)
> > > > >
> > > > > In order to interoperate we'll need to maintain a protocol
> > > > > specification.  Mayb You and JJ could put that together and CC the 
> > > > > vfio,
> > > > > rust-vmm, and QEMU communities for discussion?
> > > >
> > > > Sure, I can start by drafting a design doc and share it.
> > > 
> > > ok! I am quite amazed you went this far with a ldpreload hack. This
> > > demonstrates some limits of gpl projects, if it was necessary.
> > > 
> > > I think with this work, and the muser experience, you have a pretty
> > > good idea of what the protocol could look like. My approach, as I
> > > remember, was a quite straightforward VFIO over socket translation,
> > > while trying to see if it could share some aspects with vhost-user,
> > > such as memory handling etc.
> > > 
> > > To contrast with the work done on qemu-mp series, I'd also prefer we
> > > focus our work on a vfio-like protocol, before trying to see how qemu
> > > code and interface could be changed over multiple binaries etc. We
> > > will start with some limitations, similar to the one that apply to
> > > VFIO: migration, introspection, managements etc are mostly left out
> > > for now. (iow, qemu-mp is trying to do too many things simultaneously)
> > 
> > qemu-mp has been cut down significantly in order to make it
> > non-invasive.  The model is now much cleaner:
> > 1. No monitor command or command-line option forwarding.  The device
> >emulation program has its own command-line and monitor that QEMU
> >doesn't know about.
> > 2. No per-device proxy objects.  A single RemotePCIDevice is added to
> >QEMU.  In the current patch series it only supports the LSI SCSI
> >controller but once the socket protocol is changed to
> >vfio-over-socket it will be possible to use any PCI device.
> > 
> > We recently agreed on dropping live migration to further reduce the
> > patch series.  If you have specific suggestions, please post reviews on
> > the latest patch series.
> 
> To clarify - the decision was to *temporarily* drop live migration, to
> make the initial patch series smaller and thus easier to merge. It does
> ultimately need live migration, so there would be followup patch series
> to provide migration support, after the initial merge.

Yes.  Live migration should come from the VFIO protocol and/or vmstate
DBus.  There is no need to implement it in a qemu-mp-specific way.

Stefan


signature.asc
Description: PGP signature


Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-02 Thread Daniel P . Berrangé
On Thu, Apr 02, 2020 at 11:19:42AM +0100, Stefan Hajnoczi wrote:
> On Wed, Apr 01, 2020 at 06:58:20PM +0200, Marc-André Lureau wrote:
> > On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
> >  wrote:
> > > > > Bear in mind that since this is just a PoC lots of things can break, 
> > > > > e.g. some
> > > > > system call not intercepted etc.
> > > >
> > > > Cool, I had a quick look at libvfio and how the transport integrates
> > > > into libmuser.  The integration on the libmuser side is nice and small.
> > > >
> > > > It seems likely that there will be several different implementations of
> > > > the vfio-over-socket device side (server):
> > > > 1. libmuser
> > > > 2. A Rust equivalent to libmuser
> > > > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> > > >has been investigating this?)
> > > >
> > > > In order to interoperate we'll need to maintain a protocol
> > > > specification.  Mayb You and JJ could put that together and CC the vfio,
> > > > rust-vmm, and QEMU communities for discussion?
> > >
> > > Sure, I can start by drafting a design doc and share it.
> > 
> > ok! I am quite amazed you went this far with a ldpreload hack. This
> > demonstrates some limits of gpl projects, if it was necessary.
> > 
> > I think with this work, and the muser experience, you have a pretty
> > good idea of what the protocol could look like. My approach, as I
> > remember, was a quite straightforward VFIO over socket translation,
> > while trying to see if it could share some aspects with vhost-user,
> > such as memory handling etc.
> > 
> > To contrast with the work done on qemu-mp series, I'd also prefer we
> > focus our work on a vfio-like protocol, before trying to see how qemu
> > code and interface could be changed over multiple binaries etc. We
> > will start with some limitations, similar to the one that apply to
> > VFIO: migration, introspection, managements etc are mostly left out
> > for now. (iow, qemu-mp is trying to do too many things simultaneously)
> 
> qemu-mp has been cut down significantly in order to make it
> non-invasive.  The model is now much cleaner:
> 1. No monitor command or command-line option forwarding.  The device
>emulation program has its own command-line and monitor that QEMU
>doesn't know about.
> 2. No per-device proxy objects.  A single RemotePCIDevice is added to
>QEMU.  In the current patch series it only supports the LSI SCSI
>controller but once the socket protocol is changed to
>vfio-over-socket it will be possible to use any PCI device.
> 
> We recently agreed on dropping live migration to further reduce the
> patch series.  If you have specific suggestions, please post reviews on
> the latest patch series.

To clarify - the decision was to *temporarily* drop live migration, to
make the initial patch series smaller and thus easier to merge. It does
ultimately need live migration, so there would be followup patch series
to provide migration support, after the initial merge.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-02 Thread Stefan Hajnoczi
On Wed, Apr 01, 2020 at 06:58:20PM +0200, Marc-André Lureau wrote:
> On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
>  wrote:
> > > On Thu, Mar 26, 2020 at 09:47:38AM +, Thanos Makatos wrote:
> > > > Build MUSER with vfio-over-socket:
> > > >
> > > > git clone --single-branch --branch vfio-over-socket
> > > g...@github.com:tmakatos/muser.git
> > > > cd muser/
> > > > git submodule update --init
> > > > make
> > > >
> > > > Run device emulation, e.g.
> > > >
> > > > ./build/dbg/samples/gpio-pci-idio-16 -s 
> > > >
> > > > Where  is an available IOMMU group, essentially the device ID, which
> > > must not
> > > > previously exist in /dev/vfio/.
> > > >
> > > > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> > > >
> > > > LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> > > \
> > > > ... \
> > > > -device vfio-pci,sysfsdev=/dev/vfio/ \
> > > > -object 
> > > > memory-backend-file,id=ram-node0,prealloc=yes,mem-
> > > path=mem,share=yes,size=1073741824 \
> > > > -numa node,nodeid=0,cpus=0,memdev=ram-node0
> > > >
> 
> fyi, with 5.0 you no longer need -numa!:
> 
> -object memory-backend-memfd,id=mem,size=2G -M memory-backend=mem
> 
> (hopefully, we will get something even simpler one day)
> 
> > > > Bear in mind that since this is just a PoC lots of things can break, 
> > > > e.g. some
> > > > system call not intercepted etc.
> > >
> > > Cool, I had a quick look at libvfio and how the transport integrates
> > > into libmuser.  The integration on the libmuser side is nice and small.
> > >
> > > It seems likely that there will be several different implementations of
> > > the vfio-over-socket device side (server):
> > > 1. libmuser
> > > 2. A Rust equivalent to libmuser
> > > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> > >has been investigating this?)
> > >
> > > In order to interoperate we'll need to maintain a protocol
> > > specification.  Mayb You and JJ could put that together and CC the vfio,
> > > rust-vmm, and QEMU communities for discussion?
> >
> > Sure, I can start by drafting a design doc and share it.
> 
> ok! I am quite amazed you went this far with a ldpreload hack. This
> demonstrates some limits of gpl projects, if it was necessary.
> 
> I think with this work, and the muser experience, you have a pretty
> good idea of what the protocol could look like. My approach, as I
> remember, was a quite straightforward VFIO over socket translation,
> while trying to see if it could share some aspects with vhost-user,
> such as memory handling etc.
> 
> To contrast with the work done on qemu-mp series, I'd also prefer we
> focus our work on a vfio-like protocol, before trying to see how qemu
> code and interface could be changed over multiple binaries etc. We
> will start with some limitations, similar to the one that apply to
> VFIO: migration, introspection, managements etc are mostly left out
> for now. (iow, qemu-mp is trying to do too many things simultaneously)

qemu-mp has been cut down significantly in order to make it
non-invasive.  The model is now much cleaner:
1. No monitor command or command-line option forwarding.  The device
   emulation program has its own command-line and monitor that QEMU
   doesn't know about.
2. No per-device proxy objects.  A single RemotePCIDevice is added to
   QEMU.  In the current patch series it only supports the LSI SCSI
   controller but once the socket protocol is changed to
   vfio-over-socket it will be possible to use any PCI device.

We recently agreed on dropping live migration to further reduce the
patch series.  If you have specific suggestions, please post reviews on
the latest patch series.

The RemotePCIDevice and device emulation program infrastructure it puts
in place are intended to be used by vfio-over-socket in the future.  I
see it as complementary to vfio-over-socket rather than as a
replacement.  Elena, Jag, and JJ have been working on it for a long time
and I think we should build on top of it (replacing parts as needed)
rather than propose a new plan that sidelines their work.

> That's the rough ideas/plan I have in mind:
> - draft/define a "vfio over unix" protocol
> - similar to vhost-user, also define some backend conventions
> https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst#backend-program-conventions
> - modify qemu vfio code to allow using a socket backend. Ie something
> like "-chardev socket=foo -device vfio-pci,chardev=foo"

I think JJ has been working on this already.  Not sure what the status
is.

> - implement some test devices (and outside qemu, in whatever
> language/framework - the more the merrier!)
> - investigate how existing qemu binary could expose some devices over
> "vfio-unix", for ex: "qemu -machine none -chardev socket=foo,server
> -device pci-serial,vfio=foo". This would avoid a lot of proxy and code
> 

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-01 Thread Marc-André Lureau
Hi

On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
 wrote:
>
> > On Thu, Mar 26, 2020 at 09:47:38AM +, Thanos Makatos wrote:
> > > Build MUSER with vfio-over-socket:
> > >
> > > git clone --single-branch --branch vfio-over-socket
> > g...@github.com:tmakatos/muser.git
> > > cd muser/
> > > git submodule update --init
> > > make
> > >
> > > Run device emulation, e.g.
> > >
> > > ./build/dbg/samples/gpio-pci-idio-16 -s 
> > >
> > > Where  is an available IOMMU group, essentially the device ID, which
> > must not
> > > previously exist in /dev/vfio/.
> > >
> > > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> > >
> > > LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> > \
> > > ... \
> > > -device vfio-pci,sysfsdev=/dev/vfio/ \
> > > -object memory-backend-file,id=ram-node0,prealloc=yes,mem-
> > path=mem,share=yes,size=1073741824 \
> > > -numa node,nodeid=0,cpus=0,memdev=ram-node0
> > >

fyi, with 5.0 you no longer need -numa!:

-object memory-backend-memfd,id=mem,size=2G -M memory-backend=mem

(hopefully, we will get something even simpler one day)

> > > Bear in mind that since this is just a PoC lots of things can break, e.g. 
> > > some
> > > system call not intercepted etc.
> >
> > Cool, I had a quick look at libvfio and how the transport integrates
> > into libmuser.  The integration on the libmuser side is nice and small.
> >
> > It seems likely that there will be several different implementations of
> > the vfio-over-socket device side (server):
> > 1. libmuser
> > 2. A Rust equivalent to libmuser
> > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> >has been investigating this?)
> >
> > In order to interoperate we'll need to maintain a protocol
> > specification.  Mayb You and JJ could put that together and CC the vfio,
> > rust-vmm, and QEMU communities for discussion?
>
> Sure, I can start by drafting a design doc and share it.

ok! I am quite amazed you went this far with a ldpreload hack. This
demonstrates some limits of gpl projects, if it was necessary.

I think with this work, and the muser experience, you have a pretty
good idea of what the protocol could look like. My approach, as I
remember, was a quite straightforward VFIO over socket translation,
while trying to see if it could share some aspects with vhost-user,
such as memory handling etc.

To contrast with the work done on qemu-mp series, I'd also prefer we
focus our work on a vfio-like protocol, before trying to see how qemu
code and interface could be changed over multiple binaries etc. We
will start with some limitations, similar to the one that apply to
VFIO: migration, introspection, managements etc are mostly left out
for now. (iow, qemu-mp is trying to do too many things simultaneously)

That's the rough ideas/plan I have in mind:
- draft/define a "vfio over unix" protocol
- similar to vhost-user, also define some backend conventions
https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst#backend-program-conventions
- modify qemu vfio code to allow using a socket backend. Ie something
like "-chardev socket=foo -device vfio-pci,chardev=foo"
- implement some test devices (and outside qemu, in whatever
language/framework - the more the merrier!)
- investigate how existing qemu binary could expose some devices over
"vfio-unix", for ex: "qemu -machine none -chardev socket=foo,server
-device pci-serial,vfio=foo". This would avoid a lot of proxy and code
churn proposed in qemu-mp.
- think about evolution of QMP, so that commands are dispatched to the
right process. In my book, this is called a bus, and I would go for
DBus (not through qemu) in the long term. But for now, we probably
want to split QMP code to make it more modular (in qemu-mp series,
this isn't stellar either). Later on, perhaps look at bridging QMP
over DBus.
- code refactoring in qemu, to allow smaller binaries, that implement
the minimum for vfio-user devices. (imho, this will be a bit easier
after the meson move, as the build system is simpler)

That should allow some work sharing.

I can't wait for your design draft, and see how I could help.

>
> > It should cover the UNIX domain socket connection semantics (does a
> > listen socket only accept 1 connection at a time?  What happens when the
> > client disconnects?  What happens when the server disconnects?), how
> > VFIO structs are exchanged, any vfio-over-socket specific protocol
> > messages, etc.  Basically everything needed to write an implementation
> > (although it's not necessary to copy the VFIO struct definitions from
> > the kernel headers into the spec or even document their semantics if
> > they are identical to kernel VFIO).
> >
> > The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> > client implementation in QEMU.  There is a prototype here:
> > 

RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-01 Thread Thanos Makatos
> On Thu, Mar 26, 2020 at 09:47:38AM +, Thanos Makatos wrote:
> > Build MUSER with vfio-over-socket:
> >
> > git clone --single-branch --branch vfio-over-socket
> g...@github.com:tmakatos/muser.git
> > cd muser/
> > git submodule update --init
> > make
> >
> > Run device emulation, e.g.
> >
> > ./build/dbg/samples/gpio-pci-idio-16 -s 
> >
> > Where  is an available IOMMU group, essentially the device ID, which
> must not
> > previously exist in /dev/vfio/.
> >
> > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> >
> > LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> \
> > ... \
> > -device vfio-pci,sysfsdev=/dev/vfio/ \
> > -object memory-backend-file,id=ram-node0,prealloc=yes,mem-
> path=mem,share=yes,size=1073741824 \
> > -numa node,nodeid=0,cpus=0,memdev=ram-node0
> >
> > Bear in mind that since this is just a PoC lots of things can break, e.g. 
> > some
> > system call not intercepted etc.
> 
> Cool, I had a quick look at libvfio and how the transport integrates
> into libmuser.  The integration on the libmuser side is nice and small.
> 
> It seems likely that there will be several different implementations of
> the vfio-over-socket device side (server):
> 1. libmuser
> 2. A Rust equivalent to libmuser
> 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
>has been investigating this?)
> 
> In order to interoperate we'll need to maintain a protocol
> specification.  Mayb You and JJ could put that together and CC the vfio,
> rust-vmm, and QEMU communities for discussion?

Sure, I can start by drafting a design doc and share it.

> It should cover the UNIX domain socket connection semantics (does a
> listen socket only accept 1 connection at a time?  What happens when the
> client disconnects?  What happens when the server disconnects?), how
> VFIO structs are exchanged, any vfio-over-socket specific protocol
> messages, etc.  Basically everything needed to write an implementation
> (although it's not necessary to copy the VFIO struct definitions from
> the kernel headers into the spec or even document their semantics if
> they are identical to kernel VFIO).
> 
> The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> client implementation in QEMU.  There is a prototype here:
> https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> user.c
> 
> If there are any volunteers for working on that then this would be a
> good time to discuss it.
> 
> Finally, has anyone looked at CrosVM's out-of-process device model?  I
> wonder if it has any features we should consider...
> 
> Looks like a great start to vfio-over-socket!



Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-01 Thread Stefan Hajnoczi
On Thu, Mar 26, 2020 at 09:47:38AM +, Thanos Makatos wrote:
> Build MUSER with vfio-over-socket:
> 
> git clone --single-branch --branch vfio-over-socket 
> g...@github.com:tmakatos/muser.git
> cd muser/
> git submodule update --init
> make
> 
> Run device emulation, e.g.
> 
> ./build/dbg/samples/gpio-pci-idio-16 -s 
> 
> Where  is an available IOMMU group, essentially the device ID, which must 
> not
> previously exist in /dev/vfio/.
> 
> Run QEMU using the vfio wrapper library and specifying the MUSER device:
> 
> LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \
> ... \
> -device vfio-pci,sysfsdev=/dev/vfio/ \
> -object 
> memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824
>  \
> -numa node,nodeid=0,cpus=0,memdev=ram-node0
> 
> Bear in mind that since this is just a PoC lots of things can break, e.g. some
> system call not intercepted etc.

Cool, I had a quick look at libvfio and how the transport integrates
into libmuser.  The integration on the libmuser side is nice and small.

It seems likely that there will be several different implementations of
the vfio-over-socket device side (server):
1. libmuser
2. A Rust equivalent to libmuser
3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
   has been investigating this?)

In order to interoperate we'll need to maintain a protocol
specification.  Mayb You and JJ could put that together and CC the vfio,
rust-vmm, and QEMU communities for discussion?

It should cover the UNIX domain socket connection semantics (does a
listen socket only accept 1 connection at a time?  What happens when the
client disconnects?  What happens when the server disconnects?), how
VFIO structs are exchanged, any vfio-over-socket specific protocol
messages, etc.  Basically everything needed to write an implementation
(although it's not necessary to copy the VFIO struct definitions from
the kernel headers into the spec or even document their semantics if
they are identical to kernel VFIO).

The next step beyond the LD_PRELOAD library is a native vfio-over-socket
client implementation in QEMU.  There is a prototype here:
https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-user.c

If there are any volunteers for working on that then this would be a
good time to discuss it.

Finally, has anyone looked at CrosVM's out-of-process device model?  I
wonder if it has any features we should consider...

Looks like a great start to vfio-over-socket!

Stefan


signature.asc
Description: PGP signature


RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-03-27 Thread Thanos Makatos
>  
> Next I explain how to test the PoC.
> 
> Build MUSER with vfio-over-socket:
> 
> git clone --single-branch --branch vfio-over-socket
> g...@github.com:tmakatos/muser.git
> cd muser/
> git submodule update --init
> make

Yesterday's version had a bug where it didn't build if you didn't have an 
existing libmuser installation, I pushed a patch to fix that.



RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-03-26 Thread Thanos Makatos
I want to continue the discussion regarding using MUSER
(https://github.com/nutanix/muser) as a device offloading mechanism. The main
drawback of MUSER is that it requires a kernel module, so I've experimented
with a proof of concept of how MUSER would look like if we somehow didn't need
a kernel module. I did this by implementing a wrapper library
(https://github.com/tmakatos/libpathtrap) that intercepts accesses to
VFIO-related paths and forwards them to the MUSER process providing device
emulation over a UNIX domain socket. This does not require any changes to QEMU
(4.1.0). Obviously this is a massive hack and is done only for the needs of
this PoC.

The result is a fully working PCI device in QEMU (the gpio sample explained in
https://github.com/nutanix/muser/blob/master/README.md#running-gpio-pci-idio-16),
which is as simple as possible. I've also tested with a much more complicated
device emulation, https://github.com/tmakatos/spdk, which provides NVMe device
emulation and requires accessing guest memory for DMA, allowing BAR0 to be
memory mapped into the guest, using MSI-X interrupts, etc.

The changes required in MUSER are fairly small, all that is needed is to
introduce a new concept of "transport" to receive requests from a UNIX domain
socket instead of the kernel (from a character device) and to send/receive file
descriptors for sharing memory and firing interrupts.

My experience is that VFIO is so intuitive to use for offloading device
emulation from one process to another that makes this feature quite
straightforward. There's virtually nothing specific to the kernel in the VFIO
API. Therefore I strongly agree with Stefan's suggestion to use it for device
offloading when interacting with QEMU. Using 'muser.ko' is still interesting
when QEMU is not the client, but if everyone is happy to proceed with the
vfio-over-socket alternative the kernel module can become a second-class
citizen. (QEMU is, after all, our first and most relevant client.)

Next I explain how to test the PoC.

Build MUSER with vfio-over-socket:

git clone --single-branch --branch vfio-over-socket 
g...@github.com:tmakatos/muser.git
cd muser/
git submodule update --init
make

Run device emulation, e.g.

./build/dbg/samples/gpio-pci-idio-16 -s 

Where  is an available IOMMU group, essentially the device ID, which must not
previously exist in /dev/vfio/.

Run QEMU using the vfio wrapper library and specifying the MUSER device:

LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \
... \
-device vfio-pci,sysfsdev=/dev/vfio/ \
-object 
memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824
 \
-numa node,nodeid=0,cpus=0,memdev=ram-node0

Bear in mind that since this is just a PoC lots of things can break, e.g. some
system call not intercepted etc.