Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-08 Thread Michael S. Tsirkin
On Mon, Jun 07, 2010 at 12:01:04PM -0700, Tom Lyon wrote:
 On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote:
  On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
   OK, in the interest of making progress, I am about to embark on the 
   following:
   
   1. Create a user-iommu-domain driver - opening it will give a new empty 
   domain.
   Ultimately this can also populate sysfs with the state of its world, 
   which would
   also be a good addition to the base iommu stuff.
   If someone closes the fd while in use, the domain stays valid anyway 
   until users
   drop off.
   
   2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require 
   that
  a domain be set before using the VFIO_DMA_MAP_IOVA ioctl
  
  Require domain to be set before you allow any access to the device:
  mmap, write, read.  IMO this is the only safe way to make sure userspace
  does not corrupt memory, and this removes the need to special-case
  MSI memory, play with bus master enable and hope it can be cleared without
  reset, etc.
 
 Michael - the light bulb finally lit for me and I now understand what you've 
 been
 saying the past few weeks.  Of course you're right - we need iommu set before 
 any
 register access.  I had thought that was done by default but now I see that 
 the 
 dma_map_sg routine only attaches to the iommu on demand.
 
 So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
 with the vfio fd so that the user can still do everything with one fd. I'm 
 thinking the
 fd opens and iommu bindings could be done in a program before spinning out the
 program with the user driver.

This would kind of break Avi's idea that mappings are programmed
at the domain and shared by multiple devices, won't it?

  
   (this is the one
  that KVM wants).
  
  Not sure I understand. I think that MAP should be done on the domain,
  not the device, this handles pinning pages correctly and
  this way you don't need any special checks.
  
  However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
  which uses the dma_sg interface which has no expicit control of 
   domains. I
  intend to keep it the way it is, but expect only non-hypervisor 
   programs would
  want to use it.
  
  If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
  non-hypervisors just pick an address?
  
   3. Clean up the docs and other nits that folks have found.
   
   Comments? 
  
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-07 Thread Tom Lyon
On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote:
 On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
  OK, in the interest of making progress, I am about to embark on the 
  following:
  
  1. Create a user-iommu-domain driver - opening it will give a new empty 
  domain.
  Ultimately this can also populate sysfs with the state of its world, 
  which would
  also be a good addition to the base iommu stuff.
  If someone closes the fd while in use, the domain stays valid anyway 
  until users
  drop off.
  
  2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
 a domain be set before using the VFIO_DMA_MAP_IOVA ioctl
 
 Require domain to be set before you allow any access to the device:
 mmap, write, read.  IMO this is the only safe way to make sure userspace
 does not corrupt memory, and this removes the need to special-case
 MSI memory, play with bus master enable and hope it can be cleared without
 reset, etc.

Michael - the light bulb finally lit for me and I now understand what you've 
been
saying the past few weeks.  Of course you're right - we need iommu set before 
any
register access.  I had thought that was done by default but now I see that the 
dma_map_sg routine only attaches to the iommu on demand.

So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
with the vfio fd so that the user can still do everything with one fd. I'm 
thinking the
fd opens and iommu bindings could be done in a program before spinning out the
program with the user driver.
 
  (this is the one
 that KVM wants).
 
 Not sure I understand. I think that MAP should be done on the domain,
 not the device, this handles pinning pages correctly and
 this way you don't need any special checks.
 
 However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
 which uses the dma_sg interface which has no expicit control of domains. 
  I
 intend to keep it the way it is, but expect only non-hypervisor programs 
  would
 want to use it.
 
 If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
 non-hypervisors just pick an address?
 
  3. Clean up the docs and other nits that folks have found.
  
  Comments? 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-06 Thread Michael S. Tsirkin
On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
 OK, in the interest of making progress, I am about to embark on the following:
 
 1. Create a user-iommu-domain driver - opening it will give a new empty 
 domain.
 Ultimately this can also populate sysfs with the state of its world, 
 which would
 also be a good addition to the base iommu stuff.
 If someone closes the fd while in use, the domain stays valid anyway 
 until users
 drop off.
 
 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
a domain be set before using the VFIO_DMA_MAP_IOVA ioctl

Require domain to be set before you allow any access to the device:
mmap, write, read.  IMO this is the only safe way to make sure userspace
does not corrupt memory, and this removes the need to special-case
MSI memory, play with bus master enable and hope it can be cleared without
reset, etc.

 (this is the one
that KVM wants).

Not sure I understand. I think that MAP should be done on the domain,
not the device, this handles pinning pages correctly and
this way you don't need any special checks.

However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
which uses the dma_sg interface which has no expicit control of domains. I
intend to keep it the way it is, but expect only non-hypervisor programs 
 would
want to use it.

If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
non-hypervisors just pick an address?

 3. Clean up the docs and other nits that folks have found.
 
 Comments? 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-06 Thread Avi Kivity

On 06/02/2010 07:53 PM, Chris Wright wrote:

* Avi Kivity (a...@redhat.com) wrote:
   

The interface would only work for clients which support it: kvm,
vhost, and iommu/devices with restartable dma.
 

BTW, there is no such thing as restartable dma.  There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them.  So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.
   


Thanks for the correction.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-03 Thread Tom Lyon
OK, in the interest of making progress, I am about to embark on the following:

1. Create a user-iommu-domain driver - opening it will give a new empty domain.
Ultimately this can also populate sysfs with the state of its world, which 
would
also be a good addition to the base iommu stuff.
If someone closes the fd while in use, the domain stays valid anyway until 
users
drop off.

2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
   a domain be set before using the VFIO_DMA_MAP_IOVA ioctl (this is the one
   that KVM wants).  However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
   which uses the dma_sg interface which has no expicit control of domains. I
   intend to keep it the way it is, but expect only non-hypervisor programs 
would
   want to use it.

3. Clean up the docs and other nits that folks have found.

Comments? 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:

 There seems to be some misunderstanding.  The userspace interface
 proposed forces a separate domain per device and forces userspace to
 repeat iommu programming for each device.  We are better off sharing a
 domain between devices and programming the iommu once.
 
 The natural way to do this is to have an iommu driver for programming
 iommu.

IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
 On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:

 Main difference is that vhost works fine with unlocked
 memory, paging it in on demand. iommu needs to unmap
 memory when it is swapped out or relocated.

 So you'd just take the memory map and not pin anything.  This way you  
 can reuse the memory map.

 But no, it doesn't handle the dirty bitmap, so no go.

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 12:45 PM, Joerg Roedel wrote:

On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
   

On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
 
   

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

   

So you'd just take the memory map and not pin anything.  This way you
can reuse the memory map.

But no, it doesn't handle the dirty bitmap, so no go.
 

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.
   


vhost doesn't pin memory.

What I proposed is to describe the memory map using an object (fd), and 
pass it around to clients that use it: kvm, vhost, vfio.  That way you 
maintain the memory map in a central location and broadcast changes to 
clients.  Only a vfio client would result in memory being pinned.


It can still work, but the interface needs to be extended to include 
dirty bitmap logging.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 12:42 PM, Joerg Roedel wrote:

On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:

   

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.

The natural way to do this is to have an iommu driver for programming
iommu.
 

IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.
   


This is non trivial with hotplug.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Tue, Jun 01, 2010 at 09:59:40PM -0700, Tom Lyon wrote:
 This is just what I was thinking.  But rather than a get/set, just use two 
 fds.
 
   ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
 
 This may fail if there are really 2 different IOMMUs, so user code must be
 prepared for failure,  In addition, this is strictlyupwards compatible with 
 what is there now, so maybe we can add it later.

How can this fail with multiple IOMMUs? This should be handled
transparently by the IOMMU driver.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
 On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:
 
  There seems to be some misunderstanding.  The userspace interface
  proposed forces a separate domain per device and forces userspace to
  repeat iommu programming for each device.  We are better off sharing a
  domain between devices and programming the iommu once.
  
  The natural way to do this is to have an iommu driver for programming
  iommu.
 
 IMO a seperate iommu-userspace driver is a nightmare for a userspace
 interface. It is just too complicated to use.

One advantage would be that we can reuse the uio framework
for the devices themselves. So an existing app can just program
an iommu for DMA and keep using uio for interrupts and access.

 We can solve the problem
 of multiple devices-per-domain with an ioctl which allows binding one
 uio-device to the address-space on another.

This would imply switching an iommu domain for a device while
it could potentially be doing DMA. No idea whether this can be done
in a safe manner.
Forcing iommu assignment to be done as a first step seems much saner.


 Thats much simpler.
 
   Joerg


So instead of
dev = open();
ioctl(dev, ASSIGN, iommu)
mmap

and if we for ioctl mmap will fail
we have

dev = open();
if (ndevices  0)
ioctl(devices[0], ASSIGN, dev)
mmap

And if we forget ioctl we get errors from device.
Seems more complicated to me.


There will also always exist the confusion: address space for
which device are we modifying? With a separate driver for iommu,
we can safely check that binding is done correctly.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
 On 06/02/2010 12:45 PM, Joerg Roedel wrote:
 IOMMU mapped memory can not be swapped out because we can't do demand
 paging on io-page-faults with current devices. We have to pin _all_
 userspace memory that is mapped into an IOMMU domain.

 vhost doesn't pin memory.

 What I proposed is to describe the memory map using an object (fd), and  
 pass it around to clients that use it: kvm, vhost, vfio.  That way you  
 maintain the memory map in a central location and broadcast changes to  
 clients.  Only a vfio client would result in memory being pinned.

Ah ok, so its only about the database which keeps the mapping
information.

 It can still work, but the interface needs to be extended to include  
 dirty bitmap logging.

Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:04:04PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
  On 06/02/2010 12:45 PM, Joerg Roedel wrote:
  IOMMU mapped memory can not be swapped out because we can't do demand
  paging on io-page-faults with current devices. We have to pin _all_
  userspace memory that is mapped into an IOMMU domain.
 
  vhost doesn't pin memory.
 
  What I proposed is to describe the memory map using an object (fd), and  
  pass it around to clients that use it: kvm, vhost, vfio.  That way you  
  maintain the memory map in a central location and broadcast changes to  
  clients.  Only a vfio client would result in memory being pinned.
 
 Ah ok, so its only about the database which keeps the mapping
 information.
 
  It can still work, but the interface needs to be extended to include  
  dirty bitmap logging.
 
 Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
 dirty-bits in the page-table. And without demand-paging we can't really
 tell what pages a device has written to. The only choice is to mark all
 IOMMU-mapped pages dirty as long as they are mapped.
 
   Joerg

Or mark them dirty when they are unmapped.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:

  IMO a seperate iommu-userspace driver is a nightmare for a userspace
  interface. It is just too complicated to use.
 
 One advantage would be that we can reuse the uio framework
 for the devices themselves. So an existing app can just program
 an iommu for DMA and keep using uio for interrupts and access.

The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
IOMMU mappings belongs there.

  We can solve the problem
  of multiple devices-per-domain with an ioctl which allows binding one
  uio-device to the address-space on another.
 
 This would imply switching an iommu domain for a device while
 it could potentially be doing DMA. No idea whether this can be done
 in a safe manner.

It can. The worst thing that can happen is an io-page-fault.

 Forcing iommu assignment to be done as a first step seems much saner.

If we force it, there is no reason why not doing it implicitly.

We can do something like this then:

dev1 = open();
ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
it*/

dev2 = open();
ioctl(dev2, IOMMU_MAP, ...);

/* Now dev1 and dev2 are in seperate domains */

ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
   assigns it to the same domain as
   dev1. Domain has a refcount of two
   now */

close(dev1); /* domain refcount goes down to one */
close(dev2); /* domain refcount is zero and domain gets destroyed */


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 11:45:27AM +0200, Joerg Roedel wrote:
 On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
  On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
 
  Main difference is that vhost works fine with unlocked
  memory, paging it in on demand. iommu needs to unmap
  memory when it is swapped out or relocated.
 
  So you'd just take the memory map and not pin anything.  This way you  
  can reuse the memory map.
 
  But no, it doesn't handle the dirty bitmap, so no go.
 
 IOMMU mapped memory can not be swapped out because we can't do demand
 paging on io-page-faults with current devices. We have to pin _all_
 userspace memory that is mapped into an IOMMU domain.
 
   Joerg


One of the issues I see with the current patch is that
it uses the mlock rlimit to do this pinning. So this wastes the rlimit
for an app that did mlockall already, and also consumes
this resource transparently, so an app might call mlock
on a small buffer and be surprised that it fails.

Using mmu notifiers might help?


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
 
   IMO a seperate iommu-userspace driver is a nightmare for a userspace
   interface. It is just too complicated to use.
  
  One advantage would be that we can reuse the uio framework
  for the devices themselves. So an existing app can just program
  an iommu for DMA and keep using uio for interrupts and access.
 
 The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
 IOMMU mappings belongs there.
 
   We can solve the problem
   of multiple devices-per-domain with an ioctl which allows binding one
   uio-device to the address-space on another.
  
  This would imply switching an iommu domain for a device while
  it could potentially be doing DMA. No idea whether this can be done
  in a safe manner.
 
 It can. The worst thing that can happen is an io-page-fault.

devices might not be able to recover from this.

  Forcing iommu assignment to be done as a first step seems much saner.
 
 If we force it, there is no reason why not doing it implicitly.

What you describe below does 3 ioctls for what can be done with 1.

 We can do something like this then:
 
 dev1 = open();
 ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
   it*/
 
 dev2 = open();
 ioctl(dev2, IOMMU_MAP, ...);
 
 /* Now dev1 and dev2 are in seperate domains */
 
 ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
  assigns it to the same domain as
  dev1. Domain has a refcount of two
  now */

Or maybe it destroys mapping for dev1?
How do you remember?

 close(dev1); /* domain refcount goes down to one */
 close(dev2); /* domain refcount is zero and domain gets destroyed */
 
 
   Joerg

Also, no way to unshare? That seems limiting.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 01:15:34PM +0300, Michael S. Tsirkin wrote:
 One of the issues I see with the current patch is that
 it uses the mlock rlimit to do this pinning. So this wastes the rlimit
 for an app that did mlockall already, and also consumes
 this resource transparently, so an app might call mlock
 on a small buffer and be surprised that it fails.
 
 Using mmu notifiers might help?

MMU notifiers are problematic because they are designed for situations
where we can do demand paging. The invalidate_range_start and
invalidate_range_end functions are not only called on munmap, they also
run when mprotect is called (in which case we don't want to tear down
iommu mappings). So what may happen with mmu notifiers is that we
accidentially tear down iommu mappings. With demand-paging this is no
problem because the io-ptes could be re-faulted. But that does not work
here.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:

  It can. The worst thing that can happen is an io-page-fault.
 
 devices might not be able to recover from this.

With the userspace interface a process can create io-page-faults
anyway if it wants. We can't protect us from this. And the process is
also responsible to not tear down iommu-mappings that are currently in
use.

 What you describe below does 3 ioctls for what can be done with 1.

The second IOMMU_MAP ioctl is just to show that existing mappings would
be destroyed if the device is assigned to another address space. Not
strictly necessary. So we have two ioctls but save one call to create
the iommu-domain.

  ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
 assigns it to the same domain as
 dev1. Domain has a refcount of two
 now */
 
 Or maybe it destroys mapping for dev1?
 How do you remember?

Because we express here that dev2 shares the iommu mappings of dev1.
Thats easy to remember.

 Also, no way to unshare? That seems limiting.

Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
because that would require a second parameter) ioctl is certainly also
required.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
 
   It can. The worst thing that can happen is an io-page-fault.
  
  devices might not be able to recover from this.
 
 With the userspace interface a process can create io-page-faults
 anyway if it wants. We can't protect us from this.

We could fail all operations until an iommu is bound.
This will help catch bugs with access before setup. We can not do this
if a domain is bound by default.

 And the process is
 also responsible to not tear down iommu-mappings that are currently in
 use.
 
  What you describe below does 3 ioctls for what can be done with 1.
 
 The second IOMMU_MAP ioctl is just to show that existing mappings would
 be destroyed if the device is assigned to another address space. Not
 strictly necessary. So we have two ioctls but save one call to create
 the iommu-domain.

With 10 devices you have 10 extra ioctls.

   ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
assigns it to the same domain as
dev1. Domain has a refcount of two
now */
  
  Or maybe it destroys mapping for dev1?
  How do you remember?
 
 Because we express here that dev2 shares the iommu mappings of dev1.
 Thats easy to remember.

they both share the mappings. which one gets the iommu
destroyed (breaking the device if it is now doing DMA)?

  Also, no way to unshare? That seems limiting.
 
 Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
 because that would require a second parameter) ioctl is certainly also
 required.
 
   Joerg



-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
 
   IMO a seperate iommu-userspace driver is a nightmare for a userspace
   interface. It is just too complicated to use.
  
  One advantage would be that we can reuse the uio framework
  for the devices themselves. So an existing app can just program
  an iommu for DMA and keep using uio for interrupts and access.
 
 The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
 IOMMU mappings belongs there.

Maybe it could be put there but the patch posted did not use uio.
And one of the reasons is that uio framework provides for
device access and interrupts but not for programming memory mappings.

Solutions (besides giving up on uio completely)
could include extending the framework in some way
(which was tried, but the result was not pretty) or adding
a separate driver for iommu and binding to that.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:

  With the userspace interface a process can create io-page-faults
  anyway if it wants. We can't protect us from this.
 
 We could fail all operations until an iommu is bound.  This will help
 catch bugs with access before setup. We can not do this if a domain is
 bound by default.

Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.

  The second IOMMU_MAP ioctl is just to show that existing mappings would
  be destroyed if the device is assigned to another address space. Not
  strictly necessary. So we have two ioctls but save one call to create
  the iommu-domain.
 
 With 10 devices you have 10 extra ioctls.

And this works implicitly with your proposal? Remember that we still
need to be able to provide seperate mappings for each device to support
IOMMU emulation for the guest. I think my proposal does not have any
extra costs.

  Because we express here that dev2 shares the iommu mappings of dev1.
  Thats easy to remember.
 
 they both share the mappings. which one gets the iommu
 destroyed (breaking the device if it is now doing DMA)?

As I wrote the domain has a reference count and is destroyed only when
it goes down to zero. This does not happen as long as a device is bound
to it.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 01:04 PM, Joerg Roedel wrote:

On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
   

On 06/02/2010 12:45 PM, Joerg Roedel wrote:
 

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.
   

vhost doesn't pin memory.

What I proposed is to describe the memory map using an object (fd), and
pass it around to clients that use it: kvm, vhost, vfio.  That way you
maintain the memory map in a central location and broadcast changes to
clients.  Only a vfio client would result in memory being pinned.
 

Ah ok, so its only about the database which keeps the mapping
information.
   


Yes.

   

It can still work, but the interface needs to be extended to include
dirty bitmap logging.
 

Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.

   


The interface would only work for clients which support it: kvm, vhost, 
and iommu/devices with restartable dma.


Note dirty logging is not very interesting for vfio anyway, since you 
can't live migrate with assigned devices.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
 
   With the userspace interface a process can create io-page-faults
   anyway if it wants. We can't protect us from this.
  
  We could fail all operations until an iommu is bound.  This will help
  catch bugs with access before setup. We can not do this if a domain is
  bound by default.
 
 Even if it is bound to a domain the userspace driver could program the
 device to do dma to unmapped regions causing io-page-faults. The kernel
 can't do anything about it.

It can always corrupt its own memory directly as well :)
But that is not a reason not to detect errors if we can,
and not to make APIs hard to misuse.

   The second IOMMU_MAP ioctl is just to show that existing mappings would
   be destroyed if the device is assigned to another address space. Not
   strictly necessary. So we have two ioctls but save one call to create
   the iommu-domain.
  
  With 10 devices you have 10 extra ioctls.
 
 And this works implicitly with your proposal?

Yes.  so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)

No need to add a SHARE ioctl.


 Remember that we still
 need to be able to provide seperate mappings for each device to support
 IOMMU emulation for the guest.

Generally not true. E.g. guest can enable iommu passthrough
or have domain per a group of devices.

 I think my proposal does not have any
 extra costs.

with my proposal we have 1 ioctl per device + 1 per domain.
with yours we have 2 ioctls per device is iommu is shared
and 1 if it is not shared.

as current apps share iommu it seems to make sense
to optimize for that.

   Because we express here that dev2 shares the iommu mappings of dev1.
   Thats easy to remember.
  
  they both share the mappings. which one gets the iommu
  destroyed (breaking the device if it is now doing DMA)?
 
 As I wrote the domain has a reference count and is destroyed only when
 it goes down to zero. This does not happen as long as a device is bound
 to it.
 
   Joerg

We were talking about UNSHARE ioctl:
ioctl(dev1, UNSHARE, dev2)
Does it change the domain for dev1 or dev2?
If you make a mistake you get a hard to debug bug.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:

  Even if it is bound to a domain the userspace driver could program the
  device to do dma to unmapped regions causing io-page-faults. The kernel
  can't do anything about it.
 
 It can always corrupt its own memory directly as well :)
 But that is not a reason not to detect errors if we can,
 and not to make APIs hard to misuse.

Changing the domain of a device while dma can happen is the same type of
bug as unmapping potential dma target addresses. We can't catch this
kind of misuse.

   With 10 devices you have 10 extra ioctls.
  
  And this works implicitly with your proposal?
 
 Yes.  so you do:
 iommu = open
 ioctl(dev1, BIND, iommu)
 ioctl(dev2, BIND, iommu)
 ioctl(dev3, BIND, iommu)
 ioctl(dev4, BIND, iommu)
 
 No need to add a SHARE ioctl.

In my proposal this looks like:


dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);

So we actually save an ioctl.

  Remember that we still need to be able to provide seperate mappings
  for each device to support IOMMU emulation for the guest.
 
 Generally not true. E.g. guest can enable iommu passthrough
 or have domain per a group of devices.

What I meant was that there may me multiple io-addresses spaces
necessary for one process. I didn't want to say that every device
_needs_ to have its own address space.

  As I wrote the domain has a reference count and is destroyed only when
  it goes down to zero. This does not happen as long as a device is bound
  to it.
  
  Joerg
 
 We were talking about UNSHARE ioctl:
 ioctl(dev1, UNSHARE, dev2)
 Does it change the domain for dev1 or dev2?
 If you make a mistake you get a hard to debug bug.

As I already wrote we would have an UNBIND ioctl which just removes a
device from its current domain. UNBIND is better than UNSHARE for
exactly the reason you pointed out above. I thought I stated that
already.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 03:19 PM, Joerg Roedel wrote:



Yes.  so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)

No need to add a SHARE ioctl.
 

In my proposal this looks like:


dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);

So we actually save an ioctl.
   


The problem with this is that it is assymetric, dev1 is treated 
differently from dev[234].  It's an unintuitive API.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
 
   Even if it is bound to a domain the userspace driver could program the
   device to do dma to unmapped regions causing io-page-faults. The kernel
   can't do anything about it.
  
  It can always corrupt its own memory directly as well :)
  But that is not a reason not to detect errors if we can,
  and not to make APIs hard to misuse.
 
 Changing the domain of a device while dma can happen is the same type of
 bug as unmapping potential dma target addresses. We can't catch this
 kind of misuse.

you normally need device mapped to start DMA.
SHARE makes this bug more likely as you allow
switching domains: mmap could be done before switching.

With 10 devices you have 10 extra ioctls.
   
   And this works implicitly with your proposal?
  
  Yes.  so you do:
  iommu = open
  ioctl(dev1, BIND, iommu)
  ioctl(dev2, BIND, iommu)
  ioctl(dev3, BIND, iommu)
  ioctl(dev4, BIND, iommu)
  
  No need to add a SHARE ioctl.
 
 In my proposal this looks like:
 
 
 dev1 = open();
 ioctl(dev2, SHARE, dev1);
 ioctl(dev3, SHARE, dev1);
 ioctl(dev4, SHARE, dev1);
 
 So we actually save an ioctl.

I thought we had a BIND ioctl?

   Remember that we still need to be able to provide seperate mappings
   for each device to support IOMMU emulation for the guest.
  
  Generally not true. E.g. guest can enable iommu passthrough
  or have domain per a group of devices.
 
 What I meant was that there may me multiple io-addresses spaces
 necessary for one process. I didn't want to say that every device
 _needs_ to have its own address space.
 
   As I wrote the domain has a reference count and is destroyed only when
   it goes down to zero. This does not happen as long as a device is bound
   to it.
   
 Joerg
  
  We were talking about UNSHARE ioctl:
  ioctl(dev1, UNSHARE, dev2)
  Does it change the domain for dev1 or dev2?
  If you make a mistake you get a hard to debug bug.
 
 As I already wrote we would have an UNBIND ioctl which just removes a
 device from its current domain. UNBIND is better than UNSHARE for
 exactly the reason you pointed out above. I thought I stated that
 already.
 
   Joerg

You undo SHARE with UNBIND?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
 On 06/02/2010 03:19 PM, Joerg Roedel wrote:

 Yes.  so you do:
 iommu = open
 ioctl(dev1, BIND, iommu)
 ioctl(dev2, BIND, iommu)
 ioctl(dev3, BIND, iommu)
 ioctl(dev4, BIND, iommu)

 No need to add a SHARE ioctl.
  
 In my proposal this looks like:


 dev1 = open();
 ioctl(dev2, SHARE, dev1);
 ioctl(dev3, SHARE, dev1);
 ioctl(dev4, SHARE, dev1);

 So we actually save an ioctl.


 The problem with this is that it is assymetric, dev1 is treated  
 differently from dev[234].  It's an unintuitive API.

Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it. If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:

dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);

Which is much easier than the need to create a domain explicitly.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 03:34:17PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:

 you normally need device mapped to start DMA.
 SHARE makes this bug more likely as you allow
 switching domains: mmap could be done before switching.

We need to support domain switching anyway for iommu emulation in a
guest. So if you consider this to be a problem (I don't) it will not go
away with your proposal.

  dev1 = open();
  ioctl(dev2, SHARE, dev1);
  ioctl(dev3, SHARE, dev1);
  ioctl(dev4, SHARE, dev1);
  
  So we actually save an ioctl.
 
 I thought we had a BIND ioctl?

I can't remember a BIND ioctl in my proposal. I remember an UNBIND, but
thats bad naming as you pointed out below. See my statement on this
below too.

 You undo SHARE with UNBIND?

Thats bad naming, agreed. Lets keep UNSHARE. Point is, we only need one
parameter to do this which removes any ambiguity:

ioctl(dev1, UNSHARE);

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 03:50 PM, Joerg Roedel wrote:



The problem with this is that it is assymetric, dev1 is treated
differently from dev[234].  It's an unintuitive API.
 

Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it.


I don't really care about the iommu domain.  It's a side effect.  The 
kernel takes care of it.  I'm only worried about the API.


We have a memory map that is (often) the same for a set of devices.  If 
you were coding a non-kernel interface, how would you code it?


  struct memory_map;
  void memory_map_init(struct memory_map *mm, ...);
  struct device;
  void device_set_memory_map(struct device *device, struct memory_map *mm);

or

  struct device;
  void device_init_memory_map(struct device *dev, ...);
  void device_clone_memory_map(struct device *dev, struct device *other);

I wouldn't even think of the second one personally.


If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
   


It's still special.  You define the memory map only for the first 
device.  You have to make sure dev1 doesn't go away while sharing it.



Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:

dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);

Which is much easier than the need to create a domain explicitly.
   


mm = open()
ioctl(mm, MAP, ...)
dev = open();
ioctl(dev, BIND, mm);
...
close(mm);
close(dev);

so yes, more work, but once you have multiple devices which come and go 
dynamically things become simpler.  The map object has global lifetime 
(you can even construct it if you don't assign any devices), the devices 
attach to it, memory hotplug updates the memory map but doesn't touch 
devices.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
  On 06/02/2010 03:19 PM, Joerg Roedel wrote:
 
  Yes.  so you do:
  iommu = open
  ioctl(dev1, BIND, iommu)
  ioctl(dev2, BIND, iommu)
  ioctl(dev3, BIND, iommu)
  ioctl(dev4, BIND, iommu)
 
  No need to add a SHARE ioctl.
   
  In my proposal this looks like:
 
 
  dev1 = open();
  ioctl(dev2, SHARE, dev1);
  ioctl(dev3, SHARE, dev1);
  ioctl(dev4, SHARE, dev1);
 
  So we actually save an ioctl.
 
 
  The problem with this is that it is assymetric, dev1 is treated  
  differently from dev[234].  It's an unintuitive API.
 
 Its by far more unintuitive that a process needs to explicitly bind a
 device to an iommu domain before it can do anything with it.

The reason it is more intuitive is because it is harder to get it
wrong. If you swap iommu and device in the call, you get BADF
so you know you made a mistake. We can even make it work
both ways if we wanted to. With ioctl(dev1, BIND, dev2)
it breaks silently.


 If its
 required anyway the binding can happen implicitly. We could allow to do
 a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

And then when we assign meaning to it we find that half the apps
are broken because they did not call this ioctl.

 Note that this way of handling userspace iommu mappings is also a lot
 simpler for most use-cases outside of KVM. If a developer wants to write
 a userspace driver all it needs to do is:
 
 dev = open();
 ioctl(dev, MAP, ...);
 /* use device with mappings */
 close(dev);
 
 Which is much easier than the need to create a domain explicitly.
 
   Joerg

This simple scenario ignores all the real-life corner cases.
For example, with an explicit iommu open and bind application
can naturally detect that:
- we have run out of iommu domains
- iommu is unsupported
- iommu is in use by another, incompatible device
- device is in bad state
because each is a separate operation, so it is easy to produce meaningful
errors.

Another interesting thing that a separate iommu device supports is when
application A controls the iommu and application B
controls the device. This might be good to e.g. improve security
(B is run by root, A is unpriveledged and passes commands to/from B
 over a pipe).

This is not possible when same fd is used for iommu and device.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 04:06:21PM +0300, Avi Kivity wrote:
 On 06/02/2010 03:50 PM, Joerg Roedel wrote:

 Its by far more unintuitive that a process needs to explicitly bind a
 device to an iommu domain before it can do anything with it.

 I don't really care about the iommu domain.  It's a side effect.  The  
 kernel takes care of it.  I'm only worried about the API.

The proposed memory-map object is nothing else than a userspace
abstraction of an iommu-domain.

 We have a memory map that is (often) the same for a set of devices.  If  
 you were coding a non-kernel interface, how would you code it?

   struct memory_map;
   void memory_map_init(struct memory_map *mm, ...);
   struct device;
   void device_set_memory_map(struct device *device, struct memory_map *mm);

 or

   struct device;
   void device_init_memory_map(struct device *dev, ...);
   void device_clone_memory_map(struct device *dev, struct device *other);

 I wouldn't even think of the second one personally.

Right, a kernel-interface would be designed the first way. The IOMMU-API
is actually designed in this manner. But I still think we should keep it
simpler for userspace.

 If its required anyway the binding can happen implicitly. We could
 allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

 It's still special.  You define the memory map only for the first  
 device.  You have to make sure dev1 doesn't go away while sharing it.

Must be a misunderstanding. In my proposal the domain is not owned by
one device. It is owned by all devices that share it and will only
vanish if all devices that use it are unbound (which happens when the file
descriptor is closed, for example).

 so yes, more work, but once you have multiple devices which come and go  
 dynamically things become simpler.  The map object has global lifetime  
 (you can even construct it if you don't assign any devices), the devices  
 attach to it, memory hotplug updates the memory map but doesn't touch  
 devices.

I still think a userspace interface should be as simple as possible. But
since both ways will work I am not really opposed to Michael's proposal.
I just think its overkill for the common non-kvm usecase (a userspace
device driver).

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
 The interface would only work for clients which support it: kvm,
 vhost, and iommu/devices with restartable dma.

BTW, there is no such thing as restartable dma.  There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them.  So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Chris Wright
* Joerg Roedel (j...@8bytes.org) wrote:
 On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
 
   Even if it is bound to a domain the userspace driver could program the
   device to do dma to unmapped regions causing io-page-faults. The kernel
   can't do anything about it.
  
  It can always corrupt its own memory directly as well :)
  But that is not a reason not to detect errors if we can,
  and not to make APIs hard to misuse.
 
 Changing the domain of a device while dma can happen is the same type of
 bug as unmapping potential dma target addresses. We can't catch this
 kind of misuse.
 
With 10 devices you have 10 extra ioctls.
   
   And this works implicitly with your proposal?
  
  Yes.  so you do:
  iommu = open
  ioctl(dev1, BIND, iommu)
  ioctl(dev2, BIND, iommu)
  ioctl(dev3, BIND, iommu)
  ioctl(dev4, BIND, iommu)
  
  No need to add a SHARE ioctl.
 
 In my proposal this looks like:
 
 
 dev1 = open();
 ioctl(dev2, SHARE, dev1);
 ioctl(dev3, SHARE, dev1);
 ioctl(dev4, SHARE, dev1);
 
 So we actually save an ioctl.

This is not any hot path, so saving an ioctl shouldn't be a consideration.
Only important consideration is a good API.  I may have lost context here,
but the SHARE API is limited to the vfio fd.  The BIND API expects a new
iommu object.  Are there other uses for this object?  Tom's current vfio
driver exposes a dma mapping interface, would the iommu object expose
one as well?  Current interface is device specific DMA interface for
host device drivers typically mapping in-flight dma buffers, and IOMMU
specific interface for assigned devices typically mapping entire virtual
address space.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Tom Lyon
On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:
 * Joerg Roedel (j...@8bytes.org) wrote:
  On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
   On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
  
Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.
   
   It can always corrupt its own memory directly as well :)
   But that is not a reason not to detect errors if we can,
   and not to make APIs hard to misuse.
  
  Changing the domain of a device while dma can happen is the same type of
  bug as unmapping potential dma target addresses. We can't catch this
  kind of misuse.
  
 With 10 devices you have 10 extra ioctls.

And this works implicitly with your proposal?
   
   Yes.  so you do:
   iommu = open
   ioctl(dev1, BIND, iommu)
   ioctl(dev2, BIND, iommu)
   ioctl(dev3, BIND, iommu)
   ioctl(dev4, BIND, iommu)
   
   No need to add a SHARE ioctl.
  
  In my proposal this looks like:
  
  
  dev1 = open();
  ioctl(dev2, SHARE, dev1);
  ioctl(dev3, SHARE, dev1);
  ioctl(dev4, SHARE, dev1);
  
  So we actually save an ioctl.
 
 This is not any hot path, so saving an ioctl shouldn't be a consideration.
 Only important consideration is a good API.  I may have lost context here,
 but the SHARE API is limited to the vfio fd.  The BIND API expects a new
 iommu object.  Are there other uses for this object?  Tom's current vfio
 driver exposes a dma mapping interface, would the iommu object expose
 one as well?  Current interface is device specific DMA interface for
 host device drivers typically mapping in-flight dma buffers, and IOMMU
 specific interface for assigned devices typically mapping entire virtual
 address space.

Actually, it a domain object - which may be usable among iommus (Joerg?).
However, you can't really do the dma mapping with just the domain because
every device supports a different size address space as a master, i.e.,
the dma_mask.

And I don't know how kvm would deal with devices with varying dma mask support,
or why they'd be in the same domain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 11:09:17AM -0700, Tom Lyon wrote:
 On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:

  This is not any hot path, so saving an ioctl shouldn't be a consideration.
  Only important consideration is a good API.  I may have lost context here,
  but the SHARE API is limited to the vfio fd.  The BIND API expects a new
  iommu object.  Are there other uses for this object?  Tom's current vfio
  driver exposes a dma mapping interface, would the iommu object expose
  one as well?  Current interface is device specific DMA interface for
  host device drivers typically mapping in-flight dma buffers, and IOMMU
  specific interface for assigned devices typically mapping entire virtual
  address space.
 
 Actually, it a domain object - which may be usable among iommus (Joerg?).

Yes, this 'iommu' thing is would be a domain object. But sharing among
iommus is not necessary because the fact that there are multiple iommus
in the system is hidden by the iommu drivers. This fact is not even
exposed by the iommu-api. This makes protection domains system global.

 However, you can't really do the dma mapping with just the domain because
 every device supports a different size address space as a master, i.e.,
 the dma_mask.

The dma_mask has to be handled by the device driver. With the
iommu-mapping interface the driver can specify the target io-address and
has to consider the dma_mask for that too.

 And I don't know how kvm would deal with devices with varying dma mask 
 support,
 or why they'd be in the same domain.

KVM does not care about these masks. This is the business of the guest
device drivers.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:

On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
   

On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
 

So what I suggested is failing any kind of access until iommu
is assigned.

   

So, the kernel driver must be aware of the iommu.  In which case it may
as well program it.
 

It's a kernel driver anyway. Point is that
the *device* driver is better off not programming iommu,
this way we do not need to reprogram it for each device.
   


The device driver is in userspace.  It can't program the iommu.  What 
the patch proposes is that userspace tells vfio about the needed 
mappings, and vfio programs the iommu.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Michael S. Tsirkin
On Tue, Jun 01, 2010 at 11:10:45AM +0300, Avi Kivity wrote:
 On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:
 On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:

 On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
  
 So what I suggested is failing any kind of access until iommu
 is assigned.


 So, the kernel driver must be aware of the iommu.  In which case it may
 as well program it.
  
 It's a kernel driver anyway. Point is that
 the *device* driver is better off not programming iommu,
 this way we do not need to reprogram it for each device.


 The device driver is in userspace.

I mean the kernel driver that grants userspace the access.

  It can't program the iommu.
 What  
 the patch proposes is that userspace tells vfio about the needed  
 mappings, and vfio programs the iommu.

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.

The natural way to do this is to have an iommu driver for programming
iommu.

This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.

And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:



  It can't program the iommu.
What
the patch proposes is that userspace tells vfio about the needed
mappings, and vfio programs the iommu.
 

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.
   


  iommufd = open(/dev/iommu);
  ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
  ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

?

If so, I agree.


The natural way to do this is to have an iommu driver for programming
iommu.

This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.

And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.
   


Since vfio would be the only driver, there would be no duplication.  But 
a separate object for the iommu mapping is a good thing.  Perhaps we can 
even share it with vhost (without actually using the mmu, since vhost is 
software only).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Michael S. Tsirkin
On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
 On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:

   It can't program the iommu.
 What
 the patch proposes is that userspace tells vfio about the needed
 mappings, and vfio programs the iommu.
  
 There seems to be some misunderstanding.  The userspace interface
 proposed forces a separate domain per device and forces userspace to
 repeat iommu programming for each device.  We are better off sharing a
 domain between devices and programming the iommu once.


   iommufd = open(/dev/iommu);
   ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
   ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

 ?

Yes.

 If so, I agree.

Good.

 The natural way to do this is to have an iommu driver for programming
 iommu.

 This likely means we will have to pass the domain to 'vfio' or uio or
 whatever the driver that gives userspace the access to device is called,
 but this is only for security, there's no need to support programming
 iommu there.

 And using this design means the uio framework changes
 required would be minor, so we won't have to duplicate code.


 Since vfio would be the only driver, there would be no duplication.  But  
 a separate object for the iommu mapping is a good thing.  Perhaps we can  
 even share it with vhost (without actually using the mmu, since vhost is  
 software only).

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:



Since vfio would be the only driver, there would be no duplication.  But
a separate object for the iommu mapping is a good thing.  Perhaps we can
even share it with vhost (without actually using the mmu, since vhost is
software only).
 

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

   


So you'd just take the memory map and not pin anything.  This way you 
can reuse the memory map.


But no, it doesn't handle the dirty bitmap, so no go.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Tuesday 01 June 2010 03:46:51 am Michael S. Tsirkin wrote:
 On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
  On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
 
It can't program the iommu.
  What
  the patch proposes is that userspace tells vfio about the needed
  mappings, and vfio programs the iommu.
   
  There seems to be some misunderstanding.  The userspace interface
  proposed forces a separate domain per device and forces userspace to
  repeat iommu programming for each device.  We are better off sharing a
  domain between devices and programming the iommu once.
 
 
iommufd = open(/dev/iommu);
ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
 
  ?
 
 Yes.
 
  If so, I agree.
 
 Good.

I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?  
Also, domains are not a very scarce resource - my little core i5 has 256, 
and the intel architecture goes to 64K.
And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain. 

Given all that, I am inclined to leave it alone until someone has a real 
problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Monday 31 May 2010 10:17:35 am Alan Cox wrote:
 
 Does look like it needs a locking audit, some memory and error checks
 reviewing and some further review of the ioctl security and
 overflows/trusted values.
Yes. Thanks for the detailed look.
 
 Rather a nice way of attacking the user space PCI problem.
And thanks for that!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/02/2010 12:26 AM, Tom Lyon wrote:


I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?
   


In the case of kvm, we don't want isolation between devices, because 
that doesn't happen on real hardware.  So if the guest programs devices 
to dma to each other, we want that to succeed.



Also, domains are not a very scarce resource - my little core i5 has 256,
and the intel architecture goes to 64K.
   


But there is a 0.2% of mapped memory per domain cost for the page 
tables.  For the kvm use case, that could be significant since a guest 
may have large amounts of memory and large numbers of assigned devices.



And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain.
   


That's indeed a deficiency.


Given all that, I am inclined to leave it alone until someone has a real 
problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings
   


I think we do have a real problem (though a mild one).

The only issue I see with deferring the solution is that the API becomes 
gnarly; both the kernel and userspace will have to support both APIs 
forever.  Perhaps we can implement the new API but defer the actual 
sharing until later, don't know how much work this saves.  Or Alex/Chris 
can pitch in and help.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Alex Williamson
On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
 On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
 
It can't program the iommu.
  What
  the patch proposes is that userspace tells vfio about the needed
  mappings, and vfio programs the iommu.
   
  There seems to be some misunderstanding.  The userspace interface
  proposed forces a separate domain per device and forces userspace to
  repeat iommu programming for each device.  We are better off sharing a
  domain between devices and programming the iommu once.
 
 
iommufd = open(/dev/iommu);
ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

It seems part of the annoyance of the current KVM device assignment is
that we have multiple files open, we mmap here, read there, write over
there, maybe, if it's not emulated.  I quite like Tom's approach that we
have one stop shopping with /dev/vfion, including config space
emulation so each driver doesn't have to try to write their own.  So
continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
ioctl to vfio so that after we setup one device we can bind the next to
the same domain?

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Tuesday 01 June 2010 09:29:47 pm Alex Williamson wrote:
 On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
  On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
  
 It can't program the iommu.
   What
   the patch proposes is that userspace tells vfio about the needed
   mappings, and vfio programs the iommu.

   There seems to be some misunderstanding.  The userspace interface
   proposed forces a separate domain per device and forces userspace to
   repeat iommu programming for each device.  We are better off sharing a
   domain between devices and programming the iommu once.
  
  
 iommufd = open(/dev/iommu);
 ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
 ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
 
 It seems part of the annoyance of the current KVM device assignment is
 that we have multiple files open, we mmap here, read there, write over
 there, maybe, if it's not emulated.  I quite like Tom's approach that we
 have one stop shopping with /dev/vfion, including config space
 emulation so each driver doesn't have to try to write their own.  So
 continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
 ioctl to vfio so that after we setup one device we can bind the next to
 the same domain?

This is just what I was thinking.  But rather than a get/set, just use two fds.

ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure,  In addition, this is strictlyupwards compatible with 
what is there now, so maybe we can add it later.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/02/2010 07:59 AM, Tom Lyon wrote:


This is just what I was thinking.  But rather than a get/set, just use two fds.

ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure,  In addition, this is strictlyupwards compatible with
what is there now, so maybe we can add it later.

   


What happens if one of the fds is later closed?

I don't like this conceptually.  There is a 1:n relationship between the 
memory map and the devices.  Ignoring it will cause the API to have 
warts.  It's more straightforward to have an object to represent the 
memory mapping (and talk to the iommus), and have devices bind to this 
object.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
 On 06/02/2010 12:26 AM, Tom Lyon wrote:
 
 I'm not really opposed to multiple devices per domain, but let me point out 
 how I
 ended up here.  First, the driver has two ways of mapping pages, one based 
 on the
 iommu api and one based on the dma_map_sg api.  With the latter, the system
 already allocates a domain per device and there's no way to control it. This 
 was
 presumably done to help isolation between drivers.  If there are multiple 
 drivers
 in the user level, do we not want the same isoation to apply to them?
 
 In the case of kvm, we don't want isolation between devices, because
 that doesn't happen on real hardware.

Sure it does.  That's exactly what happens when there's an iommu
involved with bare metal.

 So if the guest programs
 devices to dma to each other, we want that to succeed.

And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).

That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain.  However, it can be useful to put each
device belonging to a guest in a unique domain.  Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.

 Also, domains are not a very scarce resource - my little core i5 has 256,
 and the intel architecture goes to 64K.
 
 But there is a 0.2% of mapped memory per domain cost for the page
 tables.  For the kvm use case, that could be significant since a
 guest may have large amounts of memory and large numbers of assigned
 devices.
 
 And then there's the fact that it is possible to have multiple disjoint 
 iommus on a system,
 so it may not even be possible to bring 2 devices under one domain.
 
 That's indeed a deficiency.

Not sure it's a deficiency.  Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping.  Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).

 Given all that, I am inclined to leave it alone until someone has a real 
 problem.
 Note that not sharing iommu domains doesn't mean you can't share device 
 memory,
 just that you have to do multiple mappings
 
 I think we do have a real problem (though a mild one).
 
 The only issue I see with deferring the solution is that the API
 becomes gnarly; both the kernel and userspace will have to support
 both APIs forever.  Perhaps we can implement the new API but defer
 the actual sharing until later, don't know how much work this saves.
 Or Alex/Chris can pitch in and help.

It really shouldn't be that complicated to create the API to allow for
flexible device - domain mappings, so I agree, makes sense to do it
right up front.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/02/2010 08:29 AM, Chris Wright wrote:

* Avi Kivity (a...@redhat.com) wrote:
   

On 06/02/2010 12:26 AM, Tom Lyon wrote:
 

I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?
   

In the case of kvm, we don't want isolation between devices, because
that doesn't happen on real hardware.
 

Sure it does.  That's exactly what happens when there's an iommu
involved with bare metal.
   


But we are emulating a machine without an iommu.

When we emulate a machine with an iommu, then yes, we'll want to use as 
many domains as the guest does.



So if the guest programs
devices to dma to each other, we want that to succeed.
 

And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).

That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain.  However, it can be useful to put each
device belonging to a guest in a unique domain.  Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.
   


Right, we need to keep flexibility.


And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain.
   

That's indeed a deficiency.
 

Not sure it's a deficiency.  Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping.  Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).
   


Sharing the page tables is just an optimization, I was worried about 
devices in separate domains not talking to each other.  if ATS fixes 
that, great.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-31 Thread Avi Kivity

On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:


So what I suggested is failing any kind of access until iommu
is assigned.
   


So, the kernel driver must be aware of the iommu.  In which case it may 
as well program it.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-31 Thread Alan Cox

 +/*
 + * Map usr buffer at specific IO virtual address
 + */
 +static int vfio_dma_map_iova(

 + mlp = kzalloc(sizeof *mlp, GFP_KERNEL);

Not good at that point. I think you need to allocate it first, error if
it can't be allocated and then do the work and free it on error ?


 + mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
 + mlp-pages = pages;

Ditto


 +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
 +{
 + struct pci_dev *pdev = vdev-pdev;
 + struct eventfd_ctx *ctx;
 + int ret = 0;
 + int i;
 + int fd;
 +
 + vdev-msix = kzalloc(nvec * sizeof(struct msix_entry),
 + GFP_KERNEL);
 + vdev-ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
 + GFP_KERNEL);

These don't seem to get freed on the error path - or indeed protected
against being allocated twice (eg two parallel ioctls ?)



 + case VFIO_DMA_MAP_ANYWHERE:
 + case VFIO_DMA_MAP_IOVA:
 + if (copy_from_user(dm, uarg, sizeof dm))
 + return -EFAULT;
 + ret = vfio_dma_map_common(listener, cmd, dm);
 + if (!ret  copy_to_user(uarg, dm, sizeof dm))

So the vfio_dma_map is untrusted. That seems to be checked ok later but
the dma_map_common code then plays in current-mm- without apparently
holding any locks to stop the values getting corrupted by a parallel
mlock ?

Actually no I take that back

dmp-size is 64bit

So npage can end up with values like 0x and cause 32bit
boxes to go kerblam

 +
 + case VFIO_EVENTFD_IRQ:
 + if (copy_from_user(fd, uarg, sizeof fd))
 + return -EFAULT;
 + if (vdev-ev_irq)
 + eventfd_ctx_put(vdev-ev_irq);

These paths need locking - suppose two EVENTFD irq ioctls occur at once
(in general these paths seem not to be covered)


 + case VFIO_BAR_LEN:
 + if (copy_from_user(bar, uarg, sizeof bar))
 + return -EFAULT;
 + if (bar  0 || bar  PCI_ROM_RESOURCE)
 + return -EINVAL;
 + bar = pci_resource_len(pdev, bar);
 + if (copy_to_user(uarg, bar, sizeof bar))
 + return -EFAULT;

How does this all work out if the device is a bridge ?

 + pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, line);
 + if (line == 0)
 + goto out;

That may produce some interestingly wrong answers. Firstly the platform
has interrupt abstraction so dev-irq may not match PCI_INTERRUPT_LINE,
secondly you have devices that report their IRQ via other paths as per
spec (notably IDE class devices in non-native mode)

So that would also want extra checks.


 + pci_read_config_word(pdev, PCI_COMMAND, orig);
 + ret = orig  PCI_COMMAND_MASTER;
 + if (!ret) {
 + new = orig | PCI_COMMAND_MASTER;
 + pci_write_config_word(pdev, PCI_COMMAND, new);
 + pci_read_config_word(pdev, PCI_COMMAND, new);
 + ret = new  PCI_COMMAND_MASTER;
 + pci_write_config_word(pdev, PCI_COMMAND, orig);

The master bit on some devices can be turned on but not off. Not sure it
matters here.

 + vdev-pdev = pdev;

Probably best to take/drop a reference. Not needed if you can prove your
last use is before the end of the remove path though.


Does look like it needs a locking audit, some memory and error checks
reviewing and some further review of the ioctl security and
overflows/trusted values.

Rather a nice way of attacking the user space PCI problem.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-31 Thread Michael S. Tsirkin
On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
 On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:

 So what I suggested is failing any kind of access until iommu
 is assigned.


 So, the kernel driver must be aware of the iommu.  In which case it may  
 as well program it.

It's a kernel driver anyway. Point is that
the *device* driver is better off not programming iommu,
this way we do not need to reprogram it for each device.

 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Avi Kivity

On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:

On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
   

The VFIO driver is used to allow privileged AND non-privileged processes to
implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
devices.
Signed-off-by: Tom Lyonp...@cisco.com
---
This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.
 

IMO this was because this driver does two things: programming iommu and
handling interrupts. uio does interrupt handling.
We could have moved iommu / DMA programming to
a separate driver, and have uio work with it.
This would solve limitation of the current driver
that is needs an iommu domain per device.
   


How do we enforce security then?  We need to ensure that unprivileged 
users can only use the device with an iommu.



[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read  write calls) - but includes table
driven code to determine whats safe to write and what is not.
 

I don't really see why this is helpful: a driver written corrrectly
will not access these addresses, and we need an iommu anyway to protect
us against a drivers.
   


Haven't reviewed the code (yet) but things like the BARs, MSI, and 
interrupt disable need to be protected from the guest regardless of the 
iommu.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Michael S. Tsirkin
On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:
 On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
 On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:

 The VFIO driver is used to allow privileged AND non-privileged processes 
 to
 implement user-level device drivers for any well-behaved PCI, PCI-X, and 
 PCIe
 devices.
 Signed-off-by: Tom Lyonp...@cisco.com
 ---
 This patch is the evolution of code which was first proposed as a patch to
 uio/uio_pci_generic, then as a more generic uio patch. Now it is taken 
 entirely
 out of the uio framework, and things seem much cleaner. Of course, there is
 a lot of functional overlap with uio, but the previous version just seemed
 like a giant mode switch in the uio code that did not lead to clarity for
 either the new or old code.
  
 IMO this was because this driver does two things: programming iommu and
 handling interrupts. uio does interrupt handling.
 We could have moved iommu / DMA programming to
 a separate driver, and have uio work with it.
 This would solve limitation of the current driver
 that is needs an iommu domain per device.


 How do we enforce security then?  We need to ensure that unprivileged  
 users can only use the device with an iommu.

Force assigning to iommu before we allow any other operation?

 [a pony for avi...]
 The major new functionality in this version is the ability to deal with
 PCI config space accesses (through read  write calls) - but includes table
 driven code to determine whats safe to write and what is not.
  
 I don't really see why this is helpful: a driver written corrrectly
 will not access these addresses, and we need an iommu anyway to protect
 us against a drivers.


 Haven't reviewed the code (yet) but things like the BARs, MSI, and  
 interrupt disable need to be protected from the guest regardless of the  
 iommu.

Yes but userspace can do this. As long as userspace can not
crash the kernel, no reason to put this policy into kernel.


 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Avi Kivity

On 05/29/2010 02:07 AM, Tom Lyon wrote:

The VFIO driver is used to allow privileged AND non-privileged processes to
implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
devices.
   



+
+Why is this interesting?  Some applications, especially in the high performance
+computing field, need access to hardware functions with as little overhead as
+possible. Examples are in network adapters (typically non tcp/ip based) and
+in compute accelerators - i.e., array processors, FPGA processors, etc.
+Previous to the VFIO drivers these apps would need either a kernel-level
+driver (with corrsponding overheads), or else root permissions to directly
+access the hardware. The VFIO driver allows generic access to the hardware
+from non-privileged apps IF the hardware is well-behaved enough for this
+to be safe.
   




+
+Any SR-IOV virtual function meets the VFIO definition of well-behaved, but
+there are many other non-IOV PCI devices which also meet the defintion.
+Elements of this definition are:
+- The size of any memory BARs to be mmap'ed into the user process space must be
+  a multiple of the system page size.
   


You can relax this.
 - smaller than page size can be mapped if the rest of the page unused 
and if the platform tolerates writes to unused areas

 - if the rest of the page is used, we can relocate the BAR
 - otherwise, we can prevent mmap() but still allow mediated access via 
a syscall



+- If MSI-X interrupts are used, the device driver must not attempt to mmap or
+  write the MSI-X vector area.
   


We can allow mediated access (that's what qemu-kvm does).  I guess the 
ioctls for setting up msi interrupts are equivalent to this mediated access.


(later I see you do provide mediated access via pwrite - please confirm)


+- The device must not use the PCI configuration space in any non-standard way,
+  i.e., the user level driver will be permitted only to read and write standard
+  fields of the PCI config space, and only if those fields cannot cause harm to
+  the system. In addition, some fields are virtualized, so that the user
+  driver can read/write them like a kernel driver, but they do not affect the
+  real device.
   


What's wrong with nonstandard fields?


+
+Even with these restrictions, there are bound to be devices which are unsafe
+for user level use - it is still up to the system admin to decide whether to
+grant access to the device.  When the vfio module is loaded, it will have
+access to no devices until the desired PCI devices are bound to the driver.
+First, make sure the devices are not bound to another kernel driver. You can
+unload that driver if you wish to unbind all its devices, or else enter the
+driver's sysfs directory, and unbind a specific device:
+   cd /sys/bus/pci/drivers/drivername
+   echo :06:02.00  unbind
+(The :06:02.00 is a fully qualified PCI device name - different for each
+device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
+write the PCI device type of the target device to the new_id file:
+   echo 8086 10ca  new_id
+(8086 10ca are the vendor and device type for the Intel 82576 virtual function
+devices). A /dev/vfioN  entry will be created for each device bound. The 
final
+step is to grant users permission by changing the mode and/or owner of the /dev
+entry - chmod 666 /dev/vfio0.
   


What if I have several such devices?  Isn't it better to bind by topoloy 
(device address)?



+
+Reads  Writes:
+
+The user driver will typically use mmap to access the memory BAR(s) of a
+device; the I/O BARs and the PCI config space may be accessed through normal
+read and write system calls. Only 1 file descriptor is needed for all driver
+functions -- the desired BAR for I/O, memory, or config space is indicated via
+high-order bits of the file offset.


My preference would be one fd per BAR, but that's a matter of personal 
taste.



For instance, the following implements a
+write to the PCI config space:
+
+   #includelinux/vfio.h
+   void pci_write_config_word(int pci_fd, u16 off, u16 wd)
+   {
+   off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
+
+   if (pwrite(pci_fd,wd, 2, cfg_off) != 2)
+   perror(pwrite config_dword);
+   }
+
   


Nice, has the benefit of avoiding endianness issues in the interface.


+The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
+in vfio.h to convert bar numbers to file offsets and vice-versa.
+
+Interrupts:
+
+Device interrupts are translated by the vfio driver into input events on event
+notification file descriptors created by the eventfd system call. The user
+program must one or more event descriptors and pass them to the vfio driver
+via ioctls to arrange for the interrupt mapping:
+1.
+   efd = eventfd(0, 0);
+   ioctl(vfio_fd, VFIO_EVENTFD_IRQ,efd);
+   This provides an eventfd for traditional IRQ interrupts.
+   IRQs will be 

Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Michael S. Tsirkin
On Sun, May 30, 2010 at 04:01:53PM +0300, Avi Kivity wrote:
 On 05/30/2010 03:49 PM, Michael S. Tsirkin wrote:
 On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:

 On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
  
 On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:


 The VFIO driver is used to allow privileged AND non-privileged 
 processes to
 implement user-level device drivers for any well-behaved PCI, PCI-X, and 
 PCIe
 devices.
   Signed-off-by: Tom Lyonp...@cisco.com
 ---
 This patch is the evolution of code which was first proposed as a patch to
 uio/uio_pci_generic, then as a more generic uio patch. Now it is taken 
 entirely
 out of the uio framework, and things seem much cleaner. Of course, there 
 is
 a lot of functional overlap with uio, but the previous version just seemed
 like a giant mode switch in the uio code that did not lead to clarity for
 either the new or old code.

  
 IMO this was because this driver does two things: programming iommu and
 handling interrupts. uio does interrupt handling.
 We could have moved iommu / DMA programming to
 a separate driver, and have uio work with it.
 This would solve limitation of the current driver
 that is needs an iommu domain per device.


 How do we enforce security then?  We need to ensure that unprivileged
 users can only use the device with an iommu.
  
 Force assigning to iommu before we allow any other operation?


 That means the driver must be aware of the iommu.

The userspace driver? Yes. And It is a good thing to be explicit
there anyway, since this lets userspace map a non-contigious
virtual address list into a contiguous bus address range.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-29 Thread Arnd Bergmann
On Saturday 29 May 2010, Tom Lyon wrote:
 +/*
 + * Structure for DMA mapping of user buffers
 + * vaddr, dmaaddr, and size must all be page aligned
 + * buffer may only be larger than 1 page if (a) there is
 + * an iommu in the system, or (b) buffer is part of a huge page
 + */
 +struct vfio_dma_map {
 +   __u64   vaddr;  /* process virtual addr */
 +   __u64   dmaaddr;/* desired and/or returned dma address */
 +   __u64   size;   /* size in bytes */
 +   int rdwr;   /* bool: 0 for r/o; 1 for r/w */
 +};

Please add a 32 bit padding word at the end of this, otherwise the
size of the data structure is incompatible between 32 x86 applications
and 64 bit kernels.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-29 Thread Avi Kivity

On 05/29/2010 02:55 PM, Arnd Bergmann wrote:

On Saturday 29 May 2010, Tom Lyon wrote:
   

+/*
+ * Structure for DMA mapping of user buffers
+ * vaddr, dmaaddr, and size must all be page aligned
+ * buffer may only be larger than 1 page if (a) there is
+ * an iommu in the system, or (b) buffer is part of a huge page
+ */
+struct vfio_dma_map {
+   __u64   vaddr;  /* process virtual addr */
+   __u64   dmaaddr;/* desired and/or returned dma address */
+   __u64   size;   /* size in bytes */
+   int rdwr;   /* bool: 0 for r/o; 1 for r/w */
+};
 

Please add a 32 bit padding word at the end of this, otherwise the
size of the data structure is incompatible between 32 x86 applications
and 64 bit kernels.
   


Might as well call it 'flags' and reserve a bit more space (keeping 
64-bit aligned size) for future expansion.


rdwr can be folded into it.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-28 Thread Randy Dunlap
Hi,


On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote:

Missing diffstat -p1 -w 70:

 Documentation/vfio.txt |  176 
 MAINTAINERS|7 
 drivers/Kconfig|2 
 drivers/Makefile   |1 
 drivers/vfio/Kconfig   |9 
 drivers/vfio/Makefile  |5 
 drivers/vfio/vfio_dma.c|  372 ++
 drivers/vfio/vfio_intrs.c  |  189 +
 drivers/vfio/vfio_main.c   |  627 +++
 drivers/vfio/vfio_pci_config.c |  554 +++
 drivers/vfio/vfio_rdwr.c   |  147 +++
 drivers/vfio/vfio_sysfs.c  |  153 +++
 include/linux/vfio.h   |  193 +
 13 files changed, 2435 insertions(+)


which shows that the patch is missing an update to
Documentation/ioctl/ioctl-number.txt for ioctl code ';'.  Please add that.


 diff -uprN linux-2.6.34/drivers/vfio/Kconfig 
 vfio-linux-2.6.34/drivers/vfio/Kconfig
 --- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.0 -0800
 +++ vfio-linux-2.6.34/drivers/vfio/Kconfig2010-05-27 17:07:25.0 
 -0700
 @@ -0,0 +1,9 @@
 +menuconfig VFIO
 + tristate Non-Priv User Space PCI drivers

  Non-privileged

 + depends on PCI
 + help
 +   Driver to allow advanced user space drivers for PCI, PCI-X,
 +   and PCIe devices.  Requires IOMMU to allow non-privilged

 non-privileged

 +   processes to directly control the PCI devices.
 +
 +   If you don't know what to do here, say N.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-28 Thread Randy Dunlap
On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote:

 diff -uprN linux-2.6.34/Documentation/vfio.txt 
 vfio-linux-2.6.34/Documentation/vfio.txt
 --- linux-2.6.34/Documentation/vfio.txt   1969-12-31 16:00:00.0 
 -0800
 +++ vfio-linux-2.6.34/Documentation/vfio.txt  2010-05-28 14:03:05.0 
 -0700
 @@ -0,0 +1,176 @@
 +---
 +The VFIO driver is used to allow privileged AND non-privileged processes to
 +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
 +devices.
 +
 +Why is this interesting?  Some applications, especially in the high 
 performance
 +computing field, need access to hardware functions with as little overhead as
 +possible. Examples are in network adapters (typically non tcp/ip based) and

 non-TCP/IP-based)

 +in compute accelerators - i.e., array processors, FPGA processors, etc.
 +Previous to the VFIO drivers these apps would need either a kernel-level
 +driver (with corrsponding overheads), or else root permissions to directly

corresponding

 +access the hardware. The VFIO driver allows generic access to the hardware
 +from non-privileged apps IF the hardware is well-behaved enough for this
 +to be safe.
 +
 +While there have long been ways to implement user-level drivers using 
 specific
 +corresponding drivers in the kernel, it was not until the introduction of the
 +UIO driver framework, and the uio_pci_generic driver that one could have a
 +generic kernel component supporting many types of user level drivers. 
 However,
 +even with the uio_pci_generic driver, processes implementing the user level
 +drivers had to be trusted - they could do dangerous manipulation of DMA
 +addreses and were required to be root to write PCI configuration space
 +registers.
 +
 +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
 +new hardware capabilities which the VFIO solution exploits to allow non-root
 +user level drivers. The main role of the IOMMU is to ensure that DMA accesses
 +from devices go only to the appropriate memory locations, this allows VFIO to

  locations;

 +ensure that user level drivers do not corrupt inappropriate memory.  PCI I/O
 +virtualization (SR-IOV) was defined to allow pass-through of virtual 
 devices
 +to guest virtual machines. VFIO in essence implements pass-through of devices
 +to user processes, not virtual machines.  SR-IOV devices implement a
 +traditional PCI device (the physical function) and a dynamic number of 
 special
 +PCI devices (virtual functions) whose feature set is somewhat restricted - in
 +order to allow the operating system or virtual machine monitor to ensure the
 +safe operation of the system.
 +
 +Any SR-IOV virtual function meets the VFIO definition of well-behaved, but
 +there are many other non-IOV PCI devices which also meet the defintion.
 +Elements of this definition are:
 +- The size of any memory BARs to be mmap'ed into the user process space must 
 be
 +  a multiple of the system page size.
 +- If MSI-X interrupts are used, the device driver must not attempt to mmap or
 +  write the MSI-X vector area.
 +- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
 +  revision 2.3 to allow its interrupts to be masked in a generic way.
 +- The device must not use the PCI configuration space in any non-standard 
 way,
 +  i.e., the user level driver will be permitted only to read and write 
 standard
 +  fields of the PCI config space, and only if those fields cannot cause harm 
 to
 +  the system. In addition, some fields are virtualized, so that the user
 +  driver can read/write them like a kernel driver, but they do not affect the
 +  real device.
 +- For now, there is no support for user access to the PCIe and PCI-X extended
 +  capabilities configuration space.
 +
 +Even with these restrictions, there are bound to be devices which are unsafe
 +for user level use - it is still up to the system admin to decide whether to
 +grant access to the device.  When the vfio module is loaded, it will have
 +access to no devices until the desired PCI devices are bound to the driver.
 +First, make sure the devices are not bound to another kernel driver. You can
 +unload that driver if you wish to unbind all its devices, or else enter the
 +driver's sysfs directory, and unbind a specific device:
 + cd /sys/bus/pci/drivers/drivername
 + echo :06:02.00  unbind
 +(The :06:02.00 is a fully qualified PCI device name - different for each
 +device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio 
 and
 +write the PCI device type of the target device to the new_id file:
 + echo 8086 10ca  new_id
 +(8086 10ca are the vendor and device type for the Intel 82576 virtual 
 function
 +devices). A /dev/vfioN entry will be created for each device bound.