subject:"Re\: kvm PCI assignment \& VFIO ramblings"

Re: kvm PCI assignment VFIO ramblings

2011-08-30 Thread Joerg Roedel

On Fri, Aug 26, 2011 at 12:04:22PM -0600, Alex Williamson wrote:
 On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:

  If we really expect segment numbers that need the full 16 bit then this
  would be the way to go. Otherwise I would prefer returning the group-id
  directly and partition the group-id space for the error values (s32 with
  negative numbers being errors).
 
 It's unlikely to have segments using the top bit, but it would be broken
 for an iommu driver to define it's group numbers using pci s:b:d.f if we
 don't have that bit available.  Ben/David, do PEs have an identifier of
 a convenient size?  I'd guess any hardware based identifier is going to
 use a full unsigned bit width.

Okay, if we want to go the secure way I am fine with the int *group
parameter. Another option is to just return u64 and use the extended
number space for errors. But that is even worse as an interface, I
think.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-30 Thread Joerg Roedel

On Sun, Aug 28, 2011 at 05:04:32PM +0300, Avi Kivity wrote:
 On 08/28/2011 04:56 PM, Joerg Roedel wrote:

 This can't be secured by a lock, because it introduces potential
 A-B--B-A lock problem when two processes try to take each others mm.
 It could probably be solved by a task-real_mm pointer, havn't thought
 about this yet...


 Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

Right, a workqueue might do the trick. We'll evaluate that. Thanks for
the idea :)

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-29 Thread David Gibson

eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
 Yes.  In essence, I'd rather not have to run any other admin processes.
 Doing things programmatically, on the fly, from each process, is the
 cleanest model right now.

The persistent group model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the use interface, and I don't see that making the admin
interface separate and persistent makes this any harder.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-28 Thread Avi Kivity


On 08/26/2011 12:24 PM, Roedel, Joerg wrote:


  As I see it there are two options: (a) make subsequent accesses from
  userspace or the guest result in either a SIGBUS that userspace must
  either deal with or die, or (b) replace the mapping with a dummy RO
  mapping containing 0xff, with any trapped writes emulated as nops.

The biggest problem with this approach is that it has to happen in the
context of the given process. Linux can't really modify an mm which
which belong to another context in a safe way.



Is use_mm() insufficient?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-28 Thread Joerg Roedel

On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
 On 08/26/2011 12:24 PM, Roedel, Joerg wrote:

 The biggest problem with this approach is that it has to happen in the
 context of the given process. Linux can't really modify an mm which
 which belong to another context in a safe way.


 Is use_mm() insufficient?

Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.

One example:

Process A   Process B   Process C
.   .   .
.   -- takes A-mm .
.   and assignes as B-mm   .
.   .   -- Wants to take
.   .   B-mm, but gets
A-mm now

This can't be secured by a lock, because it introduces potential
A-B--B-A lock problem when two processes try to take each others mm.
It could probably be solved by a task-real_mm pointer, havn't thought
about this yet...

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-28 Thread Avi Kivity


On 08/28/2011 04:56 PM, Joerg Roedel wrote:

On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
  On 08/26/2011 12:24 PM, Roedel, Joerg wrote:

  The biggest problem with this approach is that it has to happen in the
  context of the given process. Linux can't really modify an mm which
  which belong to another context in a safe way.


  Is use_mm() insufficient?

Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.

One example:

Process A   Process B   Process C
.   .   .
.   --  takes A-mm  .
.   and assignes as B-mm.
.   .   --  Wants to take
.   .   B-mm, but gets
A-mm now


Good catch.



This can't be secured by a lock, because it introduces potential
A-B--B-A lock problem when two processes try to take each others mm.
It could probably be solved by a task-real_mm pointer, havn't thought
about this yet...



Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Roedel, Joerg

On Fri, Aug 26, 2011 at 12:24:23AM -0400, David Gibson wrote:
 On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
  On 25.08.2011, at 07:31, Roedel, Joerg wrote:

   For mmio we could stop the guest and replace the mmio region with a
   region that is filled with 0xff, no?
  
  Sure, but that happens in user space. The question is how does
  kernel space enforce an MMIO region to not be mapped after the
  hotplug event occured? Keep in mind that user space is pretty much
  untrusted here - it doesn't have to be QEMU. It could just as well
  be a generic user space driver. And that can just ignore hotplug
  events.
 
 We're saying you hard yank the mapping from the userspace process.
 That is, you invalidate all its PTEs mapping the MMIO space, and don't
 let it fault them back in.
 
 As I see it there are two options: (a) make subsequent accesses from
 userspace or the guest result in either a SIGBUS that userspace must
 either deal with or die, or (b) replace the mapping with a dummy RO
 mapping containing 0xff, with any trapped writes emulated as nops.

The biggest problem with this approach is that it has to happen in the
context of the given process. Linux can't really modify an mm which
which belong to another context in a safe way.

The more I think about this, I come to the conclusion that it would be
the best to just kill the process accessing the device if it is manually
de-assigned from vfio. It should be a non-standard path anyway so it
doesn't make a lot of sense to implement complicated handling semantics
for it, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Roedel, Joerg

On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
 On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
  On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
   On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
  
I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.
   
   I think several people are misreading what Ben means by static.  I
   would prefer to say 'persistent', in that the meta-groups lifetime is
   not tied to an fd, but they can be freely created, altered and removed
   during runtime.
  
  Even if it can be altered at runtime, from a usability perspective it is
  certainly the best to handle these groups directly in qemu. Or are there
  strong reasons to do it somewhere else?
 
 Funny, Ben and I think usability demands it be the other way around.

The reason is that you mean the usability for the programmer and I mean
it for the actual user of qemu :)

 If the meta-groups are transient - that is lifetime tied to an fd -
 then any program that wants to use meta-groups *must* know the
 interfaces for creating one, whatever they are.
 
 But if they're persistent, the admin can use other tools to create the
 meta-group then just hand it to a program to use, since the interfaces
 for _using_ a meta-group are identical to those for an atomic group.
 
 This doesn't preclude a program from being meta-group aware, and
 creating its own if it wants to, of course.  My guess is that qemu
 would not want to build its own meta-groups, but libvirt probably
 would.

Doing it in libvirt makes it really hard for a plain user of qemu to
assign more than one device to a guest. What I want it that a user just
types

qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...

and it just works. Qemu creates the meta-groups and they are
automatically destroyed when qemu exits. That the programs are not aware
of meta-groups is not a big problem because all software using vfio
needs still to be written :)

Btw, with this concept the programmer can still decide to not use
meta-groups and just multiplex the mappings to all open device-fds it
uses.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Alexander Graf


On 26.08.2011, at 04:33, Roedel, Joerg wrote:

 On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
 On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
 On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
 On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
 
 I don't see a reason to make this meta-grouping static. It would harm
 flexibility on x86. I think it makes things easier on power but there
 are options on that platform to get the dynamic solution too.
 
 I think several people are misreading what Ben means by static.  I
 would prefer to say 'persistent', in that the meta-groups lifetime is
 not tied to an fd, but they can be freely created, altered and removed
 during runtime.
 
 Even if it can be altered at runtime, from a usability perspective it is
 certainly the best to handle these groups directly in qemu. Or are there
 strong reasons to do it somewhere else?
 
 Funny, Ben and I think usability demands it be the other way around.
 
 The reason is that you mean the usability for the programmer and I mean
 it for the actual user of qemu :)

No, we mean the actual user of qemu. The reason being that making a device 
available for any user space application is an administrative task.

Forget the KVM case for a moment and think of a user space device driver. I as 
a user am not root. But I as a user when having access to /dev/vfioX want to be 
able to access the device and manage it - and only it. The admin of that box 
needs to set it up properly for me to be able to access it.

So having two steps is really the correct way to go:

  * create VFIO group
  * use VFIO group

because the two are done by completely different users. It's similar to how 
tun/tap works in Linux too. Of course nothing keeps you from also creating a 
group on the fly, but it shouldn't be the only interface available. The 
persistent setup is definitely more useful.

 
 If the meta-groups are transient - that is lifetime tied to an fd -
 then any program that wants to use meta-groups *must* know the
 interfaces for creating one, whatever they are.
 
 But if they're persistent, the admin can use other tools to create the
 meta-group then just hand it to a program to use, since the interfaces
 for _using_ a meta-group are identical to those for an atomic group.
 
 This doesn't preclude a program from being meta-group aware, and
 creating its own if it wants to, of course.  My guess is that qemu
 would not want to build its own meta-groups, but libvirt probably
 would.
 
 Doing it in libvirt makes it really hard for a plain user of qemu to
 assign more than one device to a guest. What I want it that a user just
 types
 
   qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...
 
 and it just works. Qemu creates the meta-groups and they are
 automatically destroyed when qemu exits. That the programs are not aware
 of meta-groups is not a big problem because all software using vfio
 needs still to be written :)
 
 Btw, with this concept the programmer can still decide to not use
 meta-groups and just multiplex the mappings to all open device-fds it
 uses.

What I want to see is:

  # vfio-create 00:01.0
/dev/vfio0
  # vftio-create -a /dev/vfio0 00:02.0
/dev/vfio0

  $ qemu -vfio dev=/dev/vfio0,id=vfio0 -device vfio,vfio=vfio0.0 -device 
vfio,vfio=vfio0.1


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Joerg Roedel

On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
 On 26.08.2011, at 04:33, Roedel, Joerg wrote:
  
  The reason is that you mean the usability for the programmer and I mean
  it for the actual user of qemu :)
 
 No, we mean the actual user of qemu. The reason being that making a
 device available for any user space application is an administrative
 task.

 Forget the KVM case for a moment and think of a user space device
 driver. I as a user am not root. But I as a user when having access to
 /dev/vfioX want to be able to access the device and manage it - and
 only it. The admin of that box needs to set it up properly for me to
 be able to access it.

Right, and that task is being performed by attaching the device(s) in
question to the vfio driver. The rights-management happens on the
/dev/vfio/$group file.

 So having two steps is really the correct way to go:
 
   * create VFIO group
   * use VFIO group
 
 because the two are done by completely different users. It's similar
 to how tun/tap works in Linux too. Of course nothing keeps you from
 also creating a group on the fly, but it shouldn't be the only
 interface available. The persistent setup is definitely more useful.

I see the use-case. But to make it as easy as possible for the end-user
we can do both.

So the user of (qemu again) does this:

# vfio-ctl attach 00:01.0
vfio-ctl: attached to group 8
# vfio-ctl attach 00:02.0
vfio-ctl: attached to group 16
$ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...

which should cover the usecase you prefer. Qemu still creates the
meta-group that allow the devices to share the same page-table. But what
should also be possible is:

# qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0

In that case qemu detects that the devices are not yet bound to vfio and
will do so and also unbinds them afterwards (essentially the developer
use-case).

Your interface which requires pre-binding of devices into one group by
the administrator only makes sense if you want to force userspace to
use certain devices (which do not belong to the same hw-group) only
together. But I don't see a usecase for defining such constraints (yet).

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Alexander Graf


On 26.08.2011, at 10:24, Joerg Roedel wrote:

 On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
 On 26.08.2011, at 04:33, Roedel, Joerg wrote:
 
 The reason is that you mean the usability for the programmer and I mean
 it for the actual user of qemu :)
 
 No, we mean the actual user of qemu. The reason being that making a
 device available for any user space application is an administrative
 task.
 
 Forget the KVM case for a moment and think of a user space device
 driver. I as a user am not root. But I as a user when having access to
 /dev/vfioX want to be able to access the device and manage it - and
 only it. The admin of that box needs to set it up properly for me to
 be able to access it.
 
 Right, and that task is being performed by attaching the device(s) in
 question to the vfio driver. The rights-management happens on the
 /dev/vfio/$group file.

Yup :)

 
 So having two steps is really the correct way to go:
 
  * create VFIO group
  * use VFIO group
 
 because the two are done by completely different users. It's similar
 to how tun/tap works in Linux too. Of course nothing keeps you from
 also creating a group on the fly, but it shouldn't be the only
 interface available. The persistent setup is definitely more useful.
 
 I see the use-case. But to make it as easy as possible for the end-user
 we can do both.
 
 So the user of (qemu again) does this:
 
 # vfio-ctl attach 00:01.0
 vfio-ctl: attached to group 8
 # vfio-ctl attach 00:02.0
 vfio-ctl: attached to group 16
 $ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...
 
 which should cover the usecase you prefer. Qemu still creates the
 meta-group that allow the devices to share the same page-table. But what
 should also be possible is:
 
 # qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0
 
 In that case qemu detects that the devices are not yet bound to vfio and
 will do so and also unbinds them afterwards (essentially the developer
 use-case).

I agree. The same it works with tun today. You can either have qemu spawn a tun 
device dynamically or have a preallocated one you use. If you run qemu as a 
user (which I always do), I preallocate a tun device and attach qemu to it.

 Your interface which requires pre-binding of devices into one group by
 the administrator only makes sense if you want to force userspace to
 use certain devices (which do not belong to the same hw-group) only
 together. But I don't see a usecase for defining such constraints (yet).

Agreed. As long as the kernel backend can always figure out the hw-groups, 
we're good :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Aaron Fabbri




On 8/26/11 7:07 AM, Alexander Graf ag...@suse.de wrote:

 
snip
 
 Forget the KVM case for a moment and think of a user space device driver. I as
 a user am not root. But I as a user when having access to /dev/vfioX want to
 be able to access the device and manage it - and only it. The admin of that
 box needs to set it up properly for me to be able to access it.
 
 So having two steps is really the correct way to go:
 
   * create VFIO group
   * use VFIO group
 
 because the two are done by completely different users.

This is not the case for my userspace drivers using VFIO today.

Each process will open vfio devices on the fly, and they need to be able to
share IOMMU resources.

So I need the ability to dynamically bring up devices and assign them to a
group.  The number of actual devices and how they map to iommu domains is
not known ahead of time.  We have a single piece of silicon that can expose
hundreds of pci devices.

In my case, the only administrative task would be to give my processes/users
access to the vfio groups (which are initially singletons), and the
application actually opens them and needs the ability to merge groups
together to conserve IOMMU resources (assuming we're not going to expose
uiommu).

-Aaron

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Alex Williamson

On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:
 On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
  On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
 
   We need to solve this differently. ARM is starting to use the iommu-api
   too and this definitly does not work there. One possible solution might
   be to make the iommu-ops per-bus.
  
  That sounds good.  Is anyone working on it?  It seems like it doesn't
  hurt to use this in the interim, we may just be watching the wrong bus
  and never add any sysfs group info.
 
 I'll cook something up for RFC over the weekend.
 
   Also the return type should not be long but something that fits into
   32bit on all platforms. Since you use -ENODEV, probably s32 is a good
   choice.
  
  The convenience of using seg|bus|dev|fn was too much to resist, too bad
  it requires a full 32bits.  Maybe I'll change it to:
  int iommu_device_group(struct device *dev, unsigned int *group)
 
 If we really expect segment numbers that need the full 16 bit then this
 would be the way to go. Otherwise I would prefer returning the group-id
 directly and partition the group-id space for the error values (s32 with
 negative numbers being errors).

It's unlikely to have segments using the top bit, but it would be broken
for an iommu driver to define it's group numbers using pci s:b:d.f if we
don't have that bit available.  Ben/David, do PEs have an identifier of
a convenient size?  I'd guess any hardware based identifier is going to
use a full unsigned bit width.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Chris Wright

* Aaron Fabbri (aafab...@cisco.com) wrote:
 On 8/26/11 7:07 AM, Alexander Graf ag...@suse.de wrote:
  Forget the KVM case for a moment and think of a user space device driver. I 
  as
  a user am not root. But I as a user when having access to /dev/vfioX want to
  be able to access the device and manage it - and only it. The admin of that
  box needs to set it up properly for me to be able to access it.
  
  So having two steps is really the correct way to go:
  
* create VFIO group
* use VFIO group
  
  because the two are done by completely different users.
 
 This is not the case for my userspace drivers using VFIO today.
 
 Each process will open vfio devices on the fly, and they need to be able to
 share IOMMU resources.

How do you share IOMMU resources w/ multiple processes, are the processes
sharing memory?

 So I need the ability to dynamically bring up devices and assign them to a
 group.  The number of actual devices and how they map to iommu domains is
 not known ahead of time.  We have a single piece of silicon that can expose
 hundreds of pci devices.

This does not seem fundamentally different from the KVM use case.

We have 2 kinds of groupings.

1) low-level system or topoolgy grouping

   Some may have multiple devices in a single group

   * the PCIe-PCI bridge example
   * the POWER partitionable endpoint

   Many will not

   * singleton group, e.g. typical x86 PCIe function (majority of
 assigned devices)

   Not sure it makes sense to have these administratively defined as
   opposed to system defined.

2) logical grouping

   * multiple low-level groups (singleton or otherwise) attached to same
 process, allowing things like single set of io page tables where
 applicable.

   These are nominally adminstratively defined.  In the KVM case, there
   is likely a privileged task (i.e. libvirtd) involved w/ making the
   device available to the guest and can do things like group merging.
   In your userspace case, perhaps it should be directly exposed.

 In my case, the only administrative task would be to give my processes/users
 access to the vfio groups (which are initially singletons), and the
 application actually opens them and needs the ability to merge groups
 together to conserve IOMMU resources (assuming we're not going to expose
 uiommu).

I agree, we definitely need to expose _some_ way to do this.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Aaron Fabbri




On 8/26/11 12:35 PM, Chris Wright chr...@sous-sol.org wrote:

 * Aaron Fabbri (aafab...@cisco.com) wrote:
 On 8/26/11 7:07 AM, Alexander Graf ag...@suse.de wrote:
 Forget the KVM case for a moment and think of a user space device driver. I
 as
 a user am not root. But I as a user when having access to /dev/vfioX want to
 be able to access the device and manage it - and only it. The admin of that
 box needs to set it up properly for me to be able to access it.
 
 So having two steps is really the correct way to go:
 
   * create VFIO group
   * use VFIO group
 
 because the two are done by completely different users.
 
 This is not the case for my userspace drivers using VFIO today.
 
 Each process will open vfio devices on the fly, and they need to be able to
 share IOMMU resources.
 
 How do you share IOMMU resources w/ multiple processes, are the processes
 sharing memory?

Sorry, bad wording.  I share IOMMU domains *within* each process.

E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).

If I ever need to share devices across processes, the shared memory case
might be interesting.

 
 So I need the ability to dynamically bring up devices and assign them to a
 group.  The number of actual devices and how they map to iommu domains is
 not known ahead of time.  We have a single piece of silicon that can expose
 hundreds of pci devices.
 
 This does not seem fundamentally different from the KVM use case.
 
 We have 2 kinds of groupings.
 
 1) low-level system or topoolgy grouping
 
Some may have multiple devices in a single group
 
* the PCIe-PCI bridge example
* the POWER partitionable endpoint
 
Many will not
 
* singleton group, e.g. typical x86 PCIe function (majority of
  assigned devices)
 
Not sure it makes sense to have these administratively defined as
opposed to system defined.
 
 2) logical grouping
 
* multiple low-level groups (singleton or otherwise) attached to same
  process, allowing things like single set of io page tables where
  applicable.
 
These are nominally adminstratively defined.  In the KVM case, there
is likely a privileged task (i.e. libvirtd) involved w/ making the
device available to the guest and can do things like group merging.
In your userspace case, perhaps it should be directly exposed.

Yes.  In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.

 
 In my case, the only administrative task would be to give my processes/users
 access to the vfio groups (which are initially singletons), and the
 application actually opens them and needs the ability to merge groups
 together to conserve IOMMU resources (assuming we're not going to expose
 uiommu).
 
 I agree, we definitely need to expose _some_ way to do this.
 
 thanks,
 -chris

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-26 Thread Chris Wright

* Aaron Fabbri (aafab...@cisco.com) wrote:
 On 8/26/11 12:35 PM, Chris Wright chr...@sous-sol.org wrote:
  * Aaron Fabbri (aafab...@cisco.com) wrote:
  Each process will open vfio devices on the fly, and they need to be able to
  share IOMMU resources.
  
  How do you share IOMMU resources w/ multiple processes, are the processes
  sharing memory?
 
 Sorry, bad wording.  I share IOMMU domains *within* each process.

Ah, got it.  Thanks.

 E.g. If one process has 3 devices and another has 10, I can get by with two
 iommu domains (and can share buffers among devices within each process).
 
 If I ever need to share devices across processes, the shared memory case
 might be interesting.
 
  
  So I need the ability to dynamically bring up devices and assign them to a
  group.  The number of actual devices and how they map to iommu domains is
  not known ahead of time.  We have a single piece of silicon that can expose
  hundreds of pci devices.
  
  This does not seem fundamentally different from the KVM use case.
  
  We have 2 kinds of groupings.
  
  1) low-level system or topoolgy grouping
  
 Some may have multiple devices in a single group
  
 * the PCIe-PCI bridge example
 * the POWER partitionable endpoint
  
 Many will not
  
 * singleton group, e.g. typical x86 PCIe function (majority of
   assigned devices)
  
 Not sure it makes sense to have these administratively defined as
 opposed to system defined.
  
  2) logical grouping
  
 * multiple low-level groups (singleton or otherwise) attached to same
   process, allowing things like single set of io page tables where
   applicable.
  
 These are nominally adminstratively defined.  In the KVM case, there
 is likely a privileged task (i.e. libvirtd) involved w/ making the
 device available to the guest and can do things like group merging.
 In your userspace case, perhaps it should be directly exposed.
 
 Yes.  In essence, I'd rather not have to run any other admin processes.
 Doing things programmatically, on the fly, from each process, is the
 cleanest model right now.

I don't see an issue w/ this.  As long it can not add devices to the
system defined groups, it's not a privileged operation.  So we still
need the iommu domain concept exposed in some form to logically put
groups into a single iommu domain (if desired).  In fact, I believe Alex
covered this in his most recent recap:

  ...The group fd will provide interfaces for enumerating the devices
  in the group, returning a file descriptor for each device in the group
  (the device fd), binding groups together, and returning a file
  descriptor for iommu operations (the iommu fd).

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Roedel, Joerg

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
 Is this roughly what you're thinking of for the iommu_group component?
 Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
 support in the iommu base.  Would AMD-Vi do something similar (or
 exactly the same) for group #s?  Thanks,

The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.

 diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
 index 6e6b6a1..6b54c1a 100644
 --- a/drivers/base/iommu.c
 +++ b/drivers/base/iommu.c
 @@ -17,20 +17,56 @@
   */
  
  #include linux/bug.h
 +#include linux/device.h
  #include linux/types.h
  #include linux/module.h
  #include linux/slab.h
  #include linux/errno.h
  #include linux/iommu.h
 +#include linux/pci.h
  
  static struct iommu_ops *iommu_ops;
  
 +static ssize_t show_iommu_group(struct device *dev,
 + struct device_attribute *attr, char *buf)
 +{
 + return sprintf(buf, %lx, iommu_dev_to_group(dev));

Probably add a 0x prefix so userspace knows the format?

 +}
 +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
 +
 +static int add_iommu_group(struct device *dev, void *unused)
 +{
 + if (iommu_dev_to_group(dev) = 0)
 + return device_create_file(dev, dev_attr_iommu_group);
 +
 + return 0;
 +}
 +
 +static int device_notifier(struct notifier_block *nb,
 +unsigned long action, void *data)
 +{
 + struct device *dev = data;
 +
 + if (action == BUS_NOTIFY_ADD_DEVICE)
 + return add_iommu_group(dev, NULL);
 +
 + return 0;
 +}
 +
 +static struct notifier_block device_nb = {
 + .notifier_call = device_notifier,
 +};
 +
  void register_iommu(struct iommu_ops *ops)
  {
   if (iommu_ops)
   BUG();
  
   iommu_ops = ops;
 +
 + /* FIXME - non-PCI, really want for_each_bus() */
 + bus_register_notifier(pci_bus_type, device_nb);
 + bus_for_each_dev(pci_bus_type, NULL, NULL, add_iommu_group);
  }

We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.

  bool iommu_found(void)
 @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
  }
  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
  
 +long iommu_dev_to_group(struct device *dev)
 +{
 + if (iommu_ops-dev_to_group)
 + return iommu_ops-dev_to_group(dev);
 + return -ENODEV;
 +}
 +EXPORT_SYMBOL_GPL(iommu_dev_to_group);

Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.

 +
  int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, int gfp_order, int prot)
  {
 diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
 index f02c34d..477259c 100644
 --- a/drivers/pci/intel-iommu.c
 +++ b/drivers/pci/intel-iommu.c
 @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
  static int dmar_forcedac;
  static int intel_iommu_strict;
  static int intel_iommu_superpage = 1;
 +static int intel_iommu_no_mf_groups;
  
  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
  static DEFINE_SPINLOCK(device_domain_lock);
 @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
   printk(KERN_INFO
   Intel-IOMMU: disable supported super page\n);
   intel_iommu_superpage = 0;
 + } else if (!strncmp(str, no_mf_groups, 12)) {
 + printk(KERN_INFO
 + Intel-IOMMU: disable separate groups for 
 multifunction devices\n);
 + intel_iommu_no_mf_groups = 1;

This should really be a global iommu option and not be VT-d specific.

  
   str += strcspn(str, ,);
 @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
 iommu_domain *domain,
   return 0;
  }
  
 +/* Group numbers are arbitrary.  Device with the same group number
 + * indicate the iommu cannot differentiate between them.  To avoid
 + * tracking used groups we just use the seg|bus|devfn of the lowest
 + * level we're able to differentiate devices */
 +static long intel_iommu_dev_to_group(struct device *dev)
 +{
 + struct pci_dev *pdev = to_pci_dev(dev);
 + struct pci_dev *bridge;
 + union {
 + struct {
 + u8 devfn;
 + u8 bus;
 + u16 segment;
 + } pci;
 + u32 group;
 + } id;
 +
 + if (iommu_no_mapping(dev))
 + return -ENODEV;
 +
 +

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Roedel, Joerg

On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote:
 On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
  A side-note: Might it be better to expose assigned devices in a guest on
  a seperate bus? This will make it easier to emulate an IOMMU for the
  guest inside qemu.
 
 I think we want that option, sure.  A lot of guests aren't going to
 support hotplugging buses though, so I think our default, map the entire
 guest model should still be using bus 0.  The ACPI gets a lot more
 complicated for that model too; dynamic SSDTs?  Thanks,

Ok, if only AMD-Vi should be emulated then it is not strictly
necessary. For this IOMMU we can specify that devices on the same bus
belong to different IOMMUs. So we can implement an IOMMU that handles
internal qemu-devices and one that handles pass-through devices.
Not sure if this is possible with VT-d too. Okay VT-d emulation would
also require that the devices emulation of a PCIe bridge, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Roedel, Joerg

On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
 On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
  On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
   On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
  
Handling it through fds is a good idea. This makes sure that everything
belongs to one process. I am not really sure yet if we go the way to
just bind plain groups together or if we create meta-groups. The
meta-groups thing seems somewhat cleaner, though.
   
   I'm leaning towards binding because we need to make it dynamic, but I
   don't really have a good picture of the lifecycle of a meta-group.
  
  In my view the life-cycle of the meta-group is a subrange of the
  qemu-instance's life-cycle.
 
 I guess I mean the lifecycle of a super-group that's actually exposed as
 a new group in sysfs.  Who creates it?  How?  How are groups dynamically
 added and removed from the super-group?  The group merging makes sense
 to me because it's largely just an optimization that qemu will try to
 merge groups.  If it works, great.  If not, it manages them separately.
 When all the devices from a group are unplugged, unmerge the group if
 necessary.

Right. The super-group thing is an optimization.

 We need to try the polite method of attempting to hot unplug the device
 from qemu first, which the current vfio code already implements.  We can
 then escalate if it doesn't respond.  The current code calls abort in
 qemu if the guest doesn't respond, but I agree we should also be
 enforcing this at the kernel interface.  I think the problem with the
 hard-unplug is that we don't have a good revoke mechanism for the mmio
 mmaps.

For mmio we could stop the guest and replace the mmio region with a
region that is filled with 0xff, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Don Dutile


On 08/25/2011 06:54 AM, Roedel, Joerg wrote:

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,


The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.


diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
   */

  #includelinux/bug.h
+#includelinux/device.h
  #includelinux/types.h
  #includelinux/module.h
  #includelinux/slab.h
  #includelinux/errno.h
  #includelinux/iommu.h
+#includelinux/pci.h

  static struct iommu_ops *iommu_ops;

+static ssize_t show_iommu_group(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, %lx, iommu_dev_to_group(dev));


Probably add a 0x prefix so userspace knows the format?


+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+   if (iommu_dev_to_group(dev)= 0)
+   return device_create_file(dev,dev_attr_iommu_group);
+
+   return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct device *dev = data;
+
+   if (action == BUS_NOTIFY_ADD_DEVICE)
+   return add_iommu_group(dev, NULL);
+
+   return 0;
+}
+
+static struct notifier_block device_nb = {
+   .notifier_call = device_notifier,
+};
+
  void register_iommu(struct iommu_ops *ops)
  {
if (iommu_ops)
BUG();

iommu_ops = ops;
+
+   /* FIXME - non-PCI, really want for_each_bus() */
+   bus_register_notifier(pci_bus_type,device_nb);
+   bus_for_each_dev(pci_bus_type, NULL, NULL, add_iommu_group);
  }


We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.


When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.



  bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
  }
  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);

+long iommu_dev_to_group(struct device *dev)
+{
+   if (iommu_ops-dev_to_group)
+   return iommu_ops-dev_to_group(dev);
+   return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);


Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.


+
  int iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, int gfp_order, int prot)
  {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
  static int dmar_forcedac;
  static int intel_iommu_strict;
  static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;

  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
  static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
printk(KERN_INFO
Intel-IOMMU: disable supported super page\n);
intel_iommu_superpage = 0;
+   } else if (!strncmp(str, no_mf_groups, 12)) {
+   printk(KERN_INFO
+   Intel-IOMMU: disable separate groups for 
multifunction devices\n);
+   intel_iommu_no_mf_groups = 1;


This should really be a global iommu option and not be VT-d specific.



str += strcspn(str, ,);
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
  }

+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Roedel, Joerg

On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote:

 On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
  We need to solve this differently. ARM is starting to use the iommu-api
  too and this definitly does not work there. One possible solution might
  be to make the iommu-ops per-bus.
 
 When you think of a system where there isn't just one bus-type
 with iommu support, it makes more sense.
 Additionally, it also allows the long-term architecture to use different types
 of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
 esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
 for direct-attach disk hba's.

Not sure how likely it is to have different types of IOMMUs within a
given bus-type. But if they become reality we can multiplex in the
iommu-api without much hassle :)
For now, something like bus_set_iommu() or bus_register_iommu() would
provide a nice way to do bus-specific setups for a given iommu
implementation.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Alex Williamson

On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
 Hi Alex,
 
 On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
  Is this roughly what you're thinking of for the iommu_group component?
  Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
  support in the iommu base.  Would AMD-Vi do something similar (or
  exactly the same) for group #s?  Thanks,
 
 The concept looks good, I have some comments, though. On AMD-Vi the
 implementation would look a bit different because there is a
 data-structure were the information can be gathered from, so no need for
 PCI bus scanning there.
 
  diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
  index 6e6b6a1..6b54c1a 100644
  --- a/drivers/base/iommu.c
  +++ b/drivers/base/iommu.c
  @@ -17,20 +17,56 @@
*/
   
   #include linux/bug.h
  +#include linux/device.h
   #include linux/types.h
   #include linux/module.h
   #include linux/slab.h
   #include linux/errno.h
   #include linux/iommu.h
  +#include linux/pci.h
   
   static struct iommu_ops *iommu_ops;
   
  +static ssize_t show_iommu_group(struct device *dev,
  +   struct device_attribute *attr, char *buf)
  +{
  +   return sprintf(buf, %lx, iommu_dev_to_group(dev));
 
 Probably add a 0x prefix so userspace knows the format?

I think I'll probably change it to %u.  Seems common to have decimal in
sysfs and doesn't get confusing if we cat it with a string.  As a bonus,
it abstracts that vt-d is just stuffing a PCI device address in there,
which nobody should ever rely on.

  +}
  +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
  +
  +static int add_iommu_group(struct device *dev, void *unused)
  +{
  +   if (iommu_dev_to_group(dev) = 0)
  +   return device_create_file(dev, dev_attr_iommu_group);
  +
  +   return 0;
  +}
  +
  +static int device_notifier(struct notifier_block *nb,
  +  unsigned long action, void *data)
  +{
  +   struct device *dev = data;
  +
  +   if (action == BUS_NOTIFY_ADD_DEVICE)
  +   return add_iommu_group(dev, NULL);
  +
  +   return 0;
  +}
  +
  +static struct notifier_block device_nb = {
  +   .notifier_call = device_notifier,
  +};
  +
   void register_iommu(struct iommu_ops *ops)
   {
  if (iommu_ops)
  BUG();
   
  iommu_ops = ops;
  +
  +   /* FIXME - non-PCI, really want for_each_bus() */
  +   bus_register_notifier(pci_bus_type, device_nb);
  +   bus_for_each_dev(pci_bus_type, NULL, NULL, add_iommu_group);
   }
 
 We need to solve this differently. ARM is starting to use the iommu-api
 too and this definitly does not work there. One possible solution might
 be to make the iommu-ops per-bus.

That sounds good.  Is anyone working on it?  It seems like it doesn't
hurt to use this in the interim, we may just be watching the wrong bus
and never add any sysfs group info.

   bool iommu_found(void)
  @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
   }
   EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
   
  +long iommu_dev_to_group(struct device *dev)
  +{
  +   if (iommu_ops-dev_to_group)
  +   return iommu_ops-dev_to_group(dev);
  +   return -ENODEV;
  +}
  +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
 
 Please rename this to iommu_device_group(). The dev_to_group name
 suggests a conversion but it is actually just a property of the device.

Ok.

 Also the return type should not be long but something that fits into
 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
 choice.

The convenience of using seg|bus|dev|fn was too much to resist, too bad
it requires a full 32bits.  Maybe I'll change it to:
int iommu_device_group(struct device *dev, unsigned int *group)

  +
   int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, int gfp_order, int prot)
   {
  diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
  index f02c34d..477259c 100644
  --- a/drivers/pci/intel-iommu.c
  +++ b/drivers/pci/intel-iommu.c
  @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
   static int dmar_forcedac;
   static int intel_iommu_strict;
   static int intel_iommu_superpage = 1;
  +static int intel_iommu_no_mf_groups;
   
   #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
   static DEFINE_SPINLOCK(device_domain_lock);
  @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
  printk(KERN_INFO
  Intel-IOMMU: disable supported super page\n);
  intel_iommu_superpage = 0;
  +   } else if (!strncmp(str, no_mf_groups, 12)) {
  +   printk(KERN_INFO
  +   Intel-IOMMU: disable separate groups for 
  multifunction devices\n);
  +   intel_iommu_no_mf_groups = 1;
 
 This should really be a global iommu option and not be VT-d specific.

You think?  It's meaningless on benh's power systems.

   
  str

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread David Gibson

On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
 On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
  On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
 
   I don't see a reason to make this meta-grouping static. It would harm
   flexibility on x86. I think it makes things easier on power but there
   are options on that platform to get the dynamic solution too.
  
  I think several people are misreading what Ben means by static.  I
  would prefer to say 'persistent', in that the meta-groups lifetime is
  not tied to an fd, but they can be freely created, altered and removed
  during runtime.
 
 Even if it can be altered at runtime, from a usability perspective it is
 certainly the best to handle these groups directly in qemu. Or are there
 strong reasons to do it somewhere else?

Funny, Ben and I think usability demands it be the other way around.

If the meta-groups are transient - that is lifetime tied to an fd -
then any program that wants to use meta-groups *must* know the
interfaces for creating one, whatever they are.

But if they're persistent, the admin can use other tools to create the
meta-group then just hand it to a program to use, since the interfaces
for _using_ a meta-group are identical to those for an atomic group.

This doesn't preclude a program from being meta-group aware, and
creating its own if it wants to, of course.  My guess is that qemu
would not want to build its own meta-groups, but libvirt probably
would.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread David Gibson

On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote:
 
 On 25.08.2011, at 07:31, Roedel, Joerg wrote:
 
  On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
  On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
  
 
 [...]
 
  We need to try the polite method of attempting to hot unplug the device
  from qemu first, which the current vfio code already implements.  We can
  then escalate if it doesn't respond.  The current code calls abort in
  qemu if the guest doesn't respond, but I agree we should also be
  enforcing this at the kernel interface.  I think the problem with the
  hard-unplug is that we don't have a good revoke mechanism for the mmio
  mmaps.
  
  For mmio we could stop the guest and replace the mmio region with a
  region that is filled with 0xff, no?
 
 Sure, but that happens in user space. The question is how does
 kernel space enforce an MMIO region to not be mapped after the
 hotplug event occured? Keep in mind that user space is pretty much
 untrusted here - it doesn't have to be QEMU. It could just as well
 be a generic user space driver. And that can just ignore hotplug
 events.

We're saying you hard yank the mapping from the userspace process.
That is, you invalidate all its PTEs mapping the MMIO space, and don't
let it fault them back in.

As I see it there are two options: (a) make subsequent accesses from
userspace or the guest result in either a SIGBUS that userspace must
either deal with or die, or (b) replace the mapping with a dummy RO
mapping containing 0xff, with any trapped writes emulated as nops.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Joerg Roedel

On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
 On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:

  Could be tho in what form ? returning sysfs pathes ?
 
 I'm at a loss there, please suggest.  I think we need an ioctl that
 returns some kind of array of devices within the group and another that
 maybe takes an index from that array and returns an fd for that device.
 A sysfs path string might be a reasonable array element, but it sounds
 like a pain to work with.

Limiting to PCI we can just pass the BDF as the argument to optain the
device-fd. For a more generic solution we need a unique identifier in
some way which is unique across all 'struct device' instances in the
system. As far as I know we don't have that yet (besides the sysfs-path)
so we either add that or stick with bus-specific solutions.

  1:1 process has the advantage of linking to an -mm which makes the whole
  mmu notifier business doable. How do you want to track down mappings and
  do the second level translation in the case of explicit map/unmap (like
  on power) if you are not tied to an mm_struct ?
 
 Right, I threw away the mmu notifier code that was originally part of
 vfio because we can't do anything useful with it yet on x86.  I
 definitely don't want to prevent it where it makes sense though.  Maybe
 we just record current-mm on open and restrict subsequent opens to the
 same.

Hmm, I think we need io-page-fault support in the iommu-api then.

  Another aspect I don't see discussed is how we represent these things to
  the guest.
  
  On Power for example, I have a requirement that a given iommu domain is
  represented by a single dma window property in the device-tree. What
  that means is that that property needs to be either in the node of the
  device itself if there's only one device in the group or in a parent
  node (ie a bridge or host bridge) if there are multiple devices.
  
  Now I do -not- want to go down the path of simulating P2P bridges,
  besides we'll quickly run out of bus numbers if we go there.
  
  For us the most simple and logical approach (which is also what pHyp
  uses and what Linux handles well) is really to expose a given PCI host
  bridge per group to the guest. Believe it or not, it makes things
  easier :-)
 
 I'm all for easier.  Why does exposing the bridge use less bus numbers
 than emulating a bridge?
 
 On x86, I want to maintain that our default assignment is at the device
 level.  A user should be able to pick single or multiple devices from
 across several groups and have them all show up as individual,
 hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
 also seen cases where users try to attach a bridge to the guest,
 assuming they'll get all the devices below the bridge, so I'd be in
 favor of making this just work if possible too, though we may have to
 prevent hotplug of those.

A side-note: Might it be better to expose assigned devices in a guest on
a seperate bus? This will make it easier to emulate an IOMMU for the
guest inside qemu.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Roedel, Joerg

On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
 On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:

  Handling it through fds is a good idea. This makes sure that everything
  belongs to one process. I am not really sure yet if we go the way to
  just bind plain groups together or if we create meta-groups. The
  meta-groups thing seems somewhat cleaner, though.
 
 I'm leaning towards binding because we need to make it dynamic, but I
 don't really have a good picture of the lifecycle of a meta-group.

In my view the life-cycle of the meta-group is a subrange of the
qemu-instance's life-cycle.

  Putting the process to sleep (which would be uninterruptible) seems bad.
  The process would sleep until the guest releases the device-group, which
  can take days or months.
  The best thing (and the most intrusive :-) ) is to change PCI core to
  allow unbindings to fail, I think. But this probably further complicates
  the way to upstream VFIO...
 
 Yes, it's not ideal but I think it's sufficient for now and if we later
 get support for returning an error from release, we can set a timeout
 after notifying the user to make use of that.  Thanks,

Ben had the idea of just forcing to hard-unplug this device from the
guest. Thats probably the best way to deal with that, I think. VFIO
sends a notification to qemu that the device is gone and qemu informs
the guest in some way about it.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Roedel, Joerg

On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote:
 On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:

  Hmm, good idea. But as far as I know the hotplug-event needs to be in
  the guest _before_ the device is actually unplugged (so that the guest
  can unbind its driver first). That somehow brings back the sleep-idea
  and the timeout in the .release function.
 
 That's for normal assisted hotplug, but don't we support hard hotplug ?
 I mean, things like cardbus, thunderbolt (if we ever support that)
 etc... will need it and some platforms do support hard hotplug of PCIe
 devices.
 
 (That's why drivers should never spin on MMIO waiting for a 1 bit to
 clear without a timeout :-)

Right, thats probably the best semantic for this issue then. The worst
thing that happens is that the admin crashed the guest.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Joerg Roedel

On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote:
 On 8/23/11 10:01 AM, Alex Williamson alex.william...@redhat.com wrote:
  The iommu domain would probably be allocated when the first device is
  bound to vfio.  As each device is bound, it gets attached to the group.
  DMAs are done via an ioctl on the group.
  
  I think group + uiommu leads to effectively reliving most of the
  problems with the current code.  The only benefit is the group
  assignment to enforce hardware restrictions.  We still have the problem
  that uiommu open() = iommu_domain_alloc(), whose properties are
  meaningless without attached devices (groups).  Which I think leads to
  the same awkward model of attaching groups to define the domain, then we
  end up doing mappings via the group to enforce ordering.
 
 Is there a better way to allow groups to share an IOMMU domain?
 
 Maybe, instead of having an ioctl to allow a group A to inherit the same
 iommu domain as group B, we could have an ioctl to fully merge two groups
 (could be what Ben was thinking):
 
 A.ioctl(MERGE_TO_GROUP, B)
 
 The group A now goes away and its devices join group B.  If A ever had an
 iommu domain assigned (and buffers mapped?) we fail.
 
 Groups cannot get smaller (they are defined as minimum granularity of an
 IOMMU, initially).  They can get bigger if you want to share IOMMU
 resources, though.
 
 Any downsides to this approach?

As long as this is a 2-way road its fine. There must be a way to split
the groups again after the guest exits. But then we are again at the
super-groups (aka meta-groups, aka uiommu) point.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Roedel, Joerg

On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
 On 8/23/11 4:04 AM, Joerg Roedel joerg.roe...@amd.com wrote:
  That is makes uiommu basically the same as the meta-groups, right?
 
 Yes, functionality seems the same, thus my suggestion to keep uiommu
 explicit.  Is there some need for group-groups besides defining sets of
 groups which share IOMMU resources?
 
 I do all this stuff (bringing up sets of devices which may share IOMMU
 domain) dynamically from C applications.  I don't really want some static
 (boot-time or sysfs fiddling) supergroup config unless there is a good
 reason KVM/power needs it.
 
 As you say in your next email, doing it all from ioctls is very easy,
 programmatically.

I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread David Gibson

On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
 On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
  On 8/23/11 4:04 AM, Joerg Roedel joerg.roe...@amd.com wrote:
   That is makes uiommu basically the same as the meta-groups, right?
  
  Yes, functionality seems the same, thus my suggestion to keep uiommu
  explicit.  Is there some need for group-groups besides defining sets of
  groups which share IOMMU resources?
  
  I do all this stuff (bringing up sets of devices which may share IOMMU
  domain) dynamically from C applications.  I don't really want some static
  (boot-time or sysfs fiddling) supergroup config unless there is a good
  reason KVM/power needs it.
  
  As you say in your next email, doing it all from ioctls is very easy,
  programmatically.
 
 I don't see a reason to make this meta-grouping static. It would harm
 flexibility on x86. I think it makes things easier on power but there
 are options on that platform to get the dynamic solution too.

I think several people are misreading what Ben means by static.  I
would prefer to say 'persistent', in that the meta-groups lifetime is
not tied to an fd, but they can be freely created, altered and removed
during runtime.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Roedel, Joerg

On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
 On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:

  I don't see a reason to make this meta-grouping static. It would harm
  flexibility on x86. I think it makes things easier on power but there
  are options on that platform to get the dynamic solution too.
 
 I think several people are misreading what Ben means by static.  I
 would prefer to say 'persistent', in that the meta-groups lifetime is
 not tied to an fd, but they can be freely created, altered and removed
 during runtime.

Even if it can be altered at runtime, from a usability perspective it is
certainly the best to handle these groups directly in qemu. Or are there
strong reasons to do it somewhere else?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Alex Williamson

On Wed, 2011-08-24 at 09:51 +1000, Benjamin Herrenschmidt wrote:
   For us the most simple and logical approach (which is also what pHyp
   uses and what Linux handles well) is really to expose a given PCI host
   bridge per group to the guest. Believe it or not, it makes things
   easier :-)
  
  I'm all for easier.  Why does exposing the bridge use less bus numbers
  than emulating a bridge?
 
 Because a host bridge doesn't look like a PCI to PCI bridge at all for
 us. It's an entire separate domain with it's own bus number space
 (unlike most x86 setups).

Ok, I missed the host bridge.

 In fact we have some problems afaik in qemu today with the concept of
 PCI domains, for example, I think qemu has assumptions about a single
 shared IO space domain which isn't true for us (each PCI host bridge
 provides a distinct IO space domain starting at 0). We'll have to fix
 that, but it's not a huge deal.

Yep, I've seen similar on ia64 systems.

 So for each group we'd expose in the guest an entire separate PCI
 domain space with its own IO, MMIO etc... spaces, handed off from a
 single device-tree host bridge which doesn't itself appear in the
 config space, doesn't need any emulation of any config space etc...
 
  On x86, I want to maintain that our default assignment is at the device
  level.  A user should be able to pick single or multiple devices from
  across several groups and have them all show up as individual,
  hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
  also seen cases where users try to attach a bridge to the guest,
  assuming they'll get all the devices below the bridge, so I'd be in
  favor of making this just work if possible too, though we may have to
  prevent hotplug of those.
 
  Given the device requirement on x86 and since everything is a PCI device
  on x86, I'd like to keep a qemu command line something like -device
  vfio,host=00:19.0.  I assume that some of the iommu properties, such as
  dma window size/address, will be query-able through an architecture
  specific (or general if possible) ioctl on the vfio group fd.  I hope
  that will help the specification, but I don't fully understand what all
  remains.  Thanks,
 
 Well, for iommu there's a couple of different issues here but yes,
 basically on one side we'll have some kind of ioctl to know what segment
 of the device(s) DMA address space is assigned to the group and we'll
 need to represent that to the guest via a device-tree property in some
 kind of parent node of all the devices in that group.
 
 We -might- be able to implement some kind of hotplug of individual
 devices of a group under such a PHB (PCI Host Bridge), I don't know for
 sure yet, some of that PAPR stuff is pretty arcane, but basically, for
 all intend and purpose, we really want a group to be represented as a
 PHB in the guest.
 
 We cannot arbitrary have individual devices of separate groups be
 represented in the guest as siblings on a single simulated PCI bus.

I think the vfio kernel layer we're describing easily supports both.
This is just a matter of adding qemu-vfio code to expose different
topologies based on group iommu capabilities and mapping mode.  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Alex Williamson

On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
 On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
  On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
 
   Could be tho in what form ? returning sysfs pathes ?
  
  I'm at a loss there, please suggest.  I think we need an ioctl that
  returns some kind of array of devices within the group and another that
  maybe takes an index from that array and returns an fd for that device.
  A sysfs path string might be a reasonable array element, but it sounds
  like a pain to work with.
 
 Limiting to PCI we can just pass the BDF as the argument to optain the
 device-fd. For a more generic solution we need a unique identifier in
 some way which is unique across all 'struct device' instances in the
 system. As far as I know we don't have that yet (besides the sysfs-path)
 so we either add that or stick with bus-specific solutions.
 
   1:1 process has the advantage of linking to an -mm which makes the whole
   mmu notifier business doable. How do you want to track down mappings and
   do the second level translation in the case of explicit map/unmap (like
   on power) if you are not tied to an mm_struct ?
  
  Right, I threw away the mmu notifier code that was originally part of
  vfio because we can't do anything useful with it yet on x86.  I
  definitely don't want to prevent it where it makes sense though.  Maybe
  we just record current-mm on open and restrict subsequent opens to the
  same.
 
 Hmm, I think we need io-page-fault support in the iommu-api then.

Yeah, when we can handle iommu page faults, this gets more interesting.

   Another aspect I don't see discussed is how we represent these things to
   the guest.
   
   On Power for example, I have a requirement that a given iommu domain is
   represented by a single dma window property in the device-tree. What
   that means is that that property needs to be either in the node of the
   device itself if there's only one device in the group or in a parent
   node (ie a bridge or host bridge) if there are multiple devices.
   
   Now I do -not- want to go down the path of simulating P2P bridges,
   besides we'll quickly run out of bus numbers if we go there.
   
   For us the most simple and logical approach (which is also what pHyp
   uses and what Linux handles well) is really to expose a given PCI host
   bridge per group to the guest. Believe it or not, it makes things
   easier :-)
  
  I'm all for easier.  Why does exposing the bridge use less bus numbers
  than emulating a bridge?
  
  On x86, I want to maintain that our default assignment is at the device
  level.  A user should be able to pick single or multiple devices from
  across several groups and have them all show up as individual,
  hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
  also seen cases where users try to attach a bridge to the guest,
  assuming they'll get all the devices below the bridge, so I'd be in
  favor of making this just work if possible too, though we may have to
  prevent hotplug of those.
 
 A side-note: Might it be better to expose assigned devices in a guest on
 a seperate bus? This will make it easier to emulate an IOMMU for the
 guest inside qemu.

I think we want that option, sure.  A lot of guests aren't going to
support hotplugging buses though, so I think our default, map the entire
guest model should still be using bus 0.  The ACPI gets a lot more
complicated for that model too; dynamic SSDTs?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Alex Williamson

On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
 On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
  On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
 
   Handling it through fds is a good idea. This makes sure that everything
   belongs to one process. I am not really sure yet if we go the way to
   just bind plain groups together or if we create meta-groups. The
   meta-groups thing seems somewhat cleaner, though.
  
  I'm leaning towards binding because we need to make it dynamic, but I
  don't really have a good picture of the lifecycle of a meta-group.
 
 In my view the life-cycle of the meta-group is a subrange of the
 qemu-instance's life-cycle.

I guess I mean the lifecycle of a super-group that's actually exposed as
a new group in sysfs.  Who creates it?  How?  How are groups dynamically
added and removed from the super-group?  The group merging makes sense
to me because it's largely just an optimization that qemu will try to
merge groups.  If it works, great.  If not, it manages them separately.
When all the devices from a group are unplugged, unmerge the group if
necessary.

   Putting the process to sleep (which would be uninterruptible) seems bad.
   The process would sleep until the guest releases the device-group, which
   can take days or months.
   The best thing (and the most intrusive :-) ) is to change PCI core to
   allow unbindings to fail, I think. But this probably further complicates
   the way to upstream VFIO...
  
  Yes, it's not ideal but I think it's sufficient for now and if we later
  get support for returning an error from release, we can set a timeout
  after notifying the user to make use of that.  Thanks,
 
 Ben had the idea of just forcing to hard-unplug this device from the
 guest. Thats probably the best way to deal with that, I think. VFIO
 sends a notification to qemu that the device is gone and qemu informs
 the guest in some way about it.

We need to try the polite method of attempting to hot unplug the device
from qemu first, which the current vfio code already implements.  We can
then escalate if it doesn't respond.  The current code calls abort in
qemu if the guest doesn't respond, but I agree we should also be
enforcing this at the kernel interface.  I think the problem with the
hard-unplug is that we don't have a good revoke mechanism for the mmio
mmaps.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-24 Thread Alex Williamson

Joerg,

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,

Alex

Signed-off-by: Alex Williamson alex.william...@redhat.com

diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
  */
 
 #include linux/bug.h
+#include linux/device.h
 #include linux/types.h
 #include linux/module.h
 #include linux/slab.h
 #include linux/errno.h
 #include linux/iommu.h
+#include linux/pci.h
 
 static struct iommu_ops *iommu_ops;
 
+static ssize_t show_iommu_group(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, %lx, iommu_dev_to_group(dev));
+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+   if (iommu_dev_to_group(dev) = 0)
+   return device_create_file(dev, dev_attr_iommu_group);
+
+   return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct device *dev = data;
+
+   if (action == BUS_NOTIFY_ADD_DEVICE)
+   return add_iommu_group(dev, NULL);
+
+   return 0;
+}
+
+static struct notifier_block device_nb = {
+   .notifier_call = device_notifier,
+};
+
 void register_iommu(struct iommu_ops *ops)
 {
if (iommu_ops)
BUG();
 
iommu_ops = ops;
+
+   /* FIXME - non-PCI, really want for_each_bus() */
+   bus_register_notifier(pci_bus_type, device_nb);
+   bus_for_each_dev(pci_bus_type, NULL, NULL, add_iommu_group);
 }
 
 bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
+long iommu_dev_to_group(struct device *dev)
+{
+   if (iommu_ops-dev_to_group)
+   return iommu_ops-dev_to_group(dev);
+   return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);
+
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, int gfp_order, int prot)
 {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
 static int dmar_forcedac;
 static int intel_iommu_strict;
 static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;
 
 #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
printk(KERN_INFO
Intel-IOMMU: disable supported super page\n);
intel_iommu_superpage = 0;
+   } else if (!strncmp(str, no_mf_groups, 12)) {
+   printk(KERN_INFO
+   Intel-IOMMU: disable separate groups for 
multifunction devices\n);
+   intel_iommu_no_mf_groups = 1;
}
 
str += strcspn(str, ,);
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
 }
 
+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct device *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *bridge;
+   union {
+   struct {
+   u8 devfn;
+   u8 bus;
+   u16 segment;
+   } pci;
+   u32 group;
+   } id;
+
+   if (iommu_no_mapping(dev))
+   return -ENODEV;
+
+   id.pci.segment = pci_domain_nr(pdev-bus);
+   id.pci.bus = pdev-bus-number;
+   id.pci.devfn = pdev-devfn;
+
+   if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
+   return -ENODEV;
+
+   bridge = pci_find_upstream_pcie_bridge(pdev);
+   if (bridge) {
+   if (pci_is_pcie(bridge)) {
+   id.pci.bus = bridge-subordinate-number;
+   id.pci.devfn = 0;
+   } else {
+   id.pci.bus = bridge-bus-number;
+   id.pci.devfn = bridge-devfn;
+   }
+   }
+
+   /* Virtual functions always get their own group */
+   if (!pdev-is_virtfn  intel_iommu_no_mf_groups)
+   id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
+
+   /* FIXME - seg # = 0x8000 on 32b */
+

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Benjamin Herrenschmidt

On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:

 I'm not following you.
 
 You have to enforce group/iommu domain assignment whether you have the
 existing uiommu API, or if you change it to your proposed
 ioctl(inherit_iommu) API.
 
 The only change needed to VFIO here should be to make uiommu fd assignment
 happen on the groups instead of on device fds.  That operation fails or
 succeeds according to the group semantics (all-or-none assignment/same
 uiommu).

Ok, so I missed that part where you change uiommu to operate on group
fd's rather than device fd's, my apologies if you actually wrote that
down :-) It might be obvious ... bare with me I just flew back from the
US and I am badly jet lagged ...

So I see what you mean, however...

 I think the question is: do we force 1:1 iommu/group mapping, or do we allow
 arbitrary mapping (satisfying group constraints) as we do today.
 
 I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
 ability and definitely think the uiommu approach is cleaner than the
 ioctl(inherit_iommu) approach.  We considered that approach before but it
 seemed less clean so we went with the explicit uiommu context.

Possibly, the question that interest me the most is what interface will
KVM end up using. I'm also not terribly fan with the (perceived)
discrepancy between using uiommu to create groups but using the group fd
to actually do the mappings, at least if that is still the plan.

If the separate uiommu interface is kept, then anything that wants to be
able to benefit from the ability to put multiple devices (or existing
groups) into such a meta group would need to be explicitly modified to
deal with the uiommu APIs.

I tend to prefer such meta groups as being something you create
statically using a configuration interface, either via sysfs, netlink or
ioctl's to a control vfio device driven by a simple command line tool
(which can have the configuration stored in /etc and re-apply it at
boot).

That way, any program capable of exploiting VFIO groups will
automatically be able to exploit those meta groups (or groups of
groups) as well as long as they are supported on the system.

If we ever have system specific constraints as to how such groups can be
created, then it can all be handled at the level of that configuration
tool without impact on whatever programs know how to exploit them via
the VFIO interfaces.

   .../...
  
  If we in singleton-group land were building our own groups which were 
  sets
  of devices sharing the IOMMU domains we wanted, I suppose we could do away
  with uiommu fds, but it sounds like the current proposal would create 20
  singleton groups (x86 iommu w/o PCI bridges = all devices are 
  partitionable
  endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
  worse than the current explicit uiommu API.
  
  I'd rather have an API to create super-groups (groups of groups)
  statically and then you can use such groups as normal groups using the
  same interface. That create/management process could be done via a
  simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Joerg Roedel

On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
 You have to enforce group/iommu domain assignment whether you have the
 existing uiommu API, or if you change it to your proposed
 ioctl(inherit_iommu) API.
 
 The only change needed to VFIO here should be to make uiommu fd assignment
 happen on the groups instead of on device fds.  That operation fails or
 succeeds according to the group semantics (all-or-none assignment/same
 uiommu).

That is makes uiommu basically the same as the meta-groups, right?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Joerg Roedel

On Tue, Aug 23, 2011 at 02:54:43AM -0400, Benjamin Herrenschmidt wrote:
 Possibly, the question that interest me the most is what interface will
 KVM end up using. I'm also not terribly fan with the (perceived)
 discrepancy between using uiommu to create groups but using the group fd
 to actually do the mappings, at least if that is still the plan.
 
 If the separate uiommu interface is kept, then anything that wants to be
 able to benefit from the ability to put multiple devices (or existing
 groups) into such a meta group would need to be explicitly modified to
 deal with the uiommu APIs.
 
 I tend to prefer such meta groups as being something you create
 statically using a configuration interface, either via sysfs, netlink or
 ioctl's to a control vfio device driven by a simple command line tool
 (which can have the configuration stored in /etc and re-apply it at
 boot).

Hmm, I don't think that these groups are static for the systems
run-time. They only exist for the lifetime of a guest per default, at
least on x86. Thats why I prefer to do this grouping using VFIO and not
some sysfs interface (which would be the third interface beside the
ioctls and netlink a VFIO user needs to be aware of). Doing this in the
ioctl interface just makes things easier.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Roedel, Joerg

On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
 
  I am in favour of /dev/vfio/$GROUP. If multiple devices should be
  assigned to a guest, there can also be an ioctl to bind a group to an
  address-space of another group (certainly needs some care to not allow
  that both groups belong to different processes).
  
  Btw, a problem we havn't talked about yet entirely is
  driver-deassignment. User space can decide to de-assign the device from
  vfio while a fd is open on it. With PCI there is no way to let this fail
  (the .release function returns void last time i checked). Is this a
  problem, and yes, how we handle that?
 
 We can treat it as a hard unplug (like a cardbus gone away).
 
 IE. Dispose of the direct mappings (switch to MMIO emulation) and return
 all ff's from reads ( ignore writes).
 
 Then send an unplug event via whatever mechanism the platform provides
 (ACPI hotplug controller on x86 for example, we haven't quite sorted out
 what to do on power for hotplug yet).

Hmm, good idea. But as far as I know the hotplug-event needs to be in
the guest _before_ the device is actually unplugged (so that the guest
can unbind its driver first). That somehow brings back the sleep-idea
and the timeout in the .release function.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Roedel, Joerg

On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
 On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:

  I am in favour of /dev/vfio/$GROUP. If multiple devices should be
  assigned to a guest, there can also be an ioctl to bind a group to an
  address-space of another group (certainly needs some care to not allow
  that both groups belong to different processes).
 
 That's an interesting idea.  Maybe an interface similar to the current
 uiommu interface, where you open() the 2nd group fd and pass the fd via
 ioctl to the primary group.  IOMMUs that don't support this would fail
 the attach device callback, which would fail the ioctl to bind them.  It
 will need to be designed so any group can be removed from the super-set
 and the remaining group(s) still works.  This feels like something that
 can be added after we get an initial implementation.

Handling it through fds is a good idea. This makes sure that everything
belongs to one process. I am not really sure yet if we go the way to
just bind plain groups together or if we create meta-groups. The
meta-groups thing seems somewhat cleaner, though.

  Btw, a problem we havn't talked about yet entirely is
  driver-deassignment. User space can decide to de-assign the device from
  vfio while a fd is open on it. With PCI there is no way to let this fail
  (the .release function returns void last time i checked). Is this a
  problem, and yes, how we handle that?
 
 The current vfio has the same problem, we can't unbind a device from
 vfio while it's attached to a guest.  I think we'd use the same solution
 too; send out a netlink packet for a device removal and have the .remove
 call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
 and SIGBUS the PIDs holding the device if they don't return it
 willingly.  Thanks,

Putting the process to sleep (which would be uninterruptible) seems bad.
The process would sleep until the guest releases the device-group, which
can take days or months.
The best thing (and the most intrusive :-) ) is to change PCI core to
allow unbindings to fail, I think. But this probably further complicates
the way to upstream VFIO...

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alex Williamson

On Tue, 2011-08-23 at 12:38 +1000, David Gibson wrote:
 On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
  On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
   On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci:00/:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)
   
   Ok.  Am I correct in thinking these group IDs are representing the
   minimum granularity, and are therefore always static, defined only by
   the connected hardware, not by configuration?
  
  Yes, that's the idea.  An open question I have towards the configuration
  side is whether we might add iommu driver specific options to the
  groups.  For instance on x86 where we typically have B:D.F granularity,
  should we have an option not to trust multi-function devices and use a
  B:D granularity for grouping?
 
 Right.  And likewise I can see a place for configuration parameters
 like the present 'allow_unsafe_irqs'.  But these would be more-or-less
 global options which affected the overall granularity, rather than
 detailed configuration such as explicitly binding some devices into a
 group, yes?

Yes, currently the interrupt remapping support is a global iommu
capability.  I suppose it's possible that this could be an iommu option,
where the iommu driver would not advertise a group if the interrupt
remapping constraint isn't met.

From there we have a few options.  In the BoF we discussed a model 
where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This group fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a device fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.
   
   It seems a slightly strange distinction that the group device appears
   when any device in the group is bound to vfio, but only becomes usable
   when all devices are bound.
   
Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.
   
   Which is why I marginally prefer this model, although it's not a big
   deal.
  
  Right, we can also combine models.  Binding a device to vfio
  creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
  device access until all the group devices are also bound.  I think
  the /dev/vfio/$GROUP might help provide an enumeration interface as well
  though, which could be useful.
 
 I'm not entirely sure what you mean here.  But, that's now several
 weak votes in favour of the always-present group devices, and none in
 favour of the created-when-first-device-bound model, so I suggest we
 take the /dev/vfio/$GROUP as our tentative approach.

Yep

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.
   
   A 1:1 group-process correspondance seems wrong to me. But there are
   many ways you could legitimately write the userspace side of the code,
   many of them involving some sort of concurrency.  Implementing that
   concurrency as multiple processes (using explicit shared memory and/or
   other IPC mechanisms to co-ordinate) seems a valid choice that we
   shouldn't arbitrarily prohibit.
   
   Obviously, only one UID may be permitted to have the group open at a
   time, and I think that's enough to prevent them doing any

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread aafabbri




On 8/23/11 4:04 AM, Joerg Roedel joerg.roe...@amd.com wrote:

 On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
 You have to enforce group/iommu domain assignment whether you have the
 existing uiommu API, or if you change it to your proposed
 ioctl(inherit_iommu) API.
 
 The only change needed to VFIO here should be to make uiommu fd assignment
 happen on the groups instead of on device fds.  That operation fails or
 succeeds according to the group semantics (all-or-none assignment/same
 uiommu).
 
 That is makes uiommu basically the same as the meta-groups, right?

Yes, functionality seems the same, thus my suggestion to keep uiommu
explicit.  Is there some need for group-groups besides defining sets of
groups which share IOMMU resources?

I do all this stuff (bringing up sets of devices which may share IOMMU
domain) dynamically from C applications.  I don't really want some static
(boot-time or sysfs fiddling) supergroup config unless there is a good
reason KVM/power needs it.

As you say in your next email, doing it all from ioctls is very easy,
programmatically.

-Aaron Fabbri

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alex Williamson

On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
 
  I'm not following you.
  
  You have to enforce group/iommu domain assignment whether you have the
  existing uiommu API, or if you change it to your proposed
  ioctl(inherit_iommu) API.
  
  The only change needed to VFIO here should be to make uiommu fd assignment
  happen on the groups instead of on device fds.  That operation fails or
  succeeds according to the group semantics (all-or-none assignment/same
  uiommu).
 
 Ok, so I missed that part where you change uiommu to operate on group
 fd's rather than device fd's, my apologies if you actually wrote that
 down :-) It might be obvious ... bare with me I just flew back from the
 US and I am badly jet lagged ...

I missed it too, the model I'm proposing entirely removes the uiommu
concept.

 So I see what you mean, however...
 
  I think the question is: do we force 1:1 iommu/group mapping, or do we allow
  arbitrary mapping (satisfying group constraints) as we do today.
  
  I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
  ability and definitely think the uiommu approach is cleaner than the
  ioctl(inherit_iommu) approach.  We considered that approach before but it
  seemed less clean so we went with the explicit uiommu context.
 
 Possibly, the question that interest me the most is what interface will
 KVM end up using. I'm also not terribly fan with the (perceived)
 discrepancy between using uiommu to create groups but using the group fd
 to actually do the mappings, at least if that is still the plan.

Current code: uiommu creates the domain, we bind a vfio device to that
domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
mappings via MAP_DMA on the vfio device (affecting all the vfio devices
bound to the domain)

My current proposal: groups are predefined.  groups ~= iommu domain.
The iommu domain would probably be allocated when the first device is
bound to vfio.  As each device is bound, it gets attached to the group.
DMAs are done via an ioctl on the group.

I think group + uiommu leads to effectively reliving most of the
problems with the current code.  The only benefit is the group
assignment to enforce hardware restrictions.  We still have the problem
that uiommu open() = iommu_domain_alloc(), whose properties are
meaningless without attached devices (groups).  Which I think leads to
the same awkward model of attaching groups to define the domain, then we
end up doing mappings via the group to enforce ordering.

 If the separate uiommu interface is kept, then anything that wants to be
 able to benefit from the ability to put multiple devices (or existing
 groups) into such a meta group would need to be explicitly modified to
 deal with the uiommu APIs.
 
 I tend to prefer such meta groups as being something you create
 statically using a configuration interface, either via sysfs, netlink or
 ioctl's to a control vfio device driven by a simple command line tool
 (which can have the configuration stored in /etc and re-apply it at
 boot).

I cringe anytime there's a mention of static.  IMHO, we have to
support hotplug.  That means meta groups change dynamically.  Maybe
this supports the idea that we should be able to retrieve a new fd from
the group to do mappings.  Any groups bound together will return the
same fd and the fd will persist so long as any member of the group is
open.

 That way, any program capable of exploiting VFIO groups will
 automatically be able to exploit those meta groups (or groups of
 groups) as well as long as they are supported on the system.
 
 If we ever have system specific constraints as to how such groups can be
 created, then it can all be handled at the level of that configuration
 tool without impact on whatever programs know how to exploit them via
 the VFIO interfaces.

I'd prefer to have the constraints be represented in the ioctl to bind
groups.  It works or not and the platform gets to define what it
considers compatible.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alex Williamson

On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
 On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote:
  On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote:
 
   I am in favour of /dev/vfio/$GROUP. If multiple devices should be
   assigned to a guest, there can also be an ioctl to bind a group to an
   address-space of another group (certainly needs some care to not allow
   that both groups belong to different processes).
  
  That's an interesting idea.  Maybe an interface similar to the current
  uiommu interface, where you open() the 2nd group fd and pass the fd via
  ioctl to the primary group.  IOMMUs that don't support this would fail
  the attach device callback, which would fail the ioctl to bind them.  It
  will need to be designed so any group can be removed from the super-set
  and the remaining group(s) still works.  This feels like something that
  can be added after we get an initial implementation.
 
 Handling it through fds is a good idea. This makes sure that everything
 belongs to one process. I am not really sure yet if we go the way to
 just bind plain groups together or if we create meta-groups. The
 meta-groups thing seems somewhat cleaner, though.

I'm leaning towards binding because we need to make it dynamic, but I
don't really have a good picture of the lifecycle of a meta-group.

   Btw, a problem we havn't talked about yet entirely is
   driver-deassignment. User space can decide to de-assign the device from
   vfio while a fd is open on it. With PCI there is no way to let this fail
   (the .release function returns void last time i checked). Is this a
   problem, and yes, how we handle that?
  
  The current vfio has the same problem, we can't unbind a device from
  vfio while it's attached to a guest.  I think we'd use the same solution
  too; send out a netlink packet for a device removal and have the .remove
  call sleep on a wait_event(, refcnt == 0).  We could also set a timeout
  and SIGBUS the PIDs holding the device if they don't return it
  willingly.  Thanks,
 
 Putting the process to sleep (which would be uninterruptible) seems bad.
 The process would sleep until the guest releases the device-group, which
 can take days or months.
 The best thing (and the most intrusive :-) ) is to change PCI core to
 allow unbindings to fail, I think. But this probably further complicates
 the way to upstream VFIO...

Yes, it's not ideal but I think it's sufficient for now and if we later
get support for returning an error from release, we can set a timeout
after notifying the user to make use of that.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Aaron Fabbri




On 8/23/11 10:01 AM, Alex Williamson alex.william...@redhat.com wrote:

 On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
 
 I'm not following you.
 
 You have to enforce group/iommu domain assignment whether you have the
 existing uiommu API, or if you change it to your proposed
 ioctl(inherit_iommu) API.
 
 The only change needed to VFIO here should be to make uiommu fd assignment
 happen on the groups instead of on device fds.  That operation fails or
 succeeds according to the group semantics (all-or-none assignment/same
 uiommu).
 
 Ok, so I missed that part where you change uiommu to operate on group
 fd's rather than device fd's, my apologies if you actually wrote that
 down :-) It might be obvious ... bare with me I just flew back from the
 US and I am badly jet lagged ...
 
 I missed it too, the model I'm proposing entirely removes the uiommu
 concept.
 
 So I see what you mean, however...
 
 I think the question is: do we force 1:1 iommu/group mapping, or do we allow
 arbitrary mapping (satisfying group constraints) as we do today.
 
 I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
 ability and definitely think the uiommu approach is cleaner than the
 ioctl(inherit_iommu) approach.  We considered that approach before but it
 seemed less clean so we went with the explicit uiommu context.
 
 Possibly, the question that interest me the most is what interface will
 KVM end up using. I'm also not terribly fan with the (perceived)
 discrepancy between using uiommu to create groups but using the group fd
 to actually do the mappings, at least if that is still the plan.
 
 Current code: uiommu creates the domain, we bind a vfio device to that
 domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
 mappings via MAP_DMA on the vfio device (affecting all the vfio devices
 bound to the domain)
 
 My current proposal: groups are predefined.  groups ~= iommu domain.

This is my main objection.  I'd rather not lose the ability to have multiple
devices (which are all predefined as singleton groups on x86 w/o PCI
bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

 The iommu domain would probably be allocated when the first device is
 bound to vfio.  As each device is bound, it gets attached to the group.
 DMAs are done via an ioctl on the group.
 
 I think group + uiommu leads to effectively reliving most of the
 problems with the current code.  The only benefit is the group
 assignment to enforce hardware restrictions.  We still have the problem
 that uiommu open() = iommu_domain_alloc(), whose properties are
 meaningless without attached devices (groups).  Which I think leads to
 the same awkward model of attaching groups to define the domain, then we
 end up doing mappings via the group to enforce ordering.

Is there a better way to allow groups to share an IOMMU domain?

Maybe, instead of having an ioctl to allow a group A to inherit the same
iommu domain as group B, we could have an ioctl to fully merge two groups
(could be what Ben was thinking):

A.ioctl(MERGE_TO_GROUP, B)

The group A now goes away and its devices join group B.  If A ever had an
iommu domain assigned (and buffers mapped?) we fail.

Groups cannot get smaller (they are defined as minimum granularity of an
IOMMU, initially).  They can get bigger if you want to share IOMMU
resources, though.

Any downsides to this approach?

-AF

 
 If the separate uiommu interface is kept, then anything that wants to be
 able to benefit from the ability to put multiple devices (or existing
 groups) into such a meta group would need to be explicitly modified to
 deal with the uiommu APIs.
 
 I tend to prefer such meta groups as being something you create
 statically using a configuration interface, either via sysfs, netlink or
 ioctl's to a control vfio device driven by a simple command line tool
 (which can have the configuration stored in /etc and re-apply it at
 boot).
 
 I cringe anytime there's a mention of static.  IMHO, we have to
 support hotplug.  That means meta groups change dynamically.  Maybe
 this supports the idea that we should be able to retrieve a new fd from
 the group to do mappings.  Any groups bound together will return the
 same fd and the fd will persist so long as any member of the group is
 open.
 
 That way, any program capable of exploiting VFIO groups will
 automatically be able to exploit those meta groups (or groups of
 groups) as well as long as they are supported on the system.
 
 If we ever have system specific constraints as to how such groups can be
 created, then it can all be handled at the level of that configuration
 tool without impact on whatever programs know how to exploit them via
 the VFIO interfaces.
 
 I'd prefer to have the constraints be represented in the ioctl to bind
 groups.  It works or not and

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alex Williamson

On Tue, 2011-08-23 at 10:33 -0700, Aaron Fabbri wrote:
 
 
 On 8/23/11 10:01 AM, Alex Williamson alex.william...@redhat.com wrote:
 
  On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
  On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
  
  I'm not following you.
  
  You have to enforce group/iommu domain assignment whether you have the
  existing uiommu API, or if you change it to your proposed
  ioctl(inherit_iommu) API.
  
  The only change needed to VFIO here should be to make uiommu fd assignment
  happen on the groups instead of on device fds.  That operation fails or
  succeeds according to the group semantics (all-or-none assignment/same
  uiommu).
  
  Ok, so I missed that part where you change uiommu to operate on group
  fd's rather than device fd's, my apologies if you actually wrote that
  down :-) It might be obvious ... bare with me I just flew back from the
  US and I am badly jet lagged ...
  
  I missed it too, the model I'm proposing entirely removes the uiommu
  concept.
  
  So I see what you mean, however...
  
  I think the question is: do we force 1:1 iommu/group mapping, or do we 
  allow
  arbitrary mapping (satisfying group constraints) as we do today.
  
  I'm saying I'm an existing user who wants the arbitrary iommu/group 
  mapping
  ability and definitely think the uiommu approach is cleaner than the
  ioctl(inherit_iommu) approach.  We considered that approach before but it
  seemed less clean so we went with the explicit uiommu context.
  
  Possibly, the question that interest me the most is what interface will
  KVM end up using. I'm also not terribly fan with the (perceived)
  discrepancy between using uiommu to create groups but using the group fd
  to actually do the mappings, at least if that is still the plan.
  
  Current code: uiommu creates the domain, we bind a vfio device to that
  domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
  mappings via MAP_DMA on the vfio device (affecting all the vfio devices
  bound to the domain)
  
  My current proposal: groups are predefined.  groups ~= iommu domain.
 
 This is my main objection.  I'd rather not lose the ability to have multiple
 devices (which are all predefined as singleton groups on x86 w/o PCI
 bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
 require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

We do care, I just wasn't prioritizing it as heavily since I think the
typical model is probably closer to 1 device per guest.

  The iommu domain would probably be allocated when the first device is
  bound to vfio.  As each device is bound, it gets attached to the group.
  DMAs are done via an ioctl on the group.
  
  I think group + uiommu leads to effectively reliving most of the
  problems with the current code.  The only benefit is the group
  assignment to enforce hardware restrictions.  We still have the problem
  that uiommu open() = iommu_domain_alloc(), whose properties are
  meaningless without attached devices (groups).  Which I think leads to
  the same awkward model of attaching groups to define the domain, then we
  end up doing mappings via the group to enforce ordering.
 
 Is there a better way to allow groups to share an IOMMU domain?
 
 Maybe, instead of having an ioctl to allow a group A to inherit the same
 iommu domain as group B, we could have an ioctl to fully merge two groups
 (could be what Ben was thinking):
 
 A.ioctl(MERGE_TO_GROUP, B)
 
 The group A now goes away and its devices join group B.  If A ever had an
 iommu domain assigned (and buffers mapped?) we fail.
 
 Groups cannot get smaller (they are defined as minimum granularity of an
 IOMMU, initially).  They can get bigger if you want to share IOMMU
 resources, though.
 
 Any downsides to this approach?

That's sort of the way I'm picturing it.  When groups are bound
together, they effectively form a pool, where all the groups are peers.
When the MERGE/BIND ioctl is called on group A and passed the group B
fd, A can check compatibility of the domain associated with B, unbind
devices from the B domain and attach them to the A domain.  The B domain
would then be freed and it would bump the refcnt on the A domain.  If we
need to remove A from the pool, we call UNMERGE/UNBIND on B with the A
fd, it will remove the A devices from the shared object, disassociate A
with the shared object, re-alloc a domain for A and rebind A devices to
that domain. 

This is where it seems like it might be helpful to make a GET_IOMMU_FD
ioctl so that an iommu object is ubiquitous and persistent across the
pool.  Operations on any group fd work on the pool as a whole.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alex Williamson

On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:
 
  Yes, that's the idea.  An open question I have towards the configuration
  side is whether we might add iommu driver specific options to the
  groups.  For instance on x86 where we typically have B:D.F granularity,
  should we have an option not to trust multi-function devices and use a
  B:D granularity for grouping?
 
 Or even B or range of busses... if you want to enforce strict isolation
 you really can't trust anything below a bus level :-)
 
  Right, we can also combine models.  Binding a device to vfio
  creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
  device access until all the group devices are also bound.  I think
  the /dev/vfio/$GROUP might help provide an enumeration interface as well
  though, which could be useful.
 
 Could be tho in what form ? returning sysfs pathes ?

I'm at a loss there, please suggest.  I think we need an ioctl that
returns some kind of array of devices within the group and another that
maybe takes an index from that array and returns an fd for that device.
A sysfs path string might be a reasonable array element, but it sounds
like a pain to work with.

  1:1 group-process is probably too strong.  Not allowing concurrent
  open()s on the group file enforces a single userspace entity is
  responsible for that group.  Device fds can be passed to other
  processes, but only retrieved via the group fd.  I suppose we could even
  branch off the dma interface into a different fd, but it seems like we
  would logically want to serialize dma mappings at each iommu group
  anyway.  I'm open to alternatives, this just seemed an easy way to do
  it.  Restricting on UID implies that we require isolated qemu instances
  to run as different UIDs.  I know that's a goal, but I don't know if we
  want to make it an assumption in the group security model.
 
 1:1 process has the advantage of linking to an -mm which makes the whole
 mmu notifier business doable. How do you want to track down mappings and
 do the second level translation in the case of explicit map/unmap (like
 on power) if you are not tied to an mm_struct ?

Right, I threw away the mmu notifier code that was originally part of
vfio because we can't do anything useful with it yet on x86.  I
definitely don't want to prevent it where it makes sense though.  Maybe
we just record current-mm on open and restrict subsequent opens to the
same.

  Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
  to assume 1 device per guest is a typical model and that the iotlb is
  large enough that we might improve thrashing to see both a resource and
  performance benefit from it.  I'm open to suggestions for how we could
  include it though.
 
 Sharing may or may not be possible depending on setups so yes, it's a
 bit tricky.
 
 My preference is to have a static interface (and that's actually where
 your pet netlink might make some sense :-) to create synthetic groups
 made of other groups if the arch allows it. But that might not be the
 best approach. In another email I also proposed an option for a group to
 capture another one...

I already made some comments on this in a different thread, so I won't
repeat here.

   If that's
   not what you're saying, how would the domains - now made up of a
   user's selection of groups, rather than individual devices - be
   configured?
   
Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,
   
 
 Another aspect I don't see discussed is how we represent these things to
 the guest.
 
 On Power for example, I have a requirement that a given iommu domain is
 represented by a single dma window property in the device-tree. What
 that means is that that property needs to be either in the node of the
 device itself if there's only one device in the group or in a parent
 node (ie a bridge or host bridge) if there are multiple devices.
 
 Now I do -not- want to go down the path of simulating P2P bridges,
 besides we'll quickly run out of bus numbers if we go there.
 
 For us the most simple and logical approach (which is also what pHyp
 uses and what Linux handles well) is really to expose a given PCI host
 bridge per group to the guest. Believe it or not, it makes things
 easier :-)

I'm all for easier.  Why does exposing the bridge use less bus numbers
than emulating a bridge?

On x86, I want to maintain that our default assignment is at the device
level.  A user should be able to pick single or multiple devices from
across several groups and have them all show up as individual,
hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
also seen cases where users try to attach a bridge to the guest,
assuming they'll get all the devices below the bridge, so I'd be in
favor of making this just work if possible too, though we may have to
prevent hotplug of those.

Given the device

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Benjamin Herrenschmidt

On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:
 On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
  
   I am in favour of /dev/vfio/$GROUP. If multiple devices should be
   assigned to a guest, there can also be an ioctl to bind a group to an
   address-space of another group (certainly needs some care to not allow
   that both groups belong to different processes).
   
   Btw, a problem we havn't talked about yet entirely is
   driver-deassignment. User space can decide to de-assign the device from
   vfio while a fd is open on it. With PCI there is no way to let this fail
   (the .release function returns void last time i checked). Is this a
   problem, and yes, how we handle that?
  
  We can treat it as a hard unplug (like a cardbus gone away).
  
  IE. Dispose of the direct mappings (switch to MMIO emulation) and return
  all ff's from reads ( ignore writes).
  
  Then send an unplug event via whatever mechanism the platform provides
  (ACPI hotplug controller on x86 for example, we haven't quite sorted out
  what to do on power for hotplug yet).
 
 Hmm, good idea. But as far as I know the hotplug-event needs to be in
 the guest _before_ the device is actually unplugged (so that the guest
 can unbind its driver first). That somehow brings back the sleep-idea
 and the timeout in the .release function.

That's for normal assisted hotplug, but don't we support hard hotplug ?
I mean, things like cardbus, thunderbolt (if we ever support that)
etc... will need it and some platforms do support hard hotplug of PCIe
devices.

(That's why drivers should never spin on MMIO waiting for a 1 bit to
clear without a timeout :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Benjamin Herrenschmidt

On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
 
 Yeah.  Joerg's idea of binding groups internally (pass the fd of one
 group to another via ioctl) is one option.  The tricky part will be
 implementing it to support hot unplug of any group from the
 supergroup.
 I believe Ben had a suggestion that supergroups could be created in
 sysfs, but I don't know what the mechanism to do that looks like.  It
 would also be an extra management step to dynamically bind and unbind
 groups to the supergroup around hotplug.  Thanks, 

I don't really care that much what the method for creating them is, to
be honest, I just prefer this concept of meta groups or super groups
or synthetic groups (whatever you want to name them) to having a
separate uiommu file descriptor.

The one reason I have a slight preference for creating them statically
using some kind of separate interface (again, I don't care whether it's
sysfs, netlink, etc...) is that it means things like qemu don't have to
care about them.

In general, apps that want to use vfio can just get passed the path to
such a group or the /dev/ path or the group number (whatever we chose as
the way to identify a group), and don't need to know anything about
super groups, how to manipulate them, create them, possible
constraints etc...

Now, libvirt might want to know about that other API in order to provide
control on the creation of these things, but that's a different issue.

By static I mean they persist, they aren't tied to the lifetime of an
fd.

Now that's purely a preference on my side because I believe it will make
life easier for actual programs wanting to use vfio to not have to care
about those super-groups, but as I said earlier, I don't actually care
that much :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Benjamin Herrenschmidt


  For us the most simple and logical approach (which is also what pHyp
  uses and what Linux handles well) is really to expose a given PCI host
  bridge per group to the guest. Believe it or not, it makes things
  easier :-)
 
 I'm all for easier.  Why does exposing the bridge use less bus numbers
 than emulating a bridge?

Because a host bridge doesn't look like a PCI to PCI bridge at all for
us. It's an entire separate domain with it's own bus number space
(unlike most x86 setups).

In fact we have some problems afaik in qemu today with the concept of
PCI domains, for example, I think qemu has assumptions about a single
shared IO space domain which isn't true for us (each PCI host bridge
provides a distinct IO space domain starting at 0). We'll have to fix
that, but it's not a huge deal.

So for each group we'd expose in the guest an entire separate PCI
domain space with its own IO, MMIO etc... spaces, handed off from a
single device-tree host bridge which doesn't itself appear in the
config space, doesn't need any emulation of any config space etc...

 On x86, I want to maintain that our default assignment is at the device
 level.  A user should be able to pick single or multiple devices from
 across several groups and have them all show up as individual,
 hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
 also seen cases where users try to attach a bridge to the guest,
 assuming they'll get all the devices below the bridge, so I'd be in
 favor of making this just work if possible too, though we may have to
 prevent hotplug of those.

 Given the device requirement on x86 and since everything is a PCI device
 on x86, I'd like to keep a qemu command line something like -device
 vfio,host=00:19.0.  I assume that some of the iommu properties, such as
 dma window size/address, will be query-able through an architecture
 specific (or general if possible) ioctl on the vfio group fd.  I hope
 that will help the specification, but I don't fully understand what all
 remains.  Thanks,

Well, for iommu there's a couple of different issues here but yes,
basically on one side we'll have some kind of ioctl to know what segment
of the device(s) DMA address space is assigned to the group and we'll
need to represent that to the guest via a device-tree property in some
kind of parent node of all the devices in that group.

We -might- be able to implement some kind of hotplug of individual
devices of a group under such a PHB (PCI Host Bridge), I don't know for
sure yet, some of that PAPR stuff is pretty arcane, but basically, for
all intend and purpose, we really want a group to be represented as a
PHB in the guest.

We cannot arbitrary have individual devices of separate groups be
represented in the guest as siblings on a single simulated PCI bus.

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alexander Graf


On 23.08.2011, at 18:41, Benjamin Herrenschmidt wrote:

 On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
 
 Yeah.  Joerg's idea of binding groups internally (pass the fd of one
 group to another via ioctl) is one option.  The tricky part will be
 implementing it to support hot unplug of any group from the
 supergroup.
 I believe Ben had a suggestion that supergroups could be created in
 sysfs, but I don't know what the mechanism to do that looks like.  It
 would also be an extra management step to dynamically bind and unbind
 groups to the supergroup around hotplug.  Thanks, 
 
 I don't really care that much what the method for creating them is, to
 be honest, I just prefer this concept of meta groups or super groups
 or synthetic groups (whatever you want to name them) to having a
 separate uiommu file descriptor.
 
 The one reason I have a slight preference for creating them statically
 using some kind of separate interface (again, I don't care whether it's
 sysfs, netlink, etc...) is that it means things like qemu don't have to
 care about them.
 
 In general, apps that want to use vfio can just get passed the path to
 such a group or the /dev/ path or the group number (whatever we chose as
 the way to identify a group), and don't need to know anything about
 super groups, how to manipulate them, create them, possible
 constraints etc...
 
 Now, libvirt might want to know about that other API in order to provide
 control on the creation of these things, but that's a different issue.
 
 By static I mean they persist, they aren't tied to the lifetime of an
 fd.
 
 Now that's purely a preference on my side because I believe it will make
 life easier for actual programs wanting to use vfio to not have to care
 about those super-groups, but as I said earlier, I don't actually care
 that much :-)

Oh I think it's one of the building blocks we need for a sane user space device 
exposure API. If I want to pass user X a few devices that are all behind a 
single IOMMU, I just chown that device node to user X and be done with it.

The user space tool actually using the VFIO interface wouldn't be in 
configuration business then - and it really shouldn't. That's what system 
configuration is there for :).

But I'm fairly sure we managed to persuade Alex that this is the right path on 
the BOF :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-23 Thread Alexander Graf


On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote:

 
 For us the most simple and logical approach (which is also what pHyp
 uses and what Linux handles well) is really to expose a given PCI host
 bridge per group to the guest. Believe it or not, it makes things
 easier :-)
 
 I'm all for easier.  Why does exposing the bridge use less bus numbers
 than emulating a bridge?
 
 Because a host bridge doesn't look like a PCI to PCI bridge at all for
 us. It's an entire separate domain with it's own bus number space
 (unlike most x86 setups).
 
 In fact we have some problems afaik in qemu today with the concept of
 PCI domains, for example, I think qemu has assumptions about a single
 shared IO space domain which isn't true for us (each PCI host bridge
 provides a distinct IO space domain starting at 0). We'll have to fix
 that, but it's not a huge deal.
 
 So for each group we'd expose in the guest an entire separate PCI
 domain space with its own IO, MMIO etc... spaces, handed off from a
 single device-tree host bridge which doesn't itself appear in the
 config space, doesn't need any emulation of any config space etc...
 
 On x86, I want to maintain that our default assignment is at the device
 level.  A user should be able to pick single or multiple devices from
 across several groups and have them all show up as individual,
 hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
 also seen cases where users try to attach a bridge to the guest,
 assuming they'll get all the devices below the bridge, so I'd be in
 favor of making this just work if possible too, though we may have to
 prevent hotplug of those.
 
 Given the device requirement on x86 and since everything is a PCI device
 on x86, I'd like to keep a qemu command line something like -device
 vfio,host=00:19.0.  I assume that some of the iommu properties, such as
 dma window size/address, will be query-able through an architecture
 specific (or general if possible) ioctl on the vfio group fd.  I hope
 that will help the specification, but I don't fully understand what all
 remains.  Thanks,
 
 Well, for iommu there's a couple of different issues here but yes,
 basically on one side we'll have some kind of ioctl to know what segment
 of the device(s) DMA address space is assigned to the group and we'll
 need to represent that to the guest via a device-tree property in some
 kind of parent node of all the devices in that group.
 
 We -might- be able to implement some kind of hotplug of individual
 devices of a group under such a PHB (PCI Host Bridge), I don't know for
 sure yet, some of that PAPR stuff is pretty arcane, but basically, for
 all intend and purpose, we really want a group to be represented as a
 PHB in the guest.
 
 We cannot arbitrary have individual devices of separate groups be
 represented in the guest as siblings on a single simulated PCI bus.

So would it make sense for you to go the same route that we need to go on 
embedded power, with a separate VFIO style interface that simply exports memory 
ranges and irq bindings, but doesn't know anything about PCI? For e500, we'll 
be using something like that to pass through a full PCI bus into the system.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Avi Kivity


On 08/20/2011 07:51 PM, Alex Williamson wrote:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci:00/:00:19.0/iommu_group
42



$ readlink /sys/devices/pci:00/:00:19.0/iommu_group
../../../path/to/device/which/represents/the/resource/constraint

(the pci-to-pci bridge on x86, or whatever node represents partitionable 
endpoints on power)


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Joerg Roedel

On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote:
 On 08/20/2011 07:51 PM, Alex Williamson wrote:
  We need to address both the description and enforcement of device
  groups.  Groups are formed any time the iommu does not have resolution
  between a set of devices.  On x86, this typically happens when a
  PCI-to-PCI bridge exists between the set of devices and the iommu.  For
  Power, partitionable endpoints define a group.  Grouping information
  needs to be exposed for both userspace and kernel internal usage.  This
  will be a sysfs attribute setup by the iommu drivers.  Perhaps:
 
  # cat /sys/devices/pci:00/:00:19.0/iommu_group
  42
 
 
 $ readlink /sys/devices/pci:00/:00:19.0/iommu_group
 ../../../path/to/device/which/represents/the/resource/constraint
 
 (the pci-to-pci bridge on x86, or whatever node represents partitionable 
 endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Avi Kivity


On 08/22/2011 01:46 PM, Joerg Roedel wrote:

  $ readlink /sys/devices/pci:00/:00:19.0/iommu_group
  ../../../path/to/device/which/represents/the/resource/constraint

  (the pci-to-pci bridge on x86, or whatever node represents partitionable
  endpoints on power)

That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).



How does the kernel detect that devices behind the invisible bridge must 
be assigned as a unit?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Roedel, Joerg

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
 On 08/22/2011 01:46 PM, Joerg Roedel wrote:
  That does not work. The bridge in question may not even be visible as a
  PCI device, so you can't link to it. This is the case on a few PCIe
  cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
  the PCIe interface (yes, I have seen those cards).
 
 How does the kernel detect that devices behind the invisible bridge must 
 be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Avi Kivity


On 08/22/2011 03:36 PM, Roedel, Joerg wrote:

On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote:
  On 08/22/2011 01:46 PM, Joerg Roedel wrote:
That does not work. The bridge in question may not even be visible as a
PCI device, so you can't link to it. This is the case on a few PCIe
cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement
the PCIe interface (yes, I have seen those cards).

  How does the kernel detect that devices behind the invisible bridge must
  be assigned as a unit?

On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.



I see.  There is no sysfs node representing it?

I'd rather not add another meaningless identifier.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Roedel, Joerg

On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
 On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
  On the AMD IOMMU side this information is stored in the IVRS ACPI table.
  Not sure about the VT-d side, though.
 
 I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)

 I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Avi Kivity


On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote:
  On 08/22/2011 03:36 PM, Roedel, Joerg wrote:
On the AMD IOMMU side this information is stored in the IVRS ACPI table.
Not sure about the VT-d side, though.

  I see.  There is no sysfs node representing it?

No. It also doesn't exist as a 'struct pci_dev'. This caused problems in
the AMD IOMMU driver in the past and I needed to fix that. There I know
that from :)


Well, too bad.



  I'd rather not add another meaningless identifier.

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.



I mean the contents of the group descriptor.  There are enough 42s in 
the kernel, it's better if we can replace a synthetic number with 
something meaningful.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Roedel, Joerg

On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
 On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

  Well, I don't think its really meaningless, but we need some way to
  communicate the information about device groups to userspace.
 
 I mean the contents of the group descriptor.  There are enough 42s in 
 the kernel, it's better if we can replace a synthetic number with 
 something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Avi Kivity


On 08/22/2011 04:15 PM, Roedel, Joerg wrote:

On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
  On 08/22/2011 03:55 PM, Roedel, Joerg wrote:

Well, I don't think its really meaningless, but we need some way to
communicate the information about device groups to userspace.

  I mean the contents of the group descriptor.  There are enough 42s in
  the kernel, it's better if we can replace a synthetic number with
  something meaningful.

If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
sufficient, of course. But the idea was to make it generic enough so
that it works with !PCI too.



We could make it an arch defined string instead of a symlink.  So it 
doesn't return 42, rather something that can be used by the admin to 
figure out what the problem was.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Roedel, Joerg

On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote:
 On 08/22/2011 04:15 PM, Roedel, Joerg wrote:
  On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote:
On 08/22/2011 03:55 PM, Roedel, Joerg wrote:
 
  Well, I don't think its really meaningless, but we need some way to
  communicate the information about device groups to userspace.
  
I mean the contents of the group descriptor.  There are enough 42s in
the kernel, it's better if we can replace a synthetic number with
something meaningful.
 
  If we only look at PCI than a Segment:Bus:Dev.Fn Number would be
  sufficient, of course. But the idea was to make it generic enough so
  that it works with !PCI too.
 
 
 We could make it an arch defined string instead of a symlink.  So it 
 doesn't return 42, rather something that can be used by the admin to 
 figure out what the problem was.

Well, ok, it would certainly differ from the in-kernel representation
then and introduce new architecture dependencies into libvirt. But if
the 'group-string' is more meaningful to users then its certainly good.
Suggestions?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Alex Williamson

On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
 On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
  We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
  capture the plan that I think we agreed to:
  
  We need to address both the description and enforcement of device
  groups.  Groups are formed any time the iommu does not have resolution
  between a set of devices.  On x86, this typically happens when a
  PCI-to-PCI bridge exists between the set of devices and the iommu.  For
  Power, partitionable endpoints define a group.  Grouping information
  needs to be exposed for both userspace and kernel internal usage.  This
  will be a sysfs attribute setup by the iommu drivers.  Perhaps:
  
  # cat /sys/devices/pci:00/:00:19.0/iommu_group
  42
  
  (I use a PCI example here, but attribute should not be PCI specific)
 
 Ok.  Am I correct in thinking these group IDs are representing the
 minimum granularity, and are therefore always static, defined only by
 the connected hardware, not by configuration?

Yes, that's the idea.  An open question I have towards the configuration
side is whether we might add iommu driver specific options to the
groups.  For instance on x86 where we typically have B:D.F granularity,
should we have an option not to trust multi-function devices and use a
B:D granularity for grouping?

  From there we have a few options.  In the BoF we discussed a model where
  binding a device to vfio creates a /dev/vfio$GROUP character device
  file.  This group fd provides provides dma mapping ioctls as well as
  ioctls to enumerate and return a device fd for each attached member of
  the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
  returning an error on open() of the group fd if there are members of the
  group not bound to the vfio driver.  Each device fd would then support a
  similar set of ioctls and mapping (mmio/pio/config) interface as current
  vfio, except for the obvious domain and dma ioctls superseded by the
  group fd.
 
 It seems a slightly strange distinction that the group device appears
 when any device in the group is bound to vfio, but only becomes usable
 when all devices are bound.
 
  Another valid model might be that /dev/vfio/$GROUP is created for all
  groups when the vfio module is loaded.  The group fd would allow open()
  and some set of iommu querying and device enumeration ioctls, but would
  error on dma mapping and retrieving device fds until all of the group
  devices are bound to the vfio driver.
 
 Which is why I marginally prefer this model, although it's not a big
 deal.

Right, we can also combine models.  Binding a device to vfio
creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
device access until all the group devices are also bound.  I think
the /dev/vfio/$GROUP might help provide an enumeration interface as well
though, which could be useful.

  In either case, the uiommu interface is removed entirely since dma
  mapping is done via the group fd.  As necessary in the future, we can
  define a more high performance dma mapping interface for streaming dma
  via the group fd.  I expect we'll also include architecture specific
  group ioctls to describe features and capabilities of the iommu.  The
  group fd will need to prevent concurrent open()s to maintain a 1:1 group
  to userspace process ownership model.
 
 A 1:1 group-process correspondance seems wrong to me. But there are
 many ways you could legitimately write the userspace side of the code,
 many of them involving some sort of concurrency.  Implementing that
 concurrency as multiple processes (using explicit shared memory and/or
 other IPC mechanisms to co-ordinate) seems a valid choice that we
 shouldn't arbitrarily prohibit.
 
 Obviously, only one UID may be permitted to have the group open at a
 time, and I think that's enough to prevent them doing any worse than
 shooting themselves in the foot.

1:1 group-process is probably too strong.  Not allowing concurrent
open()s on the group file enforces a single userspace entity is
responsible for that group.  Device fds can be passed to other
processes, but only retrieved via the group fd.  I suppose we could even
branch off the dma interface into a different fd, but it seems like we
would logically want to serialize dma mappings at each iommu group
anyway.  I'm open to alternatives, this just seemed an easy way to do
it.  Restricting on UID implies that we require isolated qemu instances
to run as different UIDs.  I know that's a goal, but I don't know if we
want to make it an assumption in the group security model.

  Also on the table is supporting non-PCI devices with vfio.  To do this,
  we need to generalize the read/write/mmap and irq eventfd interfaces.
  We could keep the same model of segmenting the device fd address space,
  perhaps adding ioctls to define the segment offset bit position or we
  could split each region into it's own fd

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Joerg Roedel

On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote:
 We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
 capture the plan that I think we agreed to:
 
 We need to address both the description and enforcement of device
 groups.  Groups are formed any time the iommu does not have resolution
 between a set of devices.  On x86, this typically happens when a
 PCI-to-PCI bridge exists between the set of devices and the iommu.  For
 Power, partitionable endpoints define a group.  Grouping information
 needs to be exposed for both userspace and kernel internal usage.  This
 will be a sysfs attribute setup by the iommu drivers.  Perhaps:
 
 # cat /sys/devices/pci:00/:00:19.0/iommu_group
 42

Right, that is mainly for libvirt to provide that information to the
user in a meaningful way. So userspace is aware that other devices might
not work anymore when it assigns one to a guest.

 
 (I use a PCI example here, but attribute should not be PCI specific)
 
 From there we have a few options.  In the BoF we discussed a model where
 binding a device to vfio creates a /dev/vfio$GROUP character device
 file.  This group fd provides provides dma mapping ioctls as well as
 ioctls to enumerate and return a device fd for each attached member of
 the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
 returning an error on open() of the group fd if there are members of the
 group not bound to the vfio driver.  Each device fd would then support a
 similar set of ioctls and mapping (mmio/pio/config) interface as current
 vfio, except for the obvious domain and dma ioctls superseded by the
 group fd.
 
 Another valid model might be that /dev/vfio/$GROUP is created for all
 groups when the vfio module is loaded.  The group fd would allow open()
 and some set of iommu querying and device enumeration ioctls, but would
 error on dma mapping and retrieving device fds until all of the group
 devices are bound to the vfio driver.

I am in favour of /dev/vfio/$GROUP. If multiple devices should be
assigned to a guest, there can also be an ioctl to bind a group to an
address-space of another group (certainly needs some care to not allow
that both groups belong to different processes).

Btw, a problem we havn't talked about yet entirely is
driver-deassignment. User space can decide to de-assign the device from
vfio while a fd is open on it. With PCI there is no way to let this fail
(the .release function returns void last time i checked). Is this a
problem, and yes, how we handle that?


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread aafabbri




On 8/20/11 9:51 AM, Alex Williamson alex.william...@redhat.com wrote:

 We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
 capture the plan that I think we agreed to:
 
 We need to address both the description and enforcement of device
 groups.  Groups are formed any time the iommu does not have resolution
 between a set of devices.  On x86, this typically happens when a
 PCI-to-PCI bridge exists between the set of devices and the iommu.  For
 Power, partitionable endpoints define a group.  Grouping information
 needs to be exposed for both userspace and kernel internal usage.  This
 will be a sysfs attribute setup by the iommu drivers.  Perhaps:
 
 # cat /sys/devices/pci:00/:00:19.0/iommu_group
 42
 
 (I use a PCI example here, but attribute should not be PCI specific)
 
 From there we have a few options.  In the BoF we discussed a model where
 binding a device to vfio creates a /dev/vfio$GROUP character device
 file.  This group fd provides provides dma mapping ioctls as well as
 ioctls to enumerate and return a device fd for each attached member of
 the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
 returning an error on open() of the group fd if there are members of the
 group not bound to the vfio driver.

Sounds reasonable.

 Each device fd would then support a
 similar set of ioctls and mapping (mmio/pio/config) interface as current
 vfio, except for the obvious domain and dma ioctls superseded by the
 group fd.
 
 Another valid model might be that /dev/vfio/$GROUP is created for all
 groups when the vfio module is loaded.  The group fd would allow open()
 and some set of iommu querying and device enumeration ioctls, but would
 error on dma mapping and retrieving device fds until all of the group
 devices are bound to the vfio driver.
 
 In either case, the uiommu interface is removed entirely since dma
 mapping is done via the group fd.

The loss in generality is unfortunate. I'd like to be able to support
arbitrary iommu domain - device assignment.  One way to do this would be
to keep uiommu, but to return an error if someone tries to assign more than
one uiommu context to devices in the same group.


-Aaron

 As necessary in the future, we can
 define a more high performance dma mapping interface for streaming dma
 via the group fd.  I expect we'll also include architecture specific
 group ioctls to describe features and capabilities of the iommu.  The
 group fd will need to prevent concurrent open()s to maintain a 1:1 group
 to userspace process ownership model.
 
 Also on the table is supporting non-PCI devices with vfio.  To do this,
 we need to generalize the read/write/mmap and irq eventfd interfaces.
 We could keep the same model of segmenting the device fd address space,
 perhaps adding ioctls to define the segment offset bit position or we
 could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
 VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
 suffering some degree of fd bloat (group fd, device fd(s), interrupt
 event fd(s), per resource fd, etc).  For interrupts we can overload
 VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
 devices support MSI?).
 
 For qemu, these changes imply we'd only support a model where we have a
 1:1 group to iommu domain.  The current vfio driver could probably
 become vfio-pci as we might end up with more target specific vfio
 drivers for non-pci.  PCI should be able to maintain a simple -device
 vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
 need to come up with extra options when we need to expose groups to
 guest for pvdma.
 
 Hope that captures it, feel free to jump in with corrections and
 suggestions.  Thanks,
 
 Alex
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt

On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:

  Each device fd would then support a
  similar set of ioctls and mapping (mmio/pio/config) interface as current
  vfio, except for the obvious domain and dma ioctls superseded by the
  group fd.
  
  Another valid model might be that /dev/vfio/$GROUP is created for all
  groups when the vfio module is loaded.  The group fd would allow open()
  and some set of iommu querying and device enumeration ioctls, but would
  error on dma mapping and retrieving device fds until all of the group
  devices are bound to the vfio driver.
  
  In either case, the uiommu interface is removed entirely since dma
  mapping is done via the group fd.
 
 The loss in generality is unfortunate. I'd like to be able to support
 arbitrary iommu domain - device assignment.  One way to do this would be
 to keep uiommu, but to return an error if someone tries to assign more than
 one uiommu context to devices in the same group.

I wouldn't use uiommu for that. If the HW or underlying kernel drivers
support it, what I'd suggest is that you have an (optional) ioctl to
bind two groups (you have to have both opened already) or for one group
to capture another one.

The binding means under the hood the iommus get shared, with the
lifetime being that of the owning group.

Another option is to make that static configuration APIs via special
ioctls (or even netlink if you really like it), to change the grouping
on architectures that allow it.

Cheers.
Ben.

 
 -Aaron
 
  As necessary in the future, we can
  define a more high performance dma mapping interface for streaming dma
  via the group fd.  I expect we'll also include architecture specific
  group ioctls to describe features and capabilities of the iommu.  The
  group fd will need to prevent concurrent open()s to maintain a 1:1 group
  to userspace process ownership model.
  
  Also on the table is supporting non-PCI devices with vfio.  To do this,
  we need to generalize the read/write/mmap and irq eventfd interfaces.
  We could keep the same model of segmenting the device fd address space,
  perhaps adding ioctls to define the segment offset bit position or we
  could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
  VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
  suffering some degree of fd bloat (group fd, device fd(s), interrupt
  event fd(s), per resource fd, etc).  For interrupts we can overload
  VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
  devices support MSI?).
  
  For qemu, these changes imply we'd only support a model where we have a
  1:1 group to iommu domain.  The current vfio driver could probably
  become vfio-pci as we might end up with more target specific vfio
  drivers for non-pci.  PCI should be able to maintain a simple -device
  vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
  need to come up with extra options when we need to expose groups to
  guest for pvdma.
  
  Hope that captures it, feel free to jump in with corrections and
  suggestions.  Thanks,
  
  Alex
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt

On Mon, 2011-08-22 at 09:30 +0300, Avi Kivity wrote:
 On 08/20/2011 07:51 PM, Alex Williamson wrote:
  We need to address both the description and enforcement of device
  groups.  Groups are formed any time the iommu does not have resolution
  between a set of devices.  On x86, this typically happens when a
  PCI-to-PCI bridge exists between the set of devices and the iommu.  For
  Power, partitionable endpoints define a group.  Grouping information
  needs to be exposed for both userspace and kernel internal usage.  This
  will be a sysfs attribute setup by the iommu drivers.  Perhaps:
 
  # cat /sys/devices/pci:00/:00:19.0/iommu_group
  42
 
 
 $ readlink /sys/devices/pci:00/:00:19.0/iommu_group
 ../../../path/to/device/which/represents/the/resource/constraint
 
 (the pci-to-pci bridge on x86, or whatever node represents partitionable 
 endpoints on power)

The constraint might not necessarily be a device.

The PCI bridge is just an example. There are other possible constraints.
On POWER for example, it could be a limit in how far I can segment the
DMA address space, forcing me to arbitrarily put devices together, or it
could be a similar constraint related to how the MMIO space is broken
up.

So either that remains a path in which case we do have a separate set of
sysfs nodes representing the groups themselves which may or may not
itself contain a pointer to the constraining device, or we just make
that an arbitrary number (in my case the PE#)

Cheers,
Ben

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt

On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:

 Yes, that's the idea.  An open question I have towards the configuration
 side is whether we might add iommu driver specific options to the
 groups.  For instance on x86 where we typically have B:D.F granularity,
 should we have an option not to trust multi-function devices and use a
 B:D granularity for grouping?

Or even B or range of busses... if you want to enforce strict isolation
you really can't trust anything below a bus level :-)

 Right, we can also combine models.  Binding a device to vfio
 creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
 device access until all the group devices are also bound.  I think
 the /dev/vfio/$GROUP might help provide an enumeration interface as well
 though, which could be useful.

Could be tho in what form ? returning sysfs pathes ?

 1:1 group-process is probably too strong.  Not allowing concurrent
 open()s on the group file enforces a single userspace entity is
 responsible for that group.  Device fds can be passed to other
 processes, but only retrieved via the group fd.  I suppose we could even
 branch off the dma interface into a different fd, but it seems like we
 would logically want to serialize dma mappings at each iommu group
 anyway.  I'm open to alternatives, this just seemed an easy way to do
 it.  Restricting on UID implies that we require isolated qemu instances
 to run as different UIDs.  I know that's a goal, but I don't know if we
 want to make it an assumption in the group security model.

1:1 process has the advantage of linking to an -mm which makes the whole
mmu notifier business doable. How do you want to track down mappings and
do the second level translation in the case of explicit map/unmap (like
on power) if you are not tied to an mm_struct ?

 Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
 to assume 1 device per guest is a typical model and that the iotlb is
 large enough that we might improve thrashing to see both a resource and
 performance benefit from it.  I'm open to suggestions for how we could
 include it though.

Sharing may or may not be possible depending on setups so yes, it's a
bit tricky.

My preference is to have a static interface (and that's actually where
your pet netlink might make some sense :-) to create synthetic groups
made of other groups if the arch allows it. But that might not be the
best approach. In another email I also proposed an option for a group to
capture another one...

  If that's
  not what you're saying, how would the domains - now made up of a
  user's selection of groups, rather than individual devices - be
  configured?
  
   Hope that captures it, feel free to jump in with corrections and
   suggestions.  Thanks,
  

Another aspect I don't see discussed is how we represent these things to
the guest.

On Power for example, I have a requirement that a given iommu domain is
represented by a single dma window property in the device-tree. What
that means is that that property needs to be either in the node of the
device itself if there's only one device in the group or in a parent
node (ie a bridge or host bridge) if there are multiple devices.

Now I do -not- want to go down the path of simulating P2P bridges,
besides we'll quickly run out of bus numbers if we go there.

For us the most simple and logical approach (which is also what pHyp
uses and what Linux handles well) is really to expose a given PCI host
bridge per group to the guest. Believe it or not, it makes things
easier :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt


 I am in favour of /dev/vfio/$GROUP. If multiple devices should be
 assigned to a guest, there can also be an ioctl to bind a group to an
 address-space of another group (certainly needs some care to not allow
 that both groups belong to different processes).
 
 Btw, a problem we havn't talked about yet entirely is
 driver-deassignment. User space can decide to de-assign the device from
 vfio while a fd is open on it. With PCI there is no way to let this fail
 (the .release function returns void last time i checked). Is this a
 problem, and yes, how we handle that?

We can treat it as a hard unplug (like a cardbus gone away).

IE. Dispose of the direct mappings (switch to MMIO emulation) and return
all ff's from reads ( ignore writes).

Then send an unplug event via whatever mechanism the platform provides
(ACPI hotplug controller on x86 for example, we haven't quite sorted out
what to do on power for hotplug yet).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread aafabbri




On 8/22/11 1:49 PM, Benjamin Herrenschmidt b...@kernel.crashing.org
wrote:

 On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:
 
 Each device fd would then support a
 similar set of ioctls and mapping (mmio/pio/config) interface as current
 vfio, except for the obvious domain and dma ioctls superseded by the
 group fd.
 
 Another valid model might be that /dev/vfio/$GROUP is created for all
 groups when the vfio module is loaded.  The group fd would allow open()
 and some set of iommu querying and device enumeration ioctls, but would
 error on dma mapping and retrieving device fds until all of the group
 devices are bound to the vfio driver.
 
 In either case, the uiommu interface is removed entirely since dma
 mapping is done via the group fd.
 
 The loss in generality is unfortunate. I'd like to be able to support
 arbitrary iommu domain - device assignment.  One way to do this would be
 to keep uiommu, but to return an error if someone tries to assign more than
 one uiommu context to devices in the same group.
 
 I wouldn't use uiommu for that.

Any particular reason besides saving a file descriptor?

We use it today, and it seems like a cleaner API than what you propose
changing it to.

 If the HW or underlying kernel drivers
 support it, what I'd suggest is that you have an (optional) ioctl to
 bind two groups (you have to have both opened already) or for one group
 to capture another one.

You'll need other rules there too.. both opened already, but zero mappings
performed yet as they would have instantiated a default IOMMU domain.

Keep in mind the only case I'm using is singleton groups, a.k.a. devices.

Since what I want is to specify which devices can do things like share
network buffers (in a way that conserves IOMMU hw resources), it seems
cleanest to expose this explicitly, versus some inherit iommu domain from
another device ioctl.  What happens if I do something like this:

dev1_fd = open (/dev/vfio0)
dev2_fd = open (/dev/vfio1)
dev2_fd.inherit_iommu(dev1_fd)

error = close(dev1_fd)

There are other gross cases as well.

 
 The binding means under the hood the iommus get shared, with the
 lifetime being that of the owning group.

So what happens in the close() above?  EINUSE?  Reset all children?  Still
seems less clean than having an explicit iommu fd.  Without some benefit I'm
not sure why we'd want to change this API.

If we in singleton-group land were building our own groups which were sets
of devices sharing the IOMMU domains we wanted, I suppose we could do away
with uiommu fds, but it sounds like the current proposal would create 20
singleton groups (x86 iommu w/o PCI bridges = all devices are partitionable
endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
worse than the current explicit uiommu API.

Thanks,
Aaron

 
 Another option is to make that static configuration APIs via special
 ioctls (or even netlink if you really like it), to change the grouping
 on architectures that allow it.
 
 Cheers.
 Ben.
 
 
 -Aaron
 
 As necessary in the future, we can
 define a more high performance dma mapping interface for streaming dma
 via the group fd.  I expect we'll also include architecture specific
 group ioctls to describe features and capabilities of the iommu.  The
 group fd will need to prevent concurrent open()s to maintain a 1:1 group
 to userspace process ownership model.
 
 Also on the table is supporting non-PCI devices with vfio.  To do this,
 we need to generalize the read/write/mmap and irq eventfd interfaces.
 We could keep the same model of segmenting the device fd address space,
 perhaps adding ioctls to define the segment offset bit position or we
 could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
 VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
 suffering some degree of fd bloat (group fd, device fd(s), interrupt
 event fd(s), per resource fd, etc).  For interrupts we can overload
 VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
 devices support MSI?).
 
 For qemu, these changes imply we'd only support a model where we have a
 1:1 group to iommu domain.  The current vfio driver could probably
 become vfio-pci as we might end up with more target specific vfio
 drivers for non-pci.  PCI should be able to maintain a simple -device
 vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
 need to come up with extra options when we need to expose groups to
 guest for pvdma.
 
 Hope that captures it, feel free to jump in with corrections and
 suggestions.  Thanks,
 
 Alex
 
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread Benjamin Herrenschmidt


  I wouldn't use uiommu for that.
 
 Any particular reason besides saving a file descriptor?
 
 We use it today, and it seems like a cleaner API than what you propose
 changing it to.

Well for one, we are back to square one vs. grouping constraints.

 .../...

 If we in singleton-group land were building our own groups which were sets
 of devices sharing the IOMMU domains we wanted, I suppose we could do away
 with uiommu fds, but it sounds like the current proposal would create 20
 singleton groups (x86 iommu w/o PCI bridges = all devices are partitionable
 endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
 worse than the current explicit uiommu API.

I'd rather have an API to create super-groups (groups of groups)
statically and then you can use such groups as normal groups using the
same interface. That create/management process could be done via a
simple command line utility or via sysfs banging, whatever...

Cheers,
Ben.

 Thanks,
 Aaron
 
  
  Another option is to make that static configuration APIs via special
  ioctls (or even netlink if you really like it), to change the grouping
  on architectures that allow it.
  
  Cheers.
  Ben.
  
  
  -Aaron
  
  As necessary in the future, we can
  define a more high performance dma mapping interface for streaming dma
  via the group fd.  I expect we'll also include architecture specific
  group ioctls to describe features and capabilities of the iommu.  The
  group fd will need to prevent concurrent open()s to maintain a 1:1 group
  to userspace process ownership model.
  
  Also on the table is supporting non-PCI devices with vfio.  To do this,
  we need to generalize the read/write/mmap and irq eventfd interfaces.
  We could keep the same model of segmenting the device fd address space,
  perhaps adding ioctls to define the segment offset bit position or we
  could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
  VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
  suffering some degree of fd bloat (group fd, device fd(s), interrupt
  event fd(s), per resource fd, etc).  For interrupts we can overload
  VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
  devices support MSI?).
  
  For qemu, these changes imply we'd only support a model where we have a
  1:1 group to iommu domain.  The current vfio driver could probably
  become vfio-pci as we might end up with more target specific vfio
  drivers for non-pci.  PCI should be able to maintain a simple -device
  vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
  need to come up with extra options when we need to expose groups to
  guest for pvdma.
  
  Hope that captures it, feel free to jump in with corrections and
  suggestions.  Thanks,
  
  Alex
  
  
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread aafabbri




On 8/22/11 2:49 PM, Benjamin Herrenschmidt b...@kernel.crashing.org
wrote:

 
 I wouldn't use uiommu for that.
 
 Any particular reason besides saving a file descriptor?
 
 We use it today, and it seems like a cleaner API than what you propose
 changing it to.
 
 Well for one, we are back to square one vs. grouping constraints.

I'm not following you.

You have to enforce group/iommu domain assignment whether you have the
existing uiommu API, or if you change it to your proposed
ioctl(inherit_iommu) API.

The only change needed to VFIO here should be to make uiommu fd assignment
happen on the groups instead of on device fds.  That operation fails or
succeeds according to the group semantics (all-or-none assignment/same
uiommu).

I think the question is: do we force 1:1 iommu/group mapping, or do we allow
arbitrary mapping (satisfying group constraints) as we do today.

I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
ability and definitely think the uiommu approach is cleaner than the
ioctl(inherit_iommu) approach.  We considered that approach before but it
seemed less clean so we went with the explicit uiommu context.

  .../...
 
 If we in singleton-group land were building our own groups which were sets
 of devices sharing the IOMMU domains we wanted, I suppose we could do away
 with uiommu fds, but it sounds like the current proposal would create 20
 singleton groups (x86 iommu w/o PCI bridges = all devices are partitionable
 endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
 worse than the current explicit uiommu API.
 
 I'd rather have an API to create super-groups (groups of groups)
 statically and then you can use such groups as normal groups using the
 same interface. That create/management process could be done via a
 simple command line utility or via sysfs banging, whatever...




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-22 Thread David Gibson

On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote:
 On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote:
  On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
   We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
   capture the plan that I think we agreed to:
   
   We need to address both the description and enforcement of device
   groups.  Groups are formed any time the iommu does not have resolution
   between a set of devices.  On x86, this typically happens when a
   PCI-to-PCI bridge exists between the set of devices and the iommu.  For
   Power, partitionable endpoints define a group.  Grouping information
   needs to be exposed for both userspace and kernel internal usage.  This
   will be a sysfs attribute setup by the iommu drivers.  Perhaps:
   
   # cat /sys/devices/pci:00/:00:19.0/iommu_group
   42
   
   (I use a PCI example here, but attribute should not be PCI specific)
  
  Ok.  Am I correct in thinking these group IDs are representing the
  minimum granularity, and are therefore always static, defined only by
  the connected hardware, not by configuration?
 
 Yes, that's the idea.  An open question I have towards the configuration
 side is whether we might add iommu driver specific options to the
 groups.  For instance on x86 where we typically have B:D.F granularity,
 should we have an option not to trust multi-function devices and use a
 B:D granularity for grouping?

Right.  And likewise I can see a place for configuration parameters
like the present 'allow_unsafe_irqs'.  But these would be more-or-less
global options which affected the overall granularity, rather than
detailed configuration such as explicitly binding some devices into a
group, yes?

   From there we have a few options.  In the BoF we discussed a model where
   binding a device to vfio creates a /dev/vfio$GROUP character device
   file.  This group fd provides provides dma mapping ioctls as well as
   ioctls to enumerate and return a device fd for each attached member of
   the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
   returning an error on open() of the group fd if there are members of the
   group not bound to the vfio driver.  Each device fd would then support a
   similar set of ioctls and mapping (mmio/pio/config) interface as current
   vfio, except for the obvious domain and dma ioctls superseded by the
   group fd.
  
  It seems a slightly strange distinction that the group device appears
  when any device in the group is bound to vfio, but only becomes usable
  when all devices are bound.
  
   Another valid model might be that /dev/vfio/$GROUP is created for all
   groups when the vfio module is loaded.  The group fd would allow open()
   and some set of iommu querying and device enumeration ioctls, but would
   error on dma mapping and retrieving device fds until all of the group
   devices are bound to the vfio driver.
  
  Which is why I marginally prefer this model, although it's not a big
  deal.
 
 Right, we can also combine models.  Binding a device to vfio
 creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
 device access until all the group devices are also bound.  I think
 the /dev/vfio/$GROUP might help provide an enumeration interface as well
 though, which could be useful.

I'm not entirely sure what you mean here.  But, that's now several
weak votes in favour of the always-present group devices, and none in
favour of the created-when-first-device-bound model, so I suggest we
take the /dev/vfio/$GROUP as our tentative approach.

   In either case, the uiommu interface is removed entirely since dma
   mapping is done via the group fd.  As necessary in the future, we can
   define a more high performance dma mapping interface for streaming dma
   via the group fd.  I expect we'll also include architecture specific
   group ioctls to describe features and capabilities of the iommu.  The
   group fd will need to prevent concurrent open()s to maintain a 1:1 group
   to userspace process ownership model.
  
  A 1:1 group-process correspondance seems wrong to me. But there are
  many ways you could legitimately write the userspace side of the code,
  many of them involving some sort of concurrency.  Implementing that
  concurrency as multiple processes (using explicit shared memory and/or
  other IPC mechanisms to co-ordinate) seems a valid choice that we
  shouldn't arbitrarily prohibit.
  
  Obviously, only one UID may be permitted to have the group open at a
  time, and I think that's enough to prevent them doing any worse than
  shooting themselves in the foot.
 
 1:1 group-process is probably too strong.  Not allowing concurrent
 open()s on the group file enforces a single userspace entity is
 responsible for that group.  Device fds can be passed to other
 processes, but only retrieved via the group fd.  I suppose we could even
 branch off the dma interface into a different fd, but it seems like we

Re: kvm PCI assignment VFIO ramblings

2011-08-21 Thread David Gibson

On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote:
 We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
 capture the plan that I think we agreed to:
 
 We need to address both the description and enforcement of device
 groups.  Groups are formed any time the iommu does not have resolution
 between a set of devices.  On x86, this typically happens when a
 PCI-to-PCI bridge exists between the set of devices and the iommu.  For
 Power, partitionable endpoints define a group.  Grouping information
 needs to be exposed for both userspace and kernel internal usage.  This
 will be a sysfs attribute setup by the iommu drivers.  Perhaps:
 
 # cat /sys/devices/pci:00/:00:19.0/iommu_group
 42
 
 (I use a PCI example here, but attribute should not be PCI specific)

Ok.  Am I correct in thinking these group IDs are representing the
minimum granularity, and are therefore always static, defined only by
the connected hardware, not by configuration?

 From there we have a few options.  In the BoF we discussed a model where
 binding a device to vfio creates a /dev/vfio$GROUP character device
 file.  This group fd provides provides dma mapping ioctls as well as
 ioctls to enumerate and return a device fd for each attached member of
 the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
 returning an error on open() of the group fd if there are members of the
 group not bound to the vfio driver.  Each device fd would then support a
 similar set of ioctls and mapping (mmio/pio/config) interface as current
 vfio, except for the obvious domain and dma ioctls superseded by the
 group fd.

It seems a slightly strange distinction that the group device appears
when any device in the group is bound to vfio, but only becomes usable
when all devices are bound.

 Another valid model might be that /dev/vfio/$GROUP is created for all
 groups when the vfio module is loaded.  The group fd would allow open()
 and some set of iommu querying and device enumeration ioctls, but would
 error on dma mapping and retrieving device fds until all of the group
 devices are bound to the vfio driver.

Which is why I marginally prefer this model, although it's not a big
deal.

 In either case, the uiommu interface is removed entirely since dma
 mapping is done via the group fd.  As necessary in the future, we can
 define a more high performance dma mapping interface for streaming dma
 via the group fd.  I expect we'll also include architecture specific
 group ioctls to describe features and capabilities of the iommu.  The
 group fd will need to prevent concurrent open()s to maintain a 1:1 group
 to userspace process ownership model.

A 1:1 group-process correspondance seems wrong to me. But there are
many ways you could legitimately write the userspace side of the code,
many of them involving some sort of concurrency.  Implementing that
concurrency as multiple processes (using explicit shared memory and/or
other IPC mechanisms to co-ordinate) seems a valid choice that we
shouldn't arbitrarily prohibit.

Obviously, only one UID may be permitted to have the group open at a
time, and I think that's enough to prevent them doing any worse than
shooting themselves in the foot.

 Also on the table is supporting non-PCI devices with vfio.  To do this,
 we need to generalize the read/write/mmap and irq eventfd interfaces.
 We could keep the same model of segmenting the device fd address space,
 perhaps adding ioctls to define the segment offset bit position or we
 could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
 VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
 suffering some degree of fd bloat (group fd, device fd(s), interrupt
 event fd(s), per resource fd, etc).  For interrupts we can overload
 VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq 

Sounds reasonable.

 (do non-PCI
 devices support MSI?).

They can.  Obviously they might not have exactly the same semantics as
PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices
whose interrupts are treated by the (also on-die) root interrupt
controller in the same way as PCI MSIs.

 For qemu, these changes imply we'd only support a model where we have a
 1:1 group to iommu domain.  The current vfio driver could probably
 become vfio-pci as we might end up with more target specific vfio
 drivers for non-pci.  PCI should be able to maintain a simple -device
 vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
 need to come up with extra options when we need to expose groups to
 guest for pvdma.

Are you saying that you'd no longer support the current x86 usage of
putting all of one guest's devices into a single domain?  If that's
not what you're saying, how would the domains - now made up of a
user's selection of groups, rather than individual devices - be
configured?

 Hope that captures it, feel free to jump in with corrections and
 suggestions.  Thanks,

-- 
David Gibson

Re: kvm PCI assignment VFIO ramblings

2011-08-20 Thread Alex Williamson

We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
capture the plan that I think we agreed to:

We need to address both the description and enforcement of device
groups.  Groups are formed any time the iommu does not have resolution
between a set of devices.  On x86, this typically happens when a
PCI-to-PCI bridge exists between the set of devices and the iommu.  For
Power, partitionable endpoints define a group.  Grouping information
needs to be exposed for both userspace and kernel internal usage.  This
will be a sysfs attribute setup by the iommu drivers.  Perhaps:

# cat /sys/devices/pci:00/:00:19.0/iommu_group
42

(I use a PCI example here, but attribute should not be PCI specific)

From there we have a few options.  In the BoF we discussed a model where
binding a device to vfio creates a /dev/vfio$GROUP character device
file.  This group fd provides provides dma mapping ioctls as well as
ioctls to enumerate and return a device fd for each attached member of
the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
returning an error on open() of the group fd if there are members of the
group not bound to the vfio driver.  Each device fd would then support a
similar set of ioctls and mapping (mmio/pio/config) interface as current
vfio, except for the obvious domain and dma ioctls superseded by the
group fd.

Another valid model might be that /dev/vfio/$GROUP is created for all
groups when the vfio module is loaded.  The group fd would allow open()
and some set of iommu querying and device enumeration ioctls, but would
error on dma mapping and retrieving device fds until all of the group
devices are bound to the vfio driver.

In either case, the uiommu interface is removed entirely since dma
mapping is done via the group fd.  As necessary in the future, we can
define a more high performance dma mapping interface for streaming dma
via the group fd.  I expect we'll also include architecture specific
group ioctls to describe features and capabilities of the iommu.  The
group fd will need to prevent concurrent open()s to maintain a 1:1 group
to userspace process ownership model.

Also on the table is supporting non-PCI devices with vfio.  To do this,
we need to generalize the read/write/mmap and irq eventfd interfaces.
We could keep the same model of segmenting the device fd address space,
perhaps adding ioctls to define the segment offset bit position or we
could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
suffering some degree of fd bloat (group fd, device fd(s), interrupt
event fd(s), per resource fd, etc).  For interrupts we can overload
VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
devices support MSI?).

For qemu, these changes imply we'd only support a model where we have a
1:1 group to iommu domain.  The current vfio driver could probably
become vfio-pci as we might end up with more target specific vfio
drivers for non-pci.  PCI should be able to maintain a simple -device
vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
need to come up with extra options when we need to expose groups to
guest for pvdma.

Hope that captures it, feel free to jump in with corrections and
suggestions.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-09 Thread Alex Williamson

On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote:
 On 08/03/2011 05:04 AM, David Gibson wrote:
  I still don't understand the distinction you're making.  We're saying
  the group is owned by a given user or guest in the sense that no-one
  else may use anything in the group (including host drivers).  At that
  point none, some or all of the devices in the group may actually be
  used by the guest.
 
  You seem to be making a distinction between owned by and assigned
  to and used by and I really don't see what it is.
 
 
 Alex (and I) think that we should work with device/function granularity, 
 as is common with other archs, and that the group thing is just a 
 constraint on which functions may be assigned where, while you think 
 that we should work at group granularity, with 1-function groups for 
 archs which don't have constraints.
 
 Is this an accurate way of putting it?

Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
We lose resolution of devices behind the bridge.  As you state though, I
think of this as only a constraint on what we're able to do with those
devices.

Perhaps part of the differences is that on x86 the constraints don't
really effect how we expose devices to the guest.  We need to hold
unused devices in the group hostage and use the same iommu domain for
any devices assigned, but that's not visible to the guest.  AIUI, POWER
probably needs to expose the bridge (or at least an emulated bridge) to
the guest, any devices in the group need to show up behind that bridge,
some kind of pvDMA needs to be associated with that group, there might
be MMIO segments and IOVA windows, etc.  Effectively you want to
transplant the entire group into the guest.  Is that right?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-09 Thread Benjamin Herrenschmidt


 Mostly correct, yes.  x86 isn't immune to the group problem, it shows up
 for us any time there's a PCIe-to-PCI bridge in the device hierarchy.
 We lose resolution of devices behind the bridge.  As you state though, I
 think of this as only a constraint on what we're able to do with those
 devices.
 
 Perhaps part of the differences is that on x86 the constraints don't
 really effect how we expose devices to the guest.  We need to hold
 unused devices in the group hostage and use the same iommu domain for
 any devices assigned, but that's not visible to the guest.  AIUI, POWER
 probably needs to expose the bridge (or at least an emulated bridge) to
 the guest, any devices in the group need to show up behind that bridge,

Yes, pretty much, essentially because a group must have as shared iommu
domain and so due to the way our PV representation works, that means the
iommu DMA window is to be exposed by a bridge that covers all the
devices of that group.

 some kind of pvDMA needs to be associated with that group, there might
 be MMIO segments and IOVA windows, etc.  

The MMIO segments are mostly transparent to the guest, we just tell it
where the BARs are and it leaves them alone, at least that's how it
works under pHyp.

Currently on our qemu/vfio expriments, we do let the guest do the BAR
assignment via the emulated stuff using a hack to work around the guest
expectation that the BARs have been already setup (I can fill you on the
details if you really care but it's not very interesting). It works
because we only ever used that on setups where we had a device == a
group, but it's nasty. But in any case, because they are going to be
always in separate pages, it's not too hard for KVM to remap them
wherewver we want so MMIO is basically a non-issue.

 Effectively you want to
 transplant the entire group into the guest.  Is that right?  Thanks,

Well, at least we want to have a bridge for the group (it could and
probably should be a host bridge, ie, an entire PCI domain, that's a lot
easier than trying to mess around with virtual P2P bridges).

From there, I don't care if we need to expose explicitly each device of
that group one by one. IE. It would be a nice optimziation to have the
ability to just specify the group and have qemu pick them all up but it
doesn't really matter in the grand scheme of things.

Currently, we do expose individual devices, but again, it's hacks and it
won't work on many setups etc... with horrid consequences :-) We need to
sort that before we can even think of merging that code on our side.

Cheers,
Ben.

 Alex
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-08 Thread David Gibson

On Fri, Aug 05, 2011 at 09:10:09AM -0600, Alex Williamson wrote:
 On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
  Right. In fact to try to clarify the problem for everybody, I think we
  can distinguish two different classes of constraints that can
  influence the grouping of devices:
  
   1- Hard constraints. These are typically devices using the same RID or
  where the RID cannot be reliably guaranteed (the later is the case with
  some PCIe-PCIX bridges which will take ownership of some transactions
  such as split but not all). Devices like that must be in the same
  domain. This is where PowerPC adds to what x86 does today the concept
  that the domains are pre-existing, since we use the RID for error
  isolation  MMIO segmenting as well. so we need to create those domains
  at boot time.
  
   2- Softer constraints. Those constraints derive from the fact that not
  applying them risks enabling the guest to create side effects outside of
  its sandbox. To some extent, there can be degrees of badness between
  the various things that can cause such constraints. Examples are shared
  LSIs (since trusting DisINTx can be chancy, see earlier discussions),
  potentially any set of functions in the same device can be problematic
  due to the possibility to get backdoor access to the BARs etc...
 
 This is what I've been trying to get to, hardware constraints vs system
 policy constraints.
 
  Now, what I derive from the discussion we've had so far, is that we need
  to find a proper fix for #1, but Alex and Avi seem to prefer that #2
  remains a matter of libvirt/user doing the right thing (basically
  keeping a loaded gun aimed at the user's foot with a very very very
  sweet trigger but heh, let's not start a flamewar here :-)
 
 Doesn't your own uncertainty of whether or not to allow this lead to the
 same conclusion, that it belongs in userspace policy?  I don't think we
 want to make white lists of which devices we trust to do DisINTx
 correctly part of the kernel interface, do we?  Thanks,

Yes, but the overall point is that both the hard and soft constraints
are much easier to handle if a group or iommu domain or whatever is a
persistent entity that can be set up once-per-boot by the admin with
whatever degree of safety they want, rather than a transient entity
tied to an fd's lifetime, which must be set up correctly, every time,
by the thing establishing it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-08 Thread Avi Kivity


On 08/03/2011 05:04 AM, David Gibson wrote:

I still don't understand the distinction you're making.  We're saying
the group is owned by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between owned by and assigned
to and used by and I really don't see what it is.



Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.


Is this an accurate way of putting it?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-05 Thread Benjamin Herrenschmidt

On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
 On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
  It's not clear to me how we could skip it.  With VT-d, we'd have to
  implement an emulated interrupt remapper and hope that the guest picks
  unused indexes in the host interrupt remapping table before it could do
  anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
  makes this easier?
 
 AMD IOMMU provides remapping tables per-device, and not a global one.
 But that does not make direct guest-access to the MSI-X table safe. The
 table contains the table contains the interrupt-type and the vector
 which is used as an index into the remapping table by the IOMMU. So when
 the guest writes into its MSI-X table the remapping-table in the host
 needs to be updated too.

Right, you need paravirt to avoid filtering :-)

IE the problem is two fold:

 - Getting the right value in the table / remapper so things work
(paravirt)

 - Protecting against the guest somewhat managing to change the value in
the table (either directly or via a backdoor access to its own config
space).

The later for us comes from the HW PE filtering of the MSI transactions.

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-05 Thread Benjamin Herrenschmidt

On Thu, 2011-08-04 at 12:27 +0200, Joerg Roedel wrote:
 Hi Ben,
 
 thanks for your detailed introduction to the requirements for POWER. Its
 good to know that the granularity problem is not x86-only.

I'm happy to see your reply :-) I had the feeling I was a bit alone
here...

 On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
  In IBM POWER land, we call this a partitionable endpoint (the term
  endpoint here is historic, such a PE can be made of several PCIe
  endpoints). I think partitionable is a pretty good name tho to
  represent the constraints, so I'll call this a partitionable group
  from now on.
 
 On x86 this is mostly an issue of the IOMMU and which set of devices use
 the same request-id. I used to call that an alias-group because the
 devices have a request-id alias to the pci-bridge.

Right. In fact to try to clarify the problem for everybody, I think we
can distinguish two different classes of constraints that can
influence the grouping of devices:

 1- Hard constraints. These are typically devices using the same RID or
where the RID cannot be reliably guaranteed (the later is the case with
some PCIe-PCIX bridges which will take ownership of some transactions
such as split but not all). Devices like that must be in the same
domain. This is where PowerPC adds to what x86 does today the concept
that the domains are pre-existing, since we use the RID for error
isolation  MMIO segmenting as well. so we need to create those domains
at boot time.

 2- Softer constraints. Those constraints derive from the fact that not
applying them risks enabling the guest to create side effects outside of
its sandbox. To some extent, there can be degrees of badness between
the various things that can cause such constraints. Examples are shared
LSIs (since trusting DisINTx can be chancy, see earlier discussions),
potentially any set of functions in the same device can be problematic
due to the possibility to get backdoor access to the BARs etc...

Now, what I derive from the discussion we've had so far, is that we need
to find a proper fix for #1, but Alex and Avi seem to prefer that #2
remains a matter of libvirt/user doing the right thing (basically
keeping a loaded gun aimed at the user's foot with a very very very
sweet trigger but heh, let's not start a flamewar here :-)

So let's try to find a proper solution for #1 now, and leave #2 alone
for the time being.

Maybe the right option is for x86 to move toward pre-existing domains
like powerpc does, or maybe we can just expose some kind of ID.

Because #1 is a mix of generic constraints (nasty bridges) and very
platform specific ones (whatever capacity limits in our MMIO segmenting
forced us to put two devices in the same hard domain on power), I
believe it's really something the kernel must solve, not libvirt nor
qemu user or anything else.

I am open to suggestions here. I can easily expose my PE# (it's just a
number) somewhere in sysfs, in fact I'm considering doing it in the PCI
devices sysfs directory, simply because it can/will be useful for other
things such as error reporting, so we could maybe build on that.

The crux for me is really the need for pre-existence of the iommu
domains as my PE's imply a shared iommu space.

  - The -minimum- granularity of pass-through is not always a single
  device and not always under SW control
 
 Correct.
  
  - Having a magic heuristic in libvirt to figure out those constraints is
  WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
  knowledge of PCI resource management and getting it wrong in many many
  cases, something that took years to fix essentially by ripping it all
  out. This is kernel knowledge and thus we need the kernel to expose in a
  way or another what those constraints are, what those partitionable
  groups are.
 
 I agree. Managing the ownership of a group should be done in the kernel.
 Doing this in userspace is just too dangerous.
 
 The problem to be solved here is how to present these PEs inside the
 kernel and to userspace. I thought a bit about making this visbible
 through the iommu-api for in-kernel users. That is probably the most
 logical place.

Ah you started answering to my above questions :-)

We could do what you propose. It depends what we want to do with
domains. Practically speaking, we could make domains pre-existing (with
the ability to group several PEs into larger domains) or we could keep
the concepts different, possibly with the limitation that on powerpc, a
domain == a PE.

I suppose we -could- make arbitrary domains on ppc as well by making the
various PE's iommu's in HW point to the same in-memory table, but that's
a bit nasty in practice due to the way we manage those, and it would to
some extent increase the risk of a failing device/driver stomping on
another one and thus taking it down with itself. IE. isolation of errors
is an important feature for us.

So I'd rather avoid the whole domain thing for now and keep the

Re: kvm PCI assignment VFIO ramblings

2011-08-05 Thread Joerg Roedel

On Fri, Aug 05, 2011 at 08:26:11PM +1000, Benjamin Herrenschmidt wrote:
 On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote:
  On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
   It's not clear to me how we could skip it.  With VT-d, we'd have to
   implement an emulated interrupt remapper and hope that the guest picks
   unused indexes in the host interrupt remapping table before it could do
   anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
   makes this easier?
  
  AMD IOMMU provides remapping tables per-device, and not a global one.
  But that does not make direct guest-access to the MSI-X table safe. The
  table contains the table contains the interrupt-type and the vector
  which is used as an index into the remapping table by the IOMMU. So when
  the guest writes into its MSI-X table the remapping-table in the host
  needs to be updated too.
 
 Right, you need paravirt to avoid filtering :-)

Or a shadow MSI-X table like done on x86. How to handle this seems to be
platform specific. As you indicate there is a standardized paravirt
interface for that on Power.

 IE the problem is two fold:
 
  - Getting the right value in the table / remapper so things work
 (paravirt)
 
  - Protecting against the guest somewhat managing to change the value in
 the table (either directly or via a backdoor access to its own config
 space).
 
 The later for us comes from the HW PE filtering of the MSI transactions.

Right. The second part of the problem can be avoided with
interrupt-remapping/filtering hardware in the IOMMUs.

Joerg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-05 Thread Joerg Roedel

On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:

 Right. In fact to try to clarify the problem for everybody, I think we
 can distinguish two different classes of constraints that can
 influence the grouping of devices:
 
  1- Hard constraints. These are typically devices using the same RID or
 where the RID cannot be reliably guaranteed (the later is the case with
 some PCIe-PCIX bridges which will take ownership of some transactions
 such as split but not all). Devices like that must be in the same
 domain. This is where PowerPC adds to what x86 does today the concept
 that the domains are pre-existing, since we use the RID for error
 isolation  MMIO segmenting as well. so we need to create those domains
 at boot time.

Domains (in the iommu-sense) are created at boot time on x86 today.
Every device needs at least a domain to provide dma-mapping
functionality to the drivers. So all the grouping is done too at
boot-time. This is specific to the iommu-drivers today but can be
generalized I think.

  2- Softer constraints. Those constraints derive from the fact that not
 applying them risks enabling the guest to create side effects outside of
 its sandbox. To some extent, there can be degrees of badness between
 the various things that can cause such constraints. Examples are shared
 LSIs (since trusting DisINTx can be chancy, see earlier discussions),
 potentially any set of functions in the same device can be problematic
 due to the possibility to get backdoor access to the BARs etc...

Hmm, there is no sane way to handle such constraints in a safe way,
right? We can either blacklist devices which are know to have such
backdoors or we just ignore the problem.

 Now, what I derive from the discussion we've had so far, is that we need
 to find a proper fix for #1, but Alex and Avi seem to prefer that #2
 remains a matter of libvirt/user doing the right thing (basically
 keeping a loaded gun aimed at the user's foot with a very very very
 sweet trigger but heh, let's not start a flamewar here :-)
 
 So let's try to find a proper solution for #1 now, and leave #2 alone
 for the time being.

Yes, and the solution for #1 should be entirely in the kernel. The
question is how to do that. Probably the most sane way is to introduce a
concept of device ownership. The ownership can either be a kernel driver
or a userspace process. Giving ownership of a device to userspace is
only possible if all devices in the same group are unbound from its
respective drivers. This is a very intrusive concept, no idea if it
has a chance of acceptance :-)
But the advantage is clearly that this allows better semantics in the
IOMMU drivers and a more stable handover of devices from host drivers to
kvm guests.

 Maybe the right option is for x86 to move toward pre-existing domains
 like powerpc does, or maybe we can just expose some kind of ID.

As I said, the domains are created a iommu driver initialization time
(usually boot time). But the groups are internal to the iommu drivers
and not visible somewhere else.

 Ah you started answering to my above questions :-)
 
 We could do what you propose. It depends what we want to do with
 domains. Practically speaking, we could make domains pre-existing (with
 the ability to group several PEs into larger domains) or we could keep
 the concepts different, possibly with the limitation that on powerpc, a
 domain == a PE.
 
 I suppose we -could- make arbitrary domains on ppc as well by making the
 various PE's iommu's in HW point to the same in-memory table, but that's
 a bit nasty in practice due to the way we manage those, and it would to
 some extent increase the risk of a failing device/driver stomping on
 another one and thus taking it down with itself. IE. isolation of errors
 is an important feature for us.

These arbitrary domains exist in the iommu-api. It would be good to
emulate them on Power too. Can't you put a PE into an isolated
error-domain when something goes wrong with it? This should provide the
same isolation as before.
What you derive the group number from is your business :-) On x86 it is
certainly the best to use the RID these devices share together with the
PCI segment number.

Regards,

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-05 Thread Alex Williamson

On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote:
 Right. In fact to try to clarify the problem for everybody, I think we
 can distinguish two different classes of constraints that can
 influence the grouping of devices:
 
  1- Hard constraints. These are typically devices using the same RID or
 where the RID cannot be reliably guaranteed (the later is the case with
 some PCIe-PCIX bridges which will take ownership of some transactions
 such as split but not all). Devices like that must be in the same
 domain. This is where PowerPC adds to what x86 does today the concept
 that the domains are pre-existing, since we use the RID for error
 isolation  MMIO segmenting as well. so we need to create those domains
 at boot time.
 
  2- Softer constraints. Those constraints derive from the fact that not
 applying them risks enabling the guest to create side effects outside of
 its sandbox. To some extent, there can be degrees of badness between
 the various things that can cause such constraints. Examples are shared
 LSIs (since trusting DisINTx can be chancy, see earlier discussions),
 potentially any set of functions in the same device can be problematic
 due to the possibility to get backdoor access to the BARs etc...

This is what I've been trying to get to, hardware constraints vs system
policy constraints.

 Now, what I derive from the discussion we've had so far, is that we need
 to find a proper fix for #1, but Alex and Avi seem to prefer that #2
 remains a matter of libvirt/user doing the right thing (basically
 keeping a loaded gun aimed at the user's foot with a very very very
 sweet trigger but heh, let's not start a flamewar here :-)

Doesn't your own uncertainty of whether or not to allow this lead to the
same conclusion, that it belongs in userspace policy?  I don't think we
want to make white lists of which devices we trust to do DisINTx
correctly part of the kernel interface, do we?  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-05 Thread Benjamin Herrenschmidt

On Fri, 2011-08-05 at 15:44 +0200, Joerg Roedel wrote:
 On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:
 
  Right. In fact to try to clarify the problem for everybody, I think we
  can distinguish two different classes of constraints that can
  influence the grouping of devices:
  
   1- Hard constraints. These are typically devices using the same RID or
  where the RID cannot be reliably guaranteed (the later is the case with
  some PCIe-PCIX bridges which will take ownership of some transactions
  such as split but not all). Devices like that must be in the same
  domain. This is where PowerPC adds to what x86 does today the concept
  that the domains are pre-existing, since we use the RID for error
  isolation  MMIO segmenting as well. so we need to create those domains
  at boot time.
 
 Domains (in the iommu-sense) are created at boot time on x86 today.
 Every device needs at least a domain to provide dma-mapping
 functionality to the drivers. So all the grouping is done too at
 boot-time. This is specific to the iommu-drivers today but can be
 generalized I think.

Ok, let's go there then.

   2- Softer constraints. Those constraints derive from the fact that not
  applying them risks enabling the guest to create side effects outside of
  its sandbox. To some extent, there can be degrees of badness between
  the various things that can cause such constraints. Examples are shared
  LSIs (since trusting DisINTx can be chancy, see earlier discussions),
  potentially any set of functions in the same device can be problematic
  due to the possibility to get backdoor access to the BARs etc...
 
 Hmm, there is no sane way to handle such constraints in a safe way,
 right? We can either blacklist devices which are know to have such
 backdoors or we just ignore the problem.

Arguably they probably all do have such backdoors. A debug register,
JTAG register, ... My point is you don't really know unless you get
manufacturer guarantee that there is no undocumented register somewhere
or way to change the microcode so that it does it etc The more
complex the devices, the less likely to have a guarantee.

The safe way is what pHyp does and basically boils down to only
allowing pass-through of entire 'slots', ie, things that are behind a
P2P bridge (virtual one typically, ie, a PCIe switch) and disallowing
pass-through with shared interrupts.

That way, even if the guest can move the BARs around, it cannot make
them overlap somebody else device because the parent bridge restricts
the portion of MMIO space that is forwarded down to that device anyway.

  Now, what I derive from the discussion we've had so far, is that we need
  to find a proper fix for #1, but Alex and Avi seem to prefer that #2
  remains a matter of libvirt/user doing the right thing (basically
  keeping a loaded gun aimed at the user's foot with a very very very
  sweet trigger but heh, let's not start a flamewar here :-)
  
  So let's try to find a proper solution for #1 now, and leave #2 alone
  for the time being.
 
 Yes, and the solution for #1 should be entirely in the kernel. The
 question is how to do that. Probably the most sane way is to introduce a
 concept of device ownership. The ownership can either be a kernel driver
 or a userspace process. Giving ownership of a device to userspace is
 only possible if all devices in the same group are unbound from its
 respective drivers. This is a very intrusive concept, no idea if it
 has a chance of acceptance :-)
 But the advantage is clearly that this allows better semantics in the
 IOMMU drivers and a more stable handover of devices from host drivers to
 kvm guests.

I tend to think around those lines too, but the ownership concept
doesn't necessarily have to be core-kernel enforced itself, it can be in
VFIO.

If we have a common API to expose the domain number, it can perfectly
be a matter of VFIO itself not allowing to do pass-through until it has 
attached its stub driver to all the devices with that domain number, and
it can handle exclusion of iommu domains from there.

  Maybe the right option is for x86 to move toward pre-existing domains
  like powerpc does, or maybe we can just expose some kind of ID.
 
 As I said, the domains are created a iommu driver initialization time
 (usually boot time). But the groups are internal to the iommu drivers
 and not visible somewhere else.

That's what we need to fix :-)

  Ah you started answering to my above questions :-)
  
  We could do what you propose. It depends what we want to do with
  domains. Practically speaking, we could make domains pre-existing (with
  the ability to group several PEs into larger domains) or we could keep
  the concepts different, possibly with the limitation that on powerpc, a
  domain == a PE.
  
  I suppose we -could- make arbitrary domains on ppc as well by making the
  various PE's iommu's in HW point to the same in-memory table, but that's
  a bit nasty in practice due to the way we

Re: kvm PCI assignment VFIO ramblings

2011-08-04 Thread Joerg Roedel

Hi Ben,

thanks for your detailed introduction to the requirements for POWER. Its
good to know that the granularity problem is not x86-only.

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
 In IBM POWER land, we call this a partitionable endpoint (the term
 endpoint here is historic, such a PE can be made of several PCIe
 endpoints). I think partitionable is a pretty good name tho to
 represent the constraints, so I'll call this a partitionable group
 from now on.

On x86 this is mostly an issue of the IOMMU and which set of devices use
the same request-id. I used to call that an alias-group because the
devices have a request-id alias to the pci-bridge.

 - The -minimum- granularity of pass-through is not always a single
 device and not always under SW control

Correct.
 
 - Having a magic heuristic in libvirt to figure out those constraints is
 WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
 knowledge of PCI resource management and getting it wrong in many many
 cases, something that took years to fix essentially by ripping it all
 out. This is kernel knowledge and thus we need the kernel to expose in a
 way or another what those constraints are, what those partitionable
 groups are.

I agree. Managing the ownership of a group should be done in the kernel.
Doing this in userspace is just too dangerous.

The problem to be solved here is how to present these PEs inside the
kernel and to userspace. I thought a bit about making this visbible
through the iommu-api for in-kernel users. That is probably the most
logical place.

For userspace I would like to propose a new device attribute in sysfs.
This attribute contains the group number. All devices with the same
group number belong to the same PE. Libvirt needs to scan the whole
device tree to build the groups but that is probalbly not a big deal.


Joerg

 
 - That does -not- mean that we cannot specify for each individual device
 within such a group where we want to put it in qemu (what devfn etc...).
 As long as there is a clear understanding that the ownership of the
 device goes with the group, this is somewhat orthogonal to how they are
 represented in qemu. (Not completely... if the iommu is exposed to the
 guest ,via paravirt for example, some of these constraints must be
 exposed but I'll talk about that more later).
 
 The interface currently proposed for VFIO (and associated uiommu)
 doesn't handle that problem at all. Instead, it is entirely centered
 around a specific feature of the VTd iommu's for creating arbitrary
 domains with arbitrary devices (tho those devices -do- have the same
 constraints exposed above, don't try to put 2 legacy PCI devices behind
 the same bridge into 2 different domains !), but the API totally ignores
 the problem, leaves it to libvirt magic foo and focuses on something
 that is both quite secondary in the grand scheme of things, and quite
 x86 VTd specific in the implementation and API definition.
 
 Now, I'm not saying these programmable iommu domains aren't a nice
 feature and that we shouldn't exploit them when available, but as it is,
 it is too much a central part of the API.
 
 I'll talk a little bit more about recent POWER iommu's here to
 illustrate where I'm coming from with my idea of groups:
 
 On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
 of domain and a per-RID filtering. However it differs from VTd in a few
 ways:
 
 The domains (aka PEs) encompass more than just an iommu filtering
 scheme. The MMIO space and PIO space are also segmented, and those
 segments assigned to domains. Interrupts (well, MSI ports at least) are
 assigned to domains. Inbound PCIe error messages are targeted to
 domains, etc...
 
 Basically, the PEs provide a very strong isolation feature which
 includes errors, and has the ability to immediately isolate a PE on
 the first occurence of an error. For example, if an inbound PCIe error
 is signaled by a device on a PE or such a device does a DMA to a
 non-authorized address, the whole PE gets into error state. All
 subsequent stores (both DMA and MMIO) are swallowed and reads return all
 1's, interrupts are blocked. This is designed to prevent any propagation
 of bad data, which is a very important feature in large high reliability
 systems.
 
 Software then has the ability to selectively turn back on MMIO and/or
 DMA, perform diagnostics, reset devices etc...
 
 Because the domains encompass more than just DMA, but also segment the
 MMIO space, it is not practical at all to dynamically reconfigure them
 at runtime to move devices into domains. The firmware or early kernel
 code (it depends) will assign devices BARs using an algorithm that keeps
 them within PE segment boundaries, etc
 
 Additionally (and this is indeed a restriction compared to VTd, though
 I expect our future IO chips to lift it to some extent), PE don't get
 separate DMA address spaces. There is one 64-bit DMA address space per

Re: kvm PCI assignment VFIO ramblings

2011-08-04 Thread Joerg Roedel

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
 On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
  - The -minimum- granularity of pass-through is not always a single
  device and not always under SW control
 
 But IMHO, we need to preserve the granularity of exposing a device to a
 guest as a single device.  That might mean some devices are held hostage
 by an agent on the host.

Thats true. There is a difference between unassign a group from the host
and make single devices in that PE visible to the guest. But we need
to make sure that no device in a PE is used by the host while at least
one device is assigned to a guest.

Unlike the other proposals to handle this in libvirt, I think this
belongs into the kernel. Doing this in userspace may break the entire
system if done wrong.

For example, if one device from e PE is assigned to a guest while
another one is not unbound from its host driver, the driver may get very
confused when DMA just stops working. This may crash the entire system
or lead to silent data corruption in the guest. The behavior is
basically undefined then. The kernel must not not allow that.


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-04 Thread Joerg Roedel

On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote:
 It's not clear to me how we could skip it.  With VT-d, we'd have to
 implement an emulated interrupt remapper and hope that the guest picks
 unused indexes in the host interrupt remapping table before it could do
 anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
 makes this easier?

AMD IOMMU provides remapping tables per-device, and not a global one.
But that does not make direct guest-access to the MSI-X table safe. The
table contains the table contains the interrupt-type and the vector
which is used as an index into the remapping table by the IOMMU. So when
the guest writes into its MSI-X table the remapping-table in the host
needs to be updated too.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-03 Thread David Gibson

On Tue, Aug 02, 2011 at 09:44:49PM -0600, Alex Williamson wrote:
 On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote:
  On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
   On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
 On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
  On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
 [snip]
  On x86, the USB controllers don't typically live behind a 
  PCIe-to-PCI
  bridge, so don't suffer the source identifier problem, but they do 
  often
  share an interrupt.  But even then, we can count on most modern 
  devices
  supporting PCI2.3, and thus the DisINTx feature, which allows us to
  share interrupts.  In any case, yes, it's more rare but we need to 
  know
  how to handle devices behind PCI bridges.  However I disagree that 
  we
  need to assign all the devices behind such a bridge to the guest.
  There's a difference between removing the device from the host and
  exposing the device to the guest.
 
 I think you're arguing only over details of what words to use for
 what, rather than anything of substance here.  The point is that an
 entire partitionable group must be assigned to host (in which case
 kernel drivers may bind to it) or to a particular guest partition (or
 at least to a single UID on the host).  Which of the assigned devices
 the partition actually uses is another matter of course, as is at
 exactly which level they become de-exposed if you don't want to use
 all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
   
   Sorry, I didn't intend to have such circular logic.  ... I disagree
   that qemu is necessarily the owner of the entire partition vs granted
   access to devices within the partition.  Thanks,
  
  I still don't understand the distinction you're making.  We're saying
  the group is owned by a given user or guest in the sense that no-one
  else may use anything in the group (including host drivers).  At that
  point none, some or all of the devices in the group may actually be
  used by the guest.
  
  You seem to be making a distinction between owned by and assigned
  to and used by and I really don't see what it is.
 
 How does a qemu instance that uses none of the devices in a group still
 own that group?

?? In the same way that you still own a file you don't have open..?

  Aren't we at that point free to move the group to a
 different qemu instance or return ownership to the host?

Of course.  But until you actually do that, the group is still
notionally owned by the guest.

  Who does that?

The admin.  Possily by poking sysfs, or possibly by frobbing some
character device, or maybe something else.  Naturally libvirt or
whatever could also do this.

 In my mental model, there's an intermediary that owns the group and
 just as kernel drivers bind to devices when the host owns the group,
 qemu is a userspace device driver that binds to sets of devices when the
 intermediary owns it.  Obviously I'm thinking libvirt, but it doesn't
 have to be.  Thanks,

Well sure, but I really don't see how such an intermediary fits into
the kernel's model of ownership.

So, first, take a step back and look at what sort of entities can
own a group (or device or whatever).  I notice that when I've said
owned by the guest you seem to have read this as owned by qemu
which is not necessarily the same thing.

What I had in mind is that each group is either owned by host, in
which case host kernel drivers can bind to it, or it's in guest mode
in which case it has a user, group and mode and can be bound by user
drivers (and therefore guests) with the right permission.  From the
kernel's perspective there is therefore no distinction between owned
by qemu and owned by libvirt.


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Avi Kivity


On 08/01/2011 11:27 PM, Alex Williamson wrote:

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
  On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
Due to our paravirt nature, we don't need to masquerade the MSI-X table
for example. At all. If the guest configures crap into it, too bad, it
can only shoot itself in the foot since the host bridge enforce
validation anyways as I explained earlier. Because it's all paravirt, we
don't need to translate the interrupt vectors   addresses, the guest
will call hyercalls to configure things anyways.

  So, you have interrupt redirection?  That is, MSI-x table values encode
  the vcpu, not pcpu?

  Alex, with interrupt redirection, we can skip this as well?  Perhaps
  only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.


Yeah.  We need the interrupt remapping hardware to indirect based on the 
source of the message, not just the address and data.



Maybe AMD IOMMU
makes this easier?  Thanks,



No idea.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Avi Kivity


On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:


  I have a feeling you'll be getting the same capabilities sooner or
  later, or you won't be able to make use of S/R IOV VFs.

I'm not sure why you mean. We can do SR/IOV just fine (well, with some
limitations due to constraints with how our MMIO segmenting works and
indeed some of those are being lifted in our future chipsets but
overall, it works).


Don't those limitations include all VFs must be assigned to the same 
guest?


PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?



In -theory-, one could do the grouping dynamically with some kind of API
for us as well. However the constraints are such that it's not
practical. Filtering on RID is based on number of bits to match in the
bus number and whether to match the dev and fn. So it's not arbitrary
(but works fine for SR-IOV).

The MMIO segmentation is a bit special too. There is a single MMIO
region in 32-bit space (size is configurable but that's not very
practical so for now we stick it to 1G) which is evenly divided into N
segments (where N is the number of PE# supported by the host bridge,
typically 128 with the current bridges).

Each segment goes through a remapping table to select the actual PE# (so
large BARs use consecutive segments mapped to the same PE#).

For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
regions which act as some kind of accordions, they are evenly divided
into segments in different PE# and there's several of them which we can
move around and typically use to map VF BARs.


So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.



  
VFIO here is basically designed for one and only one thing: expose the
entire guest physical address space to the device more/less 1:1.

  A single level iommu cannot be exposed to guests.  Well, it can be
  exposed as an iommu that does not provide per-device mapping.

Well, x86 ones can't maybe but on POWER we can and must thanks to our
essentially paravirt model :-) Even if it' wasn't and we used trapping
of accesses to the table, it would work because in practice, even with
filtering, what we end up having is a per-device (or rather per-PE#
table).

  A two level iommu can be emulated and exposed to the guest.  See
  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?).


(16 or 25)


We don't have that and probably never will. But again, because
we have a paravirt interface to the iommu, it's less of an issue.


Well, then, I guess we need an additional interface to expose that to 
the guest.



This means:
  
   - It only works with iommu's that provide complete DMA address spaces
to devices. Won't work with a single 'segmented' address space like we
have on POWER.
  
   - It requires the guest to be pinned. Pass-through -   no more swap

  Newer iommus (and devices, unfortunately) (will) support I/O page faults
  and then the requirement can be removed.

No. -Some- newer devices will. Out of these, a bunch will have so many
bugs in it it's not usable. Some never will. It's a mess really and I
wouldn't design my stuff based on those premises just yet. Making it
possible to support it for sure, having it in mind, but not making it
the fundation on which the whole API is designed.


The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.


But I see what you mean, the API is designed around up-front 
specification of all guest memory.



   - It doesn't work for POWER server anyways because of our need to
provide a paravirt iommu interface to the guest since that's how pHyp
works today and how existing OSes expect to operate.

  Then you need to provide that same interface, and implement it using the
  real iommu.

Yes. Working on it. It's not very practical due to how VFIO interacts in
terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
almost entirely real-mode for performance reasons.


The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.



- Performance sucks of course, the vfio map ioctl wasn't mean for that
and has quite a bit of overhead. However we'll want to do the paravirt
call directly in the kernel eventually ...

  Does the guest iomap each request?  Why?

Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread David Gibson

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
 On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
 On x86, the USB controllers don't typically live behind a PCIe-to-PCI
 bridge, so don't suffer the source identifier problem, but they do often
 share an interrupt.  But even then, we can count on most modern devices
 supporting PCI2.3, and thus the DisINTx feature, which allows us to
 share interrupts.  In any case, yes, it's more rare but we need to know
 how to handle devices behind PCI bridges.  However I disagree that we
 need to assign all the devices behind such a bridge to the guest.
 There's a difference between removing the device from the host and
 exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to host (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become de-exposed if you don't want to use
all of then.

[snip]
  Maybe something like /sys/devgroups ? This probably warrants involving
  more kernel people into the discussion.
 
 I don't yet buy into passing groups to qemu since I don't buy into the
 idea of always exposing all of those devices to qemu.  Would it be
 sufficient to expose iommu nodes in sysfs that link to the devices
 behind them and describe properties and capabilities of the iommu
 itself?  More on this at the end.

Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that assigning.  There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.

[snip]
  Now some of this can be fixed with tweaks, and we've started doing it
  (we have a working pass-through using VFIO, forgot to mention that, it's
  just that we don't like what we had to do to get there).
 
 This is a result of wanting to support *unmodified* x86 guests.  We
 don't have the luxury of having a predefined pvDMA spec that all x86
 OSes adhere to.  The 32bit problem is unfortunate, but the priority use
 case for assigning devices to guests is high performance I/O, which
 usually entails modern, 64bit hardware.  I'd like to see us get to the
 point of having emulated IOMMU hardware on x86, which could then be
 backed by VFIO, but for now guest pinning is the most practical and
 useful.

No-one's suggesting that this isn't a valid mode of operation.  It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.

[snip]
  The above means we need arch specific APIs. So arch specific vfio
  ioctl's, either that or kvm ones going to vfio or something ... the
  current structure of vfio/kvm interaction doesn't make it easy.
 
 FYI, we also have large page support for x86 VT-d, but it seems to only
 be opportunistic right now.  I'll try to come back to the rest of this
 below.

Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet).  Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages.  But when qemu quits they
should all be released but somehow they're not.

[snip]
   - I don't like too much the fact that VFIO provides yet another
  different API to do what we already have at least 2 kernel APIs for, ie,
  BAR mapping and config space access. At least it should be better at
  using the backend infrastructure of the 2 others (sysfs  procfs). I
  understand it wants to filter in some case (config space) and -maybe-
  yet another API is the right way to go but allow me to have my doubts.
 
 The use of PCI sysfs is actually one of my complaints about current
 device assignment.  To do assignment with an unprivileged guest we need
 to open the PCI sysfs config file for it, then change ownership on a
 handful of other PCI sysfs files, then there's this other pci-stub thing
 to maintain ownership, but the kvm ioctls don't actually require it and
 can grab onto any free device...  We are duplicating some of that in
 VFIO, but we also put the ownership of the device behind a single device
 file.  We do have the uiommu problem that we can't give an unprivileged
 user ownership of that, but your usage model may actually make that
 easier.  More below...

Hrm.  I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group.  Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Benjamin Herrenschmidt

On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
 On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
  
I have a feeling you'll be getting the same capabilities sooner or
later, or you won't be able to make use of S/R IOV VFs.
 
  I'm not sure why you mean. We can do SR/IOV just fine (well, with some
  limitations due to constraints with how our MMIO segmenting works and
  indeed some of those are being lifted in our future chipsets but
  overall, it works).
 
 Don't those limitations include all VFs must be assigned to the same 
 guest?

No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can resize to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc...  for these things which can cause us to play
interesting games with the system page size setting to find a good
match.

 PCI on x86 has function granularity, SRIOV reduces this to VF 
 granularity, but I thought power has partition or group granularity 
 which is much coarser?

The granularity of a Group really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.

In fact I currently go down to function granularity on anything pure
PCIe as well, though as I explained earlier, that's a bit chancy since
some adapters -will- allow to create side effects such as side band
access to config space.

pHyp doesn't allow that granularity as far as I can tell, one slot is
always fully assigned to a PE.

However, we might have resource constraints as in reaching max number of
segments or iommu regions that may force us to group a bit more coarsly
under some circumstances.

The main point is that the grouping is pre-existing, so an API designed
around the idea of: 1- create domain, 2- add random devices to it, 3-
use it, won't work for us very well :-)

Since the grouping implies the sharing of iommu's, from a VFIO point of
view is really matches well with the idea of having the domains
pre-existing.

That's why I think a good fit is to have a static representation of the
grouping, with tools allowing to create/manipulate the groups (or
domains) for archs that allow this sort of manipulations, separately
from qemu/libvirt, avoiding those on the fly groups whose lifetime is
tied to an instance of a file descriptor.

  In -theory-, one could do the grouping dynamically with some kind of API
  for us as well. However the constraints are such that it's not
  practical. Filtering on RID is based on number of bits to match in the
  bus number and whether to match the dev and fn. So it's not arbitrary
  (but works fine for SR-IOV).
 
  The MMIO segmentation is a bit special too. There is a single MMIO
  region in 32-bit space (size is configurable but that's not very
  practical so for now we stick it to 1G) which is evenly divided into N
  segments (where N is the number of PE# supported by the host bridge,
  typically 128 with the current bridges).
 
  Each segment goes through a remapping table to select the actual PE# (so
  large BARs use consecutive segments mapped to the same PE#).
 
  For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
  regions which act as some kind of accordions, they are evenly divided
  into segments in different PE# and there's several of them which we can
  move around and typically use to map VF BARs.
 
 So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
 technical details with no ppc background to put them to, I can't say I'm 
 making any sense of this.

:-)

Don't worry, it took me a while to get my head around the HW :-) SR-IOV
VFs will generally not have limitations like that no, but on the other
hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
take a bunch of VFs and put them in the same 'domain'.

I think the main deal is that VFIO/qemu sees domains as guests and
tries to put all devices for a given guest into a domain.

On POWER, we have a different view of things were domains/groups are
defined to be the smallest granularity we can (down to a single VF) and
we give several groups to a guest (ie we avoid sharing the iommu in most
cases)

This is driven by the HW design but that design is itself driven by the
idea that the domains/group are also error isolation groups and we don't
want to take all of the IOs of a guest down if one adapter in that guest
is having an error.

The x86 domains are conceptually different as they are about sharing the
iommu page tables with the clear long term intent of then sharing those
page tables with the guest CPU own. We aren't going in that direction
(at this point at least) on POWER..

  VFIO here is basically designed for one and only one thing: expose the
  entire guest physical address space to the device more/less 1:1.
  
A

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Avi Kivity


On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:

  
What you mean 2-level is two passes through two trees (ie 6 or 8 levels
right ?).

  (16 or 25)

25 levels ? You mean 25 loads to get to a translation ? And you get any
kind of performance out of that ? :-)



Aggressive partial translation caching.  Even then, performance does 
suffer on memory intensive workloads.  The fix was transparent 
hugepages; that makes the page table walks much faster since they're 
fully cached, the partial translation caches become more effective, and 
the tlb itself becomes more effective.  On some workloads, THP on both 
guest and host was faster than no-THP on bare metal.



  
Not sure what you mean... the guest calls h-calls for every iommu page
mapping/unmapping, yes. So the performance of these is critical. So yes,
we'll eventually do it in kernel. We just haven't yet.

  I see.  x86 traditionally doesn't do it for every request.  We had some
  proposals to do a pviommu that does map every request, but none reached
  maturity.

It's quite performance critical, you don't want to go anywhere near a
full exit. On POWER we plan to handle that in real mode (ie MMU off)
straight off the interrupt handlers, with the CPU still basically
operating in guest context with HV permission. That is basically do the
permission check, translation and whack the HW iommu immediately. If for
some reason one step fails (!present PTE or something like that), we'd
then fallback to an exit to Linux to handle it in a more common
environment where we can handle page faults etc...


I guess we can hack some kind of private interface, though I'd hoped to 
avoid it (and so far we succeeded - we can even get vfio to inject 
interrupts into kvm from the kernel without either knowing anything 
about the other).



   Does the BAR value contain the segment base address?  Or is that added
   later?
  
It's a shared address space. With a basic configuration on p7ioc for
example we have MMIO going from 3G to 4G (PCI side addresses). BARs
contain the normal PCI address there. But that 1G is divided in 128
segments of equal size which can separately be assigned to PE#'s.
  
So BARs are allocated by firmware or the kernel PCI code so that devices
in different PEs don't share segments.

  Okay, and config space virtualization ensures that the guest can't remap?

Well, so it depends :-)

With KVM we currently use whatever config space virtualization you do
and so we somewhat rely on this but it's not very fool proof.

I believe pHyp doesn't even bother filtering config space. As I said in
another note, you can't trust adapters anyway. Plenty of them (video
cards come to mind) have ways to get to their own config space via MMIO
registers for example.


Yes, we've seen that.


So what pHyp does is that it always create PE's (aka groups) that are
below a bridge. With PCIe, everything mostly is below a bridge so that's
easy, but that does mean that you always have all functions of a device
in the same PE (and thus in the same partition). SR-IOV is an exception
to this rule since in that case the HW is designed to be trusted.

That way, being behind a bridge, the bridge windows are going to define
what can be forwarded to the device, and thus the system is immune to
the guest putting crap into the BARs. It can't be remapped to overlap a
neighbouring device.

Note that the bridge itself isn't visible to the guest, so yes, config
space is -somewhat- virtualized, typically pHyp make every pass-through
PE look like a separate PCI host bridge with the devices below it.


I think I see, yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Alex Williamson

On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote:
 It's a shared address space. With a basic configuration on p7ioc for
 example we have MMIO going from 3G to 4G (PCI side addresses). BARs
 contain the normal PCI address there. But that 1G is divided in 128
 segments of equal size which can separately be assigned to PE#'s.
 
 So BARs are allocated by firmware or the kernel PCI code so that devices
 in different PEs don't share segments.
 
 Of course there's always the risk that a device can be hacked via a
 sideband access to BARs to move out of it's allocated segment. That
 means that the guest owning that device won't be able to access it
 anymore and can potentially disturb a guest or host owning whatever is
 in that other segment.

Wait, what?  I thought the MMIO segments were specifically so that if
the device BARs moved out of the segment the guest only hurts itself and
not the new segments overlapped.

 The only way to enforce isolation here is to ensure that PE# are
 entirely behind P2P bridges, since those would then ensure that even if
 you put crap into your BARs you won't be able to walk over a neighbour.

Ok, so the MMIO segments are really just a configuration nuance of the
platform and being behind a P2P bridge is what allows you to hand off
BARs to a guest (which needs to know the bridge window to do anything
useful with them).  Is that right?

 I believe pHyp enforces that, for example, if you have a slot, all
 devices  functions behind that slot pertain to the same PE# under pHyp.
 
 That means you cannot put individual functions of a device into
 different PE# with pHyp.
 
 We plan to be a bit less restrictive here for KVM, assuming that if you
 use a device that allows such a back-channel to the BARs, then it's your
 problem to not trust such a device for virtualization. And most of the
 time, you -will- have a P2P to protect you anyways.
 
 The problem doesn't exist (or is assumed as non-existing) for SR-IOV
 since in that case, the VFs are meant to be virtualized, so pHyp assumes
 there is no such back-channel and it can trust them to be in different
 PE#.

But you still need the P2P bridge to protect MMIO segments?  Or do
SR-IOV BARs need to be virtualized?  I'm having trouble with the mental
model of how you can do both.  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Alex Williamson

On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
 
 Don't worry, it took me a while to get my head around the HW :-) SR-IOV
 VFs will generally not have limitations like that no, but on the other
 hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
 take a bunch of VFs and put them in the same 'domain'.
 
 I think the main deal is that VFIO/qemu sees domains as guests and
 tries to put all devices for a given guest into a domain.

Actually, that's only a recent optimization, before that each device got
it's own iommu domain.  It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.

 On POWER, we have a different view of things were domains/groups are
 defined to be the smallest granularity we can (down to a single VF) and
 we give several groups to a guest (ie we avoid sharing the iommu in most
 cases)
 
 This is driven by the HW design but that design is itself driven by the
 idea that the domains/group are also error isolation groups and we don't
 want to take all of the IOs of a guest down if one adapter in that guest
 is having an error.
 
 The x86 domains are conceptually different as they are about sharing the
 iommu page tables with the clear long term intent of then sharing those
 page tables with the guest CPU own. We aren't going in that direction
 (at this point at least) on POWER..

Yes and no.  The x86 domains are pretty flexible and used a few
different ways.  On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges.  In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest.  As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway).  If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Alex Williamson

On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
 On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
  On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
 [snip]
  On x86, the USB controllers don't typically live behind a PCIe-to-PCI
  bridge, so don't suffer the source identifier problem, but they do often
  share an interrupt.  But even then, we can count on most modern devices
  supporting PCI2.3, and thus the DisINTx feature, which allows us to
  share interrupts.  In any case, yes, it's more rare but we need to know
  how to handle devices behind PCI bridges.  However I disagree that we
  need to assign all the devices behind such a bridge to the guest.
  There's a difference between removing the device from the host and
  exposing the device to the guest.
 
 I think you're arguing only over details of what words to use for
 what, rather than anything of substance here.  The point is that an
 entire partitionable group must be assigned to host (in which case
 kernel drivers may bind to it) or to a particular guest partition (or
 at least to a single UID on the host).  Which of the assigned devices
 the partition actually uses is another matter of course, as is at
 exactly which level they become de-exposed if you don't want to use
 all of then.

Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy.  And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.

 [snip]
   Maybe something like /sys/devgroups ? This probably warrants involving
   more kernel people into the discussion.
  
  I don't yet buy into passing groups to qemu since I don't buy into the
  idea of always exposing all of those devices to qemu.  Would it be
  sufficient to expose iommu nodes in sysfs that link to the devices
  behind them and describe properties and capabilities of the iommu
  itself?  More on this at the end.
 
 Again, I don't think you're making a distinction of any substance.
 Ben is saying the group as a whole must be set to allow partition
 access, whether or not you call that assigning.  There's no reason
 that passing a sysfs descriptor to qemu couldn't be the qemu
 developer's quick-and-dirty method of putting the devices in, while
 also allowing full assignment of the devices within the groups by
 libvirt.

Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed.  I tend to envision a userspace entity defining
policy and granting devices to qemu.  Do we really want separate
developer vs production interfaces?

 [snip]
   Now some of this can be fixed with tweaks, and we've started doing it
   (we have a working pass-through using VFIO, forgot to mention that, it's
   just that we don't like what we had to do to get there).
  
  This is a result of wanting to support *unmodified* x86 guests.  We
  don't have the luxury of having a predefined pvDMA spec that all x86
  OSes adhere to.  The 32bit problem is unfortunate, but the priority use
  case for assigning devices to guests is high performance I/O, which
  usually entails modern, 64bit hardware.  I'd like to see us get to the
  point of having emulated IOMMU hardware on x86, which could then be
  backed by VFIO, but for now guest pinning is the most practical and
  useful.
 
 No-one's suggesting that this isn't a valid mode of operation.  It's
 just that right now conditionally disabling it for us is fairly ugly
 because of the way the qemu code is structured.

It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.

 [snip]
- I don't like too much the fact that VFIO provides yet another
   different API to do what we already have at least 2 kernel APIs for, ie,
   BAR mapping and config space access. At least it should be better at
   using the backend infrastructure of the 2 others (sysfs  procfs). I
   understand it wants to filter in some case (config space) and -maybe-
   yet another API is the right way to go but allow me to have my doubts.
  
  The use of PCI sysfs is actually one of my complaints about current
  device assignment.  To do assignment with an unprivileged guest we need
  to open the PCI sysfs config file for it, then change ownership on a
  handful of other PCI sysfs files, then there's this other pci-stub thing
  to maintain ownership, but the kvm ioctls don't actually require it and
  can grab onto any free device...  We are duplicating some of that in
  VFIO, but we also put the ownership of the device behind a single device
  file.  We do have the uiommu problem that we can't give an unprivileged
  user ownership of that, but your usage model may actually make that
  easier.  More below...
 
 Hrm.  I was assuming that

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Alex Williamson

On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
 On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
  On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
   On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
  [snip]
   On x86, the USB controllers don't typically live behind a PCIe-to-PCI
   bridge, so don't suffer the source identifier problem, but they do often
   share an interrupt.  But even then, we can count on most modern devices
   supporting PCI2.3, and thus the DisINTx feature, which allows us to
   share interrupts.  In any case, yes, it's more rare but we need to know
   how to handle devices behind PCI bridges.  However I disagree that we
   need to assign all the devices behind such a bridge to the guest.
   There's a difference between removing the device from the host and
   exposing the device to the guest.
  
  I think you're arguing only over details of what words to use for
  what, rather than anything of substance here.  The point is that an
  entire partitionable group must be assigned to host (in which case
  kernel drivers may bind to it) or to a particular guest partition (or
  at least to a single UID on the host).  Which of the assigned devices
  the partition actually uses is another matter of course, as is at
  exactly which level they become de-exposed if you don't want to use
  all of then.
 
 Well first we need to define what a partitionable group is, whether it's
 based on hardware requirements or user policy.  And while I agree that
 we need unique ownership of a partition, I disagree that qemu is
 necessarily the owner of the entire partition vs individual devices.

Sorry, I didn't intend to have such circular logic.  ... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Konrad Rzeszutek Wilk

On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
 On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
  
  Don't worry, it took me a while to get my head around the HW :-) SR-IOV
  VFs will generally not have limitations like that no, but on the other
  hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
  take a bunch of VFs and put them in the same 'domain'.
  
  I think the main deal is that VFIO/qemu sees domains as guests and
  tries to put all devices for a given guest into a domain.
 
 Actually, that's only a recent optimization, before that each device got
 it's own iommu domain.  It's actually completely configurable on the
 qemu command line which devices get their own iommu and which share.
 The default optimizes the number of domains (one) and thus the number of
 mapping callbacks since we pin the entire guest.
 
  On POWER, we have a different view of things were domains/groups are
  defined to be the smallest granularity we can (down to a single VF) and
  we give several groups to a guest (ie we avoid sharing the iommu in most
  cases)
  
  This is driven by the HW design but that design is itself driven by the
  idea that the domains/group are also error isolation groups and we don't
  want to take all of the IOs of a guest down if one adapter in that guest
  is having an error.
  
  The x86 domains are conceptually different as they are about sharing the
  iommu page tables with the clear long term intent of then sharing those
  page tables with the guest CPU own. We aren't going in that direction
  (at this point at least) on POWER..
 
 Yes and no.  The x86 domains are pretty flexible and used a few
 different ways.  On the host we do dynamic DMA with a domain per device,
 mapping only the inflight DMA ranges.  In order to achieve the
 transparent device assignment model, we have to flip that around and map
 the entire guest.  As noted, we can continue to use separate domains for
 this, but since each maps the entire guest, it doesn't add a lot of
 value and uses more resources and requires more mapping callbacks (and
 x86 doesn't have the best error containment anyway).  If we had a well
 supported IOMMU model that we could adapt for pvDMA, then it would make
 sense to keep each device in it's own domain again.  Thanks,

Could you have an PV IOMMU (in the guest) that would set up those
maps?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm PCI assignment VFIO ramblings

2011-08-02 Thread Alex Williamson

On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote:
 On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
  On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
   
   Don't worry, it took me a while to get my head around the HW :-) SR-IOV
   VFs will generally not have limitations like that no, but on the other
   hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
   take a bunch of VFs and put them in the same 'domain'.
   
   I think the main deal is that VFIO/qemu sees domains as guests and
   tries to put all devices for a given guest into a domain.
  
  Actually, that's only a recent optimization, before that each device got
  it's own iommu domain.  It's actually completely configurable on the
  qemu command line which devices get their own iommu and which share.
  The default optimizes the number of domains (one) and thus the number of
  mapping callbacks since we pin the entire guest.
  
   On POWER, we have a different view of things were domains/groups are
   defined to be the smallest granularity we can (down to a single VF) and
   we give several groups to a guest (ie we avoid sharing the iommu in most
   cases)
   
   This is driven by the HW design but that design is itself driven by the
   idea that the domains/group are also error isolation groups and we don't
   want to take all of the IOs of a guest down if one adapter in that guest
   is having an error.
   
   The x86 domains are conceptually different as they are about sharing the
   iommu page tables with the clear long term intent of then sharing those
   page tables with the guest CPU own. We aren't going in that direction
   (at this point at least) on POWER..
  
  Yes and no.  The x86 domains are pretty flexible and used a few
  different ways.  On the host we do dynamic DMA with a domain per device,
  mapping only the inflight DMA ranges.  In order to achieve the
  transparent device assignment model, we have to flip that around and map
  the entire guest.  As noted, we can continue to use separate domains for
  this, but since each maps the entire guest, it doesn't add a lot of
  value and uses more resources and requires more mapping callbacks (and
  x86 doesn't have the best error containment anyway).  If we had a well
  supported IOMMU model that we could adapt for pvDMA, then it would make
  sense to keep each device in it's own domain again.  Thanks,
 
 Could you have an PV IOMMU (in the guest) that would set up those
 maps?

Yep, definitely.  That's effectively what power wants to do.  We could
do it on x86, but as others have noted, the map/unmap interface isn't
tuned to do this at that granularity and our target guest OS audience is
effectively reduced to Linux.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 114 matches

Mail list logo