RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-19 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, August 18, 2020 7:50 PM
> 
> On Tue, Aug 18, 2020 at 01:09:01AM +, Tian, Kevin wrote:
> > The difference in my reply is not just about the implementation gap
> > of growing a userspace DMA framework to a passthrough framework.
> > My real point is about the different goals that each wants to achieve.
> > Userspace DMA is purely about allowing userspace to directly access
> > the portal and do DMA, but the wq configuration is always under kernel
> > driver's control. On the other hand, passthrough means delegating full
> > control of the wq to the guest and then associated mock-up (live migration,
> > vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> > value of mixing them together when there is already a good candidate
> > to handle passthrough...
> 
> In Linux a 'VM' and virtualization has always been a normal system
> process that uses a few extra kernel features. This has been more or
> less the cornerstone of that design since the start.
> 
> In that view it doesn't make any sense to say that uAPI from idxd that
> is useful for virtualization somehow doesn't belong as part of the
> standard uAPI.

The point is that we already have a more standard uAPI (VFIO) which
is unified and vendor-agnostic to userspace. Creating a idxd specific
uAPI to absorb similar requirements that VFIO already does is not 
compelling and instead causes more trouble to Qemu or other VMMs 
as they need to deal with every such driver uAPI even when Qemu 
itself has no interest in the device detail (since the real user is inside 
guest). 

> 
> Especially when it is such a small detail like what APIs are used to
> configure the wq.
> 
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?
> 

I'm not familiar with how container suspend/resume is done today.
But my gut-feeling is that it's different from virtualization. For 
virtualization, the whole wq is assigned to the guest thus the uAPI
must provide a way to save the wq state including its configuration 
at suspsend, and then restore the state to what guest expects when
resume. However in container case which does userspace DMA, the
wq is managed by host kernel and could be shared between multiple
containers. So the wq state is irrelevant to container. The only relevant
state is the in-fly workloads which needs a draining interface. In this
view I think the two have a major difference.

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-18 Thread Jason Gunthorpe
On Tue, Aug 18, 2020 at 07:05:16PM +0200, Paolo Bonzini wrote:
> On 18/08/20 18:49, Jason Gunthorpe wrote:
> > On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
> >> On 18/08/20 13:50, Jason Gunthorpe wrote:
> >>> For instance, what about suspend/resume of containers using idxd?
> >>> Wouldn't you want to have the same basic approach of controlling the
> >>> wq from userspace that virtualization uses?
> >>
> >> The difference is that VFIO more or less standardizes the approach you
> >> use for live migration.  With another interface you'd have to come up
> >> with something for every driver, and add support in CRIU for every
> >> driver as well.
> > 
> > VFIO is very unsuitable for use as some general userspace. It only 1:1
> > with a single process and just can't absorb what the existing idxd
> > userspace is doing.
> 
> The point of mdev is that it's not 1:1 anymore.

The lifecycle model of mdev is not compatible with how something like
idxd works, it needs a multi-open cdev.

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-18 Thread Paolo Bonzini
On 18/08/20 18:49, Jason Gunthorpe wrote:
> On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
>> On 18/08/20 13:50, Jason Gunthorpe wrote:
>>> For instance, what about suspend/resume of containers using idxd?
>>> Wouldn't you want to have the same basic approach of controlling the
>>> wq from userspace that virtualization uses?
>>
>> The difference is that VFIO more or less standardizes the approach you
>> use for live migration.  With another interface you'd have to come up
>> with something for every driver, and add support in CRIU for every
>> driver as well.
> 
> VFIO is very unsuitable for use as some general userspace. It only 1:1
> with a single process and just can't absorb what the existing idxd
> userspace is doing.

The point of mdev is that it's not 1:1 anymore.

Paolo

> So VFIO is already not a solution for normal userspace idxd where CRIU
> becomes interesting. Not sure what you are trying to say?
> 
> My point was the opposite, if you want to enable CRIU for idxd then
> you probably need all the same stuff as for qemu/VFIO except in the
> normal idxd user API.
> 
> Jason
> 



Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-18 Thread Jason Gunthorpe
On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
> On 18/08/20 13:50, Jason Gunthorpe wrote:
> > For instance, what about suspend/resume of containers using idxd?
> > Wouldn't you want to have the same basic approach of controlling the
> > wq from userspace that virtualization uses?
> 
> The difference is that VFIO more or less standardizes the approach you
> use for live migration.  With another interface you'd have to come up
> with something for every driver, and add support in CRIU for every
> driver as well.

VFIO is very unsuitable for use as some general userspace. It only 1:1
with a single process and just can't absorb what the existing idxd
userspace is doing.

So VFIO is already not a solution for normal userspace idxd where CRIU
becomes interesting. Not sure what you are trying to say?

My point was the opposite, if you want to enable CRIU for idxd then
you probably need all the same stuff as for qemu/VFIO except in the
normal idxd user API.

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-18 Thread Paolo Bonzini
On 18/08/20 13:50, Jason Gunthorpe wrote:
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?

The difference is that VFIO more or less standardizes the approach you
use for live migration.  With another interface you'd have to come up
with something for every driver, and add support in CRIU for every
driver as well.

Paolo



Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-18 Thread Jason Gunthorpe
On Tue, Aug 18, 2020 at 01:09:01AM +, Tian, Kevin wrote:
> The difference in my reply is not just about the implementation gap
> of growing a userspace DMA framework to a passthrough framework.
> My real point is about the different goals that each wants to achieve.
> Userspace DMA is purely about allowing userspace to directly access
> the portal and do DMA, but the wq configuration is always under kernel 
> driver's control. On the other hand, passthrough means delegating full 
> control of the wq to the guest and then associated mock-up (live migration,
> vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> value of mixing them together when there is already a good candidate
> to handle passthrough...

In Linux a 'VM' and virtualization has always been a normal system
process that uses a few extra kernel features. This has been more or
less the cornerstone of that design since the start.

In that view it doesn't make any sense to say that uAPI from idxd that
is useful for virtualization somehow doesn't belong as part of the
standard uAPI.

Especially when it is such a small detail like what APIs are used to
configure the wq.

For instance, what about suspend/resume of containers using idxd?
Wouldn't you want to have the same basic approach of controlling the
wq from userspace that virtualization uses?

Jason


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-17 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, August 18, 2020 8:44 AM
> 
> On Mon, Aug 17, 2020 at 02:12:44AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Friday, August 14, 2020 9:35 PM
> > >
> > > On Mon, Aug 10, 2020 at 07:32:24AM +, Tian, Kevin wrote:
> > >
> > > > > I would prefer to see that the existing userspace interface have the
> > > > > extra needed bits for virtualization (eg by having appropriate
> > > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > > the synthetic PCI device be done in userspace.
> > > >
> > > > In the end what decides the direction is the amount of changes that
> > > > we have to put in kernel, not whether we call it 'emulation'.
> > >
> > > No, this is not right. The decision should be based on what will end
> > > up more maintable in the long run.
> > >
> > > Yes it would be more code to dis-aggregate some of the things
> > > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > > above) but once it is disaggregated the maintability of the whole
> > > solution will be better overall, and more drivers will be able to use
> > > this functionality.
> > >
> >
> > Disaggregation is an orthogonal topic to the main divergence in
> > this thread, which is passthrough vs. userspace DMA. I gave detail
> > explanation about the difference between the two in last reply.
> 
> You said the first issue was related to SVA, which is understandable
> because we have no SVA uAPIs outside VFIO.
> 
> Imagine if we had some /dev/sva that provided this API and user space
> DMA drivers could simply accept an FD and work properly. It is not
> such a big leap anymore, nor is it customized code in idxd.
> 
> The other pass through issue was IRQ, which last time I looked, was
> fairly trivial to connect via interrupt remapping in the kernel, or
> could be made extremely trivial.
> 
> The last, seemed to be a concern that the current uapi for idxd was
> lacking seems idxd specific features, which seems like an quite weak
> reason to use VFIO.
> 

The difference in my reply is not just about the implementation gap
of growing a userspace DMA framework to a passthrough framework.
My real point is about the different goals that each wants to achieve.
Userspace DMA is purely about allowing userspace to directly access
the portal and do DMA, but the wq configuration is always under kernel 
driver's control. On the other hand, passthrough means delegating full 
control of the wq to the guest and then associated mock-up (live migration,
vSVA, posted interrupt, etc.) for that to work. I really didn't see the
value of mixing them together when there is already a good candidate
to handle passthrough...

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-17 Thread Jason Gunthorpe
On Mon, Aug 17, 2020 at 02:12:44AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Friday, August 14, 2020 9:35 PM
> > 
> > On Mon, Aug 10, 2020 at 07:32:24AM +, Tian, Kevin wrote:
> > 
> > > > I would prefer to see that the existing userspace interface have the
> > > > extra needed bits for virtualization (eg by having appropriate
> > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > the synthetic PCI device be done in userspace.
> > >
> > > In the end what decides the direction is the amount of changes that
> > > we have to put in kernel, not whether we call it 'emulation'.
> > 
> > No, this is not right. The decision should be based on what will end
> > up more maintable in the long run.
> > 
> > Yes it would be more code to dis-aggregate some of the things
> > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > above) but once it is disaggregated the maintability of the whole
> > solution will be better overall, and more drivers will be able to use
> > this functionality.
> > 
> 
> Disaggregation is an orthogonal topic to the main divergence in 
> this thread, which is passthrough vs. userspace DMA. I gave detail
> explanation about the difference between the two in last reply.

You said the first issue was related to SVA, which is understandable
because we have no SVA uAPIs outside VFIO.

Imagine if we had some /dev/sva that provided this API and user space
DMA drivers could simply accept an FD and work properly. It is not
such a big leap anymore, nor is it customized code in idxd.

The other pass through issue was IRQ, which last time I looked, was
fairly trivial to connect via interrupt remapping in the kernel, or
could be made extremely trivial.

The last, seemed to be a concern that the current uapi for idxd was
lacking seems idxd specific features, which seems like an quite weak
reason to use VFIO.

Jason


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-16 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, August 14, 2020 9:24 PM
> 
> The same basic argument goes for all the points - the issue is really
> the only uAPI we have for this stuff is under VFIO, and the better
> solution is to disagregate that uAPI, not to try and make everything
> pretend to be a VFIO device.
> 

Nobody is proposing to make everything VFIO. there must be some
criteria which can be brainstormed in LPC. But the opposite also holds - 
the fact that we should not make everything VFIO doesn't imply
prohibition on anyone from using it. There is a clear difference between 
passthrough and userspace DMA requirements in idxd context, and we
see good reasons to use VFIO for our passthrough requirements.


Thanks
Kevin


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-16 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Friday, August 14, 2020 9:35 PM
> 
> On Mon, Aug 10, 2020 at 07:32:24AM +, Tian, Kevin wrote:
> 
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'.
> 
> No, this is not right. The decision should be based on what will end
> up more maintable in the long run.
> 
> Yes it would be more code to dis-aggregate some of the things
> currently only bundled as uAPI inside VFIO (eg your vSVA argument
> above) but once it is disaggregated the maintability of the whole
> solution will be better overall, and more drivers will be able to use
> this functionality.
> 

Disaggregation is an orthogonal topic to the main divergence in 
this thread, which is passthrough vs. userspace DMA. I gave detail
explanation about the difference between the two in last reply.
the possibility of dis-aggregating something between passthrough
frameworks (e.g. VFIO and vDPA) is not the reason for growing 
every userspace DMA framework to be a passthrough framework.
Doing that is instead hurting maintainability in general...

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-14 Thread Jason Gunthorpe
On Mon, Aug 10, 2020 at 07:32:24AM +, Tian, Kevin wrote:

> > I would prefer to see that the existing userspace interface have the
> > extra needed bits for virtualization (eg by having appropriate
> > internal kernel APIs to make this easy) and all the emulation to build
> > the synthetic PCI device be done in userspace.
> 
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. 

No, this is not right. The decision should be based on what will end
up more maintable in the long run.

Yes it would be more code to dis-aggregate some of the things
currently only bundled as uAPI inside VFIO (eg your vSVA argument
above) but once it is disaggregated the maintability of the whole
solution will be better overall, and more drivers will be able to use
this functionality.

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-14 Thread Jason Gunthorpe
On Thu, Aug 13, 2020 at 02:01:58PM +0800, Jason Wang wrote:
> 
> On 2020/8/13 下午1:26, Tian, Kevin wrote:
> > > From: Jason Wang 
> > > Sent: Thursday, August 13, 2020 12:34 PM
> > > 
> > > 
> > > On 2020/8/12 下午12:05, Tian, Kevin wrote:
> > > > > The problem is that if we tie all controls via VFIO uAPI, the other
> > > > > subsystem like vDPA is likely to duplicate them. I wonder if there is 
> > > > > a
> > > > > way to decouple the vSVA out of VFIO uAPI?
> > > > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > > > be managed by its device driver (VFIO or vDPA). From this angle some
> > > > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > > > frameworks. Within the kernel the majority of vSVA handling is done by
> > > > IOMMU and IOASID modules thus most logic are shared.
> > > 
> > > So why not introduce vSVA uAPI at IOMMU or IOASID layer?
> > One may ask a similar question why IOMMU doesn't expose map/unmap
> > as uAPI...
> 
> 
> I think this is probably a good idea as well. If there's anything missed in
> the infrastructure, we can invent. Besides vhost-vDPA, there are other
> subsystems that relaying their uAPI to IOMMU API. Duplicating uAPIs is
> usually a hint of the codes duplication. Simple map/unmap could be easy but
> vSVA uAPI is much more complicated.

A way to create the vSVA objects unrelated to VFIO and then pass those
objects for device use into existing uAPIs, to me, makes alot of
sense.

You should not have to use the VFIO API just to get vSVA.

Or stated another way, the existing user driver should be able to get
a PASID from the vSVA components as well as just create a PASID from
the local mm_struct.

The same basic argument goes for all the points - the issue is really
the only uAPI we have for this stuff is under VFIO, and the better
solution is to disagregate that uAPI, not to try and make everything
pretend to be a VFIO device.

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-13 Thread Jason Wang



On 2020/8/13 下午1:26, Tian, Kevin wrote:

From: Jason Wang 
Sent: Thursday, August 13, 2020 12:34 PM


On 2020/8/12 下午12:05, Tian, Kevin wrote:

The problem is that if we tie all controls via VFIO uAPI, the other
subsystem like vDPA is likely to duplicate them. I wonder if there is a
way to decouple the vSVA out of VFIO uAPI?

vSVA is a per-device (either pdev or mdev) feature thus naturally should
be managed by its device driver (VFIO or vDPA). From this angle some
duplication is inevitable given VFIO and vDPA are orthogonal passthrough
frameworks. Within the kernel the majority of vSVA handling is done by
IOMMU and IOASID modules thus most logic are shared.


So why not introduce vSVA uAPI at IOMMU or IOASID layer?

One may ask a similar question why IOMMU doesn't expose map/unmap
as uAPI...



I think this is probably a good idea as well. If there's anything missed 
in the infrastructure, we can invent. Besides vhost-vDPA, there are 
other subsystems that relaying their uAPI to IOMMU API. Duplicating 
uAPIs is usually a hint of the codes duplication. Simple map/unmap could 
be easy but vSVA uAPI is much more complicated.








If an userspace DMA interface can be easily
adapted to be a passthrough one, it might be the choice.

It's not that easy even for VFIO which requires a lot of new uAPIs and
infrastructures(e.g mdev) to be invented.



But for idxd,
we see mdev a much better fit here, given the big difference between
what userspace DMA requires and what guest driver requires in this hw.

A weak point for mdev is that it can't serve kernel subsystem other than
VFIO. In this case, you need some other infrastructures (like [1]) to do
this.

mdev is not exclusive from kernel usages. It's perfectly fine for a driver
to reserve some work queues for host usages, while wrapping others
into mdevs.


I meant you may want slices to be an independent device from the kernel
point of view:

E.g for ethernet devices, you may want 10K mdevs to be passed to guest.

Similarly, you may want 10K net devices which is connected to the kernel
networking subsystems.

In this case it's not simply reserving queues but you need some other
type of device abstraction. There could be some kind of duplication
between this and mdev.


yes, some abstraction required but isn't it what the driver should
care about instead of mdev framework itself?



With mdev you present a "PCI" device, but what's kind of device it tries 
to present to kernel? If it's still PCI, there's duplication with mdev, 
if it's something new, maybe we can switch to that API.




If the driver reports
the same set of resource to both mdev and networking, it needs to
make sure when the resource is claimed in one interface then it
should be marked in-use in another. e.g. each mdev includes a
available_intances attribute. the driver could report 10k available
instances initially and then update it to 5K when another 5K is used
for net devices later.



Right but this probably means you need another management layer under mdev.




Mdev definitely has its usage limitations. Some may be improved
in the future, some may not. But those are distracting from the
original purpose of this thread (mdev vs. userspace DMA) and better
be discussed in other places e.g. LPC...



Ok.

Thanks




Thanks
Kevin




RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-12 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, August 13, 2020 12:34 PM
> 
> 
> On 2020/8/12 下午12:05, Tian, Kevin wrote:
> >> The problem is that if we tie all controls via VFIO uAPI, the other
> >> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> >> way to decouple the vSVA out of VFIO uAPI?
> > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > be managed by its device driver (VFIO or vDPA). From this angle some
> > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > frameworks. Within the kernel the majority of vSVA handling is done by
> > IOMMU and IOASID modules thus most logic are shared.
> 
> 
> So why not introduce vSVA uAPI at IOMMU or IOASID layer?

One may ask a similar question why IOMMU doesn't expose map/unmap
as uAPI...

> 
> 
> >
> >>>If an userspace DMA interface can be easily
> >>> adapted to be a passthrough one, it might be the choice.
> >> It's not that easy even for VFIO which requires a lot of new uAPIs and
> >> infrastructures(e.g mdev) to be invented.
> >>
> >>
> >>> But for idxd,
> >>> we see mdev a much better fit here, given the big difference between
> >>> what userspace DMA requires and what guest driver requires in this hw.
> >> A weak point for mdev is that it can't serve kernel subsystem other than
> >> VFIO. In this case, you need some other infrastructures (like [1]) to do
> >> this.
> > mdev is not exclusive from kernel usages. It's perfectly fine for a driver
> > to reserve some work queues for host usages, while wrapping others
> > into mdevs.
> 
> 
> I meant you may want slices to be an independent device from the kernel
> point of view:
> 
> E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
> 
> Similarly, you may want 10K net devices which is connected to the kernel
> networking subsystems.
> 
> In this case it's not simply reserving queues but you need some other
> type of device abstraction. There could be some kind of duplication
> between this and mdev.
> 

yes, some abstraction required but isn't it what the driver should
care about instead of mdev framework itself? If the driver reports
the same set of resource to both mdev and networking, it needs to
make sure when the resource is claimed in one interface then it
should be marked in-use in another. e.g. each mdev includes a
available_intances attribute. the driver could report 10k available
instances initially and then update it to 5K when another 5K is used
for net devices later.

Mdev definitely has its usage limitations. Some may be improved 
in the future, some may not. But those are distracting from the
original purpose of this thread (mdev vs. userspace DMA) and better
be discussed in other places e.g. LPC... 

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-12 Thread Jason Wang



On 2020/8/12 下午12:05, Tian, Kevin wrote:

The problem is that if we tie all controls via VFIO uAPI, the other
subsystem like vDPA is likely to duplicate them. I wonder if there is a
way to decouple the vSVA out of VFIO uAPI?

vSVA is a per-device (either pdev or mdev) feature thus naturally should
be managed by its device driver (VFIO or vDPA). From this angle some
duplication is inevitable given VFIO and vDPA are orthogonal passthrough
frameworks. Within the kernel the majority of vSVA handling is done by
IOMMU and IOASID modules thus most logic are shared.



So why not introduce vSVA uAPI at IOMMU or IOASID layer?





   If an userspace DMA interface can be easily
adapted to be a passthrough one, it might be the choice.

It's not that easy even for VFIO which requires a lot of new uAPIs and
infrastructures(e.g mdev) to be invented.



But for idxd,
we see mdev a much better fit here, given the big difference between
what userspace DMA requires and what guest driver requires in this hw.

A weak point for mdev is that it can't serve kernel subsystem other than
VFIO. In this case, you need some other infrastructures (like [1]) to do
this.

mdev is not exclusive from kernel usages. It's perfectly fine for a driver
to reserve some work queues for host usages, while wrapping others
into mdevs.



I meant you may want slices to be an independent device from the kernel 
point of view:


E.g for ethernet devices, you may want 10K mdevs to be passed to guest.

Similarly, you may want 10K net devices which is connected to the kernel 
networking subsystems.


In this case it's not simply reserving queues but you need some other 
type of device abstraction. There could be some kind of duplication 
between this and mdev.


Thanks




Thanks
Kevin





RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Wednesday, August 12, 2020 11:28 AM
> 
> 
> On 2020/8/10 下午3:32, Tian, Kevin wrote:
> >> From: Jason Gunthorpe 
> >> Sent: Friday, August 7, 2020 8:20 PM
> >>
> >> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >>
> >>> If you see this as an abuse of the framework, then let's identify those
> >>> specific issues and come up with a better approach.  As we've discussed
> >>> before, things like basic PCI config space emulation are acceptable
> >>> overhead and low risk (imo) and some degree of register emulation is
> >>> well within the territory of an mdev driver.
> >> What troubles me is that idxd already has a direct userspace interface
> >> to its HW, and does userspace DMA. The purpose of this mdev is to
> >> provide a second direct userspace interface that is a little different
> >> and trivially plugs into the virtualization stack.
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > * PASID allocation/free (on some platforms);
> > * bind/unbind guest mm/page table (nested translation);
> > * invalidate IOMMU cache/iotlb for guest page table changes;
> > * report page request from device to guest;
> > * forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > - put in dedicated or shared mode;
> > - enable/disable SVA;
> > - Associate guest-provided PASID to MSI/IMS entry;
> > - set threshold;
> > - allow/deny privileged access;
> > - allocate/free interrupt handle (enlightened for guest);
> > - collect error status;
> > - ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> 
> 
> So for userspace drivers like DPDK, it can use both of the two uAPIs?

yes.

> 
> 
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> >> I don't think VFIO should be the only entry point to
> >> virtualization. If we say the universe of devices doing user space DMA
> >> must also implement a VFIO mdev to plug into virtualization then it
> >> will be alot of mdevs.
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.
> 
> 
> The problem is that if we tie all controls via VFIO uAPI, the other
> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> way to decouple the vSVA out of VFIO uAPI?

vSVA is a per-device (either pdev or mdev) feature thus naturally should 
be managed by its device driver (VFIO or vDPA). From this angle some
duplication is inevitable given VFIO and vDPA are orthogonal passthrough
frameworks. Within the kernel the majority of vSVA handling is done by
IOMMU and IOASID modules thus most logic are shared.

> 
> 
> >   If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice.
> 
> 
> It's not that easy even for VFIO which requires a lot of new uAPIs and
> infrastructures(e.g mdev) to be invented.
> 
> 
> > But for idxd,
> > we see mdev a much better 

RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, August 12, 2020 10:36 AM
> On Wed, 12 Aug 2020 01:58:00 +
> "Tian, Kevin"  wrote:
> 
> > >
> > > I'll also remind folks that LPC is coming up in just a couple short
> > > weeks and this might be something we should discuss (virtually)
> > > in-person.  uconf CfPs are currently open.Thanks,
> > >
> >
> > Yes, LPC is a good place to reach consensus. btw I saw there is
> > already one VFIO topic called "device assignment/sub-assignment".
> > Do you think whether this can be covered under that topic, or
> > makes more sense to be a new one?
> 
> All the things listed in the CFP are only potential topics to get ideas
> flowing, there is currently no proposal to talk about sub-assignment.
> I'd suggest submitting separate topics for each and if we run into time
> constraints we can ask that they might be combined together.  Thanks,
> 

Done.
--
title: Criteria of using VFIO mdev (vs. userspace DMA)

Content:
VFIO mdev provides a framework for subdevice assignment and reuses 
existing VFIO uAPI  to handle common passthrough-related requirements. 
However, subdevice (e.g. ADI defined in Intel Scalable IOV) might not be 
a PCI endpoint (e.g. just a work queue), thus requires some degree of 
emulation/mediation in kernel to fit into VFIO device API. Then there is 
a concern on putting emulation in kernel and how to judge abuse of 
mdev framework by simply using it as an easy path to hook into 
virtualization stack. An associated open is about differentiating mdev 
from userspace DMA framework (such as uacce), and whether building 
passthrough features on top of userspace DMA framework is a better 
choice than using mdev. 

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Jason Wang



On 2020/8/10 下午3:32, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Friday, August 7, 2020 8:20 PM

On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:


If you see this as an abuse of the framework, then let's identify those
specific issues and come up with a better approach.  As we've discussed
before, things like basic PCI config space emulation are acceptable
overhead and low risk (imo) and some degree of register emulation is
well within the territory of an mdev driver.

What troubles me is that idxd already has a direct userspace interface
to its HW, and does userspace DMA. The purpose of this mdev is to
provide a second direct userspace interface that is a little different
and trivially plugs into the virtualization stack.

No. Userspace DMA and subdevice passthrough (what mdev provides)
are two distinct usages IMO (at least in idxd context). and this might
be the main divergence between us, thus let me put more words here.
If we could reach consensus in this matter, which direction to go
would be clearer.

First, a passthrough interface requires some unique requirements
which are not commonly observed in an userspace DMA interface, e.g.:

- Tracking DMA dirty pages for live migration;
- A set of interfaces for using SVA inside guest;
* PASID allocation/free (on some platforms);
* bind/unbind guest mm/page table (nested translation);
* invalidate IOMMU cache/iotlb for guest page table changes;
* report page request from device to guest;
* forward page response from guest to device;
- Configuring irqbypass for posted interrupt;
- ...

Second, a passthrough interface requires delegating raw controllability
of subdevice to guest driver, while the same delegation might not be
required for implementing an userspace DMA interface (especially for
modern devices which support SVA). For example, idxd allows following
setting per wq (guest driver may configure them in any combination):
- put in dedicated or shared mode;
- enable/disable SVA;
- Associate guest-provided PASID to MSI/IMS entry;
- set threshold;
- allow/deny privileged access;
- allocate/free interrupt handle (enlightened for guest);
- collect error status;
- ...

We plan to support idxd userspace DMA with SVA. The driver just needs
to prepare a wq with a predefined configuration (e.g. shared, SVA,
etc.), bind the process mm to IOMMU (non-nested) and then map
the portal to userspace. The goal that userspace can do DMA to
associated wq doesn't change the fact that the wq is still *owned*
and *controlled* by kernel driver. However as far as passthrough
is concerned, the wq is considered 'owned' by the guest driver thus
we need an interface which can support low-level *controllability*
from guest driver. It is sort of a mess in uAPI when mixing the
two together.



So for userspace drivers like DPDK, it can use both of the two uAPIs?




Based on above two reasons, we see distinct requirements between
userspace DMA and passthrough interfaces, at least in idxd context
(though other devices may have less distinction in-between). Therefore,
we didn't see the value/necessity of reinventing the wheel that mdev
already handles well to evolve an simple application-oriented usespace
DMA interface to a complex guest-driver-oriented passthrough interface.
The complexity of doing so would incur far more kernel-side changes
than the portion of emulation code that you've been concerned about...
  

I don't think VFIO should be the only entry point to
virtualization. If we say the universe of devices doing user space DMA
must also implement a VFIO mdev to plug into virtualization then it
will be alot of mdevs.

Certainly VFIO will not be the only entry point. and This has to be a
case-by-case decision.



The problem is that if we tie all controls via VFIO uAPI, the other 
subsystem like vDPA is likely to duplicate them. I wonder if there is a 
way to decouple the vSVA out of VFIO uAPI?




  If an userspace DMA interface can be easily
adapted to be a passthrough one, it might be the choice.



It's not that easy even for VFIO which requires a lot of new uAPIs and 
infrastructures(e.g mdev) to be invented.




But for idxd,
we see mdev a much better fit here, given the big difference between
what userspace DMA requires and what guest driver requires in this hw.



A weak point for mdev is that it can't serve kernel subsystem other than 
VFIO. In this case, you need some other infrastructures (like [1]) to do 
this.


(For idxd, you probably don't need this, but it's pretty common in the 
case of networking or storage device.)


Thanks

[1] https://patchwork.kernel.org/patch/11280547/





I would prefer to see that the existing userspace interface have the
extra needed bits for virtualization (eg by having appropriate
internal kernel APIs to make this easy) and all the emulation to build
the synthetic PCI device be done in userspace.

In the 

Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Alex Williamson
On Wed, 12 Aug 2020 01:58:00 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Wednesday, August 12, 2020 1:01 AM
> > 
> > On Mon, 10 Aug 2020 07:32:24 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Jason Gunthorpe 
> > > > Sent: Friday, August 7, 2020 8:20 PM
> > > >
> > > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > > >  
> > > > > If you see this as an abuse of the framework, then let's identify 
> > > > > those
> > > > > specific issues and come up with a better approach.  As we've 
> > > > > discussed
> > > > > before, things like basic PCI config space emulation are acceptable
> > > > > overhead and low risk (imo) and some degree of register emulation is
> > > > > well within the territory of an mdev driver.  
> > > >
> > > > What troubles me is that idxd already has a direct userspace interface
> > > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > > provide a second direct userspace interface that is a little different
> > > > and trivially plugs into the virtualization stack.  
> > >
> > > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > > are two distinct usages IMO (at least in idxd context). and this might
> > > be the main divergence between us, thus let me put more words here.
> > > If we could reach consensus in this matter, which direction to go
> > > would be clearer.
> > >
> > > First, a passthrough interface requires some unique requirements
> > > which are not commonly observed in an userspace DMA interface, e.g.:
> > >
> > > - Tracking DMA dirty pages for live migration;
> > > - A set of interfaces for using SVA inside guest;
> > >   * PASID allocation/free (on some platforms);
> > >   * bind/unbind guest mm/page table (nested translation);
> > >   * invalidate IOMMU cache/iotlb for guest page table changes;
> > >   * report page request from device to guest;
> > >   * forward page response from guest to device;
> > > - Configuring irqbypass for posted interrupt;
> > > - ...
> > >
> > > Second, a passthrough interface requires delegating raw controllability
> > > of subdevice to guest driver, while the same delegation might not be
> > > required for implementing an userspace DMA interface (especially for
> > > modern devices which support SVA). For example, idxd allows following
> > > setting per wq (guest driver may configure them in any combination):
> > >   - put in dedicated or shared mode;
> > >   - enable/disable SVA;
> > >   - Associate guest-provided PASID to MSI/IMS entry;
> > >   - set threshold;
> > >   - allow/deny privileged access;
> > >   - allocate/free interrupt handle (enlightened for guest);
> > >   - collect error status;
> > >   - ...
> > >
> > > We plan to support idxd userspace DMA with SVA. The driver just needs
> > > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > > etc.), bind the process mm to IOMMU (non-nested) and then map
> > > the portal to userspace. The goal that userspace can do DMA to
> > > associated wq doesn't change the fact that the wq is still *owned*
> > > and *controlled* by kernel driver. However as far as passthrough
> > > is concerned, the wq is considered 'owned' by the guest driver thus
> > > we need an interface which can support low-level *controllability*
> > > from guest driver. It is sort of a mess in uAPI when mixing the
> > > two together.
> > >
> > > Based on above two reasons, we see distinct requirements between
> > > userspace DMA and passthrough interfaces, at least in idxd context
> > > (though other devices may have less distinction in-between). Therefore,
> > > we didn't see the value/necessity of reinventing the wheel that mdev
> > > already handles well to evolve an simple application-oriented usespace
> > > DMA interface to a complex guest-driver-oriented passthrough interface.
> > > The complexity of doing so would incur far more kernel-side changes
> > > than the portion of emulation code that you've been concerned about...
> > >  
> > > >
> > > > I don't think VFIO should be the only entry point to
> > > > virtualization. If we say the universe of devices doing user space DMA
> > > > must also implement a VFIO mdev to plug into virtualization then it
> > > > will be alot of mdevs.  
> > >
> > > Certainly VFIO will not be the only entry point. and This has to be a
> > > case-by-case decision.  If an userspace DMA interface can be easily
> > > adapted to be a passthrough one, it might be the choice. But for idxd,
> > > we see mdev a much better fit here, given the big difference between
> > > what userspace DMA requires and what guest driver requires in this hw.
> > >  
> > > >
> > > > I would prefer to see that the existing userspace interface have the
> > > > extra needed bits for virtualization (eg by having appropriate
> > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > the synthetic PCI device be done in userspace.  
> > >
> > > In the end what decides the direction 

RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, August 12, 2020 1:01 AM
> 
> On Mon, 10 Aug 2020 07:32:24 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jason Gunthorpe 
> > > Sent: Friday, August 7, 2020 8:20 PM
> > >
> > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > >
> > > > If you see this as an abuse of the framework, then let's identify those
> > > > specific issues and come up with a better approach.  As we've discussed
> > > > before, things like basic PCI config space emulation are acceptable
> > > > overhead and low risk (imo) and some degree of register emulation is
> > > > well within the territory of an mdev driver.
> > >
> > > What troubles me is that idxd already has a direct userspace interface
> > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > provide a second direct userspace interface that is a little different
> > > and trivially plugs into the virtualization stack.
> >
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > * PASID allocation/free (on some platforms);
> > * bind/unbind guest mm/page table (nested translation);
> > * invalidate IOMMU cache/iotlb for guest page table changes;
> > * report page request from device to guest;
> > * forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > - put in dedicated or shared mode;
> > - enable/disable SVA;
> > - Associate guest-provided PASID to MSI/IMS entry;
> > - set threshold;
> > - allow/deny privileged access;
> > - allocate/free interrupt handle (enlightened for guest);
> > - collect error status;
> > - ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> > >
> > > I don't think VFIO should be the only entry point to
> > > virtualization. If we say the universe of devices doing user space DMA
> > > must also implement a VFIO mdev to plug into virtualization then it
> > > will be alot of mdevs.
> >
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.  If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice. But for idxd,
> > we see mdev a much better fit here, given the big difference between
> > what userspace DMA requires and what guest driver requires in this hw.
> >
> > >
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > adding special passthrough requirements (guest SVA, dirty tracking,
> > etc.) and raw controllability to the simple userspace DMA interface
> > is 

Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-11 Thread Alex Williamson
On Mon, 10 Aug 2020 07:32:24 +
"Tian, Kevin"  wrote:

> > From: Jason Gunthorpe 
> > Sent: Friday, August 7, 2020 8:20 PM
> > 
> > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >   
> > > If you see this as an abuse of the framework, then let's identify those
> > > specific issues and come up with a better approach.  As we've discussed
> > > before, things like basic PCI config space emulation are acceptable
> > > overhead and low risk (imo) and some degree of register emulation is
> > > well within the territory of an mdev driver.  
> > 
> > What troubles me is that idxd already has a direct userspace interface
> > to its HW, and does userspace DMA. The purpose of this mdev is to
> > provide a second direct userspace interface that is a little different
> > and trivially plugs into the virtualization stack.  
> 
> No. Userspace DMA and subdevice passthrough (what mdev provides)
> are two distinct usages IMO (at least in idxd context). and this might 
> be the main divergence between us, thus let me put more words here. 
> If we could reach consensus in this matter, which direction to go 
> would be clearer.
> 
> First, a passthrough interface requires some unique requirements 
> which are not commonly observed in an userspace DMA interface, e.g.:
> 
> - Tracking DMA dirty pages for live migration;
> - A set of interfaces for using SVA inside guest;
>   * PASID allocation/free (on some platforms);
>   * bind/unbind guest mm/page table (nested translation);
>   * invalidate IOMMU cache/iotlb for guest page table changes;
>   * report page request from device to guest;
>   * forward page response from guest to device;
> - Configuring irqbypass for posted interrupt;
> - ...
> 
> Second, a passthrough interface requires delegating raw controllability
> of subdevice to guest driver, while the same delegation might not be
> required for implementing an userspace DMA interface (especially for
> modern devices which support SVA). For example, idxd allows following
> setting per wq (guest driver may configure them in any combination):
>   - put in dedicated or shared mode;
>   - enable/disable SVA;
>   - Associate guest-provided PASID to MSI/IMS entry;
>   - set threshold;
>   - allow/deny privileged access;
>   - allocate/free interrupt handle (enlightened for guest);
>   - collect error status;
>   - ...
> 
> We plan to support idxd userspace DMA with SVA. The driver just needs 
> to prepare a wq with a predefined configuration (e.g. shared, SVA, 
> etc.), bind the process mm to IOMMU (non-nested) and then map 
> the portal to userspace. The goal that userspace can do DMA to 
> associated wq doesn't change the fact that the wq is still *owned* 
> and *controlled* by kernel driver. However as far as passthrough 
> is concerned, the wq is considered 'owned' by the guest driver thus 
> we need an interface which can support low-level *controllability* 
> from guest driver. It is sort of a mess in uAPI when mixing the
> two together.
> 
> Based on above two reasons, we see distinct requirements between 
> userspace DMA and passthrough interfaces, at least in idxd context 
> (though other devices may have less distinction in-between). Therefore,
> we didn't see the value/necessity of reinventing the wheel that mdev 
> already handles well to evolve an simple application-oriented usespace 
> DMA interface to a complex guest-driver-oriented passthrough interface. 
> The complexity of doing so would incur far more kernel-side changes 
> than the portion of emulation code that you've been concerned about...
>  
> > 
> > I don't think VFIO should be the only entry point to
> > virtualization. If we say the universe of devices doing user space DMA
> > must also implement a VFIO mdev to plug into virtualization then it
> > will be alot of mdevs.  
> 
> Certainly VFIO will not be the only entry point. and This has to be a 
> case-by-case decision.  If an userspace DMA interface can be easily 
> adapted to be a passthrough one, it might be the choice. But for idxd, 
> we see mdev a much better fit here, given the big difference between 
> what userspace DMA requires and what guest driver requires in this hw.
> 
> > 
> > I would prefer to see that the existing userspace interface have the
> > extra needed bits for virtualization (eg by having appropriate
> > internal kernel APIs to make this easy) and all the emulation to build
> > the synthetic PCI device be done in userspace.  
> 
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. For idxd,
> adding special passthrough requirements (guest SVA, dirty tracking,
> etc.) and raw controllability to the simple userspace DMA interface 
> is for sure making kernel more complex than reusing the mdev
> framework (plus some degree of emulation mockup behind). Not to
> mention the merit of uAPI compatibility with mdev...

I agree 

RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-10 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, August 7, 2020 8:20 PM
> 
> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> 
> > If you see this as an abuse of the framework, then let's identify those
> > specific issues and come up with a better approach.  As we've discussed
> > before, things like basic PCI config space emulation are acceptable
> > overhead and low risk (imo) and some degree of register emulation is
> > well within the territory of an mdev driver.
> 
> What troubles me is that idxd already has a direct userspace interface
> to its HW, and does userspace DMA. The purpose of this mdev is to
> provide a second direct userspace interface that is a little different
> and trivially plugs into the virtualization stack.

No. Userspace DMA and subdevice passthrough (what mdev provides)
are two distinct usages IMO (at least in idxd context). and this might 
be the main divergence between us, thus let me put more words here. 
If we could reach consensus in this matter, which direction to go 
would be clearer.

First, a passthrough interface requires some unique requirements 
which are not commonly observed in an userspace DMA interface, e.g.:

- Tracking DMA dirty pages for live migration;
- A set of interfaces for using SVA inside guest;
* PASID allocation/free (on some platforms);
* bind/unbind guest mm/page table (nested translation);
* invalidate IOMMU cache/iotlb for guest page table changes;
* report page request from device to guest;
* forward page response from guest to device;
- Configuring irqbypass for posted interrupt;
- ...

Second, a passthrough interface requires delegating raw controllability
of subdevice to guest driver, while the same delegation might not be
required for implementing an userspace DMA interface (especially for
modern devices which support SVA). For example, idxd allows following
setting per wq (guest driver may configure them in any combination):
- put in dedicated or shared mode;
- enable/disable SVA;
- Associate guest-provided PASID to MSI/IMS entry;
- set threshold;
- allow/deny privileged access;
- allocate/free interrupt handle (enlightened for guest);
- collect error status;
- ...

We plan to support idxd userspace DMA with SVA. The driver just needs 
to prepare a wq with a predefined configuration (e.g. shared, SVA, 
etc.), bind the process mm to IOMMU (non-nested) and then map 
the portal to userspace. The goal that userspace can do DMA to 
associated wq doesn't change the fact that the wq is still *owned* 
and *controlled* by kernel driver. However as far as passthrough 
is concerned, the wq is considered 'owned' by the guest driver thus 
we need an interface which can support low-level *controllability* 
from guest driver. It is sort of a mess in uAPI when mixing the
two together.

Based on above two reasons, we see distinct requirements between 
userspace DMA and passthrough interfaces, at least in idxd context 
(though other devices may have less distinction in-between). Therefore,
we didn't see the value/necessity of reinventing the wheel that mdev 
already handles well to evolve an simple application-oriented usespace 
DMA interface to a complex guest-driver-oriented passthrough interface. 
The complexity of doing so would incur far more kernel-side changes 
than the portion of emulation code that you've been concerned about...
 
> 
> I don't think VFIO should be the only entry point to
> virtualization. If we say the universe of devices doing user space DMA
> must also implement a VFIO mdev to plug into virtualization then it
> will be alot of mdevs.

Certainly VFIO will not be the only entry point. and This has to be a 
case-by-case decision.  If an userspace DMA interface can be easily 
adapted to be a passthrough one, it might be the choice. But for idxd, 
we see mdev a much better fit here, given the big difference between 
what userspace DMA requires and what guest driver requires in this hw.

> 
> I would prefer to see that the existing userspace interface have the
> extra needed bits for virtualization (eg by having appropriate
> internal kernel APIs to make this easy) and all the emulation to build
> the synthetic PCI device be done in userspace.

In the end what decides the direction is the amount of changes that
we have to put in kernel, not whether we call it 'emulation'. For idxd,
adding special passthrough requirements (guest SVA, dirty tracking,
etc.) and raw controllability to the simple userspace DMA interface 
is for sure making kernel more complex than reusing the mdev
framework (plus some degree of emulation mockup behind). Not to
mention the merit of uAPI compatibility with mdev...

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-07 Thread Jason Gunthorpe
On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:

> If you see this as an abuse of the framework, then let's identify those
> specific issues and come up with a better approach.  As we've discussed
> before, things like basic PCI config space emulation are acceptable
> overhead and low risk (imo) and some degree of register emulation is
> well within the territory of an mdev driver.  

What troubles me is that idxd already has a direct userspace interface
to its HW, and does userspace DMA. The purpose of this mdev is to
provide a second direct userspace interface that is a little different
and trivially plugs into the virtualization stack.

I don't think VFIO should be the only entry point to
virtualization. If we say the universe of devices doing user space DMA
must also implement a VFIO mdev to plug into virtualization then it
will be alot of mdevs.

I would prefer to see that the existing userspace interface have the
extra needed bits for virtualization (eg by having appropriate
internal kernel APIs to make this easy) and all the emulation to build
the synthetic PCI device be done in userspace.

Not only is it better for security, it keeps things to one device
driver per device..

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-08-05 Thread Alex Williamson
On Thu, 23 Jul 2020 21:19:30 -0300
Jason Gunthorpe  wrote:

> On Tue, Jul 21, 2020 at 11:54:49PM +, Tian, Kevin wrote:
> > In a nutshell, applications don't require raw WQ controllability as guest
> > kernel drivers may expect. Extending DSA user space interface to be another
> > passthrough interface just for virtualization needs is less compelling than
> > leveraging established VFIO/mdev framework (with the major merit that
> > existing user space VMMs just work w/o any change as long as they already
> > support VFIO uAPI).  
> 
> Sure, but the above is how the cover letter should have summarized
> that discussion, not as "it is not much code difference"
> 
> > In last review you said that you didn't hard nak this approach and would
> > like to hear opinion from virtualization guys. In this version we CCed KVM
> > mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
> > hypervisor), etc. Let's see how they feel about this approach.  
> 
> Yes, the VFIO community should decide.
> 
> If we are doing emulation tasks in the kernel now, then I can think of
> several nice semi-emulated mdevs to propose.
> 
> This will not be some one off, but the start of a widely copied
> pattern.

And that's definitely a concern, there should be a reason for
implementing device emulation in the kernel beyond an easy path to get
a device exposed up through a virtualization stack.  The entire idea of
mdev is the mediation of access to a device to make it safe for a user
and to fit within the vfio device API.  Mediation, emulation, and
virtualization can be hard to differentiate, and there is some degree of
emulation required to fill out the device API, for vfio-pci itself
included.  So I struggle with a specific measure of where to draw the
line, and also whose authority it is to draw that line.  I don't think
it's solely mine, that's something we need to decide as a community.

If you see this as an abuse of the framework, then let's identify those
specific issues and come up with a better approach.  As we've discussed
before, things like basic PCI config space emulation are acceptable
overhead and low risk (imo) and some degree of register emulation is
well within the territory of an mdev driver.  Drivers are accepting
some degree of increased attack surface by each addition of a uAPI and
the complexity of those uAPIs, but it seems largely a decision for
those drivers whether they're willing to take on that responsibility
and burden.

At some point, possibly in the near-ish future, we might have a
vfio-user interface with userspace vfio-over-socket servers that might
be able to consume existing uAPIs and offload some of this complexity
and emulation to userspace while still providing an easy path to insert
devices into the virtualization stack.  Hopefully if/when that comes
along, it would provide these sorts of drivers an opportunity to
offload some of the current overhead out to userspace, but I'm not sure
it's worth denying a mainline implementation now.  Thanks,

Alex



Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-23 Thread Jason Gunthorpe
On Tue, Jul 21, 2020 at 11:54:49PM +, Tian, Kevin wrote:
> In a nutshell, applications don't require raw WQ controllability as guest
> kernel drivers may expect. Extending DSA user space interface to be another
> passthrough interface just for virtualization needs is less compelling than
> leveraging established VFIO/mdev framework (with the major merit that
> existing user space VMMs just work w/o any change as long as they already
> support VFIO uAPI).

Sure, but the above is how the cover letter should have summarized
that discussion, not as "it is not much code difference"

> In last review you said that you didn't hard nak this approach and would
> like to hear opinion from virtualization guys. In this version we CCed KVM
> mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
> hypervisor), etc. Let's see how they feel about this approach.

Yes, the VFIO community should decide.

If we are doing emulation tasks in the kernel now, then I can think of
several nice semi-emulated mdevs to propose.

This will not be some one off, but the start of a widely copied
pattern.

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2020 at 10:31:28AM -0700, Dey, Megha wrote:
> > > I didn't notice any of this in the patch series? What is the calling
> > > context for the platform_msi_ops ? I think I already mentioned that
> > > ideally we'd need blocking/sleeping. The big selling point is that IMS
> > > allows this data to move off-chip, which means accessing it is no
> > > longer just an atomic write to some on-chip memory.
> > > 
> > > These details should be documented in the comment on top of
> > > platform_msi_ops
> 
> so the platform_msi_ops care called from the same context as the existing
> msi_ops for instance, we are not adding anything new. I think the above
> comment is a little misleading I will remove it next time around.

If it is true that all calls are under driver control then I think it
would be good to document that. I actually don't know off hand if
mask/unmask are restricted like that

As this is a op a driver has to implement vs the arch it probably need
a bit more hand holding.

> Also, I thought even the current write to on-chip memory is not
> atomic..

The writel to the MSI-X table in MMIO memory is 'atomic'

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-22 Thread Dey, Megha




On 7/21/2020 11:00 AM, Dave Jiang wrote:



On 7/21/2020 9:45 AM, Jason Gunthorpe wrote:

On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:

v2:
IMS (now dev-msi):
With recommendations from Jason/Thomas/Dan on making IMS more generic:
Pass a non-pci generic device(struct device) for IMS management 
instead of mdev

Remove all references to mdev and symbol_get/put
Remove all references to IMS in common code and replace with dev-msi
remove dynamic allocation of platform-msi interrupts: no groups,no 
new msi list or list helpers
Create a generic dev-msi domain with and without interrupt remapping 
enabled.

Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis


I didn't dig into the details of irq handling to really check this,
but the big picture of this is much more in line with what I would
expect for this kind of ability.


Link to previous discussions with Jason:
https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0...@intel.com/ 

The emulation part that can be moved to user space is very small due 
to the majority of the
emulations being control bits and need to reside in the kernel. We 
can revisit the necessity of
moving the small emulation part to userspace and required 
architectural changes at a later time.


The point here is that you already have a user space interface for
these queues that already has kernel support to twiddle the control
bits. Generally I'd expect extending that existing kernel code to do
the small bit more needed for mapping the queue through to PCI
emulation to be smaller than the 2kloc of new code here to put all the
emulation and support framework in the kernel, and exposes a lower
attack surface of kernel code to the guest.


The kernel can specify the requirements for these callback functions
(e.g., the driver is not expected to block, or not expected to take
a lock in the callback function).


I didn't notice any of this in the patch series? What is the calling
context for the platform_msi_ops ? I think I already mentioned that
ideally we'd need blocking/sleeping. The big selling point is that IMS
allows this data to move off-chip, which means accessing it is no
longer just an atomic write to some on-chip memory.

These details should be documented in the comment on top of
platform_msi_ops


so the platform_msi_ops care called from the same context as the 
existing msi_ops for instance, we are not adding anything new. I think 
the above comment is a little misleading I will remove it next time around.


Also, I thought even the current write to on-chip memory is not atomic.. 
could you let me know which piece of code you are referring to?
Since the driver gets to write to the off chip memory, shouldn't it be 
the drivers responsibility to call it from a sleeping/blocking context?




I'm actually a little confused how idxd_ims_irq_mask() manages this -
I thought IRQ masking should be synchronous, shouldn't there at least 
be a

flushing read to ensure that new MSI's are stopped and any in flight
are flushed to the APIC?


You are right Jason. It's missing a flushing read.



Jason


.


RE: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, July 22, 2020 12:45 AM
> 
> > Link to previous discussions with Jason:
> > https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-
> 46d823ca0...@intel.com/
> > The emulation part that can be moved to user space is very small due to
> the majority of the
> > emulations being control bits and need to reside in the kernel. We can
> revisit the necessity of
> > moving the small emulation part to userspace and required architectural
> changes at a later time.
> 
> The point here is that you already have a user space interface for
> these queues that already has kernel support to twiddle the control
> bits. Generally I'd expect extending that existing kernel code to do
> the small bit more needed for mapping the queue through to PCI
> emulation to be smaller than the 2kloc of new code here to put all the
> emulation and support framework in the kernel, and exposes a lower
> attack surface of kernel code to the guest.
> 

We discussed in v1 about why extending user space interface is not a
strong motivation at current stage:

https://lore.kernel.org/lkml/20200513124016.gg19...@mellanox.com/

In a nutshell, applications don't require raw WQ controllability as guest
kernel drivers may expect. Extending DSA user space interface to be another
passthrough interface just for virtualization needs is less compelling than
leveraging established VFIO/mdev framework (with the major merit that
existing user space VMMs just work w/o any change as long as they already
support VFIO uAPI).

And in this version we split previous 2kloc mdev patch into three parts:
[09] mdev framework and callbacks; [10] mmio/pci_cfg emulation; and
[11] handling of control commands. Only patch[10] is purely about
emulation (~500LOC), while the other two parts are tightly coupled to
physical resource management.

In last review you said that you didn't hard nak this approach and would
like to hear opinion from virtualization guys. In this version we CCed KVM
mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
hypervisor), etc. Let's see how they feel about this approach.

Thanks
Kevin


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Dan Williams
On Tue, Jul 21, 2020 at 9:29 AM Greg KH  wrote:
>
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> > v2:
>
> "RFC" to me means "I don't really think this is mergable, so I'm
> throwing it out there."  Which implies you know it needs more work
> before others should review it as you are not comfortable with it :(

There's full blown reviewed-by from me on the irq changes. The VFIO /
mdev changes looked ok to me, but I did not feel comfortable / did not
have time to sign-off on them. At the same time I did not see much to
be gained to keeping those internal. So "RFC" in this case is a bit
modest. It's more internal reviewer said this looks like it is going
in the right direction, but wants more community discussion on the
approach.

> So, back-of-the-queue you go...

Let's consider this not RFC in that context. The drivers/base/ pieces
have my review for you, the rest are dmaengine and vfio subsystem
concerns that could use some commentary.


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Dave Jiang




On 7/21/2020 9:45 AM, Jason Gunthorpe wrote:

On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:

v2:
IMS (now dev-msi):
With recommendations from Jason/Thomas/Dan on making IMS more generic:
Pass a non-pci generic device(struct device) for IMS management instead of mdev
Remove all references to mdev and symbol_get/put
Remove all references to IMS in common code and replace with dev-msi
remove dynamic allocation of platform-msi interrupts: no groups,no new msi list 
or list helpers
Create a generic dev-msi domain with and without interrupt remapping enabled.
Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis


I didn't dig into the details of irq handling to really check this,
but the big picture of this is much more in line with what I would
expect for this kind of ability.


Link to previous discussions with Jason:
https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0...@intel.com/
The emulation part that can be moved to user space is very small due to the 
majority of the
emulations being control bits and need to reside in the kernel. We can revisit 
the necessity of
moving the small emulation part to userspace and required architectural changes 
at a later time.


The point here is that you already have a user space interface for
these queues that already has kernel support to twiddle the control
bits. Generally I'd expect extending that existing kernel code to do
the small bit more needed for mapping the queue through to PCI
emulation to be smaller than the 2kloc of new code here to put all the
emulation and support framework in the kernel, and exposes a lower
attack surface of kernel code to the guest.


The kernel can specify the requirements for these callback functions
(e.g., the driver is not expected to block, or not expected to take
a lock in the callback function).


I didn't notice any of this in the patch series? What is the calling
context for the platform_msi_ops ? I think I already mentioned that
ideally we'd need blocking/sleeping. The big selling point is that IMS
allows this data to move off-chip, which means accessing it is no
longer just an atomic write to some on-chip memory.

These details should be documented in the comment on top of
platform_msi_ops

I'm actually a little confused how idxd_ims_irq_mask() manages this -
I thought IRQ masking should be synchronous, shouldn't there at least be a
flushing read to ensure that new MSI's are stopped and any in flight
are flushed to the APIC?


You are right Jason. It's missing a flushing read.



Jason



Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Dave Jiang




On 7/21/2020 9:28 AM, Greg KH wrote:

On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:

v2:


"RFC" to me means "I don't really think this is mergable, so I'm
throwing it out there."  Which implies you know it needs more work
before others should review it as you are not comfortable with it :(

So, back-of-the-queue you go...

greg k-h



Hi Greg! Yes this absolutely needs more work! I think it's in pretty good shape, 
but it has reached the point where it needs the talented eyes of reviewers from 
outside of Intel. I was really hoping to get feedback from folks like Jason 
(Thanks Jason!!) and KVM and VFIO experts like Alex, Paolo, Eric, and Kirti.


I can understand that you are quite busy and can not necessarily provide a 
detailed review at this phase. Would you prefer to be cc'd on code at this phase 
in the future? Or, should we reserve putting you on the cc for times when we 
know it's ready for merge?


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Jason Gunthorpe
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:
> IMS (now dev-msi):
> With recommendations from Jason/Thomas/Dan on making IMS more generic:
> Pass a non-pci generic device(struct device) for IMS management instead of 
> mdev
> Remove all references to mdev and symbol_get/put
> Remove all references to IMS in common code and replace with dev-msi
> remove dynamic allocation of platform-msi interrupts: no groups,no new msi 
> list or list helpers
> Create a generic dev-msi domain with and without interrupt remapping enabled.
> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis

I didn't dig into the details of irq handling to really check this,
but the big picture of this is much more in line with what I would
expect for this kind of ability.

> Link to previous discussions with Jason:
> https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0...@intel.com/
> The emulation part that can be moved to user space is very small due to the 
> majority of the
> emulations being control bits and need to reside in the kernel. We can 
> revisit the necessity of
> moving the small emulation part to userspace and required architectural 
> changes at a later time.

The point here is that you already have a user space interface for
these queues that already has kernel support to twiddle the control
bits. Generally I'd expect extending that existing kernel code to do
the small bit more needed for mapping the queue through to PCI
emulation to be smaller than the 2kloc of new code here to put all the
emulation and support framework in the kernel, and exposes a lower
attack surface of kernel code to the guest.

> The kernel can specify the requirements for these callback functions
> (e.g., the driver is not expected to block, or not expected to take
> a lock in the callback function).

I didn't notice any of this in the patch series? What is the calling
context for the platform_msi_ops ? I think I already mentioned that
ideally we'd need blocking/sleeping. The big selling point is that IMS
allows this data to move off-chip, which means accessing it is no
longer just an atomic write to some on-chip memory.

These details should be documented in the comment on top of
platform_msi_ops

I'm actually a little confused how idxd_ims_irq_mask() manages this -
I thought IRQ masking should be synchronous, shouldn't there at least be a
flushing read to ensure that new MSI's are stopped and any in flight
are flushed to the APIC?

Jason


Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-07-21 Thread Greg KH
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:

"RFC" to me means "I don't really think this is mergable, so I'm
throwing it out there."  Which implies you know it needs more work
before others should review it as you are not comfortable with it :(

So, back-of-the-queue you go...

greg k-h