Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-17 Thread Jacob Pan (Jun)
Hi Jason,
On Thu, 17 Sep 2020 11:53:49 +0800, Jason Wang 
wrote:

> On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote:
> > Hi Jason,
> > On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe 
> > wrote:
> >  
> >> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:  
> >>> Hi Jason,
> >>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe
> >>>  wrote:
> >>>  
> >>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:  
> >>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe
> >>>>> wrote:  
> >>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun)
> >>>>>> wrote:  
> >>>>>>>> If user space wants to bind page tables, create the PASID
> >>>>>>>> with /dev/sva, use ioctls there to setup the page table
> >>>>>>>> the way it wants, then pass the now configured PASID to a
> >>>>>>>> driver that can use it.  
> >>>>>>> Are we talking about bare metal SVA?  
> >>>>>> What a weird term.  
> >>>>> Glad you noticed it at v7 :-)
> >>>>>
> >>>>> Any suggestions on something less weird than
> >>>>> Shared Virtual Addressing? There is a reason why we moved from
> >>>>> SVM to SVA.  
> >>>> SVA is fine, what is "bare metal" supposed to mean?
> >>>>  
> >>> What I meant here is sharing virtual address between DMA and host
> >>> process. This requires devices perform DMA request with PASID and
> >>> use IOMMU first level/stage 1 page tables.
> >>> This can be further divided into 1) user SVA 2) supervisor SVA
> >>> (sharing init_mm)
> >>>
> >>> My point is that /dev/sva is not useful here since the driver can
> >>> perform PASID allocation while doing SVA bind.  
> >> No, you are thinking too small.
> >>
> >> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the
> >> SVA. 
> > Could you point to me the SVA UAPI? I couldn't find it in the
> > mainline. Seems VDPA uses VHOST interface?  
> 
> 
> It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h.
> 
Thanks for the pointer, for complete vSVA functionality we would need
1 TLB flush (IOTLB and PASID cache etc.)
2 PASID alloc/free
3 bind/unbind page tables or PASID tables
4 Page request service

Seems vhost_iotlb_msg can be used for #1 partially. And the
proposal is to pluck out the rest into /dev/sda? Seems awkward as Alex
pointed out earlier for similar situation in VFIO.

> 
> >  
> >> When VDPA is used by DPDK it makes sense that the PASID will be SVA
> >> and 1:1 with the mm_struct.
> >>  
> > I still don't see why bare metal DPDK needs to get a handle of the
> > PASID.  
> 
> 
> My understanding is that it may:
> 
> - have a unified uAPI with vSVA: alloc, bind, unbind, free
Got your point, but vSVA needs more than these

> - leave the binding policy to userspace instead of the using a
> implied one in the kenrel
> 
Only if necessary.

> 
> > Perhaps the SVA patch would explain. Or are you talking about
> > vDPA DPDK process that is used to support virtio-net-pmd in the
> > guest? 
> >> When VDPA is used by qemu it makes sense that the PASID will be an
> >> arbitary IOVA map constructed to be 1:1 with the guest vCPU
> >> physical map. /dev/sva allows a single uAPI to do this kind of
> >> setup, and qemu can support it while supporting a range of SVA
> >> kernel drivers. VDPA and vfio-mdev are obvious initial targets.
> >>
> >> *BOTH* are needed.
> >>
> >> In general any uAPI for PASID should have the option to use either
> >> the mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs
> >> virtually nothing to implement this in the driver as PASID is just
> >> a number, and gives so much more flexability.
> >>  
> > Not really nothing in terms of PASID life cycles. For example, if
> > user uses uacce interface to open an accelerator, it gets an
> > FD_acc. Then it opens /dev/sva to allocate PASID then get another
> > FD_pasid. Then we pass FD_pasid to the driver to bind page tables,
> > perhaps multiple drivers. Now we have to worry about If FD_pasid
> > gets closed before FD_acc(s) closed and all these race conditions.  
> 
> 
> I'm not sure I understand this. But this demonstrates the flexibility
> of an unified uAPI. E.g it allows vDPA and VFIO device

Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-16 Thread Jacob Pan (Jun)
Hi Jason,
On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe 
wrote:

> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:
> > Hi Jason,
> > On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe 
> > wrote:
> >   
> > > On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:  
> > > > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe
> > > > wrote:    
> > > > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun)
> > > > > wrote:
> > > > > > > If user space wants to bind page tables, create the PASID
> > > > > > > with /dev/sva, use ioctls there to setup the page table
> > > > > > > the way it wants, then pass the now configured PASID to a
> > > > > > > driver that can use it. 
> > > > > > 
> > > > > > Are we talking about bare metal SVA? 
> > > > > 
> > > > > What a weird term.
> > > > 
> > > > Glad you noticed it at v7 :-) 
> > > > 
> > > > Any suggestions on something less weird than 
> > > > Shared Virtual Addressing? There is a reason why we moved from
> > > > SVM to SVA.
> > > 
> > > SVA is fine, what is "bare metal" supposed to mean?
> > >   
> > What I meant here is sharing virtual address between DMA and host
> > process. This requires devices perform DMA request with PASID and
> > use IOMMU first level/stage 1 page tables.
> > This can be further divided into 1) user SVA 2) supervisor SVA
> > (sharing init_mm)
> > 
> > My point is that /dev/sva is not useful here since the driver can
> > perform PASID allocation while doing SVA bind.  
> 
> No, you are thinking too small.
> 
> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA.
> 
Could you point to me the SVA UAPI? I couldn't find it in the mainline.
Seems VDPA uses VHOST interface?

> When VDPA is used by DPDK it makes sense that the PASID will be SVA
> and 1:1 with the mm_struct.
> 
I still don't see why bare metal DPDK needs to get a handle of the
PASID. Perhaps the SVA patch would explain. Or are you talking about
vDPA DPDK process that is used to support virtio-net-pmd in the guest?

> When VDPA is used by qemu it makes sense that the PASID will be an
> arbitary IOVA map constructed to be 1:1 with the guest vCPU physical
> map. /dev/sva allows a single uAPI to do this kind of setup, and qemu
> can support it while supporting a range of SVA kernel drivers. VDPA
> and vfio-mdev are obvious initial targets.
> 
> *BOTH* are needed.
> 
> In general any uAPI for PASID should have the option to use either the
> mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually
> nothing to implement this in the driver as PASID is just a number, and
> gives so much more flexability.
> 
Not really nothing in terms of PASID life cycles. For example, if user
uses uacce interface to open an accelerator, it gets an FD_acc. Then it
opens /dev/sva to allocate PASID then get another FD_pasid. Then we
pass FD_pasid to the driver to bind page tables, perhaps multiple
drivers. Now we have to worry about If FD_pasid gets closed before
FD_acc(s) closed and all these race conditions.

If we do not expose FD_pasid to the user, the teardown is much simpler
and streamlined. Following each FD_acc close, PASID unbind is performed.

> > Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will
> > be introduced later.  
> 
> Last patch is:
> 
>   vfio/type1: Add vSVA support for IOMMU-backed mdevs
> 
> So pretty hard to see how this is not about vfio-mdev, at least a
> little..
> 
> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-16 Thread Jacob Pan (Jun)
Hi Jason,
On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe 
wrote:

> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:
> > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote:  
> > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:  
> > > > > If user space wants to bind page tables, create the PASID with
> > > > > /dev/sva, use ioctls there to setup the page table the way it
> > > > > wants, then pass the now configured PASID to a driver that
> > > > > can use it.   
> > > > 
> > > > Are we talking about bare metal SVA?   
> > > 
> > > What a weird term.  
> > 
> > Glad you noticed it at v7 :-) 
> > 
> > Any suggestions on something less weird than 
> > Shared Virtual Addressing? There is a reason why we moved from SVM
> > to SVA.  
> 
> SVA is fine, what is "bare metal" supposed to mean?
> 
What I meant here is sharing virtual address between DMA and host
process. This requires devices perform DMA request with PASID and use
IOMMU first level/stage 1 page tables.
This can be further divided into 1) user SVA 2) supervisor SVA (sharing
init_mm)

My point is that /dev/sva is not useful here since the driver can
perform PASID allocation while doing SVA bind.

> PASID is about constructing an arbitary DMA IOVA map for PCI-E
> devices, being able to intercept device DMA faults, etc.
> 
An arbitrary IOVA map does not need PASID. In IOVA, you do map/unmap
explicitly, why you need to handle IO page fault?

To me, PASID identifies an address space that is associated with a
mm_struct.

> SVA is doing DMA IOVA 1:1 with the mm_struct CPU VA. DMA faults
> trigger the same thing as CPU page faults. If is it not 1:1 then there
> is no "shared". When SVA is done using PCI-E PASID it is "PASID for
> SVA". Lots of existing devices already have SVA without PASID or
> IOMMU, so lets not muddy the terminology.
> 
I agree. This conversation is about "PASID for SVA" not "SVA without
PASID"


> vPASID/vIOMMU is allowing a guest to control the DMA IOVA map and
> manipulate the PASIDs.
> 
> vSVA is when a guest uses a vPASID to provide SVA, not sure this is
> an informative term.
> 
I agree.

> This particular patch series seems to be about vPASID/vIOMMU for
> vfio-mdev vs the other vPASID/vIOMMU patch which was about vPASID for
> vfio-pci.
> 
Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will be
introduced later.

> > > > If so, I don't see the need for userspace to know there is a
> > > > PASID. All user space need is that my current mm is bound to a
> > > > device by the driver. So it can be a one-step process for user
> > > > instead of two.  
> > > 
> > > You've missed the entire point of the conversation, VDPA already
> > > needs more than "my current mm is bound to a device"  
> > 
> > You mean current version of vDPA? or a potential future version of
> > vDPA?  
> 
> Future VDPA drivers, it was made clear this was important to Intel
> during the argument about VDPA as a mdev.
> 
> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-15 Thread Jacob Pan (Jun)
Hi Jason,
On Tue, 15 Sep 2020 20:51:26 -0300, Jason Gunthorpe 
wrote:

> On Tue, Sep 15, 2020 at 03:08:51PM -0700, Jacob Pan wrote:
> > > A PASID vIOMMU solution sharable with VDPA and VFIO, based on a
> > > PASID control char dev (eg /dev/sva, or maybe /dev/iommu) seems
> > > like a reasonable starting point for discussion.  
> > 
> > I am not sure what can really be consolidated in /dev/sva.   
> 
> More or less, everything in this patch. All the manipulations of PASID
> that are required for vIOMMU use case/etc. Basically all PASID control
> that is not just a 1:1 mapping of the mm_struct.
> 
> > will have their own kerne-user interfaces anyway for their usage
> > models. They are just providing the specific transport while
> > sharing generic IOMMU UAPIs and IOASID management.  
> 
> > As I mentioned PASID management is already consolidated in the
> > IOASID layer, so for VDPA or other users, it just matter of create
> > its own ioasid_set, doing allocation.  
> 
> Creating the PASID is not the problem, managing what the PASID maps to
> is the issue. That is all uAPI that we don't really have today.
> 
> > IOASID is also available to the in-kernel users which does not
> > need /dev/sva AFAICT. For bare metal SVA, I don't see a need to
> > create this 'floating' state of the PASID when created by /dev/sva.
> > PASID allocation could happen behind the scene when users need to
> > bind page tables to a device DMA stream.  
> 
> My point is I would like to see one set of uAPI ioctls to bind page
> tables. I don't want to have VFIO, VDPA, etc, etc uAPIs to do the
> exact same things only slightly differently.
> 
Got your point. I am not familiar with VDPA but for VFIO UAPI, it is
very thin, mostly passthrough IOMMU UAPI struct as opaque data.

> If user space wants to bind page tables, create the PASID with
> /dev/sva, use ioctls there to setup the page table the way it wants,
> then pass the now configured PASID to a driver that can use it. 
> 
Are we talking about bare metal SVA? If so, I don't see the need for
userspace to know there is a PASID. All user space need is that my
current mm is bound to a device by the driver. So it can be a one-step
process for user instead of two.

> Driver does not do page table binding. Do not duplicate all the
> control plane uAPI in every driver.
> 
> PASID managment and binding is seperated from the driver(s) that are
> using the PASID.
> 
Why separate? Drivers need to be involved in PASID life cycle
management. For example, when tearing down a PASID, the driver needs to
stop DMA, IOMMU driver needs to unbind, etc. If driver is the control
point, then things are just in order. I am referring to bare metal SVA.

For guest SVA, I agree that binding is separate from PASID allocation.
Could you review this doc. in terms of life cycle?
https://lkml.org/lkml/2020/8/22/13

My point is that /dev/sda has no value for bare metal SVA, we are just
talking about if guest SVA UAPIs can be consolidated. Or am I missing
something?

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/7] x86/mmu: Allocate/free PASID

2020-04-28 Thread Jacob Pan (Jun)
On Tue, 28 Apr 2020 13:59:43 -0700
"Luck, Tony"  wrote:

> >> So the driver needs to use flush/drain operations to make sure all
> >> the in-flight work has completed before releasing/re-using the
> >> PASID. 
> > Are you suggesting we should let driver also hold a reference of the
> > PASID?  
> 
> The sequence for bare metal is:
> 
>   process is queuing requests to DSA
>   process exits (either deliberately, or crashes, or is killed)
>   kernel does exit processing
>   DSA driver is called as part of tear down of "mm"
>   issues drain/flush commands to ensure that all
>   queued operations on the PASID for this mm have
>   completed
>   PASID can be freed
> 
> There's a 1:1 map from "mm" to PASID ... so reference counting seems
> like overkill. Once the kernel is in the "exit" path, we know that no
> more work can be queued using this PASID.
> 
There are two users of a PASID, mm and device driver(FD). If
either one is not done with the PASID, it cannot be reclaimed. As you
mentioned, it could take a long time for the driver to abort. If the
abort ends *after* mmdrop, we are in trouble.
If driver drops reference after abort/drain PASID is done, then we are
safe.


> -Tony

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/7] x86/mmu: Allocate/free PASID

2020-04-28 Thread Jacob Pan (Jun)
On Tue, 28 Apr 2020 12:07:25 -0700
"Luck, Tony"  wrote:

> > If fd release cleans up then how should there be something in
> > flight at the final mmdrop?  
> 
> ENQCMD from the user is only synchronous in that it lets the user
> know their request has been added to a queue (or not).  Execution of
> the request may happen later (if the device is busy working on
> requests for other users).  The request will take some time to
> complete. Someone told me the theoretical worst case once, which I've
> since forgotten, but it can be a long time.
> 
> So the driver needs to use flush/drain operations to make sure all
> the in-flight work has completed before releasing/re-using the PASID.
> 
Are you suggesting we should let driver also hold a reference of the
PASID?

> -Tony

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/7] x86/mmu: Allocate/free PASID

2020-04-28 Thread Jacob Pan (Jun)
On Tue, 28 Apr 2020 20:54:01 +0200
Thomas Gleixner  wrote:

> "Jacob Pan (Jun)"  writes:
> > On Sun, 26 Apr 2020 16:55:25 +0200
> > Thomas Gleixner  wrote:  
> >> Fenghua Yu  writes:  
> >> > The PASID is freed when the process exits (so no need to keep
> >> > reference counts on how many SVM devices are sharing the
> >> > PASID).
> >> 
> >> I'm not buying that. If there is an outstanding request with the
> >> PASID of a process then tearing down the process address space and
> >> freeing the PASID (which might be reused) is fundamentally broken.
> >>   
> > Device driver unbind PASID is tied to FD release. So when a process
> > exits, FD close causes driver to do the following:
> >
> > 1. stops DMA
> > 2. unbind PASID (clears the PASID entry in IOMMU, flush all TLBs,
> > drain in flight page requests)  
> 
> Fair enough. Explaining that somewhere might be helpful.
> 
Will do. I plan to document this in a kernel doc for IOASID/PASID
lifecycle management.

> > For bare metal SVM, if the last mmdrop always happens after FD
> > release, we can ensure no outstanding requests at the point of
> > ioasid_free(). Perhaps this is a wrong assumption?  
> 
> If fd release cleans up then how should there be something in flight
> at the final mmdrop?
> 
> > For guest SVM, there will be more users of a PASID. I am also
> > working on adding refcounting to ioasid. ioasid_free() will not
> > release the PASID back to the pool until all references are
> > dropped.  
> 
> What does more users mean?
For VT-d, a PASID can be used by VFIO, IOMMU driver, KVM, and Virtual
Device Composition Module (VDCM*) at the same time.

*https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification

There are HW context associated with the PASID in IOMMU, KVM, and VDCM.
So before the lifetime of the PASID is over, clean up must be done in
all of the above. PASID cannot be reclaimed until the last user drops
its reference. Our plan is to do notification and refcouting.


> 
> >> > +if (mm && mm->context.pasid && !(flags &
> >> > SVM_FLAG_PRIVATE_PASID)) {
> >> > +/*
> >> > + * Once a PASID is allocated for this mm, the
> >> > PASID
> >> > + * stays with the mm until the mm is dropped.
> >> > Reuse
> >> > + * the PASID which has been already allocated
> >> > for the
> >> > + * mm instead of allocating a new one.
> >> > + */
> >> > +ioasid_set_data(mm->context.pasid, svm);
> >> 
> >> So if the PASID is reused several times for different SVMs then
> >> every time ioasid_data->private is set to a different SVM. How is
> >> that supposed to work?
> >>   
> > For the lifetime of the mm, there is only one PASID.
> > svm_bind/unbind_mm could happen many times with different private
> > data: intel_svm. Multiple devices can bind to the same PASID as
> > well. But private data don't change within the first bind and last
> > unbind.  
> 
> Ok. I read through that spaghetti of intel_svm_bind_mm() again and
> now I start to get an idea how that is supposed to work. What a mess.
> 
> That function really wants to be restructured in a way so it is
> understandable to mere mortals. 
> 

Agreed. We are adding many new features and converging with generic
sva_bind_device. Things will get more clear after we have fewer moving
pieces.


> Thanks,
> 
> tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/7] x86/mmu: Allocate/free PASID

2020-04-28 Thread Jacob Pan (Jun)
On Sun, 26 Apr 2020 16:55:25 +0200
Thomas Gleixner  wrote:

> Fenghua Yu  writes:
> 
> > PASID is shared by all threads in a process. So the logical place
> > to keep track of it is in the "mm". Add the field to the
> > architecture specific mm_context_t structure.
> >
> > A PASID is allocated for an "mm" the first time any thread attaches
> > to an SVM capable device. Later device atatches (whether to the
> > same  
> 
> atatches?
> 
> > device or another SVM device) will re-use the same PASID.
> >
> > The PASID is freed when the process exits (so no need to keep
> > reference counts on how many SVM devices are sharing the PASID).  
> 
> I'm not buying that. If there is an outstanding request with the PASID
> of a process then tearing down the process address space and freeing
> the PASID (which might be reused) is fundamentally broken.
> 
Device driver unbind PASID is tied to FD release. So when a process
exits, FD close causes driver to do the following:
1. stops DMA
2. unbind PASID (clears the PASID entry in IOMMU, flush all TLBs, drain
in flight page requests)

For bare metal SVM, if the last mmdrop always happens after FD release,
we can ensure no outstanding requests at the point of ioasid_free().
Perhaps this is a wrong assumption?

For guest SVM, there will be more users of a PASID. I am also
working on adding refcounting to ioasid. ioasid_free() will not release
the PASID back to the pool until all references are dropped.

> > +void __free_pasid(struct mm_struct *mm);
> > +
> >  #endif /* _ASM_X86_IOMMU_H */
> > diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> > index bdeae9291e5c..137bf51f19e6 100644
> > --- a/arch/x86/include/asm/mmu.h
> > +++ b/arch/x86/include/asm/mmu.h
> > @@ -50,6 +50,10 @@ typedef struct {
> > u16 pkey_allocation_map;
> > s16 execute_only_pkey;
> >  #endif
> > +
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +   int pasid;  
> 
> int? It's a value which gets programmed into the MSR along with the
> valid bit (bit 31) set. 
> 
> >  extern void switch_mm(struct mm_struct *prev, struct mm_struct
> > *next, diff --git a/drivers/iommu/intel-svm.c
> > b/drivers/iommu/intel-svm.c index d7f2a5358900..da718a49e91e 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,45 @@ static LIST_HEAD(global_svm_list);
> > list_for_each_entry((sdev), &(svm)->devs, list) \
> > if ((d) != (sdev)->dev) {} else
> >  
> > +/*
> > + * If this mm already has a PASID we can use it. Otherwise
> > allocate a new one.
> > + * Let the caller know if we did an allocation via 'new_pasid'.
> > + */
> > +static int alloc_pasid(struct intel_svm *svm, struct mm_struct *mm,
> > +  int pasid_max,  bool *new_pasid, int
> > flags)  
> 
> Again, data types please. flags are generally unsigned and not plain
> int. Also pasid_max is certainly not plain int either.
> 
> > +{
> > +   int pasid;
> > +
> > +   /*
> > +* Reuse the PASID if the mm already has a PASID and not a
> > private
> > +* PASID is requested.
> > +*/
> > +   if (mm && mm->context.pasid && !(flags &
> > SVM_FLAG_PRIVATE_PASID)) {
> > +   /*
> > +* Once a PASID is allocated for this mm, the PASID
> > +* stays with the mm until the mm is dropped. Reuse
> > +* the PASID which has been already allocated for
> > the
> > +* mm instead of allocating a new one.
> > +*/
> > +   ioasid_set_data(mm->context.pasid, svm);  
> 
> So if the PASID is reused several times for different SVMs then every
> time ioasid_data->private is set to a different SVM. How is that
> supposed to work?
> 
For the lifetime of the mm, there is only one PASID. svm_bind/unbind_mm
could happen many times with different private data: intel_svm.
Multiple devices can bind to the same PASID as well. But private data
don't change within the first bind and last unbind.

> > +   *new_pasid = false;
> > +
> > +   return mm->context.pasid;
> > +   }
> > +
> > +   /*
> > +* Allocate a new pasid. Do not use PASID 0, reserved for
> > RID to
> > +* PASID.
> > +*/
> > +   pasid = ioasid_alloc(NULL, PASID_MIN, pasid_max - 1,
> > svm);  
> 
> ioasid_alloc() uses ioasid_t which is
> 
> typedef unsigned int ioasid_t;
> 
> Can we please have consistent types and behaviour all over the place?
> 
> > +   if (pasid == INVALID_IOASID)
> > +   return -ENOSPC;
> > +
> > +   *new_pasid = true;
> > +
> > +   return pasid;
> > +}
> > +
> >  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > struct svm_dev_ops *ops) {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > @@ -324,6 +363,8 @@ int intel_svm_bind_mm(struct device *dev, int
> > *pasid, int flags, struct svm_dev_ init_rcu_head(>rcu);
> >  
> > if (!svm) {
> > +   bool new_pasid;
> > +
> > svm = kzalloc(sizeof(*svm), GFP_KERNEL);
> > if (!svm) {
> > 

Re: [PATCH v2 1/3] iommu/virtio: Add topology description to virtio-iommu config space

2020-03-04 Thread Jacob Pan (Jun)
On Wed, 4 Mar 2020 18:40:46 +0100
Joerg Roedel  wrote:

> On Wed, Mar 04, 2020 at 04:38:21PM +0100, Jean-Philippe Brucker wrote:
> > I agree with this. The problem is I don't know how to get a new
> > ACPI table or change an existing one. It needs to go through the
> > UEFI forum in order to be accepted, and I don't have any weight
> > there. I've been trying to get the tiny change into IORT for ages.
> > I haven't been given any convincing reason against it or offered
> > any alternative, it's just stalled. The topology description
> > introduced here wasn't my first choice either but unless someone
> > can help finding another way into ACPI, I don't have a better
> > idea.  
> 
> A quote from the ACPI Specification (Version 6.3, Section 5.2.6,
> Page 119):
> 
>   Table signatures will be reserved by the ACPI promoters and
>   posted independently of this specification in ACPI errata and
>   clarification documents on the ACPI web site. Requests to
>   reserve a 4-byte alphanumeric table signature should be sent
> to the email address i...@acpi.info and should include the purpose
>   of the table and reference URL to a document that describes
> the table format. Tables defined outside of the ACPI specification
>   may define data value encodings in either little endian or big
>   endian format. For the purpose of clarity, external table
>   definition documents should include the endian-ness of their
>   data value encodings.
> 
> So it sounds like you need to specifiy the table format and send a
> request to i...@acpi.info to get a table signature for it.
> 
+ Mike and Erik who work closely on UEFI and ACPICA.

Copy paste Erik's initial response below on how to get a new table,
seems to confirm with the process you stated above.

"Fairly easy. You reserve a 4-letter symbol by sending a message
requesting to reserve the signature to Mike or the ASWG mailing list or
i...@acpi.info

There is also another option. You can have ASWG own this new table so
that not one entity or company owns the new table."

> Regards,
> 
>   Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-12-20 Thread Jacob Pan (Jun)
On Wed, 18 Dec 2019 12:20:44 +0100
Jean-Philippe Brucker  wrote:

> On Tue, Dec 03, 2019 at 07:01:36PM -0800, Jacob Pan (Jun) wrote:
> > Hi Jean,
> > 
> > Sorry for the delay, I was out last week. Comments inline below.
> > 
> > On Mon, 25 Nov 2019 19:02:47 +0100
> > Jean-Philippe Brucker  wrote:
> >   
> > > On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote:  
> > > > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for
> > > > > AMD and IORT for Arm). From my point of view IORT is easier to
> > > > > extend, since we just need to introduce a new node type. There
> > > > > are no dependencies to Arm in the Linux IORT driver, so it
> > > > > works well with CONFIG_X86.   
> > > > From my limited understanding, IORT and VIOT is to solve device
> > > > topology enumeration only? I am not sure how it can be expanded
> > > > to cover information beyond device topology. e.g. DMAR has NUMA
> > > > information and root port ATS, I guess they are not used today
> > > > in the guest but might be additions in the future.
> > > 
> > > The PCI root-complex node of IORT has an ATS attribute, which we
> > > can already use. However its scope is the root complex, not
> > > individual root ports like with DMAR.
> > > 
> > > I'm not very familiar with NUMA, but it looks like we just need to
> > > specify a proximity domain in relation to the SRAT table, for each
> > > viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain"
> > > field for this. We can add the same to the VIOT virtio-iommu nodes
> > > later, since the structures are extensible.
> > >   
> > I think there the proximity domain is more for each assigned device
> > than vIOMMU. vIOMMU in the guest can have assigned devices belong to
> > different pIOMMU and proximity domains. If the guest owns the first
> > level page tables (gIOVA or SVA), we want to make sure page tables
> > are allocated from the close proximity domain.
> > 
> > My understanding is virtio IOMMU supports both virtio devices and
> > assigned devices. we could care less about the former in terms of
> > NUMA.
> > 
> > In ACPI, we have _PXM method to retrieve device proximity domain. I
> > don't know if there is something equivalent or a generic way to get
> > _PXM information. I think VMM also need to make sure when an
> > assigned device is used with vIOMMU, there are some memory is
> > allocated from the device's proximity domain.
> >   
> > > But it might be better to keep the bare minimum information in
> > > the FW descriptor, and put the rest in the virtio-iommu. So yes
> > > topology enumeration is something the device cannot do itself
> > > (not fully that is, see (2)) but for the rest, virtio-iommu's
> > > PROBE request can provide details about each endpoint in relation
> > > to their physical IOMMU.
> > > 
> > > We could for example add a bit in a PROBE property saying that the
> > > whole path between the IOMMU and the endpoint supports ATS. For
> > > NUMA it might also be more interesting to have a finer
> > > granularity, since one viommu could be managing endpoints that
> > > are behind different physical IOMMUs. If in the future we want to
> > > allocate page tables close to the physical IOMMU for example, we
> > > might need to describe multiple NUMA nodes per viommu, using the
> > > PROBE request. 
> > Should we reinvent something for NUMA or use ACPI's SRAT, _PXM?   
> 
> Regardless whether we put it in the VIOT or in the virtio-iommu PROBE
> request, we necessarily need to reuse the node IDs defined by SRAT (or
> numa-node-id on devicetree, also a 32-bit value). A virtio-pci based
> virtio-iommu already has the _PXM of its closest bridge and wouldn't
> need anything more in the VIOT, while a virtio-mmio based
> virtio-iommu would need a proximity domain field in the VIOT. That
> could be added later since the table is extensible, but as you
> pointed out, that information might not be very useful.
> 
> > I am not sure how it is handled today in QEMU in terms of guest-host
> > NUMA proximity domain mapping.  
> 
> It looks like the user can specify this guest-host mapping on the
> command-line:
> 
>   -object memory-backend-ram,id=mem0,size=4G,host-nodes=3,policy=bind
>   -object memory-backend-ram,id=mem1,size=4G,host-nodes=4,policy=bind
>   -numa node,memdev=mem0,nodeid=numa0
>   -numa node,memdev=mem1,nodeid=numa1
>   -numa cpu,node-id=numa0,

Re: [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-12-03 Thread Jacob Pan (Jun)
Hi Jean,

Sorry for the delay, I was out last week. Comments inline below.

On Mon, 25 Nov 2019 19:02:47 +0100
Jean-Philippe Brucker  wrote:

> On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote:
> > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD
> > > and IORT for Arm). From my point of view IORT is easier to
> > > extend, since we just need to introduce a new node type. There
> > > are no dependencies to Arm in the Linux IORT driver, so it works
> > > well with CONFIG_X86. 
> > From my limited understanding, IORT and VIOT is to solve device
> > topology enumeration only? I am not sure how it can be expanded to
> > cover information beyond device topology. e.g. DMAR has NUMA
> > information and root port ATS, I guess they are not used today in
> > the guest but might be additions in the future.  
> 
> The PCI root-complex node of IORT has an ATS attribute, which we can
> already use. However its scope is the root complex, not individual
> root ports like with DMAR.
> 
> I'm not very familiar with NUMA, but it looks like we just need to
> specify a proximity domain in relation to the SRAT table, for each
> viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain"
> field for this. We can add the same to the VIOT virtio-iommu nodes
> later, since the structures are extensible.
> 
I think there the proximity domain is more for each assigned device
than vIOMMU. vIOMMU in the guest can have assigned devices belong to
different pIOMMU and proximity domains. If the guest owns the first
level page tables (gIOVA or SVA), we want to make sure page tables are
allocated from the close proximity domain.

My understanding is virtio IOMMU supports both virtio devices and
assigned devices. we could care less about the former in terms of NUMA.

In ACPI, we have _PXM method to retrieve device proximity domain. I
don't know if there is something equivalent or a generic way to get
_PXM information. I think VMM also need to make sure when an assigned
device is used with vIOMMU, there are some memory is allocated from the
device's proximity domain.

> But it might be better to keep the bare minimum information in the FW
> descriptor, and put the rest in the virtio-iommu. So yes topology
> enumeration is something the device cannot do itself (not fully that
> is, see (2)) but for the rest, virtio-iommu's PROBE request can
> provide details about each endpoint in relation to their physical
> IOMMU.
> 
> We could for example add a bit in a PROBE property saying that the
> whole path between the IOMMU and the endpoint supports ATS. For NUMA
> it might also be more interesting to have a finer granularity, since
> one viommu could be managing endpoints that are behind different
> physical IOMMUs. If in the future we want to allocate page tables
> close to the physical IOMMU for example, we might need to describe
> multiple NUMA nodes per viommu, using the PROBE request.
> 
Should we reinvent something for NUMA or use ACPI's SRAT, _PXM? I am
not sure how it is handled today in QEMU in terms of guest-host NUMA
proximity domain mapping.

> Thanks,
> Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-11-22 Thread Jacob Pan (Jun)
On Fri, 22 Nov 2019 11:49:47 +0100
Jean-Philippe Brucker  wrote:

> I'm seeking feedback on multi-platform support for virtio-iommu. At
> the moment only devicetree (DT) is supported and we don't have a
> pleasant solution for other platforms. Once we figure out the topology
> description, x86 support is trivial.
> 
> Since the IOMMU manages memory accesses from other devices, the guest
> kernel needs to initialize the IOMMU before endpoints start issuing
> DMA. It's a solved problem: firmware or hypervisor describes through
> DT or ACPI tables the device dependencies, and probe of endpoints is
> deferred until the IOMMU is probed. But:
> 
> (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD and
> IORT for Arm). From my point of view IORT is easier to extend, since
> we just need to introduce a new node type. There are no dependencies
> to Arm in the Linux IORT driver, so it works well with CONFIG_X86.
> 
>From my limited understanding, IORT and VIOT is to solve device topology
enumeration only? I am not sure how it can be expanded to cover
information beyond device topology. e.g. DMAR has NUMA information and
root port ATS, I guess they are not used today in the guest but might
be additions in the future.

> However, there are concerns about other OS vendors feeling
> obligated to implement this new node, so Arm proposed introducing
> another ACPI table, that can wrap any of DMAR, IVRS and IORT to
> extend it with new virtual nodes. A draft of this VIOT table
> specification is available at
> http://jpbrucker.net/virtio-iommu/viot/viot-v5.pdf
> 
> I'm afraid this could increase fragmentation as guests would need
> to implement or modify their support for all of DMAR, IVRS and IORT.
> If we end up doing VIOT, I suggest limiting it to IORT.
> 
> (2) In addition, there are some concerns about having virtio depend on
> ACPI or DT. Some hypervisors (Firecracker, QEMU microvm, kvmtool
> x86 [1]) don't currently implement those methods.
> 
> It was suggested to embed the topology description into the
> device. It can work, as demonstrated at the end of this RFC, with the
> following limitations:
> 
> - The topology description must be read before any endpoint
> managed by the IOMMU is probed, and even before the virtio module is
>   loaded. This RFC uses a PCI quirk to manually parse the virtio
>   configuration. It assumes that all endpoints managed by the
> IOMMU are under this same PCI host.
> 

> - I don't have a solution for the virtio-mmio transport at the
>   moment, because I haven't had time to modify a host to test it.
> I think it could either use a notifier on the platform bus, or
>   better, a new 'iommu' command-line argument to the virtio-mmio
>   driver. So the current prototype doesn't work for firecracker
> and microvm, which rely on virtio-mmio.
> 
> - For Arm, if the platform has an ITS, the hypervisor needs IORT
> or DT to describe it anyway. More generally, not using either ACPI or
>   DT might prevent from supporting other features as well. I
> suspect the above users will have to implement a standard method
> sooner or later.
> 
> - Even when reusing as much existing code as possible, guest
> support is still going to be around a few hundred lines since we can't
>   rely on the normal virtio infrastructure to be loaded at that
>   point. As you can see below, the diffstat for the incomplete
>   topology implementation is already bigger than the exhaustive
> IORT support, even when jumping through the VIOT hoop.
> 
> So it's a lightweight solution for very specific use-cases, and we
> should still support ACPI for the general case. Multi-platform
> guests such as Linux will then need to support three topology
> descriptions instead of two.
> 
> In this RFC I present both solutions, but I'd rather not keep all of
> it. Please see the individual patches for details:
> 
> (1) Patches 1, 3-10 add support for virtio-iommu to the Linux IORT
> driver and patches 2, 11 add the VIOT glue.
> 
> (2) Patch 12 adds the built-in topology description to the
> virtio-iommu specification. Patch 13 is a partial implementation for
> the Linux virtio-iommu driver. It only supports PCI, not platform
> devices.
> 
> You can find Linux and QEMU code on my virtio-iommu/devel branches at
> http://jpbrucker.net/git/linux and http://jpbrucker.net/git/qemu
> 
> 
> I split the diffstat since there are two independent features. The
> first one is for patches 1-11, and the second one for patch 13.
> 
> Jean-Philippe Brucker (11):
>   ACPI/IORT: Move IORT to the ACPI folder
>   ACPI: Add VIOT definitions
>   ACPI/IORT: Allow registration of external tables
>   ACPI/IORT: Add node categories
>   ACPI/IORT: Support VIOT virtio-mmio node
>   ACPI/IORT: Support VIOT virtio-pci node
>   ACPI/IORT: Defer probe until virtio-iommu-pci has registered a
> fwnode ACPI/IORT: Add callback to update a device's fwnode
>