Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
> 
> > > > Do we have any isolation requirements here? its the same process. So if 
> > > > the
> > > > page-request it sent to guest and even if you report it for mdev1, after
> > > > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > > > should simply work?
> > > 
> > > I think we already talked about this and said it should not be done.
> > 
> > I get the should not be done, I'm wondering where should that be
> > implemented?
> 
> The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> must have only one device attached to it. Attempting to connect two
> devices to the same slot fails on the iommu layer.

I guess we are talking about two different things. I was referring to SVM
side of things. Maybe you are referring to the mdev.

A single guest process should be allowed to work with 2 different
accelerators. The PASID for the process is just 1. Limiting that to just
one accelerator per process seems wrong.

Unless there is something else to prevent this, the best way seems never
expose more than 1 mdev from same pdev to the same guest. I think this is a
reasonable restriction compared to limiting a process to bind to no more
than 1 accelerator.


> 
> So the 2nd mdev will fail during IOASID binding when it tries to bind
> to the same PASID that the first mdev is already bound to.
> 
> Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 02:18:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> > > > > 
> > > > > > No. You are right on this case. I don't think there is a way to 
> > > > > > differentiate one mdev from the other if they come from the
> > > > > > same parent and attached by the same guest process. In this
> > > > > > case the fault could be reported on either mdev (e.g. the first
> > > > > > matching one) to get it fixed in the guest.
> > > > > 
> > > > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > > > and would have to share a group. Since group sharing is not supported
> > > > > today this seems like a non-issue
> > > > 
> > > > Does this mean we have to prevent 2 mdev's from same pdev being 
> > > > assigned to
> > > > the same guest? 
> > > 
> > > No, it means that the IOMMU layer has to be able to distinguish them.
> > 
> > Ok, the guest has no control over it, as it see 2 separate pci devices and
> > thinks they are all different.
> > 
> > Only time when it can fail is during the bind operation. From guest
> > perspective a bind in vIOMMU just turns into a write to local table and a
> > invalidate will cause the host to update the real copy from the shadow.
> > 
> > There is no way to fail the bind? and Allocation of the PASID is also a
> > separate operation and has no clue how its going to be used in the guest.
> 
> You can't attach the same RID to the same PASID twice. The IOMMU code
> should prevent this.
> 
> As we've talked about several times, it seems to me the vIOMMU
> interface is misdesigned for the requirements you have. The hypervisor
> should have a role in allocating the PASID since there are invisible
> hypervisor restrictions. This is one of them.

Allocating a PASID is a separate step from binding, isn't it? In vt-d we
have a virtual command interface that can fail an allocation of PASID. But
which device its bound to is a dynamic thing that only gets at bind_mm()
right?

> 
> > Do we have any isolation requirements here? its the same process. So if the
> > page-request it sent to guest and even if you report it for mdev1, after
> > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > should simply work?
> 
> I think we already talked about this and said it should not be done.

I get the should not be done, I'm wondering where should that be
implemented?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> > > 
> > > > No. You are right on this case. I don't think there is a way to 
> > > > differentiate one mdev from the other if they come from the
> > > > same parent and attached by the same guest process. In this
> > > > case the fault could be reported on either mdev (e.g. the first
> > > > matching one) to get it fixed in the guest.
> > > 
> > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > and would have to share a group. Since group sharing is not supported
> > > today this seems like a non-issue
> > 
> > Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> > the same guest? 
> 
> No, it means that the IOMMU layer has to be able to distinguish them.

Ok, the guest has no control over it, as it see 2 separate pci devices and
thinks they are all different.

Only time when it can fail is during the bind operation. From guest
perspective a bind in vIOMMU just turns into a write to local table and a
invalidate will cause the host to update the real copy from the shadow.

There is no way to fail the bind? and Allocation of the PASID is also a
separate operation and has no clue how its going to be used in the guest.

> 
> This either means they are "SW mdevs" which does not involve the IOMMU
> layer and puts both the responsibility for isolation and idenfication
> on the mdev driver.

When you mean SW mdev, is it the GPU like case where mdev is purely a SW
construct? or SIOV type where RID+PASID case?

> 
> Or they are some "PASID mdev" which does allow the IOMMU to isolate
> them.
> 
> What can't happen is to comingle /dev/iommu control over the pdev
> between two mdevs.
> 
> ie we can't talk about faults for IOMMU on SW mdevs - faults do not
> come from the IOMMU layer, they have to come from inside the mdev it
> self, somehow.

Recoverable faults for guest needs to be sent to guest? A page-request from
mdev1 and from mdev2 will both look alike when the process is sharing it.

Do we have any isolation requirements here? its the same process. So if the
page-request it sent to guest and even if you report it for mdev1, after
the PRQ is resolved by guest, the request from mdev2 from the same guest
should simply work?


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> 
> > No. You are right on this case. I don't think there is a way to 
> > differentiate one mdev from the other if they come from the
> > same parent and attached by the same guest process. In this
> > case the fault could be reported on either mdev (e.g. the first
> > matching one) to get it fixed in the guest.
> 
> If the IOMMU can't distinguish the two mdevs they are not isolated
> and would have to share a group. Since group sharing is not supported
> today this seems like a non-issue

Does this mean we have to prevent 2 mdev's from same pdev being assigned to
the same guest? 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-18 Thread Raj, Ashok
On Fri, Jun 18, 2021 at 12:15:06PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 18, 2021 at 03:47:51PM +0200, Joerg Roedel wrote:
> > Hi Kevin,
> > 
> > On Thu, Jun 17, 2021 at 07:31:03AM +, Tian, Kevin wrote:
> > > Now let's talk about the new IOMMU behavior:
> > > 
> > > -   A device is blocked from doing DMA to any resource outside of
> > > its group when it's probed by the IOMMU driver. This could be a
> > > special state w/o attaching to any domain, or a new special domain
> > > type which differentiates it from existing domain types (identity, 
> > > dma, or unmanged). Actually existing code already includes a
> > > IOMMU_DOMAIN_BLOCKED type but nobody uses it.
> > 
> > There is a reason for the default domain to exist: Devices which require
> > RMRR mappings to be present. You can't just block all DMA from devices
> > until a driver takes over, we put much effort into making sure there is
> > not even a small window in time where RMRR regions (unity mapped regions
> > on AMD) are not mapped.
> 
> Yes, I think the DMA blocking can only start around/after a VFIO type
> driver has probed() and bound to a device in the group, not much
> different from today.

Does this mean when a device has a required "RMRR" that requires a unity
mapping we block assigning those devices to guests? I remember we had some
restriction but there was a need to go around it at some point in time.

- Either we disallow assigning devices with RMRR
- Break that unity map when the device is probed and after which any RMRR
  access from device will fault.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATACH 1/1] iommu/vt-d: Use user privilege for RID2PASID translation

2021-05-12 Thread Raj, Ashok
On Wed, May 12, 2021 at 02:44:26PM +0800, Lu Baolu wrote:
> When first-level page tables are used for IOVA translation, we use user
> privilege by setting U/S bit in the page table entry. This is to make it
> consistent with the second level translation, where the U/S enforcement
> is not available. Clear the SRE (Supervisor Request Enable) field in the
> pasid table entry of RID2PASID so that requests requesting the supervisor
> privilege are blocked and treated as DMA remapping faults.
> 
> Suggested-by: Jacob Pan 
> Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/iommu.c | 7 +--
>  drivers/iommu/intel/pasid.c | 3 ++-
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 708f430af1c4..f1742da42478 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -2525,9 +2525,9 @@ static int domain_setup_first_level(struct intel_iommu 
> *iommu,
>   struct device *dev,
>   u32 pasid)
>  {
> - int flags = PASID_FLAG_SUPERVISOR_MODE;
>   struct dma_pte *pgd = domain->pgd;
>   int agaw, level;
> + int flags = 0;
>  
>   /*
>* Skip top levels of page tables for iommu which has
> @@ -2543,7 +2543,10 @@ static int domain_setup_first_level(struct intel_iommu 
> *iommu,
>   if (level != 4 && level != 5)
>   return -EINVAL;
>  
> - flags |= (level == 5) ? PASID_FLAG_FL5LP : 0;
> + if (pasid != PASID_RID2PASID)
> + flags |= PASID_FLAG_SUPERVISOR_MODE;
> + if (level == 5)
> + flags |= PASID_FLAG_FL5LP;

Given that we are still not bought into the supervisor PASID, should we make 
this an 
explicit request before turning on SUPERVISOR mode always? Since
pasid_set_sre() has no return, we have no way to fail it.



>  
>   if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
>   flags |= PASID_FLAG_PAGE_SNOOP;
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 72646bafc52f..72dc84821dad 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -699,7 +699,8 @@ int intel_pasid_setup_second_level(struct intel_iommu 
> *iommu,
>* Since it is a second level only translation setup, we should
>* set SRE bit as well (addresses are expected to be GPAs).
>*/
> - pasid_set_sre(pte);
> + if (pasid != PASID_RID2PASID)
> + pasid_set_sre(pte);
>   pasid_set_present(pte);
>   pasid_flush_caches(iommu, pte, pasid, did);
>  
> -- 
> 2.25.1
> 

-- 
Cheers,
Ashok

[Forgiveness is the attribute of the STRONG - Gandhi]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Tweak the description of a DMA fault

2021-05-12 Thread Raj, Ashok
On Wed, May 12, 2021 at 02:50:12PM +0800, Lu Baolu wrote:
> The Intel IOMMU driver reports the DMA fault reason in a decimal number
> while the VT-d specification uses a hexadecimal one. It's inconvenient
> that users need to covert them everytime before consulting the spec.
> Let's use hexadecimal number for a DMA fault reason.
> 
> The fault message uses 0x as PASID for DMA requests w/o PASID.
> This is confusing. Tweak this by adding "w/o PASID" explicitly.
> 
> Signed-off-by: Lu Baolu 

Maybe simpler to call it NO_PASID, and just PASID 0x instead?

with the minor suggestions below

Reviewed-by: Ashok Raj 

> ---
>  drivers/iommu/intel/dmar.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
> index 1757ac1e1623..11e37d2c2af2 100644
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1911,15 +1911,21 @@ static int dmar_fault_do_one(struct intel_iommu 
> *iommu, int type,
>   reason = dmar_get_fault_reason(fault_reason, _type);
>  
>   if (fault_type == INTR_REMAP)
> - pr_err("[INTR-REMAP] Request device [%02x:%02x.%d] fault index 
> %llx [fault reason %02d] %s\n",
> - source_id >> 8, PCI_SLOT(source_id & 0xFF),
> - PCI_FUNC(source_id & 0xFF), addr >> 48,
> - fault_reason, reason);
> - else
> - pr_err("[%s] Request device [%02x:%02x.%d] PASID %x fault addr 
> %llx [fault reason %02d] %s\n",
> + pr_err("[INTR-REMAP] Request device [%02x:%02x.%d] fault index 
> %llx [fault reason %02xh] %s\n",
> +source_id >> 8, PCI_SLOT(source_id & 0xFF),
> +PCI_FUNC(source_id & 0xFF), addr >> 48,
> +fault_reason, reason);
> + else if (pasid == INVALID_IOASID)
> + pr_err("[%s w/o PASID] Request device [%02x:%02x.%d] fault addr 
> %llx [fault reason %02xh] %s\n",
>  type ? "DMA Read" : "DMA Write",
>  source_id >> 8, PCI_SLOT(source_id & 0xFF),
> -PCI_FUNC(source_id & 0xFF), pasid, addr,
> +PCI_FUNC(source_id & 0xFF), addr,
> +fault_reason, reason);
> + else
> + pr_err("[%s w/ PASID %x] Request device [%02x:%02x.%d] fault 
> addr %llx [fault reason %02xh] %s\n",

Can you always lead hex values with 0x?

> +type ? "DMA Read" : "DMA Write", pasid,
> +source_id >> 8, PCI_SLOT(source_id & 0xFF),
> +PCI_FUNC(source_id & 0xFF), addr,
>  fault_reason, reason);
>   return 0;
>  }
> @@ -1987,7 +1993,7 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>   if (!ratelimited)
>   /* Using pasid -1 if pasid is not present */
>   dmar_fault_do_one(iommu, type, fault_reason,
> -   pasid_present ? pasid : -1,
> +   pasid_present ? pasid : 
> INVALID_IOASID,
> source_id, guest_addr);
>  
>   fault_index++;
> -- 
> 2.25.1
> 

-- 
Cheers,
Ashok

[Forgiveness is the attribute of the STRONG - Gandhi]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Raj, Ashok
On Mon, May 10, 2021 at 12:31:11PM -0300, Jason Gunthorpe wrote:
> On Mon, May 10, 2021 at 08:25:02AM -0700, Raj, Ashok wrote:
> 
> > Global PASID doesn't break anything, but giving that control to vIOMMU
> > doesn't seem right. When we have mixed uses cases like hardware that
> > supports shared wq and SRIOV devices that need PASIDs we need to 
> > comprehend how they will work without having a backend to migrate PASIDs 
> > to new destination.
> 
> Why wouldn't there be a backend? SRIOV live migration is a real thing
> now (see Max's VFIO patches). The PASID space of the entire dedicated
> RID needs to be migratable, which means the destination vIOMMU must be
> able to program its local hardware with the same PASID numbers and any
> kind of global PASID scheme at all will interfere with it.

The way I'm imagining it works is as follows. We have 2 types of platforms.
Let me know if i missed something.

- no shared wq, meaning RID local PASID allocation is all that's needed.
  Say platform type p1
- Shared WQ configurations that require global PASIDs.
  Say platform type p2

vIOMMU might have a capability to indicate if vIOMMU has a PASID allocation
capability. If there is none, guest is free to obtain its own PASID per
RID since they are fully isolated. For p1 type platforms this should work.
Since there is no Qemu interface even required or /dev/iommu for that
instance. Guest kernel can manage it fully per-RID based.

Platform type p2 that has both SIOV (with enqcmd) and SRIOV that requires
PASID. My thought was to reserve say some number of PASID's for per-RID
use. When you request a RID local you get PASID in that range. When you ask
for global, you get in the upper PASID range. 

Say 0-4k is reserved for any RID local allocation. This is also the guest
PASID range.  4k->2^19 are for shared WQ. (I'm not implying the size, but 
just for discussion sake that we need a separation.)

Those vIOMMU's will have a capability that it supports PASID allocation
interface. When you allocate you can say what type of PASID you need
(shared vs local) and Qemu will obtain either within the local range, or in
the shared range. When they are allocated in the shared range, you still
end up using one in guest PASID range, but mapped to a different
host-pasid using the VMCS like PASID redirection per-guest (not-perRID).

Only the shared allocation requires /dev/iommu interface. All allocation in
the guest range is fully in Qemu control.

For supporting migration, the target also needs to have this capability for
migration. 
> 
> > When we have both SRIOV and shared WQ exposed to the same guest, we 
> > do have an issue. The simplest way that I thought was to have a guest 
> > and host PASID separation.  Where the guest has its own PASID  space 
> > and host has its own carved out.  Guest can do what ever it wants within 
> > that allocated space without fear of any collition with any other device. 
> 
> And how do you reliably migrate if the target kernel has a PASID
> already allocated in that range?

For any shared range, remember there is a mapping table. And since those
ranges are always reserved in the new host it isn't a problem. For shared
WQ that requires a PASID remapping to new host PASID, the VMCS remapping
per guest that does gPASID->hPASID does this job. So the guest PASID
remains unchanged. 

Does this make sense? 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Raj, Ashok
On Mon, May 10, 2021 at 09:37:29AM -0300, Jason Gunthorpe wrote:
> On Sat, May 08, 2021 at 09:56:59AM +, Tian, Kevin wrote:
> > > From: Raj, Ashok 
> > > Sent: Friday, May 7, 2021 12:33 AM
> > > 
> > > > Basically it means when the guest's top level IOASID is created for
> > > > nesting that IOASID claims all PASID's on the RID and excludes any
> > > > PASID IOASIDs from existing on the RID now or in future.
> > > 
> > > The way to look at it this is as follows:
> > > 
> > > For platforms that do not have a need to support shared work queue model
> > > support for ENQCMD or similar, PASID space is naturally per RID. There is 
> > > no
> > > complication with this. Every RID has the full range of PASID's and no 
> > > need
> > > for host to track which PASIDs are allocated now or in future in the 
> > > guest.
> > > 
> > > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > > global across the entire system. Maybe its better to call them gPASID for
> > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered
> > > that
> > > in earlier responses)
> > > 
> > > In our current implementation we actually don't separate this space, and
> > > gPASID == hPASID. The iommu driver enforces that by using the custom
> > > allocator and the architected interface that allows all guest vIOMMU
> > > allocations to be proxied to host. Nothing but a glorified hypercall like
> > > interface. In fact some OS's do use hypercall to get a hPASID vs using
> > > the vCMD style interface.
> > > 
> > 
> > After more thinking about the new interface, I feel gPASID==hPASID 
> > actually causes some confusion in uAPI design. In concept an ioasid
> > is not active until it's attached to a device, because it's just an ID
> > if w/o a device. So supposedly an ioasid should reject all user commands
> > before attach. However an guest likely asks for a new gPASID before
> > attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu 
> > must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't 
> > been attached to any device, with the assumption on kernel knowledge 
> > that this hw_id is from an global allocator w/o dependency on any 
> > device. This doesn't sound a clean design, not to say it also conflicts 
> > with live migration.
> 
> Everything must be explicit. The situation David pointed to of
> qemu emulating a vIOMMU while running on a host with a different
> platform/physical IOMMU must be considered.
> 
> If the vIOMMU needs specific behavior it must use /dev/iommu to ask
> for it specifically and not just make wild assumptions about how the
> platform works.

I think the right way is for pIOMMU to enforce the right behavior. vIOMMU
can ask for a PASID and physical IOMMU driver would give what is optimal
for the platform. if vIOMMU says give me per-device PASID, but that can
lead to conflicts in PASID name space, its best to avoid it. 

Global PASID doesn't break anything, but giving that control to vIOMMU
doesn't seem right. When we have mixed uses cases like hardware that
supports shared wq and SRIOV devices that need PASIDs we need to 
comprehend how they will work without having a backend to migrate PASIDs 
to new destination.

for ENQCMD we have the gPASID->hPASID translation in the VMCS control.
For devices that support SIOV, programming a PASID to  a device is also
mediated, so its possible for something like the mediated interface to
assist with that migration for the dedicated WQ.


When we have both SRIOV and shared WQ exposed to the same guest, we 
do have an issue. The simplest way that I thought was to have a guest 
and host PASID separation.  Where the guest has its own PASID  space 
and host has its own carved out.  Guest can do what ever it wants within 
that allocated space without fear of any collition with any other device. 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Raj, Ashok
Hi Jason

- Removed lizefan's email due to bounces... 

On Fri, May 07, 2021 at 03:20:50PM -0300, Jason Gunthorpe wrote:
> On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote:
> > On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:
> > > 
> > > > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > > > global across the entire system. Maybe its better to call them gPASID 
> > > > for
> > > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered 
> > > > that
> > > > in earlier responses)
> > > 
> > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a
> > > per-RID PASID in the translation table as well.
> > 
> > When using ENQCMD the PASID that needs to be sent on the wire is picked
> > from an MSR setup by kernel. This is context switched along with the
> > process. So each process has only 1 PASID that can go out when using
> > ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a
> > source for the descriptor.
> 
> Oh. I forgot this also globally locked the PASID to a single
> MSR. Sigh. That makes the whole mechanism useless for anything except
> whole process SVA.

Is there another kind of SVA? Our mapping from that each process requires a
single mm, and PASID for SVM was a direct map from that. 

> 
> It also make it a general kernel problem and not just related to the
> vIOMMU scenario.
> 
> > > I think at the uAPI level the callpaths that require allocating a
> > > PASID from a group of RIDs should be explicit in their intention and
> > > not implicitly rely on a certain allocator behavior.
> > 
> > The difficult part I see is, when one application establishes a path
> > to one acclerator, we have no knowledge if its going to connect to a
> > second, third or such. I don't see how this can work reasonably
> > well. What if PASIDx is allocated for one, but the second RID its
> > trying to attach already has this PASID allocated?
> 
> You mean like some kind of vIOMMU hot plug?

Not vIOMMU hot plug. but an application opens accel1, does a bind to
allocate a PASID. What i meant was kernel has no information if this needs
to be a per-RID PASID, or a global PASID. Keeping this global solves the
other problems or more complex mechanisms to say "Reserve this PASID on all
accelerators" which seems pretty complicated to implement.

Now are we loosing anything by keeping the PASIDs global? 

As we discussed there is no security issue since the PASID table that hosts 
these PASIDs for SVM are still per-RID.  For e.g.

app establishes connection to accl1, allocates PASID-X
   RID for accel1 now has PASID-X and the process mm plummed 
later app also connects with accl2, now the PASID-X is plummed in for RID
of accel2.


> 
> > > If you want to get a PASID that can be used with every RID on in your
> > > /dev/ioasid then ask for that exactly.
> > 
> > Correct, but how does guest through vIOMMU driver communicate that intent 
> > so uAPI
> > plumbing can do this? I mean architecturally via IOMMU interfaces? 
> 
> I would have to ask for a PASID that has the property it needs. You
> are saying the property is even bigger than "usable on a group of
> RIDs" but is actually "global for every RID and IOMMU in the system so
> it can go into a MSR". Gross, but fine, ask for that explicitly when
> allocating the PASID.

If one process has a single mm, is that also gross? :-) So a single process
having a PASID is just an identifier for IOMMU. It just seems like what a
mm is for a process == PASID for SVM-IOMMU support.

The unanswered question is how do we plumb from vIOMMU without a custom
allocator to get a system wide PASID? 

The way it works today is if we have a custom allocator registered, that's
the mechanics to get PASIDs allocated. for Intel vIOMMU it happens to be a
global unique allocation. If a particular vIOMMU doesn't require, it does
not have vIOMMU interface, and those naturally get a guest local PASID name
space. (Im not sure if that's how the allocator works today, but I guess its
extensible to accomplish a RID local PASID if that's exactly what is
required)

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Raj, Ashok
On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote:
> On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:
> 
> > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > global across the entire system. Maybe its better to call them gPASID for
> > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > wide mapping for ENQCMD and not a per-RID based mapping. (We covered that
> > in earlier responses)
> 
> I don't think it is actually ENQCMD that forces this, ENQCMD can use a
> per-RID PASID in the translation table as well.

When using ENQCMD the PASID that needs to be sent on the wire is picked
from an MSR setup by kernel. This is context switched along with the
process. So each process has only 1 PASID that can go out when using
ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a
source for the descriptor. 

When one application is connecting to more than one accelerator since this
is MSR based filled in by the cpu instruction automaticaly requires both
accelerators to use the same PASID. 

Did you refer to this implementation? or something else?
> 
> You get forced here only based on the design of the vIOMMU
> communication channel. If the guest can demand any RID is attached to
> a specific guest determined PASID then the hypervisor must accommodate
> that.

True, but when we have guest using vSVM, and enabling vENQCMD the
requirement is the same inside a guest.
> 
> > > Which would be a different behavior than something like Intel's top
> > > level IOASID that doesn't claim all the PASIDs.
> > 
> > isn't this simple, if we can say ioasid allocator can provide 
> > 
> > - system wide PASID
> > - RID local PASID
> > 
> > Based on platform capabilities that require such differentiation?
> 
> I think at the uAPI level the callpaths that require allocating a
> PASID from a group of RIDs should be explicit in their intention and
> not implicitly rely on a certain allocator behavior.

The difficult part I see is, when one application establishes a path to one
acclerator, we have no knowledge if its going to connect to a second, third 
or such. I don't see how this can work reasonably well. What if PASIDx is 
allocated
for one, but the second RID its trying to attach already has this
PASID allocated? 

> 
> If you want to get a PASID that can be used with every RID on in your
> /dev/ioasid then ask for that exactly.

Correct, but how does guest through vIOMMU driver communicate that intent so 
uAPI
plumbing can do this? I mean architecturally via IOMMU interfaces? 


Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-06 Thread Raj, Ashok
Hi Jason

On Thu, May 06, 2021 at 09:27:30AM -0300, Jason Gunthorpe wrote:
> On Thu, May 06, 2021 at 09:23:48AM +0200, Jean-Philippe Brucker wrote:
> > On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > > > > For ARM, since the guest owns the per device PASID table. There is no
> > > > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ,
> > > > > there is no need for global PASID/SSID either. So PASID being global
> > > > > for ARM is for simplicity in case of host PASID/SSID.  
> > > > 
> > > > It isn't clear how ARM can support PASID and mdev but that is an
> > > > unrelated issue..
> > > > 
> > > AFAIK, the current SMMU device assignment is per RID, since only one 
> > > stage2
> > > page tables per RID, not per PASID. This is equivalent to the older VT-d
> > > spec. prior to scalable mode.
> > 
> > Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it
> > doesn't support assigning level-1 page tables to guests for mdevs (sub-VF
> > devices). So no PASIDs for mdevs, which also means each guest has its own
> > PASID space and the host doesn't track guest PASIDs.
> 
> Basically it means when the guest's top level IOASID is created for
> nesting that IOASID claims all PASID's on the RID and excludes any
> PASID IOASIDs from existing on the RID now or in future.

The way to look at it this is as follows:

For platforms that do not have a need to support shared work queue model
support for ENQCMD or similar, PASID space is naturally per RID. There is no
complication with this. Every RID has the full range of PASID's and no need
for host to track which PASIDs are allocated now or in future in the guest.

For platforms that support ENQCMD, it is required to mandate PASIDs are
global across the entire system. Maybe its better to call them gPASID for
guest and hPASID for host. Short reason being gPASID->hPASID is a guest
wide mapping for ENQCMD and not a per-RID based mapping. (We covered that
in earlier responses)

In our current implementation we actually don't separate this space, and
gPASID == hPASID. The iommu driver enforces that by using the custom
allocator and the architected interface that allows all guest vIOMMU
allocations to be proxied to host. Nothing but a glorified hypercall like
interface. In fact some OS's do use hypercall to get a hPASID vs using
the vCMD style interface.

For cases where there is full PASID range for every RID and completely
managed by the guest that requires no assist from host to ensure
uniqueness, they don't need to have a custom allocator. Maybe the general
allocator can have ways to ensure global uniqueness vs. RID wide
uniqueness. This is still managed by the iommu driver (vIOMMU) + the
backend for vCMD in the host IOMMU driver.

> 
> Which would be a different behavior than something like Intel's top
> level IOASID that doesn't claim all the PASIDs.

isn't this simple, if we can say ioasid allocator can provide 

- system wide PASID
- RID local PASID

Based on platform capabilities that require such differentiation?

And based on the other threads, if ioasid is just a pgtable representation,
it doesn't need a PASID per-se. But when you want to use SVM or such, you
can associate a PASID with it for the IOMMU to plumb things with hardware.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Raj, Ashok
On Wed, May 05, 2021 at 07:21:20PM -0300, Jason Gunthorpe wrote:
> On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:
> > 
> > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> > > 
> > > > Global and pluggable are for slightly separate reasons.
> > > > - We need global PASID on VT-d in that we need to support shared
> > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > > > assigned to two VMs. Each VM uses its private guest PASID to submit
> > > > work but each guest PASID must be translated to a global (system-wide)
> > > > host PASID to avoid conflict. Also, since PASID table storage is per
> > > > PF, if two mdevs of the same PF are assigned to different VMs, the
> > > > PASIDs must be unique.  
> > > 
> > > From a protocol perspective each RID has a unique PASID table, and
> > > RIDs can have overlapping PASIDs.
> > > 
> > True, per RID or per PF as I was referring to.
> > 
> > > Since your SWQ is connected to a single RID the requirement that
> > > PASIDs are unique to the RID ensures they are sufficiently unique.
> > > 
> > True, but one process can submit work to multiple mdevs from different
> > RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
> > The same PASID is used for all the PASID tables of each RID.
> 
> If the model is "assign this PASID to this RID" then yes, there is a
> big problem keeping everything straight that can only be solved with a
> global table.
> 
> But if the model is "give me a PASID for this RID" then it isn't such
> a problem.

Correct, since we have usage with ENQCMD, its more like

- Give me a PASID1 (not attached to any RID)
- Bind/attach PASID1 with RID1
- Bind/attach PASID1 with RID2

and ENQCMD isn't just for Intel, with the DMWr spec in PCI, it brings it to
all devices as long as routing is supported by interim switches and such.

> 
> Basically trying to enforce a uniform PASID for an IOASID across all
> RIDs attached to it is not such a nice choice.
> 
> > > That is fine, but all this stuff should be inside the Intel vIOMMU
> > > driver not made into a global resource of the entire iommu subsystem.
> > > 
> > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
> > code need to have this option. I guess you are saying we should also have a
> > per RID allocation option in addition to global?
> 
> There always has to be a RID involvement for the PASID, for security,
> this issue really boils down to where the PASID lives.

We do have a RID involvement with PASID always for security. Every RID has
its own PASID table, but the PASID name space is global. 

So if you have RID1 associated with PASID1, another RID2 doesn't have the
PASID1 in its PASID table. Until when the app binds PASID1 with RID2 as
well. Then you have PASID1 plumbed in the PASID table for RID2.

Is this what you refer to for security? 


> 
> If you need the PASID attached to the IOASID then it has to be global
> because the IOASID can be attached to any RID and must keep the same
> PASID.
> 
> If the PASID is learned when the IOASID is attached to a RID then the
> PASID is more flexible and isn't attached to the IOASID.
> 
> Honestly I'm a little leary to bake into a UAPI a specific HW choice
> that Intel made here.

Like I mentioned, this isn't just Intel going forward. The specs are public
in PCIe. I just can't comment which other vendors are adopting it.

> 
> I would advise making the "attach a global PASID to this IOASID"
> operation explicit and opt into for case that actually need it.
> 
> Which implies the API to the iommu driver should be more like:
> 
>   'assign an IOASID to this RID and return the PASID'
>   'reserve a PASID from every RID'

I don't think this has any decent change of success. Its rather round about
way to get a global PASID namespace.

>   'assign an IOASID to this RID and use this specific PASID'

This seems a bit complicated. Another way to specify this.

- IOASID is a logical construct to specify a page table.
- You can bind a global PASID to an IOASID

We aren't loosing any security by using a global PASID name space. 
Until the application asks for it, that is not bound to any other RID without 
an explicit
request.


-- 
Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-08 Thread Raj, Ashok
Hi Joerg

On Mon, Mar 08, 2021 at 09:58:26AM +0800, Lu Baolu wrote:
> Hi Joerg,
> 
> On 3/4/21 8:26 PM, Joerg Roedel wrote:
> >On Thu, Feb 25, 2021 at 02:26:51PM +0800, Lu Baolu wrote:
> >>When the first level page table is used for IOVA translation, it only
> >>supports Read-Only and Read-Write permissions. The Write-Only permission
> >>is not supported as the PRESENT bit (implying Read permission) should
> >>always set. When using second level, we still give separate permissions
> >>that allows WriteOnly which seems inconsistent and awkward. There is no
> >>use case we can think off, hence remove that configuration to make it
> >>consistent.
> >
> >No use-case for WriteOnly mappings? How about DMA_FROM_DEVICE mappings?
> >
> 
> The statement of no use case is not correct. Sorry about it.
> 
> As we have moved to use first level for IOVA translation, the first
> level page table entry only provides RO and RW permissions. So if any
> device driver specifies DMA_FROM_DEVICE attribution, it will get RW
> permission in the page table. This patch aims to make the permissions
> of second level and first level consistent. No impact on the use of
> DMA_FROM_DEVICE attribution.
> 

That is the primary motivation, given that we have moved to 1st level for
general IOVA, first level doesn't have a WO mapping. I didn't know enough
about the history to determine if a WO without a READ is very useful. I
guess the ZLR was invented to support those cases without a READ in PCIe. I

Early Intel IOMMU's didn't handle ZLR properly, until we fixed it in the
next generation. It just seemed opposite to the CPU page-tables, and we
wanted to have consistent behavior. After moving to 1st level, we don't
want things to work sometimes, and break if we use 2nd level for the same
mappings.

Hope this helps

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v13 06/10] iommu: Add a page fault handler

2021-03-02 Thread Raj, Ashok
On Tue, Mar 02, 2021 at 10:26:42AM +0100, Jean-Philippe Brucker wrote:
[snip]

> +
> +static enum iommu_page_response_code
> +iopf_handle_single(struct iopf_fault *iopf)
> +{
> + vm_fault_t ret;
> + struct mm_struct *mm;
> + struct vm_area_struct *vma;
> + unsigned int access_flags = 0;
> + unsigned int fault_flags = FAULT_FLAG_REMOTE;
> + struct iommu_fault_page_request *prm = >fault.prm;
> + enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> +
> + if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID))
> + return status;
> +
> + mm = iommu_sva_find(prm->pasid);
> + if (IS_ERR_OR_NULL(mm))
> + return status;
> +
> + mmap_read_lock(mm);
> +
> + vma = find_extend_vma(mm, prm->addr);
> + if (!vma)
> + /* Unmapped area */
> + goto out_put_mm;
> +
> + if (prm->perm & IOMMU_FAULT_PERM_READ)
> + access_flags |= VM_READ;
> +
> + if (prm->perm & IOMMU_FAULT_PERM_WRITE) {
> + access_flags |= VM_WRITE;
> + fault_flags |= FAULT_FLAG_WRITE;
> + }
> +
> + if (prm->perm & IOMMU_FAULT_PERM_EXEC) {
> + access_flags |= VM_EXEC;
> + fault_flags |= FAULT_FLAG_INSTRUCTION;
> + }
> +
> + if (!(prm->perm & IOMMU_FAULT_PERM_PRIV))
> + fault_flags |= FAULT_FLAG_USER;
> +
> + if (access_flags & ~vma->vm_flags)
> + /* Access fault */
> + goto out_put_mm;
> +
> + ret = handle_mm_fault(vma, prm->addr, fault_flags, NULL);

Should we add a trace similar to trace_page_fault_user() or kernel in
arch/x86/kernel/mm/fault.c 

or maybe add a perf_sw_event() for device faults? 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] iommu/vt-d: Parse SATC reporting structure

2021-02-02 Thread Raj, Ashok
On Tue, Feb 02, 2021 at 12:40:56PM +0800, Lu Baolu wrote:
> From: Yian Chen 
> 
> Software should parse every SATC table and all devices in the tables
> reported by the BIOS and keep the information in kernel list for further
> SATC policy deployment.
> 
The last part seems bit vague? Are you trying to imply, 

if kernel is booted with noats for instance, a device listed in SATC table
as "requires ATS" will fail to load?

Otherwise its not clear what the policy enforcement means.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC

2021-02-02 Thread Raj, Ashok
On Tue, Feb 02, 2021 at 12:40:55PM +0800, Lu Baolu wrote:
> From: Yian Chen 
> 
> Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping
> structure SATC for SOC integrated devices, according to section 8.8 of
> Intel VT-d architecture specification v3.2. The SATC structure reports
> a list of the devices that require SATC enabling via ATS capacity.

nit: s/require SATC/require ATS for normal device operation. This is a
functional requirement that these devices will not work without OS enabling
ATS capability.

> 
> This patch introduces the new enum value and structure to represent the
> remapping information. Kernel should parse the information from the
> reporting structure and enable ATC for the devices as needed.
> 
> Signed-off-by: Yian Chen 
> ---
>  include/acpi/actbl1.h | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
> index 43549547ed3e..b7ca802b66d2 100644
> --- a/include/acpi/actbl1.h
> +++ b/include/acpi/actbl1.h
> @@ -514,7 +514,8 @@ enum acpi_dmar_type {
>   ACPI_DMAR_TYPE_ROOT_ATS = 2,
>   ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,
>   ACPI_DMAR_TYPE_NAMESPACE = 4,
> - ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */
> + ACPI_DMAR_TYPE_SATC = 5,
> + ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */
>  };
>  

Think Joerg spotted the comment update.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Raj, Ashok
On Wed, Oct 21, 2020 at 08:32:18PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 21, 2020 at 01:03:15PM -0700, Raj, Ashok wrote:
> 
> > I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
> > SVM is orthogonal to how we achieve mdev passthrough to guest and
> > vSVM.
> 
> Everyone assumes that vIOMMU and SIOV aka PASID is going to be needed
> on the VDPA side as well, I think that is why JasonW brought this up
> in the first place.

True, to that effect we are working on trying to move PASID allocation
outside of VFIO, so both agents VFIO and vDPA with PASID, when that comes
available can support one way to allocate and manage PASID's from user
space.

Since the IOASID is almost standalone, this is possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Raj, Ashok
On Wed, Oct 21, 2020 at 03:24:42PM -0300, Jason Gunthorpe wrote:
> 
> > Contrary to your argument, vDPA went with a half blown device only 
> > iommu user without considering existing abstractions like containers
> 
> VDPA IOMMU was done *for Intel*, as the kind of half-architected thing
> you are advocating should be allowed for IDXD here. Not sure why you
> think bashing that work is going to help your case here.

I'm not bashing that work, sorry if it comes out that way, 
but just feels like double standards.

I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
SVM is orthogonal to how we achieve mdev passthrough to guest and vSVM. 
We visited that exact thing multiple times. Doing SVM is quite simple and 
doesn't carry the weight of other (Kevin explained this in detail 
not too long ago) long list of things we need to accomplish for mdev pass 
through. 

For SVM, just access to hw, mmio and bind_mm to get a PASID bound with
IOMMU. 

For IDXD that creates passthough devices for guest access and vSVM is
through the VFIO path. 

For guest SVM, we expose mdev's to guest OS, idxd in the guest provides vSVM
services. vSVM is *not* build around native SVM interfaces. 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Raj, Ashok
On Wed, Oct 21, 2020 at 08:48:29AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:
> > On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > > > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > > > I think we agreed (or agree to disagree and commit) for device 
> > > > > > types that 
> > > > > > we have for SIOV, VFIO based approach works well without having to 
> > > > > > re-invent 
> > > > > > another way to do the same things. Not looking for a shortcut by 
> > > > > > any means, 
> > > > > > but we need to plan around existing hardware though. Looks like 
> > > > > > vDPA took 
> > > > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > > > necessary hardware was available.. This would be a solved puzzle. 
> > > > > 
> > > > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > > > the "home" and needs to stand alone.
> > > > > 
> > > > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > > > 
> > > > So just to make this clear, I did check internally if there are any 
> > > > plans
> > > > for vDPA + SVM. There are none at the moment. 
> > > 
> > > Not SVM, SIOV.
> > 
> > ... And that included.. I should have said vDPA + PASID, No current plans. 
> > I have no idea who set expectations with you. Can you please put me in 
> > touch 
> > with that person, privately is fine.
> 
> It was the team that aruged VDPA had to be done through VFIO - SIOV
> and PASID was one of their reasons it had to be VFIO, check the list
> archives

Humm... I could search the arhives, but the point is I'm confirming that
there is no forward looking plan!

And who ever did was it was based on probably strawman hypothetical argument 
that wasn't
grounded in reality. 

> 
> If they didn't plan to use it, bit of a strawman argument, right?

This doesn't need to continue like the debates :-) Pun intended :-)

I don't think it makes any sense to have an abstract strawman argument
design discussion. Yi is looking into for pasid management alone. Rest 
of the IOMMU related topics should wait until we have another 
*real* use requiring consolidation. 

Contrary to your argument, vDPA went with a half blown device only 
iommu user without considering existing abstractions like containers 
and such in VFIO is part of the reason the gap is big at the moment.
And you might not agree, but that's beside the point. 

Rather than pivot ourselves around hypothetical, strawman,
non-intersecting, suggesting architecture without having done a proof of
concept to validate the proposal should stop. We have to ground ourselves
in reality.

The use cases we have so far for SIOV, VFIO and mdev seem to be the right
candidates and addresses them well. Now you might disagree, but as noted we
all agreed to move past this.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-20 Thread Raj, Ashok
On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > I think we agreed (or agree to disagree and commit) for device types 
> > > > that 
> > > > we have for SIOV, VFIO based approach works well without having to 
> > > > re-invent 
> > > > another way to do the same things. Not looking for a shortcut by any 
> > > > means, 
> > > > but we need to plan around existing hardware though. Looks like vDPA 
> > > > took 
> > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > necessary hardware was available.. This would be a solved puzzle. 
> > > 
> > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > the "home" and needs to stand alone.
> > > 
> > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > 
> > So just to make this clear, I did check internally if there are any plans
> > for vDPA + SVM. There are none at the moment. 
> 
> Not SVM, SIOV.

... And that included.. I should have said vDPA + PASID, No current plans. 
I have no idea who set expectations with you. Can you please put me in touch 
with that person, privately is fine.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-20 Thread Raj, Ashok
On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > I think we agreed (or agree to disagree and commit) for device types that 
> > we have for SIOV, VFIO based approach works well without having to 
> > re-invent 
> > another way to do the same things. Not looking for a shortcut by any means, 
> > but we need to plan around existing hardware though. Looks like vDPA took 
> > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > necessary hardware was available.. This would be a solved puzzle. 
> 
> I think it is the opposite, vIOMMU and related has outgrown VFIO as
> the "home" and needs to stand alone.
> 
> Apparently the HW that will need PASID for vDPA is Intel HW, so if

So just to make this clear, I did check internally if there are any plans
for vDPA + SVM. There are none at the moment. It seems like you have
better insight into our plans ;-). Please do let me know who confirmed vDPA
roadmap with you and I would love to talk to them to clear the air.


Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-20 Thread Raj, Ashok
On Tue, Oct 20, 2020 at 02:03:36PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 09:24:30AM -0700, Raj, Ashok wrote:
> > Hi Jason,
> > 
> > 
> > On Tue, Oct 20, 2020 at 11:02:17AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 10:21:41AM +, Liu, Yi L wrote:
> > > 
> > > > > I'm sure there will be some
> > > > > weird overlaps because we can't delete any of the existing VFIO APIs, 
> > > > > but
> > > > > that
> > > > > should not be a blocker.
> > > > 
> > > > but the weird thing is what we should consider. And it perhaps not just
> > > > overlap, it may be a re-definition of VFIO container. As I mentioned, 
> > > > VFIO
> > > > container is IOMMU context from the day it was defined. It could be the
> > > > blocker. :-(
> > > 
> > > So maybe you have to broaden the VFIO container to be usable by other
> > > subsystems. The discussion here is about what the uAPI should look
> > > like in a fairly abstract way. When we say 'dev/sva' it just some
> > > placeholder for a shared cdev that provides the necessary
> > > dis-aggregated functionality 
> > > 
> > > It could be an existing cdev with broader functionaltiy, it could
> > > really be /dev/iommu, etc. This is up to the folks building it to
> > > decide.
> > > 
> > > > I'm not expert on vDPA for now, but I saw you three open source
> > > > veterans have a similar idea for a place to cover IOMMU handling,
> > > > I think it may be a valuable thing to do. I said "may be" as I'm not
> > > > sure about Alex's opinion on such idea. But the sure thing is this
> > > > idea may introduce weird overlap even re-definition of existing
> > > > thing as I replied above. We need to evaluate the impact and mature
> > > > the idea step by step. 
> > > 
> > > This has happened before, uAPIs do get obsoleted and replaced with
> > > more general/better versions. It is often too hard to create a uAPI
> > > that lasts for decades when the HW landscape is constantly changing
> > > and sometime a reset is needed. 
> > 
> > I'm throwing this out with a lot of hesitation, but I'm going to :-)
> > 
> > So we have been disussing this for months now, with some high level vision
> > trying to get the uAPI's solidified with a vDPA hardware that might
> > potentially have SIOV/SVM like extensions in hardware which actualy doesn't
> > exist today. Understood people have plans. 
> 
> > Given that vDPA today has diverged already with duplicating use of IOMMU
> > api's without making an effort to gravitate to /dev/iommu as how you are
> > proposing.
> 
> I see it more like, given that we already know we have multiple users
> of IOMMU, adding new IOMMU focused features has to gravitate toward
> some kind of convergance.
> 
> Currently things are not so bad, VDPA is just getting started and the
> current IOMMU feature set is not so big.
> 
> PASID/vIOMMU/etc/et are all stressing this more, I think the
> responsibility falls to the people proposing these features to do the
> architecture work.
> 
> > The question is should we hold hostage the current vSVM/vIOMMU efforts
> > without even having made an effort for current vDPA/VFIO convergence. 
> 
> I don't think it is "held hostage" it is a "no shortcuts" approach,
> there was always a recognition that future VDPA drivers will need some
> work to integrated with vIOMMU realted stuff.

I think we agreed (or agree to disagree and commit) for device types that 
we have for SIOV, VFIO based approach works well without having to re-invent 
another way to do the same things. Not looking for a shortcut by any means, 
but we need to plan around existing hardware though. Looks like vDPA took 
some shortcuts then to not abstract iommu uAPI instead :-)? When all
necessary hardware was available.. This would be a solved puzzle. 


> 
> This is no different than the IMS discussion. The first proposed patch
> was really simple, but a layering violation.
> 
> The correct solution was some wild 20 patch series modernizing how x86

That was more like 48 patches, not 20 :-). But we had a real device with
IMS to model and create these new abstractions and test them against. 

For vDPA+SVM we have non-intersecting conversations at the moment with no
real hardware to model our discussion around. 

> interrupts works because it had outgrown itself. This general approach
> to use the shared MSI infrastructure was pointed out at the very
> beginning of 

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-20 Thread Raj, Ashok
Hi Jason,


On Tue, Oct 20, 2020 at 11:02:17AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 10:21:41AM +, Liu, Yi L wrote:
> 
> > > I'm sure there will be some
> > > weird overlaps because we can't delete any of the existing VFIO APIs, but
> > > that
> > > should not be a blocker.
> > 
> > but the weird thing is what we should consider. And it perhaps not just
> > overlap, it may be a re-definition of VFIO container. As I mentioned, VFIO
> > container is IOMMU context from the day it was defined. It could be the
> > blocker. :-(
> 
> So maybe you have to broaden the VFIO container to be usable by other
> subsystems. The discussion here is about what the uAPI should look
> like in a fairly abstract way. When we say 'dev/sva' it just some
> placeholder for a shared cdev that provides the necessary
> dis-aggregated functionality 
> 
> It could be an existing cdev with broader functionaltiy, it could
> really be /dev/iommu, etc. This is up to the folks building it to
> decide.
> 
> > I'm not expert on vDPA for now, but I saw you three open source
> > veterans have a similar idea for a place to cover IOMMU handling,
> > I think it may be a valuable thing to do. I said "may be" as I'm not
> > sure about Alex's opinion on such idea. But the sure thing is this
> > idea may introduce weird overlap even re-definition of existing
> > thing as I replied above. We need to evaluate the impact and mature
> > the idea step by step. 
> 
> This has happened before, uAPIs do get obsoleted and replaced with
> more general/better versions. It is often too hard to create a uAPI
> that lasts for decades when the HW landscape is constantly changing
> and sometime a reset is needed. 

I'm throwing this out with a lot of hesitation, but I'm going to :-)

So we have been disussing this for months now, with some high level vision
trying to get the uAPI's solidified with a vDPA hardware that might
potentially have SIOV/SVM like extensions in hardware which actualy doesn't
exist today. Understood people have plans. 

Given that vDPA today has diverged already with duplicating use of IOMMU
api's without making an effort to gravitate to /dev/iommu as how you are
proposing.

I think we all understand creating a permanent uAPI is hard, and they can
evolve in future. 

Maybe  we should start work on how to converge on generalizing the IOMMU
story first with what we have today (vDPA + VFIO) convergence and let it evolve 
with real hardware and new features like SVM/SIOV in mind. This is going 
to take time and we can start with what we have today for pulling vDPA and 
VFIO pieces first.

The question is should we hold hostage the current vSVM/vIOMMU efforts
without even having made an effort for current vDPA/VFIO convergence. 

> 
> The jump to shared PASID based IOMMU feels like one of those moments here.

As we have all noted, even without PASID we have divergence today?


> 
> > > Whoever provides the vIOMMU emulation and relays the page fault to the 
> > > guest
> > > has to translate the RID -
> > 
> > that's the point. But the device info (especially the sub-device info) is
> > within the passthru framework (e.g. VFIO). So page fault reporting needs
> > to go through passthru framework.
> >
> > > what does that have to do with VFIO?
> > > 
> > > How will VPDA provide the vIOMMU emulation?
> > 
> > a pardon here. I believe vIOMMU emulation should be based on IOMMU vendor
> > specification, right? you may correct me if I'm missing anything.
> 
> I'm asking how will VDPA translate the RID when VDPA triggers a page
> fault that has to be relayed to the guest. VDPA also has virtual PCI
> devices it creates.
> 
> We can't rely on VFIO to be the place that the vIOMMU lives because it
> excludes/complicates everything that is not VFIO from using that
> stuff.
> 
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-19 Thread Raj, Ashok
Hi Jean

On Mon, Oct 19, 2020 at 04:08:24PM +0200, Jean-Philippe Brucker wrote:
> On Sat, Oct 17, 2020 at 04:25:25AM -0700, Raj, Ashok wrote:
> > > For devices that *don't* use a stop marker, the PCIe spec says (10.4.1.2):
> > > 
> > >   To stop [using a PASID] without using a Stop Marker Message, the
> > >   function shall:
> > >   1. Stop queueing new Page Request Messages for this PASID.
> > 
> > The device driver would need to tell stop sending any new PR's.
> > 
> > >   2. Finish transmitting any multi-page Page Request Messages for this
> > >  PASID (i.e. send the Page Request Message with the L bit Set).
> > >   3. Wait for PRG Response Messages associated any outstanding Page
> > >  Request Messages for the PASID.
> > > 
> > > So they have to flush their PR themselves. And since the device driver
> > > completes this sequence before calling unbind(), then there shouldn't be
> > > any oustanding PR for the PASID, and unbind() doesn't need to flush,
> > > right?
> > 
> > I can see how the device can complete #2,3 above. But the device driver
> > isn't the one managing page-responses right. So in order for the device to
> > know the above sequence is complete, it would need to get some assist from
> > IOMMU driver?
> 
> No the device driver just waits for the device to indicate that it has
> completed the sequence. That's what the magic stop-PASID mechanism
> described by PCIe does. In 6.20.1 "Managing PASID TLP Prefix Usage" it
> says:

The goal is we do this when the device is in a messup up state. So we can't
take for granted the device is properly operating which is why we are going
to wack the device with a flr().

The only thing that's supposed to work without a brain freeze is the
invalidation logic. Spec requires devices to respond to invalidations even when
they are in the process of flr().

So when IOMMU does an invalidation wait with a Page-Drain, IOMMU waits till
the response for that arrives before completing the descriptor. Due to 
the posted semantics it will ensure any PR's issued and in the fabric are 
flushed 
out to memory. 

I suppose you can wait for the device to vouch for all responses, but that
is assuming the device is still functioning properly. Given that we use it
in two places,

* Reclaiming a PASID - only during a tear down sequence, skipping it
  doesn't really buy us much.
* During FLR this can't be skipped anyway due to the above sequence
  requirement. 

> 
> "A Function must have a mechanism to request that it gracefully stop using
>  a specific PASID. This mechanism is device specific but must satisfy the
>  following rules:
>  [...]
>  * When the stop request mechanism indicates completion, the Function has:
>[...]
>* Complied with additional rules described in Address Translation
>  Services (Chapter 10 [10.4.1.2 quoted above]) if Address Translations
>  or Page Requests were issued on the behalf of this PASID."
> 
> So after the device driver initiates this mechanism in the device, the
> device must be able to indicate completion of the mechanism, which
> includes completing all in-flight Page Requests. At that point the device
> driver can call unbind() knowing there is no pending PR for this PASID.
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-17 Thread Raj, Ashok
Hi Jean

On Fri, Oct 16, 2020 at 09:59:23AM +0200, Jean-Philippe Brucker wrote:
> On Thu, Oct 15, 2020 at 11:22:11AM -0700, Raj, Ashok wrote:
> > Hi Jean
> > 
> > + Baolu who is looking into this.
> > 
> > 
> > On Thu, Oct 15, 2020 at 11:00:27AM +0200, Jean-Philippe Brucker wrote:
> > > Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
> > > whether the PRI queue needs flushing. When looking at the PCIe spec
> > > again I noticed that most of the time the SMMUv3 driver doesn't actually
> > > need to flush the PRI queue. Does this make sense for Intel VT-d as well
> > > or did I overlook something?
> > > 
> > > Before calling iommu_sva_unbind_device(), device drivers must stop the
> > > device from using the PASID. For PCIe devices, that consists of
> > > completing any pending DMA, and completing any pending page request
> > > unless the device uses Stop Markers. So unless the device uses Stop
> > > Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
> > > means completing all stall events, so we never need to flush the event
> > > queue.
> > 
> > I don't think this is true. Baolu is working on an enhancement to this,
> > I'll quickly summarize this below:
> > 
> > Stop markers are weird, I'm not certain there is any device today that
> > sends STOP markers. Even if they did, markers don't have a required
> > response, they are fire and forget from the device pov.
> 
> Definitely agree with this, and I hope none will implement stop marker.
> For devices that *don't* use a stop marker, the PCIe spec says (10.4.1.2):
> 
>   To stop [using a PASID] without using a Stop Marker Message, the
>   function shall:
>   1. Stop queueing new Page Request Messages for this PASID.

The device driver would need to tell stop sending any new PR's.

>   2. Finish transmitting any multi-page Page Request Messages for this
>  PASID (i.e. send the Page Request Message with the L bit Set).
>   3. Wait for PRG Response Messages associated any outstanding Page
>  Request Messages for the PASID.
> 
> So they have to flush their PR themselves. And since the device driver
> completes this sequence before calling unbind(), then there shouldn't be
> any oustanding PR for the PASID, and unbind() doesn't need to flush,
> right?

I can see how the device can complete #2,3 above. But the device driver
isn't the one managing page-responses right. So in order for the device to
know the above sequence is complete, it would need to get some assist from
IOMMU driver?

How does the driver know that everything host received has been responded
back to device?

> 
> > I'm not sure about other IOMMU's how they behave, When there is no space in
> > the PRQ, IOMMU auto-responds to the device. This puts the device in a
> > while (1) loop. The fake successful response will let the device do a ATS
> > lookup, and that would fail forcing the device to do another PRQ.
> 
> But in the sequence above, step 1 should ensure that the device will not
> send another PR for any successful response coming back at step 3.

True, but there could be some page-request in flight on its way to the
IOMMU. By draining and getting that round trip back to IOMMU we gaurantee
things in flight are flushed to PRQ after that Drain completes.
> 
> So I agree with the below if we suspect there could be pending PR, but
> given that pending PR are a stop marker thing and we don't know any device
> using stop markers, I wondered why I bothered implementing PRIq flush at
> all for SMMUv3, hence this RFC.
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-15 Thread Raj, Ashok
Hi Jean

+ Baolu who is looking into this.


On Thu, Oct 15, 2020 at 11:00:27AM +0200, Jean-Philippe Brucker wrote:
> Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
> whether the PRI queue needs flushing. When looking at the PCIe spec
> again I noticed that most of the time the SMMUv3 driver doesn't actually
> need to flush the PRI queue. Does this make sense for Intel VT-d as well
> or did I overlook something?
> 
> Before calling iommu_sva_unbind_device(), device drivers must stop the
> device from using the PASID. For PCIe devices, that consists of
> completing any pending DMA, and completing any pending page request
> unless the device uses Stop Markers. So unless the device uses Stop
> Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
> means completing all stall events, so we never need to flush the event
> queue.

I don't think this is true. Baolu is working on an enhancement to this,
I'll quickly summarize this below:

Stop markers are weird, I'm not certain there is any device today that
sends STOP markers. Even if they did, markers don't have a required
response, they are fire and forget from the device pov.

I'm not sure about other IOMMU's how they behave, When there is no space in
the PRQ, IOMMU auto-responds to the device. This puts the device in a
while (1) loop. The fake successful response will let the device do a ATS
lookup, and that would fail forcing the device to do another PRQ. The idea
is somewhere there the OS has repeated the others and this will find a way
in the PRQ. The point is this is less reliable and can't be the only
indication. PRQ draining has a specific sequence. 

The detailed steps are outlined in 
"Chapter 7.10 "Software Steps to Drain Page Requests & Responses"

- Submit invalidation wait with fence flag to ensure all prior
  invalidations are processed.
- submit iotlb followed by devtlb invalidation
- Submit invalidation wait with page-drain to make sure any page-requests
  issued by the device are flushed when this invalidation wait completes.
- If during the above process there was a queue overflow SW can assume no
  outstanding page-requests are there. If we had a queue full condition,
  then sw must repeat step 2,3 above.


To that extent the proposal is as follows: names are suggestive :-) I'm
making this up as I go!

- iommu_stop_page_req() - Kernel needs to make sure we respond to all
  outstanding requests (since we can't drop responses) 
  - Ensure we respond immediatly for anything that comes before the drain
sequence completes
- iommu_drain_page_req() - Which does the above invalidation with
  Page_drain set.

Once the driver has performed a reset and before issuing any new request,
it does iommu_resume_page_req() this will ensure we start processing
incoming page-req after this point.


> 
> First patch adds flags to unbind(), and the second one lets device
> drivers tell whether the PRI queue needs to be flushed.
> 
> Other remarks:
> 
> * The PCIe spec (see quote on patch 2), says that the device signals
>   whether it has sent a Stop Marker or not during Stop PASID. In reality
>   it's unlikely that a given device will sometimes use one stop method
>   and sometimes the other, so it could be a device-wide flag rather than
>   passing it at each unbind(). I don't want to speculate too much about
>   future implementation so I prefer having the flag in unbind().
> 
> * In patch 1, uacce passes 0 to unbind(). To pass the right flag I'm
>   thinking that uacce->ops->stop_queue(), which tells the device driver
>   to stop DMA, should return whether faults are pending. This can be
>   added later once uacce has an actual PCIe user, but we need to
>   remember to do it.

I think intel iommmu driver does this today for SVA when pasid is being
freed. Its still important to go through the drain before that PASID can be
re-purposed.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Patch V8 0/3] iommu: Add support to change default domain of an iommu group

2020-10-01 Thread Raj, Ashok
Hi Joerg

On Thu, Oct 01, 2020 at 02:58:41PM +0200, Joerg Roedel wrote:
> Hi Ashok,
> 
> On Fri, Sep 25, 2020 at 12:06:17PM -0700, Ashok Raj wrote:
> > Sai Praneeth Prakhya (3):
> >   iommu: Add support to change default domain of an iommu group
> >   iommu: Take lock before reading iommu group default domain type
> >   iommu: Document usage of "/sys/kernel/iommu_groups//type" file
> > 
> >  .../ABI/testing/sysfs-kernel-iommu_groups  |  30 +++
> >  drivers/iommu/iommu.c  | 227 
> > -
> >  2 files changed, 256 insertions(+), 1 deletion(-)
> 
> Thanks for the repost, I can grab it just fine with b4. But this nees
> some more testing on my side and some time in linux-next, so it is too
> late now to queue it for v5.10. Can you please remind me after the next
> merge window? I'll pick it up then and do the testing and it will
> hopefully spend enough time in linux-next.

Yes, I'll try to remind you after the next merge window.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V7 0/3] iommu: Add support to change default domain of an iommu group

2020-09-28 Thread Raj, Ashok
Hi Joerg

On Fri, Sep 25, 2020 at 09:34:23AM +0200, Joerg Roedel wrote:
> Hi Ashok,
> 
> On Thu, Sep 24, 2020 at 10:21:48AM -0700, Raj, Ashok wrote:
> > Just trying to followup on this series.
> > 
> > Sai has moved out of Intel, hence I'm trying to followup on his behalf.
> > 
> > Let me know if you have queued this for the next release.
> 
> Not yet, but I think this is mostly ready. Can you please send a new
> version in a new mail thread so that I can pick it up with b4?

I reposted Sai's patches again here. 

https://lore.kernel.org/linux-iommu/20200925190620.18732-2-ashok@intel.com/

Can you let me know if they are in a format you can pickup> via b4? 

With recent issues we have sending mails out of Intel, wanted to make sure 
everything
made it out in one piece.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V7 0/3] iommu: Add support to change default domain of an iommu group

2020-09-25 Thread Raj, Ashok
Hi Joerg,

thanks!

On Fri, Sep 25, 2020 at 09:34:23AM +0200, Joerg Roedel wrote:
> Hi Ashok,
> 
> On Thu, Sep 24, 2020 at 10:21:48AM -0700, Raj, Ashok wrote:
> > Just trying to followup on this series.
> > 
> > Sai has moved out of Intel, hence I'm trying to followup on his behalf.
> > 
> > Let me know if you have queued this for the next release.
> 
> Not yet, but I think this is mostly ready. Can you please send a new
> version in a new mail thread so that I can pick it up with b4?

So just another version bump, no other changes?

I thought v6-v7 was a version bump...
> 
> Thanks,
> 
>   Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [bugzilla-dae...@bugzilla.kernel.org: [Bug 209149] New: "iommu/vt-d: Enable PCI ACS for platform opt in hint" makes NVMe config space not accessible after S3]

2020-09-24 Thread Raj, Ashok
On Wed, Sep 23, 2020 at 12:45:11PM -0700, Rajat Jain wrote:
> On Wed, Sep 23, 2020 at 9:19 AM Raj, Ashok  wrote:
> >
> > Hi Bjorn
> >
> >
> > On Wed, Sep 23, 2020 at 11:03:27AM -0500, Bjorn Helgaas wrote:
> > > [+cc IOMMU and NVMe folks]
> > >
> > > Sorry, I forgot to forward this to linux-pci when it was first
> > > reported.
> > >
> > > Apparently this happens with v5.9-rc3, and may be related to
> > > 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint"),
> > > which appeared in v5.8-rc3.
> > >
> > > There are several dmesg logs and proposed patches in the bugzilla, but
> > > no analysis yet of what the problem is.  From the first dmesg
> > > attachment (https://bugzilla.kernel.org/attachment.cgi?id=292327):
> >
> > We have been investigating this internally as well. It appears maybe the
> > specupdate for Cometlake is missing the errata documention. The offsets
> > were wrong in some of them, and if its the same issue its likely cause.
> 
> Can you please also confirm if errata applies to Tigerlake ?
> 

We confirmed ICL/TGL isn't affected.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [bugzilla-dae...@bugzilla.kernel.org: [Bug 209149] New: "iommu/vt-d: Enable PCI ACS for platform opt in hint" makes NVMe config space not accessible after S3]

2020-09-24 Thread Raj, Ashok
Hi Alex

> > Apparently it also requires to disable RR, and I'm not able to confirm if
> > CML requires that as well. 
> > 
> > pci_quirk_disable_intel_spt_pch_acs_redir() also seems to consult the same
> > table, so i'm not sure why we need the other patch in bugzilla is required.
> 
> If we're talking about the Intel bug where PCH root ports implement
> the ACS capability and control registers as dword rather than word
> registers, then how is ACS getting enabled in order to generate an ACS
> violation?  The standard ACS code would write to the control register
> word at offset 6, which is still the read-only capability register on
> those devices.  Thanks,


Right... Maybe we need header log to figure out what exatly is happening. 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [bugzilla-dae...@bugzilla.kernel.org: [Bug 209149] New: "iommu/vt-d: Enable PCI ACS for platform opt in hint" makes NVMe config space not accessible after S3]

2020-09-24 Thread Raj, Ashok
Hi Kai

+ Alex, since he had some of the early quirks authored.

On Thu, Sep 24, 2020 at 12:31:53AM +0800, Kai-Heng Feng wrote:
> [+Cc Christoph]
> 
> > On Sep 24, 2020, at 00:03, Bjorn Helgaas  wrote:
> > 
> > [+cc IOMMU and NVMe folks]
> > 
> > Sorry, I forgot to forward this to linux-pci when it was first
> > reported.
> > 
> > Apparently this happens with v5.9-rc3, and may be related to
> > 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint"),
> > which appeared in v5.8-rc3.
> > 
> > There are several dmesg logs and proposed patches in the bugzilla, but
> > no analysis yet of what the problem is.  From the first dmesg
> > attachment (https://bugzilla.kernel.org/attachment.cgi?id=292327):
> 
> AFAIK Intel is working on it internally.
> Comet Lake probably needs ACS quirk like older generation chips.

I have confirmed with Internal documentation that the problem exists on
Comet Lake. But its fixed ICL and TGL generations.

Unfortunately I do not see if the public specupdate documents are for these
generation chipsets to makes sure all root port id's can be captured.

There is also another entry in bugzilla that was forwarded that referred to
Request Redirect Capability to be always disabled as well. This same
workaround also seems to be turning off RR for the root port. I believe it
should fix it as well. But i saw another patch attached.

Can you tell how you reproduce this? just doing a

#echo mem > /sys/power/state

is sufficient with an attached NVMe drive? 

> 
> > 
> >  [   50.434945] PM: suspend entry (deep)
> >  [   50.802086] nvme :01:00.0: saving config space at offset 0x0 
> > (reading 0x11e0f)
> >  [   50.842775] ACPI: Preparing to enter system sleep state S3
> >  [   50.858922] ACPI: Waking up from system sleep state S3
> >  [   50.883622] nvme :01:00.0: can't change power state from D3hot to 
> > D0 (config space inaccessible)
> >  [   50.947352] nvme :01:00.0: restoring config space at offset 0x0 
> > (was 0x, writing 0x11e0f)
> >  [   50.947816] pcieport :00:1b.0: DPC: containment event, 
> > status:0x1f01 source:0x
> >  [   50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error 
> > detected
> >  [   50.947829] pcieport :00:1b.0: PCIe Bus Error: severity=Uncorrected 
> > (Non-Fatal), type=Transaction Layer, (Receiver ID)
> >  [   50.947830] pcieport :00:1b.0:   device [8086:06ac] error 
> > status/mask=0020/0001
> >  [   50.947831] pcieport :00:1b.0:[21] ACSViol
> > (First)
> >  [   50.947841] pcieport :00:1b.0: AER: broadcast error_detected message
> >  [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > 
> > I suspect the nvme "can't change power state" and restore config space
> > errors are a consequence of the DPC event.  If DPC disables the link,
> > the device is inaccessible.
> > 
> > I don't know what caused the ACS Violation.  The AER TLP Header Log
> > might have a clue, but unfortunately we didn't print it.
> > 

Apparently it also requires to disable RR, and I'm not able to confirm if
CML requires that as well. 

pci_quirk_disable_intel_spt_pch_acs_redir() also seems to consult the same
table, so i'm not sure why we need the other patch in bugzilla is required.


Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V7 0/3] iommu: Add support to change default domain of an iommu group

2020-09-24 Thread Raj, Ashok
Hi Joerg,


On Mon, Sep 07, 2020 at 08:54:44PM -0700, Prakhya, Sai Praneeth wrote:
> Presently, the default domain of an iommu group is allocated during boot time
> and it cannot be changed later. So, the device would typically be either in
> identity (pass_through) mode or the device would be in DMA mode as long as the
> system is up and running. There is no way to change the default domain type
> dynamically i.e. after booting, a device cannot switch between identity mode 
> and
> DMA mode.
> 
> Assume a use case wherein the privileged user would want to use the device in
> pass-through mode when the device is used for host so that it would be high
> performing. Presently, this is not supported. Hence add support to change the
> default domain of an iommu group dynamically.
> 
> Support this by writing to a sysfs file, namely
> "/sys/kernel/iommu_groups//type".
> 
> Testing:
> 
> Tested by dynamically changing storage device (nvme) from
> 1. identity mode to DMA and making sure file transfer works
> 2. DMA mode to identity mode and making sure file transfer works
> Tested only for intel_iommu/vt-d. Would appreciate if someone could test on 
> AMD
> and ARM based machines.
> 

Just trying to followup on this series.

Sai has moved out of Intel, hence I'm trying to followup on his behalf.

Let me know if you have queued this for the next release.

> Based on iommu maintainer's 'next' branch.
> 
> Changes from V6:
> 
> 1. None except for version bump.
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [bugzilla-dae...@bugzilla.kernel.org: [Bug 209149] New: "iommu/vt-d: Enable PCI ACS for platform opt in hint" makes NVMe config space not accessible after S3]

2020-09-23 Thread Raj, Ashok
Hi Bjorn


On Wed, Sep 23, 2020 at 11:03:27AM -0500, Bjorn Helgaas wrote:
> [+cc IOMMU and NVMe folks]
> 
> Sorry, I forgot to forward this to linux-pci when it was first
> reported.
> 
> Apparently this happens with v5.9-rc3, and may be related to
> 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint"),
> which appeared in v5.8-rc3.
> 
> There are several dmesg logs and proposed patches in the bugzilla, but
> no analysis yet of what the problem is.  From the first dmesg
> attachment (https://bugzilla.kernel.org/attachment.cgi?id=292327):

We have been investigating this internally as well. It appears maybe the
specupdate for Cometlake is missing the errata documention. The offsets
were wrong in some of them, and if its the same issue its likely cause. 

Will nudge the hw folks to hunt that down :-(.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 3/9] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)

2020-09-17 Thread Raj, Ashok
Thanks Boris.

multiple "again" makes it funny again :-)


On Thu, Sep 17, 2020 at 07:18:49PM +0200, Borislav Petkov wrote:
> On Thu, Sep 17, 2020 at 07:56:09AM -0700, Raj, Ashok wrote:
> > Just tweaked it a bit: 
> > 
> > "When ATS lookup fails for a virtual address, device should use PRI in
> > order to request the virtual address to be paged into the CPU page tables.
> > The device must use ATS again in order the fetch the translation again

s/translation again/translation

> > before use"
> 
> Thanks, amended.
> 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 3/9] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)

2020-09-17 Thread Raj, Ashok
Hi Boris,

On Thu, Sep 17, 2020 at 09:53:38AM +0200, Borislav Petkov wrote:
> On Tue, Sep 15, 2020 at 09:30:07AM -0700, Fenghua Yu wrote:
> > +Background
> > +==
> > +
> > +Shared Virtual Addressing (SVA) allows the processor and device to use the
> > +same virtual addresses avoiding the need for software to translate virtual
> > +addresses to physical addresses. SVA is what PCIe calls Shared Virtual
> > +Memory (SVM).
> > +
> > +In addition to the convenience of using application virtual addresses
> > +by the device, it also doesn't require pinning pages for DMA.
> > +PCIe Address Translation Services (ATS) along with Page Request Interface
> > +(PRI) allow devices to function much the same way as the CPU handling
> > +application page-faults. For more information please refer to the PCIe
> > +specification Chapter 10: ATS Specification.
> > +
> > +Use of SVA requires IOMMU support in the platform. IOMMU also is required
> > +to support PCIe features ATS and PRI. ATS allows devices to cache
> > +translations for virtual addresses. The IOMMU driver uses the 
> > mmu_notifier()
> > +support to keep the device TLB cache and the CPU cache in sync. PRI allows
> > +the device to request paging the virtual address by using the CPU page 
> > tables
> > +before accessing the address.
> 
> That still reads funny, the "the device to request paging the virtual
> address" part. Do you mean that per chance here:
> 
> "Before the device can access that address, the device uses the PRI in
> order to request the virtual address to be paged in into the CPU page
> tables."
> 
Agree, this reads a bit funny.

Just tweaked it a bit: 

"When ATS lookup fails for a virtual address, device should use PRI in
order to request the virtual address to be paged into the CPU page tables.
The device must use ATS again in order the fetch the translation again
before use"

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-16 Thread Raj, Ashok
On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:
> > > If user space wants to bind page tables, create the PASID with
> > > /dev/sva, use ioctls there to setup the page table the way it wants,
> > > then pass the now configured PASID to a driver that can use it. 
> > 
> > Are we talking about bare metal SVA? 
> 
> What a weird term.

Glad you noticed it at v7 :-) 

Any suggestions on something less weird than 
Shared Virtual Addressing? There is a reason why we moved from SVM to SVA.
> 
> > If so, I don't see the need for userspace to know there is a
> > PASID. All user space need is that my current mm is bound to a
> > device by the driver. So it can be a one-step process for user
> > instead of two.
> 
> You've missed the entire point of the conversation, VDPA already needs
> more than "my current mm is bound to a device"

You mean current version of vDPA? or a potential future version of vDPA?

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-15 Thread Raj, Ashok
On Tue, Sep 15, 2020 at 03:45:10PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote:
> > > PASID applies widely to many device and needs to be introduced with a
> > > wide community agreement so all scenarios will be supportable.
> > 
> > True, reading some of the earlier replies I was clearly confused as I
> > thought you were talking about mdev again. But now that you stay it, you
> > have moved past mdev and its the PASID interfaces correct?
> 
> Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID.
> 
> > For the native user applications have just 1 PASID per
> > process. There is no need for a quota management.
> 
> Yes, there is. There is a limited pool of HW PASID's. If one user fork
> bombs it can easially claim an unreasonable number from that pool as
> each process will claim a PASID. That can DOS the rest of the system.

Not sure how you had this played out.. For PASID used in ENQCMD today for
our SVM usages, we *DO* not automatically propagate or allocate new PASIDs. 

The new process needs to bind to get a PASID for its own use. For threads
of same process the PASID is inherited. For forks(), we do not
auto-allocate them. Since PASID isn't a sharable resource much like how you
would not pass mmio mmap's to forked processes that cannot be shared correct?
Such as your doorbell space for e.g. 

> 
> If PASID DOS is a worry then it must be solved at the IOMMU level for
> all user applications that might trigger a PASID allocation. VFIO is
> not special.

Feels like you can simply avoid the PASID DOS rather than permit it to
happen. 
> 
> > IIUC, you are asking that part of the interface to move to a API interface
> > that potentially the new /dev/sva and VFIO could share? I think the API's
> > for PASID management themselves are generic (Jean's patchset + Jacob's
> > ioasid set management).
> 
> Yes, the in kernel APIs are pretty generic now, and can be used by
> many types of drivers.

Good, so there is no new requirements here I suppose.
> 
> As JasonW kicked this off, VDPA will need all this identical stuff
> too. We already know this, and I think Intel VDPA HW will need it, so
> it should concern you too :)

This is one of those things that I would disagree and commit :-).. 

> 
> A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID
> control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a
> reasonable starting point for discussion.

Looks like now we are getting closer to what we need. :-)

Given that PASID api's are general purpose today and any driver can use it
to take advantage. VFIO fortunately or unfortunately has the IOMMU things
abstracted. I suppose that support is also mostly built on top of the
generic iommu* api abstractions in a vendor neutral way? 

I'm still lost on what is missing that vDPA can't build on top of what is
available?

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-15 Thread Raj, Ashok
On Tue, Sep 15, 2020 at 08:33:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 14, 2020 at 03:44:38PM -0700, Raj, Ashok wrote:
> > Hi Jason,
> > 
> > I thought we discussed this at LPC, but still seems to be going in
> > circles :-(.
> 
> We discused mdev at LPC, not PASID.
> 
> PASID applies widely to many device and needs to be introduced with a
> wide community agreement so all scenarios will be supportable.

True, reading some of the earlier replies I was clearly confused as I
thought you were talking about mdev again. But now that you stay it, you
have moved past mdev and its the PASID interfaces correct?

> 
> > As you had suggested earlier in the mail thread could Jason Wang maybe
> > build out what it takes to have a full fledged /dev/sva interface for vDPA
> > and figure out how the interfaces should emerge? otherwise it appears
> > everyone is talking very high level and with that limited understanding of
> > how things work at the moment. 
> 
> You want Jason Wang to do the work to get Intel PASID support merged?
> Seems a bit of strange request.

I was reading mdev in my head. Not PASID, sorry.

For the native user applications have just 1 PASID per process. There is no
need for a quota management. VFIO being the one used for guest where there
is more PASID's per guest is where this is enforced today. 

IIUC, you are asking that part of the interface to move to a API interface
that potentially the new /dev/sva and VFIO could share? I think the API's
for PASID management themselves are generic (Jean's patchset + Jacob's
ioasid set management). 

Possibly what you need is already available, but not in a specific way that
you expect maybe? 

Let me check with Jacob and let him/Jean pick that up.

Cheers,
Ashok

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-14 Thread Raj, Ashok
Hi Jason,

I thought we discussed this at LPC, but still seems to be going in
circles :-(.


On Mon, Sep 14, 2020 at 04:00:57PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 14, 2020 at 12:23:28PM -0600, Alex Williamson wrote:
> > On Mon, 14 Sep 2020 14:41:21 -0300
> > Jason Gunthorpe  wrote:
> > 
> > > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote:
> > >  
> > > > "its own special way" is arguable, VFIO is just making use of what's
> > > > being proposed as the uapi via its existing IOMMU interface.  
> > > 
> > > I mean, if we have a /dev/sva then it makes no sense to extend the
> > > VFIO interfaces with the same stuff. VFIO should simply accept a PASID
> > > created from /dev/sva and use it just like any other user-DMA driver
> > > would.
> > 
> > I don't think that's absolutely true.  By the same logic, we could say
> > that pci-sysfs provides access to PCI BAR and config space
> > resources,
> 
> No, it is the reverse, VFIO is a better version of pci-sysfs, so
> pci-sysfs is the one that is obsoleted by VFIO. Similarly a /dev/sva
> would be the superset interface for PASID, so whatver VFIO has would
> be obsoleted.

As you had suggested earlier in the mail thread could Jason Wang maybe
build out what it takes to have a full fledged /dev/sva interface for vDPA
and figure out how the interfaces should emerge? otherwise it appears
everyone is talking very high level and with that limited understanding of
how things work at the moment. 

As Kevin pointed out there are several aspects, and a real prototype from
interested people would be the best way to understand the easy/hard aspects
of moving between the proposals.

- PASID allocation and life cycle management
  Managing both 1-1 (as its done today) and also support a guest PASID
  space. (Supporting guest PASID range is required for migration I suppose)
- Page request processing.
- Interaction with vIOMMU, vSVA requires vIOMMU for supporting
  invalidations, forwarding prq and such.
- Supporting ENQCMD in guest. (Today its just in Intel products, but its
  also submitted to PCIe SIG) and if you are a member should be able to see
  that. FWIW, it might already be open for public review, it not now maybe
  pretty soon.
  
  For Intel we have some KVM interaction setting up the guest pasid->host
  pasid interaces.

This has to move ahead of these email discussions, hoping somone with the
right ideas would help move this forward.

Cheers,
Ashok


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-09-14 Thread Raj, Ashok
Hi Jason,

On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote:
> 
> > > Jason suggest something like /dev/sva. There will be a lot of other
> > > subsystems that could benefit from this (e.g vDPA).
> > 
> > Do you have a more precise idea of the interface /dev/sva would provide,
> > how it would interact with VFIO and others?  vDPA could transport the
> > generic iommu.h structures via its own uAPI, and call the IOMMU API
> > directly without going through an intermediate /dev/sva handle.
> 
> Prior to PASID IOMMU really only makes sense as part of vfio-pci
> because the iommu can only key on the BDF. That can't work unless the
> whole PCI function can be assigned. It is hard to see how a shared PCI
> device can work with IOMMU like this, so may as well use vfio.
> 
> SVA and various vIOMMU models change this, a shared PCI driver can
> absoultely work with a PASID that is assigned to a VM safely, and
> actually don't need to know if their PASID maps a mm_struct or
> something else.

Well, IOMMU does care if its a native mm_struct or something that belongs
to guest. Because you need ability to forward page-requests and pickup
page-responses from guest. Since there is just one PRQ on the IOMMU and
responses can't be sent directly. You have to depend on vIOMMU type
interface in guest to make all of this magic work right?

> 
> So, some /dev/sva is another way to allocate a PASID that is not 1:1
> with mm_struct, as the existing SVA stuff enforces. ie it is a way to
> program the DMA address map of the PASID.
> 
> This new PASID allocator would match the guest memory layout and

Not sure what you mean by "match guest memory layout"? 
Probably, meaning first level is gVA or gIOVA? 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] intel-iommu: don't disable ATS for device without page aligned request

2020-09-10 Thread Raj, Ashok
On Wed, Sep 09, 2020 at 10:17:35PM -0400, Jason Wang wrote:
> 
> 
> - Original Message -
> > Hi Jason
> > 
> > On Wed, Sep 09, 2020 at 04:34:32PM +0800, Jason Wang wrote:
> > > Commit 61363c1474b1 ("iommu/vt-d: Enable ATS only if the device uses
> > > page aligned address.") disables ATS for device that can do unaligned
> > > page request.
> > 
> > Did you take a look at the PCI specification?
> > Page Aligned Request is in the ATS capability Register.
> > 
> > ATS Capability Register (Offset 0x04h)
> > 
> > bit (5):
> > Page Aligned Request - If Set, indicates the Untranslated address is always
> > aligned to 4096 byte boundary. Setting this field is recommended. This
> > field permits software to distinguish between implemntations compatible
> > with this specification and those compatible with an earlier version of
> > this specification in which a Requester was permitted to supply anything in
> > bits [11:2].
> 
> Yes, my understanding is that this is optional not mandatory.

Correct, but optional on the device side. An IOMMU might *require* this for
proper normal operation. Our IOMMU's do not get the low 12 bits. Which is
why the spec gives SW a way to detect if the device is compatible for this
IOMMU implementation.

> 
> > 
> > > 
> > > This looks wrong, since the commit log said it's because the page
> > > request descriptor doesn't support reporting unaligned request.
> > 
> > I don't think you can change the definition from ATS to PRI. Both are
> > orthogonal feature.
> 
> I may miss something, here's my understanding is that:
> 
> - page request descriptor will only be used when PRS is enabled
> - ATS spec allows unaligned request
> 
> So any reason for disabling ATS for unaligned request even if PRS is
> not enabled?

I think you are getting confused between the 2 different PCIe features.

ATS - Address Translation Services. Used by device to simply request the
Host Physical Address for some DMA operation. 

When ATS response indicates failed, then the device can request a
page-request (PRS this is like a device page-fault), and then IOMMU driver
would work with the kernel to fault a page then respond with
(Page-response) success/failure. Then the device will send a new ATS 
to get the new translation. 


> 
> > 
> > > 
> > > A victim is Qemu's virtio-pci which doesn't advertise the page aligned
> > > address. Fixing by disable PRI instead of ATS if device doesn't have
> > > page aligned request.
> > 
> > This is a requirement for the Intel IOMMU's.
> > 
> > You say virtio, so is it all emulated device or you talking about some
> > hardware that implemented virtio-pci compliant hw? If you are sure the
> > device actually does comply with the requirement, but just not enumerating
> > the capability, you can maybe work a quirk to overcome that?
> 
> So far only emulated devices. But we are helping some vendor to
> implement virtio hardware so  we need to understand the connection
> between ATS alignment and page request descriptor.

ATS and PRS are 2 separate orthogonal features. 
PRS requires ATS, but not the other way around. 

> 
> > 
> > Now PRI also has an alignment requirement, and Intel IOMMU's requires that
> > as well. If your device supports SRIOV as well, PASID and PRI are
> > enumerated just on the PF and not the VF. You might want to pay attension
> > to that. We are still working on a solution for that problem.
> 
> Thanks for the reminding, but it looks to me according to the ATS
> spec, all PRI message is 4096 byte aligned? E.g lower bites were used
> for group index etc.

Right, I should have been clear. The issue with PRI is we require responses
to have PASID field set. There is another capability on the device that
exposes that. pci_prg_resp_pasid_required(). This is required to enable PRI
for a device.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] intel-iommu: don't disable ATS for device without page aligned request

2020-09-09 Thread Raj, Ashok
Hi Jason

On Wed, Sep 09, 2020 at 04:34:32PM +0800, Jason Wang wrote:
> Commit 61363c1474b1 ("iommu/vt-d: Enable ATS only if the device uses
> page aligned address.") disables ATS for device that can do unaligned
> page request.

Did you take a look at the PCI specification?
Page Aligned Request is in the ATS capability Register.

ATS Capability Register (Offset 0x04h)

bit (5):
Page Aligned Request - If Set, indicates the Untranslated address is always
aligned to 4096 byte boundary. Setting this field is recommended. This
field permits software to distinguish between implemntations compatible
with this specification and those compatible with an earlier version of
this specification in which a Requester was permitted to supply anything in
bits [11:2].

> 
> This looks wrong, since the commit log said it's because the page
> request descriptor doesn't support reporting unaligned request.

I don't think you can change the definition from ATS to PRI. Both are
orthogonal feature.

> 
> A victim is Qemu's virtio-pci which doesn't advertise the page aligned
> address. Fixing by disable PRI instead of ATS if device doesn't have
> page aligned request.

This is a requirement for the Intel IOMMU's.

You say virtio, so is it all emulated device or you talking about some
hardware that implemented virtio-pci compliant hw? If you are sure the
device actually does comply with the requirement, but just not enumerating
the capability, you can maybe work a quirk to overcome that?

Now PRI also has an alignment requirement, and Intel IOMMU's requires that
as well. If your device supports SRIOV as well, PASID and PRI are
enumerated just on the PF and not the VF. You might want to pay attension
to that. We are still working on a solution for that problem.

I don't think this is the right fix for your problem. 

> 
> Cc: sta...@vger.kernel.org
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Keith Busch 
> Cc: Kuppuswamy Sathyanarayanan 
> Signed-off-by: Jason Wang 
> ---
>  drivers/iommu/intel/iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

2020-09-03 Thread Raj, Ashok
Hi Thomas,

Thanks a ton for jumping in helping on straightening it for IMS!!!


On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)

s/Storm/Store

maybe pun intended :-)

> based devices in a halfways architecture independent way.

You mean "halfways" because the message addr and data follow guidelines
per arch (x86 or such), but the location of the storage isn't dictated
by architecture? or did you have something else in mind? 

> 
> The first version can be found here:
> 
> https://lore.kernel.org/r/20200821002424.119492...@linutronix.de
> 

[snip]

> 
> Changes vs. V1:
> 
>- Addressed various review comments and addressed the 0day fallout.
>  - Corrected the XEN logic (Jürgen)
>  - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
> 
>- Fixed the compose MSI message inconsistency
> 
>- Ensure that the necessary flags are set for device SMI

is that supposed to be MSI? 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Raj, Ashok
On Mon, Aug 03, 2020 at 08:12:18AM -0700, Andy Lutomirski wrote:
> On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen  wrote:
> >
> > On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> > >> Thomas suggested to provide a reason for the #GP caused by executing 
> > >> ENQCMD
> > >> without a valid PASID value programmed. #GP error codes are 16 bits and 
> > >> all
> > >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The 
> > >> other
> > >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> > >> when loading from the source operand, so its not fully comprehending all
> > >> the reasons. Rather than special case the ENQCMD, in future Intel may
> > >> choose a different fault mechanism for such cases if recovery is needed 
> > >> on
> > >> #GP.
> > > Decoding the user instruction is ugly and sets a bad architecture
> > > precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.
> >
> > I'll try to do one more bit of convincing. :)
> >
> > In the end, we need a way to figure out if the #GP was from a known "OK"
> > source that we can fix up.  You're right that we could fire up the
> > instruction decoder to help answer that question.  But, it (also)
> > doesn't easily yield a perfect answer as to the source of the #GP, it
> > always involves a user copy, and it's a larger code impact than what
> > we've got.
> >
> > I think I went and looked at fixup_umip_exception(), and compared it to
> > the alternative which is essentially just these three lines of code:
> >
> > > + /*
> > > +  * If the current task already has a valid PASID in the MSR,
> > > +  * the #GP must be for some other reason.
> > > +  */
> > > + if (current->has_valid_pasid)
> > > + return false;
> > ...> +  /* Now the current task has a valid PASID in the MSR. */
> > > + current->has_valid_pasid = 1;
> >
> > and *I* was convinced that instruction decoding wasn't worth it.
> >
> > There's a lot of stuff that fixup_umip_exception() does which we don't
> > have to duplicate, but it's going to be really hard to get it anywhere
> > near as compact as what we've got.
> >
> 
> I could easily be convinced that the PASID fixup is so trivial and so
> obviously free of misfiring in a way that causes an infinite loop that
> this code is fine.  But I think we first need to answer the bigger
> question of why we're doing a lazy fixup in the first place.

We choose lazy fixup for 2 reasons. 

- If some threads were already created before the MSR is programmed, then
  we need to fixup those in a race free way. scheduling some task-work etc.
  We did do that early on, but decided it was ugly. 
- Not all threads need to submit ENQCMD, force feeding the MSR probably
  isn't even required for all. Yes the overhead isn't probably big, but
  might not even be required for all threads.

We needed to fixup MSR in two different way. To keep the code simple, the
choice was to only fixup on #GP, that eliminated the extra code we need to
support case1.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 1/1] PCI/ATS: Check PRI supported on the PF device when SRIOV is enabled

2020-07-28 Thread Raj, Ashok
Hi Sasha

On Mon, Jul 27, 2020 at 09:24:35PM +, Sasha Levin wrote:
> Hi
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag
> fixing commit: b16d0cb9e2fc ("iommu/vt-d: Always enable PASID/PRI PCI 
> capabilities before ATS").
> 
> The bot has tested the following trees: v5.7.10, v5.4.53, v4.19.134, 
> v4.14.189, v4.9.231, v4.4.231.

Looks like the dependency is making this more involved with the backport. I

We could pursue a simpler fix for these older versions where there is a
conflict, but I'm not sure if that's recommended. 

In addition from our perspective 5.7 and above if there are other products
that require PASID/PRI on prior versions for SRIOV devices  we can drop the
backports. I see the same issue with other IOMMU's, for e.g. AMD as
well, I'm not sure if there are real regressions. 

> 
> v5.7.10: Build OK!
> v5.4.53: Failed to apply! Possible dependencies:
> 2b0ae7cc3bfc ("PCI/ATS: Handle sharing of PF PASID Capability with all 
> VFs")
> 751035b8dc06 ("PCI/ATS: Cache PASID Capability offset")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> 
> v4.19.134: Failed to apply! Possible dependencies:
> 2b0ae7cc3bfc ("PCI/ATS: Handle sharing of PF PASID Capability with all 
> VFs")
> 4f802170a861 ("PCI/DPC: Save and restore config state")
> 6e1ffbb7c2ab ("PCI: Move ATS declarations outside of CONFIG_PCI")
> 751035b8dc06 ("PCI/ATS: Cache PASID Capability offset")
> 8c938ddc6df3 ("PCI/ATS: Add pci_ats_page_aligned() interface")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> 9c2120090586 ("PCI: Provide pci_match_id() with CONFIG_PCI=n")
> b92b512a435d ("PCI: Make pci_ats_init() private")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_required() interface.")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> fff42928ade5 ("PCI/ATS: Add inline to pci_prg_resp_pasid_required()")
> 
> v4.14.189: Failed to apply! Possible dependencies:
> 1b79c5284439 ("PCI: cadence: Add host driver for Cadence PCIe controller")
> 1e4511604dfa ("PCI/AER: Expose internal API for obtaining AER 
> information")
> 3133e6dd07ed ("PCI: Tidy Makefiles")
> 37dddf14f1ae ("PCI: cadence: Add EndPoint Controller driver for Cadence 
> PCIe controller")
> 4696b828ca37 ("PCI/AER: Hoist aerdrv.c, aer_inject.c up to 
> drivers/pci/pcie/")
> 4f802170a861 ("PCI/DPC: Save and restore config state")
> 8c938ddc6df3 ("PCI/ATS: Add pci_ats_page_aligned() interface")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> 9de0eec29c07 ("PCI: Regroup all PCI related entries into 
> drivers/pci/Makefile")
> b92b512a435d ("PCI: Make pci_ats_init() private")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> d3252ace0bc6 ("PCI: Restore resized BAR state on resume")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_required() interface.")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> fff42928ade5 ("PCI/ATS: Add inline to pci_prg_resp_pasid_required()")
> 
> v4.9.231: Failed to apply! Possible dependencies:
> 4ebeb1ec56d4 ("PCI: Restore PRI and PASID state after Function-Level 
> Reset")
> 8c938ddc6df3 ("PCI/ATS: Add pci_ats_page_aligned() interface")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> a4f4fa681add ("PCI: Cache PRI and PASID bits in pci_dev")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_required() interface.")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> fff42928ade5 ("PCI/ATS: Add inline to pci_prg_resp_pasid_required()")
> 
> v4.4.231: Failed to apply! Possible dependencies:
> 2a2aca316aed ("PCI: Include  for isa_dma_bridge_buggy")
> 4d3f13845957 ("PCI: Add pci_unmap_iospace() to unmap I/O resources")
> 4ebeb1ec56d4 ("PCI: Restore PRI and PASID state after Function-Level 
> Reset")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> a4f4fa681add ("PCI: Cache PRI and PASID bits in pci_dev")
> c5076cfe7689 ("PCI, of: Move PCI I/O space management to PCI core code")
> e5567f5f6762 ("PCI/ATS: Add 

Re: [PATCH] PCI/ATS: PASID and PRI are only enumerated in PF devices.

2020-07-23 Thread Raj, Ashok
Hi Bjorn

On Tue, Jul 21, 2020 at 09:54:01AM -0500, Bjorn Helgaas wrote:
> On Mon, Jul 20, 2020 at 09:43:00AM -0700, Ashok Raj wrote:
> > PASID and PRI capabilities are only enumerated in PF devices. VF devices
> > do not enumerate these capabilites. IOMMU drivers also need to enumerate
> > them before enabling features in the IOMMU. Extending the same support as
> > PASID feature discovery (pci_pasid_features) for PRI.
> > 
> > Signed-off-by: Ashok Raj 
> 
> Hi Ashok,
> 
> When you update this for the 0-day implicit declaration thing, can you
> update the subject to say what the patch *does*, as opposed to what it
> is solving?  Also, no need for a period at the end.

Yes, will update and resend. Goofed up a couple things, i'll update those
as well.


> 
> Does this fix a regression?  Is it associated with a commit that we
> could add as a "Fixes:" tag so we know how far back to try to apply
> to stable kernels?

Yes, but the iommu files moved location and git fixes tags only generates
for a few handful of commits and doesn't show the old ones. 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 4/4] PCI/ACS: Enable PCI_ACS_TB for untrusted/external-facing devices

2020-07-10 Thread Raj, Ashok
Hi Bjorn


On Fri, Jul 10, 2020 at 03:29:22PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 07, 2020 at 03:46:04PM -0700, Rajat Jain wrote:
> > When enabling ACS, enable translation blocking for external facing ports
> > and untrusted devices.
> > 
> > Signed-off-by: Rajat Jain 
> > ---
> > v4: Add braces to avoid warning from kernel robot
> > print warning for only external-facing devices.
> > v3: print warning if ACS_TB not supported on external-facing/untrusted 
> > ports.
> > Minor code comments fixes.
> > v2: Commit log change
> > 
> >  drivers/pci/pci.c|  8 
> >  drivers/pci/quirks.c | 15 +++
> >  2 files changed, 23 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 73a8627822140..a5a6bea7af7ce 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -876,6 +876,14 @@ static void pci_std_enable_acs(struct pci_dev *dev)
> > /* Upstream Forwarding */
> > ctrl |= (cap & PCI_ACS_UF);
> >  
> > +   /* Enable Translation Blocking for external devices */
> > +   if (dev->external_facing || dev->untrusted) {
> > +   if (cap & PCI_ACS_TB)
> > +   ctrl |= PCI_ACS_TB;
> > +   else if (dev->external_facing)
> > +   pci_warn(dev, "ACS: No Translation Blocking on 
> > external-facing dev\n");
> > +   }
> 
> IIUC, this means that external devices can *never* use ATS and can
> never cache translations.  And (I guess, I'm not an expert) it can
> also never use the Page Request Services?

Yep, sounds like it.

> 
> Is this what we want?  Do we have any idea how many external devices
> this will affect or how much of a performance impact they will see?
> 
> Do we need some kind of override or mechanism to authenticate certain
> devices so they can use ATS and PRI?

Sounds like we would need some form of an allow-list to start with so we
can have something in the interim. 

I suppose a future platform might have a facilty to ensure ATS is secure and
authenticated we could enable for all of devices in the system, in addition
to PCI CMA/IDE. 

I think having a global override to enable all devices so platform can
switch to current behavior, or maybe via a cmdline switch.. as much as we
have a billion of those, it still gives an option in case someone needs it.



> 
> If we do decide this is the right thing to do, I think we need to
> expand the commit log a bit, because this is potentially a significant
> user-visible change.
> 
> > pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >  
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index b341628e47527..bb22b46c1d719 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4934,6 +4934,13 @@ static void pci_quirk_enable_intel_rp_mpc_acs(struct 
> > pci_dev *dev)
> > }
> >  }
> >  
> > +/*
> > + * Currently this quirk does the equivalent of
> > + * PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF
> > + *
> > + * TODO: This quirk also needs to do equivalent of PCI_ACS_TB,
> > + * if dev->external_facing || dev->untrusted
> > + */
> >  static int pci_quirk_enable_intel_pch_acs(struct pci_dev *dev)
> >  {
> > if (!pci_quirk_intel_pch_acs_match(dev))
> > @@ -4973,6 +4980,14 @@ static int pci_quirk_enable_intel_spt_pch_acs(struct 
> > pci_dev *dev)
> > ctrl |= (cap & PCI_ACS_CR);
> > ctrl |= (cap & PCI_ACS_UF);
> >  
> > +   /* Enable Translation Blocking for external devices */
> > +   if (dev->external_facing || dev->untrusted) {
> > +   if (cap & PCI_ACS_TB)
> > +   ctrl |= PCI_ACS_TB;
> > +   else if (dev->external_facing)
> > +   pci_warn(dev, "ACS: No Translation Blocking on 
> > external-facing dev\n");
> > +   }
> > +
> > pci_write_config_dword(dev, pos + INTEL_SPT_ACS_CTRL, ctrl);
> >  
> > pci_info(dev, "Intel SPT PCH root port ACS workaround enabled\n");
> > -- 
> > 2.27.0.212.ge8ba1cc988-goog
> > 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] pci: acs: Enable PCI_ACS_TB for untrusted/external-facing devices

2020-06-19 Thread Raj, Ashok
Hi Rajat


On Mon, Jun 15, 2020 at 06:17:41PM -0700, Rajat Jain wrote:
> When enabling ACS, currently the bit "translation blocking" was
> not getting changed at all. Set it to disable translation blocking

Maybe you meant "enable translation blocking" ?

Keep the commit log simple:

When enabling ACS, enable translation blocking for external facing ports
and untrusted devices.

> too for all external facing or untrusted devices. This is OK
> because ATS is only allowed on internal devces.
> 
> Signed-off-by: Rajat Jain 
> ---
>  drivers/pci/pci.c|  4 
>  drivers/pci/quirks.c | 11 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d2ff987585855..79853b52658a2 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3330,6 +3330,10 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   /* Upstream Forwarding */
>   ctrl |= (cap & PCI_ACS_UF);
>  
> + if (dev->external_facing || dev->untrusted)
> + /* Translation Blocking */
> + ctrl |= (cap & PCI_ACS_TB);
> +
>   pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index b341628e47527..6294adeac4049 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4934,6 +4934,13 @@ static void pci_quirk_enable_intel_rp_mpc_acs(struct 
> pci_dev *dev)
>   }
>  }
>  
> +/*
> + * Currently this quirk does the equivalent of
> + * PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_SV
> + *
> + * Currently missing, it also needs to do equivalent of PCI_ACS_TB,
> + * if dev->external_facing || dev->untrusted
> + */
>  static int pci_quirk_enable_intel_pch_acs(struct pci_dev *dev)
>  {
>   if (!pci_quirk_intel_pch_acs_match(dev))
> @@ -4973,6 +4980,10 @@ static int pci_quirk_enable_intel_spt_pch_acs(struct 
> pci_dev *dev)
>   ctrl |= (cap & PCI_ACS_CR);
>   ctrl |= (cap & PCI_ACS_UF);
>  
> + if (dev->external_facing || dev->untrusted)
> + /* Translation Blocking */
> + ctrl |= (cap & PCI_ACS_TB);
> +
>   pci_write_config_dword(dev, pos + INTEL_SPT_ACS_CTRL, ctrl);
>  
>   pci_info(dev, "Intel SPT PCH root port ACS workaround enabled\n");
> -- 
> 2.27.0.290.gba653c62da-goog
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4] pci: export untrusted attribute in sysfs

2020-06-18 Thread Raj, Ashok
Hi Greg,


On Thu, Jun 18, 2020 at 06:02:12PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Jun 18, 2020 at 08:03:49AM -0700, Rajat Jain wrote:
> > Hello,
> > 
> > On Thu, Jun 18, 2020 at 2:14 AM Andy Shevchenko
> >  wrote:
> > >
> > > On Thu, Jun 18, 2020 at 11:36 AM Greg Kroah-Hartman
> > >  wrote:
> > > >
> > > > On Thu, Jun 18, 2020 at 11:12:56AM +0300, Andy Shevchenko wrote:
> > > > > On Wed, Jun 17, 2020 at 10:56 PM Rajat Jain  
> > > > > wrote:
> > > > > > On Wed, Jun 17, 2020 at 12:31 AM Christoph Hellwig 
> > > > > >  wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > (and likely call it "external" instead of "untrusted".
> > > > >
> > > > > Which is not okay. 'External' to what? 'untrusted' has been carefully
> > > > > chosen by the meaning of it.
> > > > > What external does mean for M.2. WWAN card in my laptop? It's in ACPI
> > > > > tables, but I can replace it.
> > > >
> > > > Then your ACPI tables should show this, there is an attribute for it,
> > > > right?
> > >
> > > There is a _PLD() method, but it's for the USB devices (or optional
> > > for others, I don't remember by heart). So, most of the ACPI tables,
> > > alas, don't show this.
> > >
> > > > > This is only one example. Or if firmware of some device is altered,
> > > > > and it's internal (whatever it means) is it trusted or not?
> > > >
> > > > That is what people are using policy for today, if you object to this,
> > > > please bring it up to those developers :)
> > >
> > > > > So, please leave it as is (I mean name).
> > > >
> > > > firmware today exports this attribute, why do you not want userspace to
> > > > also know it?
> > 
> > To clarify, the attribute exposed by the firmware today is
> > "ExternalFacingPort" and "external-facing" respectively:
> > 
> > 617654aae50e ("PCI / ACPI: Identify untrusted PCI devices")
> > 9cb30a71ac45d("PCI: OF: Support "external-facing" property")
> > 
> > The kernel flag was named "untrusted" though, hence the assumption
> > that "external=untrusted" is currently baked into the kernel today.
> > IMHO, using "external" would fix that (The assumption can thus be
> > contained in the IOMMU drivers) and at the same time allow more use of
> > this attribute.
> > 
> > > >
> > > > Trust is different, yes, don't get the two mixed up please.  That should
> > > > be a different sysfs attribute for obvious reasons.
> > >
> > > Yes, as a bottom line that's what I meant as well.
> > 
> > So what is the consensus here? I don't have a strong opinion - but it
> > seemed to me Greg is saying "external" and Andy is saying "untrusted"?
> 
> Those two things are totally separate things when it comes to a device.

Agree that these are two separate attributes, and better not mixed.

> 
> One (external) describes the location of the device in the system.
> 
> The other (untrusted) describes what you want the kernel to do with this
> device (trust or not trust it).
> 
> One you can change (from trust to untrusted or back), the other you can
> not, it is a fixed read-only property that describes the hardware device
> as defined by the firmware.

The genesis is due to lack of a mechanism to establish if the device 
is trusted or not was the due lack of some specs and implementation around
Component Measurement And Authentication (CMA). Treating external as
untrusted was the best first effort. i.e trust internal 
devices and don't trust external devices for enabling ATS.

But that said external is just describing topology, and if Linux wants to
use that in the policy that's different. Some day external device may also
use CMA to estabilish trust. FWIW even internal devices aren't trust
worthy, except maybe RCIEP's. 

> 
> Depending on the policy you wish to define, you can use the location of
> the device to determine if you want to trust the device or not.
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Raj, Ashok
On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> On Tue, Jun 16, 2020 at 02:26:38AM +, Tian, Kevin wrote:
> > > From: Stefan Hajnoczi 
> > > Sent: Monday, June 15, 2020 6:02 PM
> > > 
> > > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > > Intel platforms allows address space sharing between device DMA and
> > > > applications. SVA can reduce programming complexity and enhance
> > > security.
> > > >
> > > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > > guest application address space with passthru devices. This is called
> > > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > > > in the "Related series").
> > > >
> > > > The high-level architecture for SVA virtualization is as below, the key
> > > > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > > > also known as IOMMU nesting translation) capability in host IOMMU.
> > > >
> > > >
> > > > .-.  .---.
> > > > |   vIOMMU|  | Guest process CR3, FL only|
> > > > | |  '---'
> > > > ./
> > > > | PASID Entry |--- PASID cache flush -
> > > > '-'   |
> > > > | |   V
> > > > | |CR3 in GPA
> > > > '-'
> > > > Guest
> > > > --| Shadow |--|
> > > >   vv  v
> > > > Host
> > > > .-.  .--.
> > > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > > | |  '--'
> > > > ./  |
> > > > | PASID Entry | V (Nested xlate)
> > > > '\.--.
> > > > | |   |SL for GPA-HPA, default domain|
> > > > | |   '--'
> > > > '-'
> > > > Where:
> > > >  - FL = First level/stage one page tables
> > > >  - SL = Second level/stage two page tables
> > > 
> > > Hi,
> > > Looks like an interesting feature!
> > > 
> > > To check I understand this feature: can applications now pass virtual
> > > addresses to devices instead of translating to IOVAs?
> > > 
> > > If yes, can guest applications restrict the vSVA address space so the
> > > device only has access to certain regions?
> > > 
> > > On one hand replacing IOVA translation with virtual addresses simplifies
> > > the application programming model, but does it give up isolation if the
> > > device can now access all application memory?
> > > 
> > 
> > with SVA each application is allocated with a unique PASID to tag its
> > virtual address space. The device that claims SVA support must guarantee 
> > that one application can only program the device to access its own virtual
> > address space (i.e. all DMAs triggered by this application are tagged with
> > the application's PASID, and are translated by IOMMU's PASID-granular
> > page table). So, isolation is not sacrificed in SVA.
> 
> Isolation between applications is preserved but there is no isolation
> between the device and the application itself. The application needs to
> trust the device.

Right. With all convenience comes security trust. With SVA there is an
expectation that the device has the required security boundaries properly
implemented. FWIW, what is our guarantee today that VF's are secure from
one another or even its own PF? They can also generate transactions with
any of its peer id's and there is nothing an IOMMU can do today. Other than
rely on ACS. Even BusMaster enable can be ignored and devices (malicious
or otherwise) can generate after the BM=0. With SVM you get the benefits of

* Not having to register regions
* Don't need to pin application space for DMA.

> 
> Examples:
> 
> 1. The device can snoop secret data from readable pages in the
>application's virtual memory space.

Aren't there other security technologies that can address this?

> 
> 2. The device can gain arbitrary execution on the CPU by overwriting
>control flow addresses (e.g. function pointers, stack return
>addresses) in writable pages.

I suppose technology like CET might be able to guard. The general
expectation is code pages and anything that needs to be protected should be
mapped nor writable.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Raj, Ashok
On Mon, Jun 15, 2020 at 06:03:57PM +0200, Peter Zijlstra wrote:
> 
> I don't get why you need a rdmsr here, or why not having one would
> require a TIF flag. Is that because this MSR is XSAVE/XRSTOR managed?
> 
> > > > +*/
> > > > +   rdmsrl(MSR_IA32_PASID, pasid_msr);
> > > > +   if (pasid_msr & MSR_IA32_PASID_VALID)
> > > > +   return false;
> > > > +
> > > > +   /* Fix up the MSR if the MSR doesn't have a valid PASID. */
> > > > +   wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID);
> 
> How much more expensive is the wrmsr over the rdmsr? Can't we just
> unconditionally write the current PASID and call it a day?

The reason to check the rdmsr() is because we are using a hueristic taking
GP faults. If we already setup the MSR, but we get it a second time it
means the reason is something other than PASID_MSR not being set.

Ideally we should use the TIF_ to track this would be cheaper, but we were
told those bits aren't easy to give out. 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/vt-d: Don't apply gfx quirks to untrusted devices

2020-06-02 Thread Raj, Ashok
On Tue, Jun 02, 2020 at 04:26:02PM -0700, Rajat Jain wrote:
> Currently, an external malicious PCI device can masquerade the VID:PID
> of faulty gfx devices, and thus apply iommu quirks to effectively
> disable the IOMMU restrictions for itself.
> 
> Thus we need to ensure that the device we are applying quirks to, is
> indeed an internal trusted device.
> 
> Signed-off-by: Rajat Jain 
> Acked-by: Lu Baolu 

With these changes

Reviewed-by: Ashok Raj 

> ---
> v3: - Separate out the warning mesage in a function to be called from
>   other places. Change the warning string as suggested.
> v2: - Change the warning print strings.
> - Add Lu Baolu's acknowledgement.
> 
>  drivers/iommu/intel-iommu.c | 37 +
>  1 file changed, 37 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ef0a5246700e5..dc859f02985a0 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6185,6 +6185,23 @@ intel_iommu_domain_set_attr(struct iommu_domain 
> *domain,
>   return ret;
>  }
>  
> +/*
> + * Check that the device does not live on an external facing PCI port that is
> + * marked as untrusted. Such devices should not be able to apply quirks and
> + * thus not be able to bypass the IOMMU restrictions.
> + */
> +static bool risky_device(struct pci_dev *pdev)
> +{
> + if (pdev->untrusted) {
> + pci_warn(pdev,
> +  "Skipping IOMMU quirk for dev (%04X:%04X) on untrusted"
> +  " PCI link. Please check with your BIOS/Platform"
> +  " vendor about this\n", pdev->vendor, pdev->device);
> + return true;
> + }
> + return false;
> +}
> +
>  const struct iommu_ops intel_iommu_ops = {
>   .capable= intel_iommu_capable,
>   .domain_alloc   = intel_iommu_domain_alloc,
> @@ -6214,6 +6231,9 @@ const struct iommu_ops intel_iommu_ops = {
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
>  {
> + if (risky_device(dev))
> + return;
> +
>   pci_info(dev, "Disabling IOMMU for graphics on this chipset\n");
>   dmar_map_gfx = 0;
>  }
> @@ -6255,6 +6275,9 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163D, 
> quirk_iommu_igfx);
>  
>  static void quirk_iommu_rwbf(struct pci_dev *dev)
>  {
> + if (risky_device(dev))
> + return;
> +
>   /*
>* Mobile 4 Series Chipset neglects to set RWBF capability,
>* but needs it. Same seems to hold for the desktop versions.
> @@ -6285,6 +6308,9 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev 
> *dev)
>  {
>   unsigned short ggc;
>  
> + if (risky_device(dev))
> + return;
> +
>   if (pci_read_config_word(dev, GGC, ))
>   return;
>  
> @@ -6318,6 +6344,12 @@ static void __init check_tylersburg_isoch(void)
>   pdev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x3a3e, NULL);
>   if (!pdev)
>   return;
> +
> + if (risky_device(pdev)) {
> + pci_dev_put(pdev);
> + return;
> + }
> +
>   pci_dev_put(pdev);
>  
>   /* System Management Registers. Might be hidden, in which case
> @@ -6327,6 +6359,11 @@ static void __init check_tylersburg_isoch(void)
>   if (!pdev)
>   return;
>  
> + if (risky_device(pdev)) {
> + pci_dev_put(pdev);
> + return;
> + }
> +
>   if (pci_read_config_dword(pdev, 0x188, )) {
>   pci_dev_put(pdev);
>   return;
> -- 
> 2.27.0.rc2.251.g90737beb825-goog
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: Don't apply gfx quirks to untrusted devices

2020-06-02 Thread Raj, Ashok
On Tue, Jun 02, 2020 at 06:43:00PM +, Rajat Jain wrote:
> Hi MIka,
> 
> Thanks for taking a look.
> 
> On Tue, Jun 2, 2020 at 2:50 AM Mika Westerberg
>  wrote:
> >
> > On Mon, Jun 01, 2020 at 10:45:17PM -0700, Rajat Jain wrote:
> > > Currently, an external malicious PCI device can masquerade the VID:PID
> > > of faulty gfx devices, and thus apply iommu quirks to effectively
> > > disable the IOMMU restrictions for itself.
> > >
> > > Thus we need to ensure that the device we are applying quirks to, is
> > > indeed an internal trusted device.
> > >
> > > Signed-off-by: Rajat Jain 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 28 
> > >  1 file changed, 28 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > > index ef0a5246700e5..f2a480168a02f 100644
> > > --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -6214,6 +6214,11 @@ const struct iommu_ops intel_iommu_ops = {
> > >
> > >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > >  {
> > > + if (dev->untrusted) {
> > > + pci_warn(dev, "skipping iommu quirk for untrusted gfx 
> > > dev\n");
> >
> > I think you should be consistent with other messages. For example iommu
> > should be spelled IOMMU as done below.
> >
> > Also this is visible to users so maybe put bit more information there:
> >
> >   pci_warn(dev, "Will not apply IOMMU quirk for untrusted graphics 
> > device\n");
> >
> > Ditto for all the other places. Also is "untrusted" good word here? If
> > an ordinary user sees this will it trigger some sort of panic reaction.
> > Perhaps we should call it "potentially untrusted" or something like
> > that?

Wish we called it external_facing rather than untrusted attribute, so its
description is consistent with the spec that defines it. Once we have
Platform Component Security Enhancements. 

to be correct, maybe call "Device located on an untrusted link" rather than
assert blame on the device.

Since the information is harvsted from BIOS tables and there are chances
this could be wrongly grouped such, add "Check with your BIOS/Platform
Vendor.



> 
> Fixed it, posted new patch at
> https://lkml.org/lkml/2020/6/2/822
> 
> Thanks,
> 
> Rajat
> 
> >
> > > + return;
> > > + }
> > > +
> > >   pci_info(dev, "Disabling IOMMU for graphics on this chipset\n");
> > >   dmar_map_gfx = 0;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/vt-d: Don't apply gfx quirks to untrusted devices

2020-06-02 Thread Raj, Ashok
Hi Rajat

On Tue, Jun 02, 2020 at 11:41:33AM -0700, Rajat Jain wrote:
> Currently, an external malicious PCI device can masquerade the VID:PID
> of faulty gfx devices, and thus apply iommu quirks to effectively
> disable the IOMMU restrictions for itself.
> 
> Thus we need to ensure that the device we are applying quirks to, is
> indeed an internal trusted device.
> 
> Signed-off-by: Rajat Jain 
> Acked-by: Lu Baolu 
> ---
> V2: - Change the warning print strings.
> - Add Lu Baolu's acknowledgement.
> 
>  drivers/iommu/intel-iommu.c | 38 +
>  1 file changed, 38 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ef0a5246700e5..fdfbea4ff8cb3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6214,6 +6214,13 @@ const struct iommu_ops intel_iommu_ops = {
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
>  {
> + if (dev->untrusted) {
> + pci_warn(dev,
> +  "Skipping IOMMU quirk %s() for potentially untrusted 
> device\n",
> +  __func__);
> + return;
> + }
> +

This check and code seems to be happening several times. Maybe add a simple
function to do the test and use in all places?

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] PCI: Relax ACS requirement for Intel RCiEP devices.

2020-06-01 Thread Raj, Ashok
On Mon, Jun 01, 2020 at 04:25:19PM -0500, Bjorn Helgaas wrote:
> On Thu, May 28, 2020 at 01:57:42PM -0700, Ashok Raj wrote:
> > All Intel platforms guarantee that all root complex implementations
> > must send transactions up to IOMMU for address translations. Hence for
> > RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> > ACS support.
> > 
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the
> >   translated HPA. Hardware implementations are free to further limit
> >   peer-to-peer accesses to specific host physical address regions
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of
> >   the PCI Express Transaction Layer Packet (TLP), for PCI Express
> >   peer-to-peer requests with ECRC, the Root-Complex hardware must use
> >   the new ECRC (re-computed with the translated address) if it
> >   decides to forward the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in
> >   PCI Express specifications for details.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > Signed-off-by: Ashok Raj 
> > To: Joerg Roedel 
> > To: Bjorn Helgaas 
> > Cc: linux-ker...@vger.kernel.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: Lu Baolu 
> > Cc: Alex Williamson 
> > Cc: Darrel Goeddel 
> > Cc: Mark Scott ,
> > Cc: Romil Sharma 
> > Cc: Ashok Raj 
> 
> Tentatively applied to pci/virtualization for v5.8, thanks!
> 
> The spec says this handling must apply "when DMA remapping is
> enabled".  The patch does not check whether DMA remapping is enabled.
> 
> Is there any case where DMA remapping is *not* enabled, and we rely on
> this patch to tell us whether the device is isolated?  It sounds like
> it may give the wrong answer in such a case?
> 
> Can you confirm that I don't need to worry about this?  

I think all of this makes sense only when DMA remapping is enabled.
Otherwise there is no enforcement for isolation. 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] PCI: Relax ACS requirement for Intel RCiEP devices.

2020-05-28 Thread Raj, Ashok
On Thu, May 28, 2020 at 03:38:26PM -0600, Alex Williamson wrote:
> On Thu, 28 May 2020 13:57:42 -0700
> Ashok Raj  wrote:
> 
> > All Intel platforms guarantee that all root complex implementations
> > must send transactions up to IOMMU for address translations. Hence for
> > RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> > ACS support.
> > 
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the
> >   translated HPA. Hardware implementations are free to further limit
> >   peer-to-peer accesses to specific host physical address regions
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of
> >   the PCI Express Transaction Layer Packet (TLP), for PCI Express
> >   peer-to-peer requests with ECRC, the Root-Complex hardware must use
> >   the new ECRC (re-computed with the translated address) if it
> >   decides to forward the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in
> >   PCI Express specifications for details.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> 
> I don't really understand this Fixes tag.  This seems like a feature,
> not a fix.  If you want it in stable releases as a feature, request it
> via Cc: sta...@vger.kernel.org.  I'd drop that tag, that's my nit.
> Otherwise:

Yes, i should have Cced Stable instead. 

Bjorn: Can you massage this in? or i can resend with Alex's Reviewed-by +
adding stable in cc list.

> 
> Reviewed-by: Alex Williamson 
> 
> > Signed-off-by: Ashok Raj 
> > To: Joerg Roedel 
> > To: Bjorn Helgaas 
> > Cc: linux-ker...@vger.kernel.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: Lu Baolu 
> > Cc: Alex Williamson 
> > Cc: Darrel Goeddel 
> > Cc: Mark Scott ,
> > Cc: Romil Sharma 
> > Cc: Ashok Raj 
> > ---
> > v2: Moved functionality from iommu to pci quirks - Alex Williamson
> > 
> >  drivers/pci/quirks.c | 15 +++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 28c9a2409c50..63373ca0a3fe 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4682,6 +4682,20 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev 
> > *dev, u16 acs_flags)
> > PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
> >  }
> >  
> > +static int pci_quirk_rciep_acs(struct pci_dev *dev, u16 acs_flags)
> > +{
> > +   /*
> > +* RCiEP's are required to allow p2p only on translated addresses.
> > +* Refer to Intel VT-d specification Section 3.16 Root-Complex Peer
> > +* to Peer Considerations
> > +*/
> > +   if (pci_pcie_type(dev) != PCI_EXP_TYPE_RC_END)
> > +   return -ENOTTY;
> > +
> > +   return pci_acs_ctrl_enabled(acs_flags,
> > +   PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
> > +}
> > +
> >  static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
> >  {
> > /*
> > @@ -4764,6 +4778,7 @@ static const struct pci_dev_acs_enabled {
> > /* I219 */
> > { PCI_VENDOR_ID_INTEL, 0x15b7, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x15b8, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_quirk_rciep_acs },
> > /* QCOM QDF2xxx root ports */
> > { PCI_VENDOR_ID_QCOM, 0x0400, pci_quirk_qcom_rp_acs },
> > { PCI_VENDOR_ID_QCOM, 0x0401, pci_quirk_qcom_rp_acs },
> 
___
iommu mailing list

Re: [PATCH] iommu: Relax ACS requirement for Intel RCiEP devices.

2020-05-28 Thread Raj, Ashok
Hi Alex

On Tue, May 26, 2020 at 05:07:15PM -0600, Alex Williamson wrote:
> > ---
> >  drivers/iommu/iommu.c | 13 -
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2b471419e26c..31b595dfedde 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1187,7 +1187,18 @@ static struct iommu_group 
> > *get_pci_function_alias_group(struct pci_dev *pdev,
> > struct pci_dev *tmp = NULL;
> > struct iommu_group *group;
> >  
> > -   if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > +   /*
> > +* Intel VT-d Specification Section 3.16, Root-Complex Peer to Peer
> > +* Considerations manadate that all transactions in RCiEP's and
> > +* even Integrated MFD's *must* be sent up to the IOMMU. P2P is
> > +* only possible on translated addresses. This gives enough
> > +* guarantee that such devices can be forgiven for lack of ACS
> > +* support.
> > +*/
> > +   if (!pdev->multifunction ||
> > +   (pdev->vendor == PCI_VENDOR_ID_INTEL &&
> > +pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> > +pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > return NULL;
> >  
> > for_each_pci_dev(tmp) {
> 
> Hi Ashok,
> 
> As this is an Intel/VT-d standard, not a PCIe standard, why not
> implement this in pci_dev_specific_acs_enabled() with all the other
> quirks?  Thanks,

Yes, that sounds like the right place.. I have a new patch, once its tested
i'll resend it. Thanks for pointing it out.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-26 Thread Raj, Ashok
On Tue, May 26, 2020 at 12:26:54PM -0600, Alex Williamson wrote:
> > > 
> > > I don't think the language in the spec is anything sufficient to handle
> > > RCiEP uniquely.  We've previously rejected kernel command line opt-outs
> > > for ACS, and the extent to which those patches still float around the
> > > user community and are blindly used to separate IOMMU groups are a
> > > testament to the failure of this approach.  Users do not have a basis
> > > for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
> > > grouping, but the risk is entirely unknown.  A kconfig option is even
> > > worse as that means if you consume a downstream kernel, the downstream
> > > maintainers might have decided universally that isolation is less
> > > important than functionality.  
> > 
> > We discussed this internally, and Intel vt-d spec does spell out clearly 
> > in Section 3.16 Root-Complex Peer to Peer Considerations. The spec clearly
> > calls out that all p2p must be done on translated addresses and therefore
> > must go through the IOMMU.
> > 
> > I suppose they should also have some similar platform gauranteed behavior
> > for RCiEP's or MFD's *Must* behave as follows. The language is strict and
> > when IOMMU is enabled in the platform, everything is sent up north to the
> > IOMMU agent.
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the 
> >   translated HPA. Hardware implementations are free to further limit 
> >   peer-to-peer accesses to specific host physical address regions 
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of the PCI
> >   Express Transaction Layer Packet (TLP), for PCI Express peer-to-peer 
> >   requests with ECRC, the Root-Complex hardware must use the new ECRC 
> >   (re-computed with the translated address) if it decides to forward 
> >   the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in 
> >   PCI Express specifications for details.
> 
> That sounds like it might be a reasonable basis for quirking all RCiEPs
> on VT-d platforms if Intel is willing to stand behind it.  Thanks,
> 

Sounds good.. that's what i hear from our platform teams. If there is a
violation it would be a bug in silicon.  
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-26 Thread Raj, Ashok
Hi Alex,

I was able to find better language in the IOMMU spec that gaurantees 
the behavior we need. See below.


On Tue, May 05, 2020 at 09:34:14AM -0600, Alex Williamson wrote:
> On Tue, 5 May 2020 07:56:06 -0700
> "Raj, Ashok"  wrote:
> 
> > On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> > > On Mon, 4 May 2020 23:11:07 -0700
> > > "Raj, Ashok"  wrote:
> > >   
> > > > Hi Alex
> > > > 
> > > > + Joerg, accidently missed in the Cc.
> > > > 
> > > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:  
> > > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > > Ashok Raj  wrote:
> > > > > 
> > > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > > > 
> > > > > > PCIe 5.0 Specification.
> > > > > > 6.12 Access Control Services (ACS)
> > > > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > > > implement ACS and some do not. It is strongly recommended that Root 
> > > > > > Complex
> > > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > > (PFs and VFs) without ACS capability are first subjected to 
> > > > > > processing by
> > > > > > the Translation Agent (TA) in the Root Complex before further 
> > > > > > decoding and
> > > > > > processing. The details of such Root Complex handling are outside 
> > > > > > the scope
> > > > > > of this specification.
> > > > > >   
> > > > > 
> > > > > Is the language here really strong enough to make this change?  ACS is
> > > > > an optional feature, so being permitted but not required is rather
> > > > > meaningless.  The spec is also specifically avoiding the words "must"
> > > > > or "shall" and even when emphasized with "strongly", we still only 
> > > > > have
> > > > > a recommendation that may or may not be honored.  This seems like a
> > > > > weak basis for assuming that RCiEPs universally honor this
> > > > > recommendation.  Thanks,
> > > > > 
> > > > 
> > > > We are speaking about PCIe spec, where people write it about 5 years 
> > > > ahead
> > > > and every vendor tries to massage their product behavior with vague
> > > > words like this..  :)
> > > > 
> > > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > > is no way to send them except up north. These aren't behind a RP.  
> > > 
> > > But they are multi-function devices and the spec doesn't define routing
> > > within multifunction packages.  A single function RCiEP will already be
> > > assumed isolated within its own group.  
> > 
> > That's right. The other two devices only have legacy PCI headers. So 
> > they can't claim to be RCiEP's but just integrated endpoints. The legacy
> > devices don't even have a PCIe header.
> > 
> > I honestly don't know why these are groped as MFD's in the first place.
> > 
> > >
> > > > I did check with couple folks who are part of the SIG, and seem to agree
> > > > that ACS treatment for RCiEP's doesn't mean much. 
> > > > 
> > > > I understand the language isn't strong, but it doesn't seem like ACS 
> > > > should
> > > > be a strong requirement for RCiEP's and reasonable to relax.
> > > > 
> > > > What are your thoughts?   
> > > 
> > > I think hardware vendors have ACS at their disposal to clarify when
> > > isolation is provided, otherwise vendors can submit quirks, but I don't
> > > see that the "strongly recommended" phrasing is sufficient to assume
> > > isolation between multifunction RCiEPs.  Thanks,  
> > 
> > You point is that integrated MFD endpoints, without ACS, there is no 
> > gaurantee to SW that they are isolated.
> > 
> > As far as a quirk, do you think:
> > - a cmdline optput for integrated endpoints, and RCiEP's suffice?
> >   along with a compile time default that is strict enforcement
> > - typical vid/did type exception list?
> > 
> > A more generic way t

Re: [PATCH 0/4] PCI, iommu: Factor 'untrusted' check for ATS enablement

2020-05-18 Thread Raj, Ashok
On Mon, May 18, 2020 at 04:47:17PM +0100, David Woodhouse wrote:
> On Fri, 2020-05-15 at 10:19 -0700, Raj, Ashok wrote:
> > Hi Christoph
> > 
> > On Fri, May 15, 2020 at 08:43:51AM -0700, Christoph Hellwig wrote:
> > > Can you please lift the untrusted flag into struct device?  It really
> > > isn't a PCI specific concept, and we should not have code poking into
> > > pci_dev all over the iommu code.
> > 
> > Just for clarification: All IOMMU's today mostly pertain to only PCI devices
> > and for devices that aren't PCI like HPET for instance we give a PCI
> > identifier. Facilities like ATS for instance require the platform to have 
> > an IOMMU.
> > 
> > what additional problems does moving this to struct device solve?
> 
> Even the Intel IOMMU supports ACPI devices for which there is no struct
> pci_dev; only a B/D/F in the ANDD record indicating which entry in the
> context table to use.

Yes, spaced out :-).. just don't work on those platforms on a daily
basis and its easy to forget :-(
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/4] PCI, iommu: Factor 'untrusted' check for ATS enablement

2020-05-15 Thread Raj, Ashok
Hi Christoph

On Fri, May 15, 2020 at 08:43:51AM -0700, Christoph Hellwig wrote:
> Can you please lift the untrusted flag into struct device?  It really
> isn't a PCI specific concept, and we should not have code poking into
> pci_dev all over the iommu code.

Just for clarification: All IOMMU's today mostly pertain to only PCI devices
and for devices that aren't PCI like HPET for instance we give a PCI
identifier. Facilities like ATS for instance require the platform to have 
an IOMMU.

what additional problems does moving this to struct device solve?

Cheers,
Ashok

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-05-15 Thread Raj, Ashok
On Fri, May 15, 2020 at 12:39:14AM -0700, Tian, Kevin wrote:
> Hi, Alex,
> 
> When working on an updated version Yi and I found an design open
> which needs your guidance.
> 
> In concept nested translation can be incarnated as one GPA->HPA page
> table and multiple GVA->GPA page tables per VM. It means one container
> is sufficient to include all SVA-capable devices assigned to the same guest,
> as there is just one iova space (GPA->HPA) to be managed by vfio iommu
> map/unmap api. GVA->GPA page tables are bound through the new api
> introduced in this patch. It is different from legacy shadow translation
> which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device requires
> its own iova space and must be in a separate container.
> 
> However supporting multiple SVA-capable devices in one container
> imposes one challenge. Now the bind_guest_pgtbl is implemented as
> iommu type1 api. From the definition of iommu type 1 any operation
> should be applied to all devices within the same container, just like
> dma map/unmap. However this philosophy is incorrect regarding to
> page table binding. We must follow the guest binding requirements
> between its page tables and assigned devices, otherwise every bound
> address space is suddenly accessible by all assigned devices thus creating
> security holes.

The above 2 paragraphs are bit confusing :-) but what really matters
is the below: 
> 
> Do you think whether it's possible to extend iommu api to accept
> device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
> also problematic, as PASID and page tables are IOMMU things which
> are not touched by vfio device drivers today.

All you are referring to is when admin groups multiple devices in a
single container, you are saying you can't give isolation to them
for SVA purposes. This is logically equivalent to a single group with
multiple devices.

- Such as devices behind p2p bridge
- MFD's or devices behind switches or hieararchies without ACS
  support for instance.

By limitation you mean, disable PASID on those devices in a single 
container?

what about ATS? 

> 
> Alternatively can we impose the limitation that nesting APIs can be
> only enabled for singleton containers which contains only one device?
> This basically falls back to the state of legacy shadow translation
> vIOMMU. and our current Qemu vIOMMU implementation actually
> does this way regardless of whether nesting is used. Do you think
> whether such tradeoff is acceptable as a starting point?
> 
> Thanks
> Kevin
> 
> > -Original Message-
> > From: Liu, Yi L 
> > Sent: Sunday, March 22, 2020 8:32 PM
> > To: alex.william...@redhat.com; eric.au...@redhat.com
> > Cc: Tian, Kevin ; jacob.jun@linux.intel.com;
> > j...@8bytes.org; Raj, Ashok ; Liu, Yi L
> > ; Tian, Jun J ; Sun, Yi Y
> > ; jean-phili...@linaro.org; pet...@redhat.com;
> > iommu@lists.linux-foundation.org; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; Wu, Hao 
> > Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> >
> > From: Liu Yi L 
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > hardware
> > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > translation). For such hardware IOMMUs, there are two stages/levels of
> > address translation, and software may let userspace/VM to own the first-
> > level/stage-1 translation structures. Example of such usage is vSVA (
> > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > translation structures and bind the structures to host, then hardware
> > IOMMU would utilize nesting translation when doing DMA translation fo
> > the devices behind such hardware IOMMU.
> >
> > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> > bind
> > guest page table is needed, it also requires to expose interface to guest
> > for iommu cache invalidation when guest modified the first-level/stage-1
> > translation structures since hardware needs to be notified to flush stale
> > iotlbs. This would be introduced in next patch.
> >
> > In this patch, guest page table bind and unbind are done by using flags
> > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > VM should have got a PASID allocated by host via
> > VFIO_IOMMU_PASID_REQUEST.
> >
> > Bind guest translation 

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-05 Thread Raj, Ashok
On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> On Mon, 4 May 2020 23:11:07 -0700
> "Raj, Ashok"  wrote:
> 
> > Hi Alex
> > 
> > + Joerg, accidently missed in the Cc.
> > 
> > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> > > On Mon,  4 May 2020 21:42:16 -0700
> > > Ashok Raj  wrote:
> > >   
> > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > 
> > > > PCIe 5.0 Specification.
> > > > 6.12 Access Control Services (ACS)
> > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > implement ACS and some do not. It is strongly recommended that Root 
> > > > Complex
> > > > implementations ensure that all accesses originating from RCiEPs
> > > > (PFs and VFs) without ACS capability are first subjected to processing 
> > > > by
> > > > the Translation Agent (TA) in the Root Complex before further decoding 
> > > > and
> > > > processing. The details of such Root Complex handling are outside the 
> > > > scope
> > > > of this specification.
> > > > 
> > > 
> > > Is the language here really strong enough to make this change?  ACS is
> > > an optional feature, so being permitted but not required is rather
> > > meaningless.  The spec is also specifically avoiding the words "must"
> > > or "shall" and even when emphasized with "strongly", we still only have
> > > a recommendation that may or may not be honored.  This seems like a
> > > weak basis for assuming that RCiEPs universally honor this
> > > recommendation.  Thanks,
> > >   
> > 
> > We are speaking about PCIe spec, where people write it about 5 years ahead
> > and every vendor tries to massage their product behavior with vague
> > words like this..  :)
> > 
> > But honestly for any any RCiEP, or even integrated endpoints, there 
> > is no way to send them except up north. These aren't behind a RP.
> 
> But they are multi-function devices and the spec doesn't define routing
> within multifunction packages.  A single function RCiEP will already be
> assumed isolated within its own group.

That's right. The other two devices only have legacy PCI headers. So 
they can't claim to be RCiEP's but just integrated endpoints. The legacy
devices don't even have a PCIe header.

I honestly don't know why these are groped as MFD's in the first place.

>  
> > I did check with couple folks who are part of the SIG, and seem to agree
> > that ACS treatment for RCiEP's doesn't mean much. 
> > 
> > I understand the language isn't strong, but it doesn't seem like ACS should
> > be a strong requirement for RCiEP's and reasonable to relax.
> > 
> > What are your thoughts? 
> 
> I think hardware vendors have ACS at their disposal to clarify when
> isolation is provided, otherwise vendors can submit quirks, but I don't
> see that the "strongly recommended" phrasing is sufficient to assume
> isolation between multifunction RCiEPs.  Thanks,

You point is that integrated MFD endpoints, without ACS, there is no 
gaurantee to SW that they are isolated.

As far as a quirk, do you think:
- a cmdline optput for integrated endpoints, and RCiEP's suffice?
  along with a compile time default that is strict enforcement
- typical vid/did type exception list?

A more generic way to ask for exception would be scalable until we can stop
those type of integrated devices. Or we need to maintain these device lists
for eternity. 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-05 Thread Raj, Ashok
Hi Alex

+ Joerg, accidently missed in the Cc.

On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> On Mon,  4 May 2020 21:42:16 -0700
> Ashok Raj  wrote:
> 
> > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > 
> > PCIe 5.0 Specification.
> > 6.12 Access Control Services (ACS)
> > Implementation of ACS in RCiEPs is permitted but not required. It is
> > explicitly permitted that, within a single Root Complex, some RCiEPs
> > implement ACS and some do not. It is strongly recommended that Root Complex
> > implementations ensure that all accesses originating from RCiEPs
> > (PFs and VFs) without ACS capability are first subjected to processing by
> > the Translation Agent (TA) in the Root Complex before further decoding and
> > processing. The details of such Root Complex handling are outside the scope
> > of this specification.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > Signed-off-by: Ashok Raj 
> > To: Joerg Roedel 
> > To: Bjorn Helgaas 
> > Cc: linux-ker...@vger.kernel.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: Lu Baolu 
> > Cc: Alex Williamson 
> > Cc: Darrel Goeddel 
> > Cc: Mark Scott ,
> > Cc: Romil Sharma 
> > Cc: Ashok Raj 
> > ---
> >  drivers/iommu/iommu.c | 15 ++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2b471419e26c..5744bd65f3e2 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1187,7 +1187,20 @@ static struct iommu_group 
> > *get_pci_function_alias_group(struct pci_dev *pdev,
> > struct pci_dev *tmp = NULL;
> > struct iommu_group *group;
> >  
> > -   if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > +   /*
> > +* PCI Spec 5.0, Section 6.12 Access Control Service
> > +* Implementation of ACS in RCiEPs is permitted but not required.
> > +* It is explicitly permitted that, within a single Root
> > +* Complex, some RCiEPs implement ACS and some do not. It is
> > +* strongly recommended that Root Complex implementations ensure
> > +* that all accesses originating from RCiEPs (PFs and VFs) without
> > +* ACS capability are first subjected to processing by the Translation
> > +* Agent (TA) in the Root Complex before further decoding and
> > +* processing.
> > +*/
> 
> Is the language here really strong enough to make this change?  ACS is
> an optional feature, so being permitted but not required is rather
> meaningless.  The spec is also specifically avoiding the words "must"
> or "shall" and even when emphasized with "strongly", we still only have
> a recommendation that may or may not be honored.  This seems like a
> weak basis for assuming that RCiEPs universally honor this
> recommendation.  Thanks,
> 

We are speaking about PCIe spec, where people write it about 5 years ahead
and every vendor tries to massage their product behavior with vague
words like this..  :)

But honestly for any any RCiEP, or even integrated endpoints, there 
is no way to send them except up north. These aren't behind a RP.

I did check with couple folks who are part of the SIG, and seem to agree
that ACS treatment for RCiEP's doesn't mean much. 

I understand the language isn't strong, but it doesn't seem like ACS should
be a strong requirement for RCiEP's and reasonable to relax.

What are your thoughts? 

Cheers,
Ashok
> 
> > +   if (!pdev->multifunction ||
> > +   (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> > +pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > return NULL;
> >  
> > for_each_pci_dev(tmp) {
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 6/7] x86/traps: Fix up invalid PASID

2020-04-27 Thread Raj, Ashok
Hi Thomas,

On Tue, Apr 28, 2020 at 02:54:59AM +0200, Thomas Gleixner wrote:
> Ashok,
> 
> "Raj, Ashok"  writes:
> > On Sun, Apr 26, 2020 at 05:25:06PM +0200, Thomas Gleixner wrote:
> >> Just for the record I also suggested to have a proper errorcode in the
> >> #GP for ENQCMD and I surely did not suggest to avoid decoding the user
> >> instructions.
> >
> > We certainly discussed the possiblity of adding an error code to 
> > identiy #GP due to ENQCMD with our HW architects. 
> >
> > There are only a few cases that have an error code, like move to segment
> > with an invalid value for instance. There were a few but i don't
> > recall that entire list. 
> >
> > Since the error code is 0 in most places, there isn't plumbing in hw to 
> > return
> > this value in all cases. It appeared that due to some uarch reasons it
> > wasn't as simple as it appears to /me sw kinds :-)
> 
> Sigh.
> 
> > So after some internal discussion we decided to take the current
> > approach. Its possible that if the #GP was due to some other reason
> > we might #GP another time. Since this wasn't perf or speed path we took
> > this lazy approach.
> 
> I know that the HW people's mantra is that everything can be fixed in
> software and therefore slapping new features into the CPUs can be done
> without thinking about the consequeses.
> 
> But we all know from painful experience that this is fundamentally wrong
> unless there is a really compelling reason.

:-)... I'm still looking for the quote from Linus about RAS before
he went to behavior school.


> 
> For new features there is absolutely no reason at all.
> 
> Can HW people pretty please understand that hardware and software have
> to be co-designed and not dictated by 'some uarch reasons'. This is
> nothing fundamentally new. This problem existed 30+ years ago, is well
> documented and has been ignored forever. I'm tired of that, really.
> 
> But as this seems to be unsolvable for the problem at hand can you
> please document the inability, unwillingness or whatever in the
> changelog?

Most certainly!

> 
> The question why this brand new_ ENQCMD + invalid PASID induced #GP does
> not generate an useful error code and needs heuristics to be dealt with
> is pretty obvious.
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 6/7] x86/traps: Fix up invalid PASID

2020-04-27 Thread Raj, Ashok
Hi Thomas

On Sun, Apr 26, 2020 at 05:25:06PM +0200, Thomas Gleixner wrote:
> Fenghua Yu  writes:
> > A #GP fault is generated when ENQCMD instruction is executed without
> > a valid PASID value programmed in.
> 
> Programmed in what?
> 
> > The #GP fault handler will initialize the current thread's PASID MSR.
> >
> > The following heuristic is used to avoid decoding the user instructions
> > to determine the precise reason for the #GP fault:
> > 1) If the mm for the process has not been allocated a PASID, this #GP
> >cannot be fixed.
> > 2) If the PASID MSR is already initialized, then the #GP was for some
> >other reason
> > 3) Try initializing the PASID MSR and returning. If the #GP was from
> >an ENQCMD this will fix it. If not, the #GP fault will be repeated
> >and we will hit case "2".
> >
> > Suggested-by: Thomas Gleixner 
> 
> Just for the record I also suggested to have a proper errorcode in the
> #GP for ENQCMD and I surely did not suggest to avoid decoding the user
> instructions.

We certainly discussed the possiblity of adding an error code to 
identiy #GP due to ENQCMD with our HW architects. 

There are only a few cases that have an error code, like move to segment
with an invalid value for instance. There were a few but i don't
recall that entire list. 

Since the error code is 0 in most places, there isn't plumbing in hw to return
this value in all cases. It appeared that due to some uarch reasons it
wasn't as simple as it appears to /me sw kinds :-)

So after some internal discussion we decided to take the current
approach. Its possible that if the #GP was due to some other reason
we might #GP another time. Since this wasn't perf or speed path we took
this lazy approach. 

We will keep tabs with HW folks for future consideration. 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 0/2] vfio/pci: expose device's PASID capability to VMs

2020-04-16 Thread Raj, Ashok
Hi Zhao


On Thu, Apr 16, 2020 at 06:12:26PM -0400, Yan Zhao wrote:
> On Tue, Mar 31, 2020 at 03:08:25PM +0800, Lu, Baolu wrote:
> > On 2020/3/31 14:35, Tian, Kevin wrote:
> > >> From: Liu, Yi L
> > >> Sent: Sunday, March 22, 2020 8:33 PM
> > >>
> > >> From: Liu Yi L
> > >>
> > >> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > >> Intel platforms allows address space sharing between device DMA and
> > >> applications. SVA can reduce programming complexity and enhance security.
> > >>
> > >> To enable SVA, device needs to have PASID capability, which is a key
> > >> capability for SVA. This patchset exposes the device's PASID capability
> > >> to guest instead of hiding it from guest.
> > >>
> > >> The second patch emulates PASID capability for VFs (Virtual Function) 
> > >> since
> > >> VFs don't implement such capability per PCIe spec. This patch emulates 
> > >> such
> > >> capability and expose to VM if the capability is enabled in PF (Physical
> > >> Function).
> > >>
> > >> However, there is an open for PASID emulation. If PF driver disables 
> > >> PASID
> > >> capability at runtime, then it may be an issue. e.g. PF should not 
> > >> disable
> > >> PASID capability if there is guest using this capability on any VF 
> > >> related
> > >> to this PF. To solve it, may need to introduce a generic communication
> > >> framework between vfio-pci driver and PF drivers. Please feel free to 
> > >> give
> > >> your suggestions on it.
> > > I'm not sure how this is addressed on bate metal today, i.e. between 
> > > normal
> > > kernel PF and VF drivers. I look at pasid enable/disable code in 
> > > intel-iommu.c.
> > > There is no check on PF/VF dependency so far. The cap is toggled when
> > > attaching/detaching the PF to its domain. Let's see how IOMMU guys
> > > respond, and if there is a way for VF driver to block PF driver from 
> > > disabling
> > > the pasid cap when it's being actively used by VF driver, then we may
> > > leverage the same trick in VFIO when emulation is provided to guest.
> > 
> > IOMMU subsystem doesn't expose any APIs for pasid enabling/disabling.
> > The PCI subsystem does. It handles VF/PF like below.
> > 
> > /**
> >   * pci_enable_pasid - Enable the PASID capability
> >   * @pdev: PCI device structure
> >   * @features: Features to enable
> >   *
> >   * Returns 0 on success, negative value on error. This function checks
> >   * whether the features are actually supported by the device and returns
> >   * an error if not.
> >   */
> > int pci_enable_pasid(struct pci_dev *pdev, int features)
> > {
> >  u16 control, supported;
> >  int pasid = pdev->pasid_cap;
> > 
> >  /*
> >   * VFs must not implement the PASID Capability, but if a PF
> >   * supports PASID, its VFs share the PF PASID configuration.
> >   */
> >  if (pdev->is_virtfn) {
> >  if (pci_physfn(pdev)->pasid_enabled)
> >  return 0;
> >  return -EINVAL;
> >  }
> > 
> > /**
> >   * pci_disable_pasid - Disable the PASID capability
> >   * @pdev: PCI device structure
> >   */
> > void pci_disable_pasid(struct pci_dev *pdev)
> > {
> >  u16 control = 0;
> >  int pasid = pdev->pasid_cap;
> > 
> >  /* VFs share the PF PASID configuration */
> >  if (pdev->is_virtfn)
> >  return;
> > 
> > 
> > It doesn't block disabling PASID on PF even VFs are possibly using it.
> >
> hi
> I'm not sure, but is it possible for pci_enable_pasid() and
> pci_disable_pasid() to do the same thing as pdev->driver->sriov_configure,
> e.g. pci_sriov_configure_simple() below.
> 
> It checks whether there are VFs are assigned in pci_vfs_assigned(dev).
> and we can set the VF in assigned status if vfio_pci_open() is performed
> on the VF.

But you can still unbind the PF driver that magically causes the VF's to be
removed from the guest image too correct? 

Only the IOMMU mucks with pasid_enable/disable. And it doesn't look like
we have a path to disable without tearing down the PF binding. 

We originally had some refcounts and such and would do the real disable only
when the refcount drops to 0, but we found it wasn't actually necessary 
to protect these resources like that.

> 
> 
> int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
> {
> int rc;
> 
> might_sleep();
> 
> if (!dev->is_physfn)
> return -ENODEV;
> 
> if (pci_vfs_assigned(dev)) {
> pci_warn(dev, "Cannot modify SR-IOV while VFs are 
> assigned\n");
> return -EPERM;
> }
> 
> if (nr_virtfn == 0) {
> sriov_disable(dev);
> return 0;
> }
> 
> rc = sriov_enable(dev, nr_virtfn);
> if (rc < 0)
> return rc;
> 
> return nr_virtfn;
> }
> 
> Thanks
> Yan
___
iommu mailing list

Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-12 Thread Raj, Ashok
Hi Alex

Going through the PCIe Spec, there seems a lot of such capabilities
that are different between PF and VF. Some that make sense
and some don't.


On Sun, Apr 12, 2020 at 08:10:43PM -0700, Raj, Ashok wrote:
> 
> > 
> > I agree though, I don't know why the SIG would preclude implementing
> > per VF control of these features.  Thanks,
> > 

For e.g. 

VF doesn't have I/O and Mem space enables, but has BME
Interrupt Status
Correctable Error Reporting
Almost all of Device Control Register.

So it seems like there is a ton of them we have to deal with today for 
VF's. How do we manage to emulate them without any support for them 
in VF's? 


> 
> Cheers,
> Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-12 Thread Raj, Ashok
On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> On Tue, 7 Apr 2020 21:00:21 -0700
> "Raj, Ashok"  wrote:
> 
> > Hi Alex
> > 
> > + Bjorn
> 
>  + Don
> 
> > FWIW I can't understand why PCI SIG went different ways with ATS, 
> > where its enumerated on PF and VF. But for PASID and PRI its only
> > in PF. 
> > 
> > I'm checking with our internal SIG reps to followup on that.
> > 
> > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > Is there vendor guarantee that hidden registers will locate at the
> > > > same offset between PF and VF config space?   
> > > 
> > > I'm not sure if the spec really precludes hidden registers, but the
> > > fact that these registers are explicitly outside of the capability
> > > chain implies they're only intended for device specific use, so I'd say
> > > there are no guarantees about anything related to these registers.  
> > 
> > As you had suggested in the other thread, we could consider
> > using the same offset as in PF, but even that's a better guess
> > still not reliable.
> > 
> > The other option is to maybe extend driver ops in the PF to expose
> > where the offsets should be. Sort of adding the quirk in the 
> > implementation. 
> > 
> > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is 
> > resisting 
> > making VF's first class citizen, we might ask them to add some verbiage
> > to suggest leave the same offsets as PF open to help emulation software.
> 
> Even if we know where to expose these capabilities on the VF, it's not
> clear to me how we can actually virtualize the capability itself.  If
> the spec defines, for example, an enable bit as r/w then software that
> interacts with that register expects the bit is settable.  There's no
> protocol for "try to set the bit and re-read it to see if the hardware
> accepted it".  Therefore a capability with a fixed enable bit
> representing the state of the PF, not settable by the VF, is
> disingenuous to the spec.

I think we are all in violent agreement. A lot of times the pci spec gets
defined several years ahead of real products and no one remembers
the justification on why they restricted things the way they did.

Maybe someone early product wasn't quite exposing these features to the VF
and hence the spec is bug compatible :-)

> 
> If what we're trying to do is expose that PASID and PRI are enabled on
> the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> without the ability to control it is not the right approach.  Maybe we

As long as the capability enable is only provided when the PF has enabled
the feature. Then it seems the hardware seems to do the right thing.

Assume we expose PASID/PRI only when PF has enabled it. It will be the
case since the PF driver needs to exist, and IOMMU would have set the 
PASID/PRI/ATS on PF.

If the emulation is purely spoofing the capability. Once vIOMMU driver
enables PASID, the context entries for the VF are completely independent
from the PF context entries. 

vIOMMU would enable PASID, and we just spoof the PASID capability.

If vIOMMU or guest for some reason does disable_pasid(), then the 
vIOMMU driver can disaable PASID on the VF context entries. So the VF
although the capability is blanket enabled on PF, IOMMU gaurantees the
transactions are blocked.


In the interim, it seems like the intent of the virtual capability
can be honored via help from the IOMMU for the controlling aspect.. 

Did i miss anything? 

> need new capabilities exposing these as slave features that cannot be
> controlled?  We could define our own vendor capability for this, but of
> course we have both the where to put it in config space issue, as well
> as the issue of trying to push an ad-hoc standard.  vfio could expose
> these as device features rather than emulating capabilities, but that
> still leaves a big gap between vfio in the hypervisor and the driver in
> the guest VM.  That might still help push the responsibility and policy
> for how to expose it to the VM as a userspace problem though.

I think this is a good long term solution, but if the vIOMMU implenentations
can carry us for the time being, we can probably defer them unless
we are stuck.

> 
> I agree though, I don't know why the SIG would preclude implementing
> per VF control of these features.  Thanks,
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-08 Thread Raj, Ashok
Hi Alex

On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> On Tue, 7 Apr 2020 21:00:21 -0700
> "Raj, Ashok"  wrote:
> 
> > Hi Alex
> > 
> > + Bjorn
> 
>  + Don
> 
> > FWIW I can't understand why PCI SIG went different ways with ATS, 
> > where its enumerated on PF and VF. But for PASID and PRI its only
> > in PF. 
> > 
> > I'm checking with our internal SIG reps to followup on that.
> > 
> > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > Is there vendor guarantee that hidden registers will locate at the
> > > > same offset between PF and VF config space?   
> > > 
> > > I'm not sure if the spec really precludes hidden registers, but the
> > > fact that these registers are explicitly outside of the capability
> > > chain implies they're only intended for device specific use, so I'd say
> > > there are no guarantees about anything related to these registers.  
> > 
> > As you had suggested in the other thread, we could consider
> > using the same offset as in PF, but even that's a better guess
> > still not reliable.
> > 
> > The other option is to maybe extend driver ops in the PF to expose
> > where the offsets should be. Sort of adding the quirk in the 
> > implementation. 
> > 
> > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is 
> > resisting 
> > making VF's first class citizen, we might ask them to add some verbiage
> > to suggest leave the same offsets as PF open to help emulation software.
> 
> Even if we know where to expose these capabilities on the VF, it's not
> clear to me how we can actually virtualize the capability itself.  If
> the spec defines, for example, an enable bit as r/w then software that
> interacts with that register expects the bit is settable.  There's no
> protocol for "try to set the bit and re-read it to see if the hardware
> accepted it".  Therefore a capability with a fixed enable bit
> representing the state of the PF, not settable by the VF, is
> disingenuous to the spec.
> 
> If what we're trying to do is expose that PASID and PRI are enabled on
> the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> without the ability to control it is not the right approach.  Maybe we
> need new capabilities exposing these as slave features that cannot be
> controlled?  We could define our own vendor capability for this, but of

The other option is to say, VFIO would never emulate these
fake capablities. If exposing a VF with PASID/PRI is required
the PF driver would simply wrap it into a VDCM like model which we do
today for Scalable IOV devices. So PF handles all aspects of this
interface.

I also like the suggestion you propose, maybe an offset where these
capabilities are exposed to VF's. Maybe have an architected DEVCAPx
which exposes these RO capabilities. No control, and the 
offset should be preserved by the SIG, so VMM can have a safe place
to stash them.

> course we have both the where to put it in config space issue, as well
> as the issue of trying to push an ad-hoc standard.  vfio could expose
> these as device features rather than emulating capabilities, but that
> still leaves a big gap between vfio in the hypervisor and the driver in
> the guest VM.  That might still help push the responsibility and policy
> for how to expose it to the VM as a userspace problem though.
> 
> I agree though, I don't know why the SIG would preclude implementing
> per VF control of these features.  Thanks,

Even if we ask SIG for clarification, it might affect today's devices
So might not be useful to solve our current situation.

Ashok

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-07 Thread Raj, Ashok
Hi Alex

+ Bjorn

FWIW I can't understand why PCI SIG went different ways with ATS, 
where its enumerated on PF and VF. But for PASID and PRI its only
in PF. 

I'm checking with our internal SIG reps to followup on that.

On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > Is there vendor guarantee that hidden registers will locate at the
> > same offset between PF and VF config space? 
> 
> I'm not sure if the spec really precludes hidden registers, but the
> fact that these registers are explicitly outside of the capability
> chain implies they're only intended for device specific use, so I'd say
> there are no guarantees about anything related to these registers.

As you had suggested in the other thread, we could consider
using the same offset as in PF, but even that's a better guess
still not reliable.

The other option is to maybe extend driver ops in the PF to expose
where the offsets should be. Sort of adding the quirk in the 
implementation. 

I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is resisting 
making VF's first class citizen, we might ask them to add some verbiage
to suggest leave the same offsets as PF open to help emulation software.


> 
> FWIW, vfio started out being more strict about restricting config space
> access to defined capabilities, until...
> 
> commit a7d1ea1c11b33bda2691f3294b4d735ed635535a
> Author: Alex Williamson 
> Date:   Mon Apr 1 09:04:12 2013 -0600
> 

Cheers,
Ashok

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/vt-d: Replace intel SVM APIs with generic SVA APIs

2020-03-23 Thread Raj, Ashok
Hi Jean

On Fri, Mar 20, 2020 at 10:29:55AM +0100, Jean-Philippe Brucker wrote:
> > +#define to_intel_svm_dev(handle) container_of(handle, struct 
> > intel_svm_dev, sva)
> > +struct iommu_sva *
> > +intel_svm_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
> > +{
> > +   struct iommu_sva *sva = ERR_PTR(-EINVAL);
> > +   struct intel_svm_dev *sdev = NULL;
> > +   int flags = 0;
> > +   int ret;
> > +
> > +   /*
> > +* TODO: Consolidate with generic iommu-sva bind after it is merged.
> > +* It will require shared SVM data structures, i.e. combine io_mm
> > +* and intel_svm etc.
> > +*/
> > +   if (drvdata)
> > +   flags = *(int *)drvdata;
> 
> drvdata is more for storing device driver contexts that can be passed to
> iommu_sva_ops, but I get that this is temporary.
> 
> As usual I'm dreading supervisor mode making it into the common API. What
> are your plans regarding SUPERVISOR_MODE and PRIVATE_PASID flags?  The
> previous discussion on the subject [1] had me hoping that you could
> replace supervisor mode with normal mappings (auxiliary domains?)
> I'm less worried about PRIVATE_PASID, it would just add complexity into

We don't seem to have an immediate need for PRIVATE_PASID. There are some talks
about potential usages, but nothing concrete. I think it might be good to
get rid of it now and add when we really need.

For SUPERVISOR_MODE, the idea is to have aux domain. Baolu is working on
something to replace. Certainly the entire kernel address is opening up 
the whole kimono.. so we are looking at dynamically creating mappings on 
demand. 
It might take some of the benefits of SVA in general with no need to create
mappings, but for security somebody has to pay the price :-)

Cheers,
Ashok


> the API and iommu-sva implementation, but doesn't really have security
> implications.
> 
> [1] 
> https://lore.kernel.org/linux-iommu/20190228220449.ga12...@araj-mobl1.jf.intel.com/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v12 2/4] uacce: add uacce driver

2020-02-24 Thread Raj, Ashok
Hi Kenneth,

sorry for waking up late on this patchset.


On Wed, Jan 15, 2020 at 10:12:46PM +0800, Zhangfei Gao wrote:
[... trimmed]

> +
> +static int uacce_fops_open(struct inode *inode, struct file *filep)
> +{
> + struct uacce_mm *uacce_mm = NULL;
> + struct uacce_device *uacce;
> + struct uacce_queue *q;
> + int ret = 0;
> +
> + uacce = xa_load(_xa, iminor(inode));
> + if (!uacce)
> + return -ENODEV;
> +
> + q = kzalloc(sizeof(struct uacce_queue), GFP_KERNEL);
> + if (!q)
> + return -ENOMEM;
> +
> + mutex_lock(>mm_lock);
> + uacce_mm = uacce_mm_get(uacce, q, current->mm);

I think having this at open time is a bit unnatural. Since when a process
does fork, we do not inherit the PASID. Although it inherits the fd
but cannot use the mmaped address in the child.

If you move this to the mmap time, its more natural. The child could
do a mmap() get a new PASID + mmio space to work with the hardware.


> + mutex_unlock(>mm_lock);
> + if (!uacce_mm) {
> + ret = -ENOMEM;
> + goto out_with_mem;
> + }
> +
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 03/10] iommu/vt-d: Replace Intel specific PASID allocator with IOASID

2019-10-22 Thread Raj, Ashok
On Tue, Oct 22, 2019 at 04:53:16PM -0700, Jacob Pan wrote:
> Make use of generic IOASID code to manage PASID allocation,
> free, and lookup. Replace Intel specific code.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 12 ++--
>  drivers/iommu/intel-pasid.c | 36 
>  drivers/iommu/intel-svm.c   | 37 +
>  3 files changed, 27 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 3aff0141c522..72febcf2c48f 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5311,7 +5311,7 @@ static void auxiliary_unlink_device(struct dmar_domain 
> *domain,
>   domain->auxd_refcnt--;
>  
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - intel_pasid_free_id(domain->default_pasid);
> + ioasid_free(domain->default_pasid);

if the domain is gauranteed to be torn down, its ok.. but otherwise
do you need to clear domain->default_pasid to avoid accidental reference
in some other code path?

>  }
>  
>  static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5329,10 +5329,10 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   if (domain->default_pasid <= 0) {
>   int pasid;
>  
> - pasid = intel_pasid_alloc_id(domain, PASID_MIN,
> -  pci_max_pasids(to_pci_dev(dev)),
> -  GFP_KERNEL);
> - if (pasid <= 0) {
> + /* No private data needed for the default pasid */
> + pasid = ioasid_alloc(NULL, PASID_MIN, 
> pci_max_pasids(to_pci_dev(dev)) - 1,

If we ensure IOMMU max-pasid is full 20bit width, its good. In a vIOMMU
if iommu max pasid is restricted for some kind of partitioning in future
you want to consider limiting to no more than what's provisioned in the vIOMMU 
right?


> + NULL);
> + if (pasid == INVALID_IOASID) {
>   pr_err("Can't allocate default pasid\n");
>   return -ENODEV;
>   }
> @@ -5368,7 +5368,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   spin_unlock(>lock);
>   spin_unlock_irqrestore(_domain_lock, flags);
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - intel_pasid_free_id(domain->default_pasid);
> + ioasid_free(domain->default_pasid);
>  
>   return ret;
>  }
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 76bcbb21e112..c0d1f28d3412 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -26,42 +26,6 @@
>   */
>  static DEFINE_SPINLOCK(pasid_lock);
>  u32 intel_pasid_max_id = PASID_MAX;
> -static DEFINE_IDR(pasid_idr);
> -
> -int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp)
> -{
> - int ret, min, max;
> -
> - min = max_t(int, start, PASID_MIN);
> - max = min_t(int, end, intel_pasid_max_id);
> -
> - WARN_ON(in_interrupt());
> - idr_preload(gfp);
> - spin_lock(_lock);
> - ret = idr_alloc(_idr, ptr, min, max, GFP_ATOMIC);
> - spin_unlock(_lock);
> - idr_preload_end();
> -
> - return ret;
> -}
> -
> -void intel_pasid_free_id(int pasid)
> -{
> - spin_lock(_lock);
> - idr_remove(_idr, pasid);
> - spin_unlock(_lock);
> -}
> -
> -void *intel_pasid_lookup_id(int pasid)
> -{
> - void *p;
> -
> - spin_lock(_lock);
> - p = idr_find(_idr, pasid);
> - spin_unlock(_lock);
> -
> - return p;
> -}
>  
>  static int check_vcmd_pasid(struct intel_iommu *iommu)
>  {
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 9b159132405d..5aef5b7bf561 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "intel-pasid.h"
> @@ -318,16 +319,15 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, 
> int flags, struct svm_dev_
>   if (pasid_max > intel_pasid_max_id)
>   pasid_max = intel_pasid_max_id;
>  
> - /* Do not use PASID 0 in caching mode (virtualised IOMMU) */
> - ret = intel_pasid_alloc_id(svm,
> -!!cap_caching_mode(iommu->cap),
> -pasid_max - 1, GFP_KERNEL);
> - if (ret < 0) {
> + /* Do not use PASID 0, reserved for RID to PASID */
> + svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid_max - 1, svm);
> + if (svm->pasid == INVALID_IOASID) {
>   kfree(svm);
>   kfree(sdev);
> + ret = ENOSPC;
>   goto out;
>   }
> - svm->pasid = ret;
>   

Re: [PATCH v6 02/10] iommu/vt-d: Add custom allocator for IOASID

2019-10-22 Thread Raj, Ashok
On Tue, Oct 22, 2019 at 04:53:15PM -0700, Jacob Pan wrote:
> When VT-d driver runs in the guest, PASID allocation must be
> performed via virtual command interface. This patch registers a
> custom IOASID allocator which takes precedence over the default
> XArray based allocator. The resulting IOASID allocation will always
> come from the host. This ensures that PASID namespace is system-
> wide.
> 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Liu, Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/Kconfig   |  1 +
>  drivers/iommu/intel-iommu.c | 67 
> +
>  include/linux/intel-iommu.h |  2 ++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index fd50ddbf..961fe5795a90 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -211,6 +211,7 @@ config INTEL_IOMMU_SVM
>   bool "Support for Shared Virtual Memory with Intel IOMMU"
>   depends on INTEL_IOMMU && X86
>   select PCI_PASID
> + select IOASID
>   select MMU_NOTIFIER
>   help
> Shared Virtual Memory (SVM) provides a facility for devices
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 3f974919d3bd..3aff0141c522 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1706,6 +1706,8 @@ static void free_dmar_iommu(struct intel_iommu *iommu)
>   if (ecap_prs(iommu->ecap))
>   intel_svm_finish_prq(iommu);
>   }
> + ioasid_unregister_allocator(>pasid_allocator);
> +
>  #endif
>  }
>  
> @@ -4910,6 +4912,46 @@ static int __init probe_acpi_namespace_devices(void)
>   return 0;
>  }
>  
> +#ifdef CONFIG_INTEL_IOMMU_SVM

Maybe move them to intel-svm.c instead? that's where the bulk
of the svm support is?

> +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max, void *data)
> +{
> + struct intel_iommu *iommu = data;
> + ioasid_t ioasid;
> +
> + /*
> +  * VT-d virtual command interface always uses the full 20 bit
> +  * PASID range. Host can partition guest PASID range based on
> +  * policies but it is out of guest's control.
> +  */
> + if (min < PASID_MIN || max > PASID_MAX)
> + return INVALID_IOASID;

What are these PASID_MIN/MAX? Do you check if these are within the 
limits supported by the iommu/vIOMMU as its enumerated?


> +
> + if (vcmd_alloc_pasid(iommu, ))
> + return INVALID_IOASID;
> +
> + return ioasid;
> +}
> +
> +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> +{
> + struct iommu_pasid_alloc_info *svm;
> + struct intel_iommu *iommu = data;
> +
> + if (!iommu)
> + return;
> + /*
> +  * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> +  * We can only free the PASID when all the devices are unbond.
> +  */
> + svm = ioasid_find(NULL, ioasid, NULL);
> + if (!svm) {
> + pr_warn("Freeing unbond IOASID %d\n", ioasid);
> + return;
> + }
> + vcmd_free_pasid(iommu, ioasid);
> +}
> +#endif
> +
>  int __init intel_iommu_init(void)
>  {
>   int ret = -ENODEV;
> @@ -5020,6 +5062,31 @@ int __init intel_iommu_init(void)
>  "%s", iommu->name);
>   iommu_device_set_ops(>iommu, _iommu_ops);
>   iommu_device_register(>iommu);
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> + if (cap_caching_mode(iommu->cap) && sm_supported(iommu)) {

do you need to check against cap_caching_mode() or ecap_vcmd?


> + /*
> +  * Register a custom ASID allocator if we are running
> +  * in a guest, the purpose is to have a system wide 
> PASID
> +  * namespace among all PASID users.
> +  * There can be multiple vIOMMUs in each guest but only
> +  * one allocator is active. All vIOMMU allocators will
> +  * eventually be calling the same host allocator.
> +  */
> + iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> + iommu->pasid_allocator.free = intel_ioasid_free;
> + iommu->pasid_allocator.pdata = (void *)iommu;
> + ret = 
> ioasid_register_allocator(>pasid_allocator);
> + if (ret) {
> + pr_warn("Custom PASID allocator registeration 
> failed\n");
> + /*
> +  * Disable scalable mode on this IOMMU if there
> +  * is no custom allocator. Mixing SM capable 
> vIOMMU
> +  * and non-SM vIOMMU are not supported.
> +  */
> + intel_iommu_sm = 0;
> + }
> + }
> +#endif
>   }
>  
>   bus_set_iommu(_bus_type, 

Re: [PATCH v6 01/10] iommu/vt-d: Enlightened PASID allocation

2019-10-22 Thread Raj, Ashok
On Tue, Oct 22, 2019 at 04:53:14PM -0700, Jacob Pan wrote:
> From: Lu Baolu 
> 
> If Intel IOMMU runs in caching mode, a.k.a. virtual IOMMU, the
> IOMMU driver should rely on the emulation software to allocate
> and free PASID IDs. The Intel vt-d spec revision 3.0 defines a
> register set to support this. This includes a capability register,
> a virtual command register and a virtual response register. Refer
> to section 10.4.42, 10.4.43, 10.4.44 for more information.

The above paragraph seems a bit confusing, there is no reference
to caching mode for for VCMD... some suggestion below.

Enabling IOMMU in a guest requires communication with the host
driver for certain aspects. Use of PASID ID to enable Shared Virtual
Addressing (SVA) requires managing PASID's in the host. VT-d 3.0 spec
provides a Virtual Command Register (VCMD) to facilitate this.
Writes to this register in the guest are trapped by Qemu and 
proxies the call to the host driver 


> 
> This patch adds the enlightened PASID allocation/free interfaces
> via the virtual command register.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Jacob Pan 
> Reviewed-by: Eric Auger 
> ---
>  drivers/iommu/intel-pasid.c | 83 
> +
>  drivers/iommu/intel-pasid.h | 13 ++-
>  include/linux/intel-iommu.h |  2 ++
>  3 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 040a445be300..76bcbb21e112 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -63,6 +63,89 @@ void *intel_pasid_lookup_id(int pasid)
>   return p;
>  }
>  
> +static int check_vcmd_pasid(struct intel_iommu *iommu)
> +{
> + u64 cap;
> +
> + if (!ecap_vcs(iommu->ecap)) {
> + pr_warn("IOMMU: %s: Hardware doesn't support virtual command\n",
> + iommu->name);
> + return -ENODEV;
> + }
> +
> + cap = dmar_readq(iommu->reg + DMAR_VCCAP_REG);
> + if (!(cap & DMA_VCS_PAS)) {
> + pr_warn("IOMMU: %s: Emulation software doesn't support PASID 
> allocation\n",
> + iommu->name);
> + return -ENODEV;
> + }
> +
> + return 0;
> +}
> +
> +int vcmd_alloc_pasid(struct intel_iommu *iommu, unsigned int *pasid)
> +{
> + u64 res;
> + u8 status_code;
> + unsigned long flags;
> + int ret = 0;
> +
> + ret = check_vcmd_pasid(iommu);

Do you have to check this everytime? every dmar_readq is going to trap
to the other side ...

> + if (ret)
> + return ret;
> +
> + raw_spin_lock_irqsave(>register_lock, flags);
> + dmar_writeq(iommu->reg + DMAR_VCMD_REG, VCMD_CMD_ALLOC);
> + IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +   !(res & VCMD_VRSP_IP), res);
> + raw_spin_unlock_irqrestore(>register_lock, flags);
> +
> + status_code = VCMD_VRSP_SC(res);
> + switch (status_code) {
> + case VCMD_VRSP_SC_SUCCESS:
> + *pasid = VCMD_VRSP_RESULT(res);
> + break;
> + case VCMD_VRSP_SC_NO_PASID_AVAIL:
> + pr_info("IOMMU: %s: No PASID available\n", iommu->name);
> + ret = -ENOMEM;
> + break;
> + default:
> + ret = -ENODEV;
> + pr_warn("IOMMU: %s: Unexpected error code %d\n",
> + iommu->name, status_code);
> + }
> +
> + return ret;
> +}
> +
> +void vcmd_free_pasid(struct intel_iommu *iommu, unsigned int pasid)
> +{
> + u64 res;
> + u8 status_code;
> + unsigned long flags;
> +
> + if (check_vcmd_pasid(iommu))
> + return;
> +
> + raw_spin_lock_irqsave(>register_lock, flags);
> + dmar_writeq(iommu->reg + DMAR_VCMD_REG, (pasid << 8) | VCMD_CMD_FREE);
> + IOMMU_WAIT_OP(iommu, DMAR_VCRSP_REG, dmar_readq,
> +   !(res & VCMD_VRSP_IP), res);
> + raw_spin_unlock_irqrestore(>register_lock, flags);
> +
> + status_code = VCMD_VRSP_SC(res);
> + switch (status_code) {
> + case VCMD_VRSP_SC_SUCCESS:
> + break;
> + case VCMD_VRSP_SC_INVALID_PASID:
> + pr_info("IOMMU: %s: Invalid PASID\n", iommu->name);
> + break;
> + default:
> + pr_warn("IOMMU: %s: Unexpected error code %d\n",
> + iommu->name, status_code);
> + }
> +}
> +
>  /*
>   * Per device pasid table management:
>   */
> diff --git a/drivers/iommu/intel-pasid.h b/drivers/iommu/intel-pasid.h
> index fc8cd8f17de1..e413e884e685 100644
> --- a/drivers/iommu/intel-pasid.h
> +++ b/drivers/iommu/intel-pasid.h
> @@ -23,6 +23,16 @@
>  #define is_pasid_enabled(entry)  (((entry)->lo >> 3) & 0x1)
>  #define get_pasid_dir_size(entry)(1 << entry)->lo >> 9) & 0x7) + 7))
>  
> +/* Virtual command interface for enlightened pasid management. */
> +#define VCMD_CMD_ALLOC   0x1
> 

Re: [PATCH v6 00/10] Nested Shared Virtual Address (SVA) VT-d support

2019-10-22 Thread Raj, Ashok
Hi Jacob

On Tue, Oct 22, 2019 at 04:53:13PM -0700, Jacob Pan wrote:
> Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
> platforms allow address space sharing between device DMA and applications.
> SVA can reduce programming complexity and enhance security.
> This series is intended to enable SVA virtualization, i.e. shared guest
> application address space and physical device DMA address. Only IOMMU portion

Last line "i.e shared guest application address space and physical device 
DMA address" doesn't read well. 

Simply rephrase as "enable use of SVA within a guest user application"

> of the changes are included in this series. Additional support is needed in
> VFIO and QEMU (will be submitted separately) to complete this functionality.
> 
> To make incremental changes and reduce the size of each patchset. This series
> does not inlcude support for page request services.
> 
> In VT-d implementation, PASID table is per device and maintained in the host.
> Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.
> 
> .-.  .---.
> |   vIOMMU|  | Guest process CR3, FL only|
> | |  '---'
> ./
> | PASID Entry |--- PASID cache flush -
> '-'   |
> | |   V
> | |CR3 in GPA
> '-'
> Guest
> --| Shadow |--|
>   vv  v
> Host
> .-.  .--.
> |   pIOMMU|  | Bind FL for GVA-GPA  |
> | |  '--'
> ./  |
> | PASID Entry | V (Nested xlate)
> '\.--.
> | |   |SL for GPA-HPA, default domain|
> | |   '--'
> '-'

is the SL for GPA-HPA default domain? When you assign a device to 
Guest, this isn't default domain right?

> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> This is the remaining VT-d only portion of V5 since the uAPIs and IOASID 
> common
> code have been applied to Joerg's IOMMU core branch.
> (https://lkml.org/lkml/2019/10/2/833)
> 
> The complete set with VFIO patches are here:
> https://github.com/jacobpan/linux.git:siov_sva
> 
> The complete nested SVA upstream patches are divided into three phases:
> 1. Common APIs and PCI device direct assignment
> 2. Page Request Services (PRS) support
> 3. Mediated device assignment
> 
> With this set and the accompanied VFIO code, we will achieve phase #1.
> 
> Thanks,
> 
> Jacob
> 
> ChangeLog:
>   - V6
> - Rebased on top of Joerg's core branch
> (git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core)
> - Adapt to new uAPIs and IOASID allocators
> 
>   - V5
> Rebased on v5.3-rc4 which has some of the IOMMU fault APIs merged.
> Addressed v4 review comments from Eric Auger, Baolu Lu, and
>   Jonathan Cameron. Specific changes are as follows:
> - Refined custom IOASID allocator to support multiple vIOMMU, hotplug
>   cases.
> - Extracted vendor data from IOMMU guest PASID bind data, for VT-d
>   will support all necessary guest PASID entry fields for PASID
>   bind.
> - Support non-identity host-guest PASID mapping
> - Exception handling in various cases
> 
>   - V4
> - Redesigned IOASID allocator such that it can support custom
> allocators with shared helper functions. Use separate XArray
> to store IOASIDs per allocator. Took advice from Eric Auger to
> have default allocator use the generic allocator structure.
> Combined into one patch in that the default allocator is just
> "another" allocator now. Can be built as a module in case of
> driver use without IOMMU.
> - Extended bind guest PASID data to support SMMU and non-identity
> guest to host PASID mapping https://lkml.org/lkml/2019/5/21/802
> - Rebased on Jean's sva/api common tree, new patches starts with
>  [PATCH v4 10/22]
> 
>   - V3
> - Addressed thorough review comments from Eric Auger (Thank you!)
> - Moved IOASID allocator from driver core to IOMMU code per
>   suggestion by Christoph Hellwig
>   (https://lkml.org/lkml/2019/4/26/462)
> - Rebased on top of Jean's SVA API branch and Eric's v7[1]
>   (git://linux-arm.org/linux-jpb.git sva/api)
> - All IOMMU APIs are unmodified (except the new bind guest PASID
>   call in patch 9/16)
> 
>   - V2
> - Rebased on Joerg's IOMMU x86/vt-d branch v5.1-rc4
> - Integrated with Eric Auger's new v7 series for common APIs
> 

Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-23 Thread Raj, Ashok
On Mon, Sep 23, 2019 at 08:24:52PM +0800, Lu Baolu wrote:
> This adds functions to manipulate first level page tables
> which could be used by a scalale mode capable IOMMU unit.

s/scalale/scalable

> 
> intel_mmmap_range(domain, addr, end, phys_addr, prot)

Maybe think of a different name..? mmmap seems a bit weird :-)

>  - Map an iova range of [addr, end) to the physical memory
>started at @phys_addr with the @prot permissions.
> 
> intel_mmunmap_range(domain, addr, end)
>  - Tear down the map of an iova range [addr, end). A page
>list will be returned which will be freed after iotlb
>flushing.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Liu Yi L 
> Cc: Yi Sun 
> Signed-off-by: Lu Baolu 


Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest

2019-09-23 Thread Raj, Ashok
Hi Jacob

On Mon, Sep 23, 2019 at 12:27:15PM -0700, Jacob Pan wrote:
> > 
> > In VT-d 3.0, scalable mode is introduced, which offers two level
> > translation page tables and nested translation mode. Regards to
> > GIOVA support, it can be simplified by 1) moving the GIOVA support
> > over 1st-level page table to store GIOVA->GPA mapping in vIOMMU,
> > 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU
> > second level for GPA->HPA translation, and 4) enable nested (a.k.a.
> > dual stage) translation in host. Compared with current shadow GIOVA
> > support, the new approach is more secure and software is simplified
> > as we only need to flush the pIOMMU IOTLB and possible device-IOTLB
> > when an IOVA mapping in vIOMMU is torn down.
> > 
> >  .---.
> >  |  vIOMMU   |
> >  |---| .---.
> >  |   |IOTLB flush trap |   QEMU|
> >  .---.(unmap)  |---|
> >  | GVA->GPA  |>|   |
> >  '---' '---'
> >  |   |   |
> >  '---'   |
> ><--
> >|  VFIO/IOMMU  
> >|  cache invalidation and  
> >| guest gpd bind interfaces
> >v
> For vSVA, the guest PGD bind interface will mark the PASID as guest
> PASID and will inject page request into the guest. In FL gIOVA case, I
> guess we are assuming there is no page fault for GIOVA. I will need to
> add a flag in the gpgd bind such that any PRS will be auto responded
> with invalid.

Is there real need to enforce this? I'm not sure if there is any
limitation in the spec, and if so, can the guest check that instead?

Also i believe the idea is to overcommit PASID#0 such uses. Thought
we had a capability to expose this to the vIOMMU as well. Not sure if this
is already documented, if not should be up in the next rev.


> 
> Also, native use of IOVA FL map is not to be supported? i.e. IOMMU API
> and DMA API for native usage will continue to be SL only?
> >  .---.
> >  |  pIOMMU   |
> >  |---|
> >  .---.
> >  | GVA->GPA  |<---First level
> >  '---'
> >  | GPA->HPA  |<---Scond level

s/Scond/Second

> >  '---'
> >  '---'
> > 
> > This patch series only aims to achieve the first goal, a.k.a using
> > first level translation for IOVA mappings in vIOMMU. I am sending
> > it out for your comments. Any comments, suggestions and concerns are
> > welcomed.
> > 
> 
> 
> > Based-on-idea-by: Ashok Raj 
> > Based-on-idea-by: Kevin Tian 
> > Based-on-idea-by: Liu Yi L 
> > Based-on-idea-by: Lu Baolu 
> > Based-on-idea-by: Sanjay Kumar 
> > 
> > Lu Baolu (4):
> >   iommu/vt-d: Move domain_flush_cache helper into header
> >   iommu/vt-d: Add first level page table interfaces
> >   iommu/vt-d: Map/unmap domain with mmmap/mmunmap
> >   iommu/vt-d: Identify domains using first level page table
> > 
> >  drivers/iommu/Makefile |   2 +-
> >  drivers/iommu/intel-iommu.c| 142 ++--
> >  drivers/iommu/intel-pgtable.c  | 342
> > + include/linux/intel-iommu.h|
> > 31 ++- include/trace/events/intel_iommu.h |  60 +
> >  5 files changed, 553 insertions(+), 24 deletions(-)
> >  create mode 100644 drivers/iommu/intel-pgtable.c
> > 
> 
> [Jacob Pan]


Re: [PATCH v4 20/22] iommu/vt-d: Add bind guest PASID support

2019-07-16 Thread Raj, Ashok
Hi Eric

Jacob is on sabbatical, so i'll give it my best shot :-)

Yi/Kevin can jump in...

On Tue, Jul 16, 2019 at 06:45:51PM +0200, Auger Eric wrote:
> Hi Jacob,
> 
> On 6/9/19 3:44 PM, Jacob Pan wrote:
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> > .-.  .---.
> > |   vIOMMU|  | Guest process CR3, FL only|
> > | |  '---'
> > ./
> > | PASID Entry |--- PASID cache flush -
> > '-'   |
> > | |   V
> > | |CR3 in GPA
> > '-'
> > Guest
> > --| Shadow |--|
> >   vv  v
> > Host
> > .-.  .--.
> > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > | |  '--'
> > ./  |
> > | PASID Entry | V (Nested xlate)
> > '\.--.
> > | |   |SL for GPA-HPA, default domain|
> > | |   '--'
> > '-'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan 
> > Signed-off-by: Liu, Yi L 
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 187 
> > 
> >  include/linux/intel-iommu.h |  13 ++-
> >  include/linux/intel-svm.h   |  17 
> >  4 files changed, 219 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > index 7cfa0eb..3b4d712 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5782,6 +5782,10 @@ const struct iommu_ops intel_iommu_ops = {
> > .dev_enable_feat= intel_iommu_dev_enable_feat,
> > .dev_disable_feat   = intel_iommu_dev_disable_feat,
> > .pgsize_bitmap  = INTEL_IOMMU_PGSIZES,
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +   .sva_bind_gpasid= intel_svm_bind_gpasid,
> > +   .sva_unbind_gpasid  = intel_svm_unbind_gpasid,
> > +#endif
> >  };
> >  
> >  static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index 66d98e1..f06a82f 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -229,6 +229,193 @@ static LIST_HEAD(global_svm_list);
> > list_for_each_entry(sdev, >devs, list) \
> > if (dev == sdev->dev)   \
> >  
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev,
> > +   struct gpasid_bind_data *data)
> > +{
> > +   struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +   struct intel_svm_dev *sdev;
> > +   struct intel_svm *svm = NULL;
> not requested
> > +   struct dmar_domain *ddomain;
> > +   int ret = 0;
> > +
> > +   if (WARN_ON(!iommu) || !data)
> > +   return -EINVAL;
> > +
> > +   if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > +   data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > +   return -EINVAL;
> > +
> > +   if (dev_is_pci(dev)) {
> > +   /* VT-d supports devices with full 20 bit PASIDs only */
> > +   if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > +   return -EINVAL;
> > +   }
> > +
> > +   /*
> > +* We only check host PASID range, we have no knowledge to check
> > +* guest PASID range nor do we use the guest PASID.
> guest pasid is set below in svm->gpasid.
> So I am confused, do you handle gpasid or not? If you don't use gpasid
> at the moment, then you may return -EINVAL if data->flags &
> IOMMU_SVA_GPASID_VAL.
> 
> I confess I don't really understand gpasid as I thought you use
> enlightened PASID allocation to use a system wide PASID allocator on
> host so in which case guest and host PASID do differ?

Correct, from within the guest, we only use the enlightned allocator.

Guest PASID can be managed via Qemu, and it will also associate
guest pasid's with appropriate host pasids. Sort of how guest bdf
and host bdf are managed for page-request etc.



Re: Device specific pass through in host systems - discuss user interface

2019-06-10 Thread Raj, Ashok
On Mon, Jun 10, 2019 at 09:38:11PM -0700, Sai Praneeth Prakhya wrote:
> Hi All,
> 
> + Sohil and Rob Clark (as there are dropped from CC'list)
> 
> > > > Most iommu vendor drivers have switched from per-device to per-group
> > > > domain (a.k.a. default domain). So per-group pass-through mode makes
> > more sense?
> > > >
> > > > By the way, can we extend this to "per-group default domain type",
> > > > instead of only "per-group pass-through mode"? Currently we have
> > > > system level default domain type, if we have finer granularity of
> > > > default domain type, both iommu drivers and end users will benefit from 
> > > > it.
> > >
> > > Sure! Makes sense.. per-group default domain type sounds good.
> 
> I am planning to implement an RFC (supporting only runtime case for now) 
> which works as below
> 
> 1. User unbinds the driver by writing to sysfs
> 2. User puts a group in pass through mode by writing "1" to
> /sys/kernel/iommu_groups//pt

might be better to read current value of default domain for that group.. 
/sys/kernel/iommu_groups//default_domain

reading the above value shows current setting.
provide a differnet file next_def_domain, and you can echo "pt" or "dma_domain"
to switch to pass-through, or normal dma isolation mode.

For devices that automatically set to pass through today like graphics, or 
isoch audio
you can show "pt" as default_domain.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Device specific pass through in host systems - discuss user interface

2019-06-10 Thread Raj, Ashok
Hi Sai

On Sun, Jun 09, 2019 at 10:41:10PM -0700, Sai Praneeth Prakhya wrote:
> > > I am working on an IOMMU driver feature that allows a user to specify
> > > if the DMA from a device should be translated by IOMMU or not.
> > > Presently, we support only all devices or none mode i.e. if user
> > > specifies "iommu=pt" [X86] or "iommu.passthrough" [ARM64] through
> > > kernel command line, all the devices would be in pass through mode and
> > > we don't have per device granularity, but, we were requested by a
> > > customer to selectively put devices in pass through mode and not all.
> > 
> > Most iommu vendor drivers have switched from per-device to per-group domain
> > (a.k.a. default domain). So per-group pass-through mode makes more sense?
> > 
> > By the way, can we extend this to "per-group default domain type", instead 
> > of
> > only "per-group pass-through mode"? Currently we have system level default
> > domain type, if we have finer granularity of default domain type, both iommu
> > drivers and end users will benefit from it.
> 
> Sure! Makes sense.. per-group default domain type sounds good.
> 
> > > I am looking for a consensus on **how the kernel command line argument
> > > should look like and path for sysfs entry**. Also, please note that if
> > > a device is put in pass through mode it won't be available for the
> > > guest and that's ok.
> > 
> > Just out of curiosity, what's the limitation for a device using pass- 
> > through DMA
> > domain to be assignable.
> 
> Sorry! I don't know about assignable devices. Probably, Ashok or Jacob could 
> answer this question

We don't switch the domain for assigned devices. Only the "type" of the default 
domain is 
changed from dma-protected to passthrough type.

When assigning devices to user-space, there is no change in this proposal.

> 
> Regards,
> Sai
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu: Bind process address spaces to devices

2019-02-28 Thread Raj, Ashok
On Thu, Feb 28, 2019 at 01:15:49PM -0800, Jacob Pan wrote:
> On Thu, 28 Feb 2019 15:09:50 +0100
> Joerg Roedel  wrote:
> 
> > Hi Jacob,
> > 
> > On Wed, Feb 27, 2019 at 01:41:29PM -0800, Jacob Pan wrote:
> > > On Tue, 26 Feb 2019 12:17:43 +0100
> > > Joerg Roedel  wrote:  
> > 
> > > Just trying to understand how to use this API.
> > > So if we bind the same mm to two different devices, we should get
> > > two different iommu_sva handle, right?
> > > I think intel-svm still needs a flag argument for supervisor pasid
> > > etc. Other than that, I think both interface should work for vt-d.  
> > 
> > I second Jean's question here, is supervisor pasid still needed with
> > scalable mode? What is the use-case and which mm_struct will be used
> > for supervisor accesses?
> > 
> I will delegate this to Ashok.

Supervisor PASID is still required for some kernel clients. Some of our
IB folks had asked for it. Current implementation uses init_mm, but
we know this is dangerous. Plus since the kernel has no support for
mmu_notifiers for kernel memory we were not able to invalidate
device tlb after memory was freed.

Suppose we could just build regular page-tables much like how the map/unmap
does today, but bind it with a Supervisor PASID. This way we don't open
up the kimono to the device, but only open select portions on request.

we haven't spent enough time on it lately, but will focus once the core
pieces are completed for the baseline support for Scalable mode.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/2] iommu/vt-d: Enable PASID only if device expects PASID in PRG Response.

2019-02-13 Thread Raj, Ashok
On Wed, Feb 13, 2019 at 12:26:33AM -0800, Tian, Kevin wrote:
> > 
> > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > index 1457f931218e..af2e4a011787 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -1399,7 +1399,8 @@ static void iommu_enable_dev_iotlb(struct
> > device_domain_info *info)
> >undefined. So always enable PASID support on devices which
> >have it, even if we can't yet know if we're ever going to
> >use it. */
> > -   if (info->pasid_supported && !pci_enable_pasid(pdev, info-
> > >pasid_supported & ~1))
> > +   if (info->pasid_supported && pci_prg_resp_pasid_required(pdev)
> > &&
> > +   !pci_enable_pasid(pdev, info->pasid_supported & ~1))
> > info->pasid_enabled = 1;
> 
> Above logic looks problematic. As Dave commented in another thread,
> PRI and PASID are orthogonal capabilities. Especially with introduction
> of VT-d scalable mode, PASID will be used alone even w/o PRI...
> 
> Why not doing the check when PRI is actually enabled? At that point
> you can fail the request if above condition is false. 
> 

That makes sense. 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/2] PCI: ATS: Add function to check ATS page aligned request status.

2019-02-11 Thread Raj, Ashok
On Fri, Feb 08, 2019 at 11:49:55PM -0500, Sinan Kaya wrote:
> On 2/8/2019 8:02 PM, sathyanarayanan kuppuswamy wrote:
> >>This means that you should probably have some kind of version check
> >>here.
> >
> >There is no version field in ATS v1.0 spec. Also, If I follow the history
> >log in PCI spec, I think ATS if first added at v1.2. Please correct me if
> >I am wrong.
> 
> v1.2 was incorporated into PCIe spec at that time. However, the ATS spec
> is old and there could be some HW that could claim to be ATS compatible.
> I know AMD GPUs declare ATS capability.

It seems rather odd we have to check for ATS version.

I always assumed unspecified bits (Reserved) must be 0. We only check
this if ATS is enabled, and this particular bit wasn't given away for another
feature.

Is it really required to check for ATS version before consuming this?


> 
> See this ECN
> 
> https://composter.com.ua/documents/ats_r1.1_26Jan09.pdf
> 
> You need to validate the version field from ATS capability header to be
> 1 before reading this register.
> 
> See Table 5-1:  ATS Extended Capability Header
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/1] iommu/vt-d: Enable PRI only if the device enables PASID.

2019-02-07 Thread Raj, Ashok
On Thu, Feb 07, 2019 at 09:15:24PM +, David Woodhouse wrote:
> On Thu, 2019-02-07 at 13:09 -0800, Raj, Ashok wrote:
> > You are right.. they are completely orthogonal. We just don't have
> > a way to handle the page-requests for request without PASID's.
> > 
> > There are some of the vIOMMU work to pass the PRI to who owns
> > the device, and we can certainly relax it then. This is just to reflect
> > what support exists today. FWIW, even the native driver maybe be able
> > to resolve this if supported.
> 
> As things stand, if a device makes a PRI request without a PASID, it'll
> get told that we didn't manage to bring the page in for it. Which is
> true.

That's true...it does seem to be covered already.. I can't remember
why I thought this was required :-(.. 

We can drop this patch.
> 
> What's the actual problem being fixed by this patch? Yes, we're going
> to want to hook up a way to pass the PRI to the right place... but why
> add *another* thing that's just going to have to be fixed, by reverting
> this patch?
> 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 1/1] iommu/vt-d: Enable PRI only if the device enables PASID.

2019-02-07 Thread Raj, Ashok
On Thu, Feb 07, 2019 at 08:08:06PM +, David Woodhouse wrote:
> On Thu, 2019-02-07 at 10:44 -0800, sathyanarayanan.kuppusw...@linux.intel.com 
> wrote:
> > From: Kuppuswamy Sathyanarayanan 
> > 
> > 
> > Intel IOMMU Page Request Services (PRS) only works with devices which
> > supports/uses PASID. So enable PRI only if the device also enables
> > PASID support. For more details, Please check the implementation of PRQ
> > handler(prq_event_thread()) in intel-svm driver.
> > 
> > Cc: Jacob Pan 
> > Cc: Ashok Raj 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
> 
> Hm, that shouldn't be true. PRI and PASID support are orthogonal.
> 
> If we get a PRI request without PASID, we should currently report it as
> non-serviced which is the right thing to do. We can hook this up to KVM
> etc. to actually allow paging of guests with devices attached, *if* the
> devices attached to those guests support PRI.
> 
> Nothing fundamentally stops us using PRI without PASID support.

You are right.. they are completely orthogonal. We just don't have
a way to handle the page-requests for request without PASID's.

There are some of the vIOMMU work to pass the PRI to who owns
the device, and we can certainly relax it then. This is just to reflect
what support exists today. FWIW, even the native driver maybe be able
to resolve this if supported.

> 
> 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/vt-d: Force IOMMU on for platform opt in hint

2018-11-12 Thread Raj, Ashok
On Mon, Nov 12, 2018 at 11:09:00AM -0700, Alex Williamson wrote:
> On Mon, 12 Nov 2018 19:06:26 +0300
> Mika Westerberg  wrote:
> 
> > From: Lu Baolu 
> > 
> > Intel VT-d spec added a new DMA_CTRL_PLATFORM_OPT_IN_FLAG flag
> > in DMAR ACPI table for BIOS to report compliance about platform
> > initiated DMA restricted to RMRR ranges when transferring control
> > to the OS. The OS treats this as a hint that the IOMMU should be
> > enabled to prevent DMA attacks from possible malicious devices.
> 
> Does this in any way suggest that there are additional recommended uses
> cases from Intel for RMRRs?  My concern here is the incompatibility we
> have with RMRRs and device assignment as we currently cannot assign
> devices where the IOVA address space is encumbered by RMRR
> requirements.  Unfortunately RMRRs do not indicate any sort or
> lifespan, so firmware enabling an RMRR simply to support some boot-time
> DMA encumbers the device with that RMRR for the life of that boot,
> unless we have VT-d code that decides it knows better.  Thanks,

IMO any new platform that requires RMRR should be a bug. It was designed 
originally for some legacy keyboard emulation etc.

The best behavior is to continue to not allow devices with RMRR be direct 
assigned.

Technically ignoring RMRR's and continuing to assign those devices is risky.
The problem is IF BIOS/SMM initiates some IO in the RMRR range and it happens 
to be
mapped by the direct assigned GPA its going to be ugly failure.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/vt-d: Do not enable ATS for external devices

2018-11-12 Thread Raj, Ashok
On Mon, Nov 12, 2018 at 07:06:27PM +0300, Mika Westerberg wrote:
> Currently Linux automatically enables ATS (Address Translation Service)
> for any device that supports it (and IOMMU is turned on). ATS is used to
> accelerate DMA access as the device can cache translations locally so
> there is no need to do full translation on IOMMU side. However, as
> pointed out in [1] ATS can be used to bypass IOMMU based security
> completely by simply sending PCIe read/write transaction with AT
> (Address Translation) field set to "translated".
> 
> To mitigate this modify the Intel IOMMU code so that it does not enable
> ATS for any device that is marked as being external. In case this turns
> out to cause performance issues we may selectively allow ATS based on
> user decision but currently use big hammer and disable it completely to
> be on the safe side.
> 
> [1] https://www.repository.cam.ac.uk/handle/1810/274352
> 
> Signed-off-by: Mika Westerberg 

Reviewed-by: Ashok Raj 

> ---
>  drivers/iommu/intel-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ada786b05a59..b79788da6971 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1473,7 +1473,8 @@ static void iommu_enable_dev_iotlb(struct 
> device_domain_info *info)
>   if (info->pri_supported && !pci_reset_pri(pdev) && 
> !pci_enable_pri(pdev, 32))
>   info->pri_enabled = 1;
>  #endif
> - if (info->ats_supported && !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
> + if (!pdev->is_external && info->ats_supported &&
> + !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
>   info->ats_enabled = 1;
>   domain_update_iotlb(info->domain);
>   info->ats_qdep = pci_ats_queue_depth(pdev);
> -- 
> 2.19.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/vt-d: Force IOMMU on for platform opt in hint

2018-11-12 Thread Raj, Ashok
On Mon, Nov 12, 2018 at 07:06:26PM +0300, Mika Westerberg wrote:
> From: Lu Baolu 
> 
> Intel VT-d spec added a new DMA_CTRL_PLATFORM_OPT_IN_FLAG flag
> in DMAR ACPI table for BIOS to report compliance about platform
> initiated DMA restricted to RMRR ranges when transferring control
> to the OS. The OS treats this as a hint that the IOMMU should be
> enabled to prevent DMA attacks from possible malicious devices.
> 
> A use of this flag is Kernel DMA protection for Thunderbolt[1]
> which in practice means that IOMMU should be enabled for PCIe
> devices connected to the Thunderbolt ports. With IOMMU enabled
> for these devices, all DMA operations are limited in the range
> reserved for it, thus the DMA attacks are prevented. All these
> devices are enumerated in the PCI/PCIe module and marked with
> an is_external flag.
> 
> This forces IOMMU to be enabled if DMA_CTRL_PLATFORM_OPT_IN_FLAG
> is set in DMAR ACPI table and there are PCIe devices marked as
> is_external in the system. This can be turned off by adding
> "intel_iommu=off" in the kernel command line, if any problems are
> found.
> 
> [1] 
> https://docs.microsoft.com/en-us/windows/security/information-protection/kernel-dma-protection-for-thunderbolt
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Sohil Mehta 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Mika Westerberg 

Looks good to me

Reviewed-by: Ashok Raj 

> ---
>  drivers/iommu/dmar.c| 25 +
>  drivers/iommu/intel-iommu.c | 55 +++--
>  include/linux/dmar.h|  8 ++
>  3 files changed, 86 insertions(+), 2 deletions(-)
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-10-23 Thread Raj, Ashok
On Mon, 2018-10-22 at 17:03 +0100, Jean-Philippe Brucker wrote:
> On 22/10/2018 11:07, Raj, Ashok wrote:
> > > For my own convenience I've been using the SVA infrastructure
> > > since
> > > I already had the locking and IOMMU ops in place. The
> > > proposed
> > > interface is also missing min_pasid and max_pasid parameters,
> > > which
> > > could be needed by device drivers to enforce PASID limits.
> > 
> > Variable PASID limits per-device is hard to manage. I suppose we
> > can play
> > some games to get this right, but I suspect that wont be very
> > useful in 
> > the long run.
> 
> Devices may have PASID limits that are not discoverable with the PCI
> PASID capability (https://patchwork.kernel.org/patch/9989307/#2138957
> 1).
> Even if we simply reject devices that don't support enough PASIDs, I
> think it's still better to return an error on bind or init_device
> than
> to return a valid PASID that they can't use

That sounds reasonable. What I meant was that trying to do similar
things like what we do for IOVA (Reserving from high etc) may not work
when a process might need to share the same PASID between 2
accelerators.


> 
> > #1: Given that PASID is a system wide resource, during boot iommu
> > driver needs 
> > to scan and set the max PASID to be no more than
> > min(iommu_supported_max_pasid) 
> > of all available IOMMU's. 
> > 
> > #2: In addition if any device supports less than the choosen system
> > max PASID
> > we should simply disable PASID on that device.
> > 
> > The reason is if the process were to bind to 2 or more accelerators
> > and
> > the second device has a limit smaller than the first that the
> > application
> > already attached, the second attemt to bind would fail. (Because
> > we would use the same PASID for a single process)
> 
> But that's not reason enough to completely disable PASID for the
> device,
> it could be the only one bound to that process, or PASID could be
> only
> used privately by the host device driver.

Agree, there could be other use cases. 

If the device was already attached during boot the driver comes early
to get the low PASID space. If the device was hot-added and the PASID
supported by device wasn't available its going to fail.

Enforcing something that will always work will be more reliable. But i
agree it maybe too strict for some cases. 

Maybe its a IOMMU enforced limit for the platform on the minimum
requirement for consistency. 

> 
> > In addition for Intel IOMMU's PASID is a system wide resource. We
> > have
> > a virtual capability for vIOMMU's to allocate PASID's. At the time
> > of
> > allocation we don't know what device this PASID is even being
> > allocated 
> > for. 
> 
> Ah, I had missed that part, I thought the PV allocation had the
> device
> ID as well. That's a significant change.
> 
> I was still hoping that we could go back to per-device PASID spaces
> in
> the host, since I still haven't seen any convincing argument in favor
> of
> system-wide PASID. Get rid of #1 in order to solve #2, and as a bonus
> support more PASIDs in total. Until now PASID was a per-device
> resource
> except for choices made when writing the IOMMU driver.

A global shared PASID table is designed with a future ISA extension for
PASID isolation. 
> 
> > Certainly we could add other intelligence to pass a hint for 
> > max_pasid in the virtiual interface. 
> > 
> > I suppose when this becomes real, most serious accelerators will 
> > support the full width.
> 
> I don't know if that's obvious from the device perspective. If I had
> to
> implement one, I would simply size my PASIDs to the number of
> contexts
> supported by my device (which might be especially small in the
> embedded
> space), given that technically nothing prevents software from
> supporting
> that and the specification allows me to choose a width.

That's very reasonable. But supporting a smaller number of contexts
is different from supporting smaller number of PASID's. A device could
support the full PASID width, but limit the number of contexts to a
smaller number. Certainly if the device vendors don't know upfront that
could be an issue. 

> 
> This was the intended model for PCI, but thankfully version 4.0 added
> an
> implementation note (6.20.2.1 - PASID Field) advising against this
> approach, and to instead support 20 bits for interoperability. Maybe
> it
> will set a trend for non-PCI devices as well.

That's a great suggestion. In fact we have already made this suggestion
 to our PCI WG rep. For Scalable IOV devices the white-paper we
published should hint that we expect devices to support the full 20 bit
PASID width.

> 
> Thanks,
> Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/6] drivers core: Add I/O ASID allocator

2018-10-22 Thread Raj, Ashok
On Mon, Oct 22, 2018 at 12:49:47PM +0800, Lu Baolu wrote:
> Hi,
> 
> On 10/20/18 2:11 AM, Jean-Philippe Brucker wrote:
> > Some devices might support multiple DMA address spaces, in particular
> > those that have the PCI PASID feature. PASID (Process Address Space ID)
> > allows to share process address spaces with devices (SVA), partition a
> > device into VM-assignable entities (VFIO mdev) or simply provide
> > multiple DMA address space to kernel drivers. Add a global PASID
> > allocator usable by different drivers at the same time. Name it I/O ASID
> > to avoid confusion with ASIDs allocated by arch code, which are usually
> > a separate ID space.
> > 
> > The IOASID space is global. Each device can have its own PASID space,
> > but by convention the IOMMU ended up having a global PASID space, so
> > that with SVA, each mm_struct is associated to a single PASID.
> > 
> > The allocator doesn't really belong in drivers/iommu because some
> > drivers would like to allocate PASIDs for devices that aren't managed by
> > an IOMMU, using the same ID space as IOMMU. It doesn't really belong in
> > drivers/pci either since platform device also support PASID. Add the
> > allocator in drivers/base.
> 
> One concern of moving pasid allocator here is about paravirtual
> allocation of pasid.
> 
> Since there is only a single set of pasid tables which is controlled by

Minor correction: Single system wide PASID namespace, but PASID tables
would be created ideally per-bdf for isolation purposes. 

I'm sure you meant name space, but didn't want that to be mis-interpreted.


> the host, the pasid is a system wide resource. When a driver running in
> a guest VM wants to consume a pasid, it must be intercepted by the
> simulation software and routed the allocation to the host via VFIO. Some
> iommu arch's provide mechanisms to aid this, for example, the virtual
> command interfaces defined in vt-d 3.0. Any pasid used in guest VM
> should go through the virtual command interfaces.
> 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-10-22 Thread Raj, Ashok
On Fri, Oct 19, 2018 at 07:11:52PM +0100, Jean-Philippe Brucker wrote:
> This is a first prototype adding auxiliary domain support to Arm SMMUv3,
> following Lu Baolu's latest proposal for IOMMU aware mediated devices
> [1]. It works, but the attach() API still doesn't feel right. See (2)
> below.
> 
> Patch 1 adapts iommu.c to the current proposal for auxiliary domains.
> Patches 2-4 rework the PASID allocator to make it usable for SVA and
> AUXD simultaneously. Patches 5-6 add AUXD support to SMMUv3.
> 
> 
> When a device can have multiple address space, for instance with PCI
> PASID, an auxiliary domain (AUXD) is the IOMMU representation of one
> address space. I distinguish auxiliary from "main" domain, which
> represents the non-PASID address space but also (at least for SMMUv3)
> the whole device context, PASID tables etc.
> 
> Auxiliary domains will be used by VFIO for IOMMU-aware mdev, and by any
> other device driver that wants to use PASID for private address spaces
> (as opposed to SVA [2]). The following API is available to device
> drivers:
> 
> (1) Enable AUXD for a device. Enable PASID if necessary and set an AUXD
> flag on the IOMMU data associated to a device.
> 
> For my own convenience I've been using the SVA infrastructure since
> I already had the locking and IOMMU ops in place. The proposed
> interface is also missing min_pasid and max_pasid parameters, which
> could be needed by device drivers to enforce PASID limits.

Variable PASID limits per-device is hard to manage. I suppose we can play
some games to get this right, but I suspect that wont be very useful in 
the long run.

#1: Given that PASID is a system wide resource, during boot iommu driver needs 
to scan and set the max PASID to be no more than min(iommu_supported_max_pasid) 
of all available IOMMU's. 

#2: In addition if any device supports less than the choosen system max PASID
we should simply disable PASID on that device.

The reason is if the process were to bind to 2 or more accelerators and
the second device has a limit smaller than the first that the application
already attached, the second attemt to bind would fail. (Because
we would use the same PASID for a single process)

In addition for Intel IOMMU's PASID is a system wide resource. We have
a virtual capability for vIOMMU's to allocate PASID's. At the time of
allocation we don't know what device this PASID is even being allocated 
for. Certainly we could add other intelligence to pass a hint for 
max_pasid in the virtiual interface. 

I suppose when this becomes real, most serious accelerators will 
support the full width. Hence doing #1 and #2 above should be good
for most cases.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: source-id verification failures

2018-10-05 Thread Raj, Ashok
On Thu, Oct 04, 2018 at 03:07:46PM -0700, Jacob Pan wrote:
> On Thu, 4 Oct 2018 13:57:24 -0700
> Jerry Snitselaar  wrote:
> 
> > On Thu Oct 04 18, Joerg Roedel wrote:
> > >Hi Jerry,
> > >
> > >thanks for the report.
> > >
> > >On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar wrote:  
> > >> I've been trying to track down a problem where an hp dl380 gen8
> > >> with a Cavium QLogic BR-1860 Fabric Adapter is getting source-id
> > >> verification failures when running dhclient against that
> > >> interface. This started showing up when I backported the iova
> > >> deferred flushing patches. So far this has only been seen on this
> > >> one system, but I'm trying to understand why it appears with the
> > >> new deferred flushing code. I also see it with both 4.18.10, and
> > >> 4.19.0-rc6 kernels.

Weird.. IRC, these were there to accomodate phantom functions.
Thought PCIe allowed 8bit tag, so if the device needs to allow
more than 256 outstanding transactions, one could use the extra
functions to account for.

I assumed Linux didn't enable phantom functions. If that's the case
we also need to ensure all the DMA is aliased properly.

I'm assuming if interrupts are generated by other aliases we could
block them. 

Is this device one such?

Cheers,
Ashok

> > >>
> > >> [35645.282021] bna :24:00.1 ens5f1: link down
> > >> [35645.298396] bna :24:00.0 ens5f0: link down
> > >> [35650.313210] DMAR: DRHD: handling fault status reg 2
> > >> [35650.332477] DMAR: [INTR-REMAP] Request device [24:00.0] fault
> > >> index 14 [fault reason 38] Blocked an interrupt request due to
> > >> source-id verification failure [35655.137667] bna :24:00.0
> > >> ens5f0: link up [35657.532454] bna :24:00.1 ens5f1: link up
> > >> [35664.281563] bna :24:00.1 ens5f1: link down [35664.298103]
> > >> bna :24:00.0 ens5f0: link down [35669.313568] DMAR: DRHD:
> > >> handling fault status reg 102 [35669.333198] DMAR: [INTR-REMAP]
> > >> Request device [24:00.0] fault index 14 [fault reason 38] Blocked
> > >> an interrupt request due to source-id verification failure
> > >> [35674.081212] bna :24:00.0 ens5f0: link up [35674.981280] bna
> > >> :24:00.1 ens5f1: link up
> > >>
> > >>
> > >> Any ideas?  
> > >
> > >No, not yet. Can you please post the output of lscpi -vvv?
> > >
> > >Jacob, can you or someone from your team please also have a look into
> > >this problem report?
> > >
> yep.
> +Ashok
> 
> Jerry,
> Could you also dump the interrupt remapping table with this patchset?
> https://lkml.org/lkml/2018/9/12/44
> 
> Thanks,
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3] PCI: Enable PASID when End-to-End TLP is supported by all bridges

2018-09-26 Thread Raj, Ashok
Hi Sinan

+ IOMMU list.

On Sat, Jun 30, 2018 at 11:24:24AM -0400, Sinan Kaya wrote:
> A PCIe endpoint carries the process address space identifier (PASID) in
> the TLP prefix as part of the memory read/write transaction. The address
> information in the TLP is relevant only for a given PASID context.
> 
> An IOMMU takes PASID value and the address information from the
> TLP to look up the physical address in the system.
> 
> PASID is an End-End TLP Prefix (PCIe r4.0, sec 6.20).  Sec 2.2.10.2 says
> 
>   It is an error to receive a TLP with an End-End TLP Prefix by a
>   Receiver that does not support End-End TLP Prefixes. A TLP in
>   violation of this rule is handled as a Malformed TLP. This is a
>   reported error associated with the Receiving Port (see Section 6.2).
> 
> Prevent error condition by proactively requiring End-to-End TLP
> prefix to be supported on the entire data path between the endpoint and
> the root port before enabling PASID.
> 
> Signed-off-by: Sinan Kaya 
> ---
>  
> +static void pci_configure_eetlp_prefix(struct pci_dev *dev)
> +{
> +#ifdef CONFIG_PCI_PASID
> + struct pci_dev *bridge;
> + u32 cap;
> +
> + if (!pci_is_pcie(dev))
> + return;
> +
> + pcie_capability_read_dword(dev, PCI_EXP_DEVCAP2, );
> + if (!(cap & PCI_EXP_DEVCAP2_E2ETLP))
> + return;

Forgot to notice this.. I'm not sure if the same enforcement is 
required for devices that are RCIEP. The spec isn't clear about calling
any excemption. Although it should be simple for devices to expose
e2etlp support, but if they don't that should be ok, since there are
nothing between itself and the root port.

We are seeking help from our SIG reps, but thought I'll ask here as well
if there are other opinions.

> +
> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT)
> + dev->eetlp_prefix_path = 1;
> + else {
> + bridge = pci_upstream_bridge(dev);
> + if (bridge && bridge->eetlp_prefix_path)
> + dev->eetlp_prefix_path = 1;
> + }
> +#endif
> +}

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-09-19 Thread Raj, Ashok
Hi Alex


On Tue, Sep 18, 2018 at 09:59:57PM -0600, Alex Williamson wrote:
> On Wed, 12 Sep 2018 10:46:19 -0700
> "Raj, Ashok"  wrote:
> 
> > On Thu, Aug 09, 2018 at 01:44:17PM -0600, Alex Williamson wrote:
> > > On Thu,  9 Aug 2018 12:37:06 -0700
> > > Ashok Raj  wrote:
> > >   
> > > > PCI_INTERRUPT_PIN should always read  0 for SRIOV Virtual
> > > > Functions.
> > > > 
> > > > Some SRIOV devices have some bugs in RTL and VF's end up reading 1
> > > > instead of 0 for the PIN.  
> > > 
> > > Hi Ashok,
> > > 
> > > One question, can we identify which VFs are known to have this
> > > issue so that users (and downstreams) can know how to prioritize
> > > this patch?  
> > 
> > Hi Alex
> > 
> > Sorry it took some time to hunt this down. 
> > 
> > The offending VF has a device ID : 0x270C
> > The corresponding PF has a device ID: 0x270B.
> 
> Ok, I interpret Alan's previous comment about the patch[1] to suggest a
> structure a bit more like that below.  IOW, we know that 8086:270c
> inspires this change, but once it's included we won't know who else
> relies on it.  We can perhaps encourage better hardware validation, or
> at least better tracking of who needs this with a warning and
> whitelist.  Testing, especially positive and negative testing against
> the warning, and reviews welcome.  Thanks,
> 
> Alex
> 
> [1]https://lkml.org/lkml/2018/8/10/462
> 
> commit d780da26a81c6f47522ae0aeff03abd4d08b89b5
> Author: Alex Williamson 
> Date:   Tue Sep 18 21:27:57 2018 -0600
> 
> vfio/pci: Mask buggy SR-IOV VF INTx support
> 
> The SR-IOV spec requires that VFs must report zero for the INTx pin
> register as VFs are precluded from INTx support.  It's much easier for
> the host kernel to understand whether a device is a VF and therefore
> whether a non-zero pin register value is bogus than it is to do the
> same in userspace.  Override the INTx count for such devices and
> virtualize the pin register to provide a consistent view of the device
> to the user.
> 
> As this is clearly a spec violation, warn about it to support hardware
> validation, but also provide a known whitelist as it doesn't do much
> good to continue complaining if the hardware vendor doesn't plan to
> fix it.
> 
> Known devices with this issue: 8086:270c
> 
> Signed-off-by: Alex Williamson 

This looks good and also addresses Alan's concern that don't silently 
hide under the rug for all devices. We'll also queue it for testing
just to confirm and keep you posted.

Reviewed-by: Ashok Raj 

> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index cddb453a1ba5..8af3f6f35f32 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -430,14 +430,41 @@ static int vfio_pci_open(void *device_data)
>   return ret;
>  }
>  
> +static const struct pci_device_id known_bogus_vf_intx_pin[] = {
> + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x270c) },
> + {}
> +};
> +
>  static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
>  {
>   if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
>   u8 pin;
> +
> + if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)
> + return 0;
> +
>   pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
> - if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin)
> - return 1;
>  
> + /*
> +  * Per SR-IOV spec rev 1.1, 3.4.1.18 the interrupt pin register
> +  * does not apply to VFs and VFs must implement this register
> +  * as read-only with value zero.  Userspace is not readily
> +  * able to identify a device as a VF and thus that the pin
> +  * definition on the device is bogus should a device violate
> +  * this requirement.  For such devices, override the bogus
> +  * value and provide a warning to support hardware validation
> +  * (or be quite if it's known).  PCI config space emulation
> +  * will virtualize this register for all VFs.
> +  */
> + if (pin && vdev->pdev->is_virtfn) {
> + if (!pci_match_id(known_bogus_vf_intx_pin, vdev->pdev))
> + dev_warn_once(>pdev->dev,
> +   "VF reports bogus INTx pin %d\n",
> +   

Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

2018-09-13 Thread Raj, Ashok
On Thu, Sep 13, 2018 at 04:03:01PM +0100, Jean-Philippe Brucker wrote:
> On 13/09/2018 01:19, Tian, Kevin wrote:
> >>> This is proposed for architectures which support finer granularity
> >>> second level translation with no impact on architectures which only
> >>> support Source ID or the similar granularity.
> >>
> >> Just to be clear, in this paragraph you're only referring to the
> >> Nested/second-level translation for mdev, which is specific to vt-d
> >> rev3? Other architectures can still do first-level translation with
> >> PASID, to support some use-cases of IOMMU aware mediated device
> >> (assigning mdevs to userspace drivers, for example)
> > 
> > yes. aux domain concept applies only to vt-d rev3 which introduces
> > scalable mode. Care is taken to avoid breaking usages on existing
> > architectures.
> > 
> > one note. Assigning mdevs to user space alone doesn't imply IOMMU
> > aware. All existing mdev usages use software or proprietary methods to
> > isolate DMA. There is only one potential IOMMU aware mdev usage 
> > which we talked not rely on vt-d rev3 scalable mode - wrap a random 
> > PCI device into a single mdev instance (no sharing). In that case mdev 
> > inherits RID from parent PCI device, thus is isolated by IOMMU in RID 
> > granular. Our RFC supports this usage too. In VFIO two usages (PASID-
> > based and RID-based) use same code path, i.e. always binding domain to
> > the parent device of mdev. But within IOMMU they go different paths.
> > PASID-based will go to aux-domain as iommu_enable_aux_domain
> > has been called on that device. RID-based will follow existing 
> > unmanaged domain path, as if it is parent device assignment.
> 
> For Arm SMMU we're more interested in the PASID-granular case than the
> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
> mode, the following example can be implemented with an SMMUv3, since it
> only needs PASID-granular first-level translation:

You are right, you can simply use the first level as IOVA for every PASID.

Only issue becomes when you need to assign that to a guest, you would be 
required
to shadow the 1st level. If you have a 2nd level per-pasid first level can
be managed in guest and don't require to shadow them.

> 
> We have a PCI function that supports PASID, and can be partitioned into
> multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI
> vector and a PASID.
> 
> Different processes (userspace drivers, not QEMU) each open one mdev. A
> process controlling one mdev has two ways of doing DMA:
> 
> (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
> creates an auxiliary domain for the mdev, with PASID #35. The process
> creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
> the auxiliary domain. The IOMMU driver populates the pgtables associated
> with PASID #35.
> 
> (2) SVA. One way of doing it: the process uses a new
> "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address
> space to the device, gets PASID #35. Simpler, but not everyone wants to
> use SVA, especially not userspace drivers which need the highest
> performance.
> 
> 
> This example only needs to modify first-level translation, and works
> with SMMUv3. The kernel here could be the host, in which case
> second-level translation is disabled in the SMMU, or it could be the
> guest, in which case second-level mappings are created by QEMU and
> first-level translation is managed by assigning PASID tables to the guest.
> 
> So (2) would use iommu_sva_bind_device(), but (1) needs something else.
> Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
> second-level or nested translation? It seems silly to use a different
> API for first-level, since the flow in userspace and VFIO is the same as
> your second-level case as far as MAP_DMA ioctl goes. The difference is
> that in your case the auxiliary domain supports an additional operation
> which binds first-level page tables. An auxiliary domain that only
> supports first-level wouldn't support this operation, but it can still
> implement iommu_map/unmap/etc.
> 
> 
> Another note: if for some reason you did want to allow userspace to
> choose between first-level or second-level, you could implement the
> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
> ioctl on a NESTING container would populate second-level, and DMA_MAP on
> a normal container populates first-level. But if you're always going to
> use second-level by default, the distinction isn't necessary.

Where is the nesting attribute specified? in vt-d2 it was part of context
entry, so also meant all PASID's are nested now. In vt-d3 its part of
PASID context.

It seems unsafe to share PASID's with different VM's since any request 
W/O PASID has only one mapping.

> 
> 
> >> Sounds good, I'll drop the private PASID patch if we can figure out a
> >> 

Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-09-12 Thread Raj, Ashok
On Thu, Aug 09, 2018 at 01:44:17PM -0600, Alex Williamson wrote:
> On Thu,  9 Aug 2018 12:37:06 -0700
> Ashok Raj  wrote:
> 
> > PCI_INTERRUPT_PIN should always read  0 for SRIOV Virtual Functions.
> > 
> > Some SRIOV devices have some bugs in RTL and VF's end up reading 1
> > instead of 0 for the PIN.
> 
> Hi Ashok,
> 
> One question, can we identify which VFs are known to have this issue so
> that users (and downstreams) can know how to prioritize this patch?

Hi Alex

Sorry it took some time to hunt this down. 

The offending VF has a device ID : 0x270C
The corresponding PF has a device ID: 0x270B.

> Thanks,
> 
> Alex
> 
> > Since this is a spec required value, rather than having a device specific
> > quirk, we could fix it permanently in vfio.
> > 
> > Reworked suggestions from Alex https://lkml.org/lkml/2018/7/16/1052
> > 
> > Reported-by: Gage Eads 
> > Tested-by: Gage Eads 
> > Signed-off-by: Ashok Raj 
> > Cc: k...@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: Joerg Roedel 
> > Cc: Bjorn Helgaas 
> > Cc: Gage Eads 
> > ---
> >  drivers/vfio/pci/vfio_pci.c| 12 +---
> >  drivers/vfio/pci/vfio_pci_config.c |  3 ++-
> >  2 files changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index b423a30..32943dd 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -433,10 +433,16 @@ static int vfio_pci_get_irq_count(struct 
> > vfio_pci_device *vdev, int irq_type)
> >  {
> > if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
> > u8 pin;
> > -   pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
> > -   if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin)
> > -   return 1;
> > +   /*
> > +* INTx must be 0 for all VF's. Enforce that for all
> > +* VF's since this is a spec requirement.
> > +*/
> > +   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
> > +   vdev->pdev->is_virtfn)
> > +   return 0;
> >  
> > +   pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
> > +   return (pin ? 1 : 0);
> > } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
> > u8 pos;
> > u16 flags;
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index 115a36f..e36b7c3 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -1676,7 +1676,8 @@ int vfio_config_init(struct vfio_pci_device *vdev)
> > *(__le16 *)[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
> > }
> >  
> > -   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)
> > +   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
> > +   pdev->is_virtfn)
> > vconfig[PCI_INTERRUPT_PIN] = 0;
> >  
> > ret = vfio_cap_init(vdev);
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


  1   2   >