Re: [PATCH v7 3/3] iommu/tegra-smmu: Add PCI support

2020-10-15 Thread Nicolin Chen
On Thu, Oct 15, 2020 at 10:55:52AM +0100, Robin Murphy wrote:
> On 2020-10-15 05:13, Nicolin Chen wrote:
> > On Wed, Oct 14, 2020 at 06:42:36PM +0100, Robin Murphy wrote:
> > > On 2020-10-09 17:19, Nicolin Chen wrote:
> > > > This patch simply adds support for PCI devices.
> > > > 
> > > > Reviewed-by: Dmitry Osipenko 
> > > > Tested-by: Dmitry Osipenko 
> > > > Signed-off-by: Nicolin Chen 
> > > > ---
> > > > 
> > > > Changelog
> > > > v6->v7
> > > >* Renamed goto labels, suggested by Thierry.
> > > > v5->v6
> > > >* Added Dmitry's Reviewed-by and Tested-by.
> > > > v4->v5
> > > >* Added Dmitry's Reviewed-by
> > > > v3->v4
> > > >* Dropped !iommu_present() check
> > > >* Added CONFIG_PCI check in the exit path
> > > > v2->v3
> > > >* Replaced ternary conditional operator with if-else in 
> > > > .device_group()
> > > >* Dropped change in tegra_smmu_remove()
> > > > v1->v2
> > > >* Added error-out labels in tegra_smmu_probe()
> > > >* Dropped pci_request_acs() since IOMMU core would call it.
> > > > 
> > > >drivers/iommu/tegra-smmu.c | 35 +--
> > > >1 file changed, 25 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
> > > > index be29f5977145..2941d6459076 100644
> > > > --- a/drivers/iommu/tegra-smmu.c
> > > > +++ b/drivers/iommu/tegra-smmu.c
> > > > @@ -10,6 +10,7 @@
> > > >#include 
> > > >#include 
> > > >#include 
> > > > +#include 
> > > >#include 
> > > >#include 
> > > >#include 
> > > > @@ -865,7 +866,11 @@ static struct iommu_group 
> > > > *tegra_smmu_device_group(struct device *dev)
> > > > group->smmu = smmu;
> > > > group->soc = soc;
> > > > -   group->group = iommu_group_alloc();
> > > > +   if (dev_is_pci(dev))
> > > > +   group->group = pci_device_group(dev);
> > > 
> > > Just to check, is it OK to have two or more swgroups "owning" the same
> > > iommu_group if an existing one gets returned here? It looks like that 
> > > might
> > > not play nice with the use of iommu_group_set_iommudata().
> > 
> > Do you mean by "gets returned here" the "IS_ERR" check below?
> 
> I mean that unlike iommu_group_alloc()/generic_device_group(),
> pci_device_group() may give you back a group that already contains another
> device and has already been set up from that device's perspective. This can
> happen for topological reasons like requester ID aliasing through a PCI-PCIe
> bridge or lack of isolation between functions.

Okay..but we don't really have two swgroups owning the same groups
in case of PCI devices. For Tegra210, all PCI devices inherit the
same swgroup from the PCI controller. And I'd think previous chips
do the same. The only use case currently of 2+ swgroups owning the
same iommu_group is for display controller.

Or do you suggest we need an additional check for pci_device_group?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [GIT PULL] dma-mapping updates for 5.10

2020-10-15 Thread pr-tracker-bot
The pull request you sent on Thu, 15 Oct 2020 19:47:43 +0200:

> git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-5.10

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5a32c3413d3340f90c82c84b375ad4b335a59f28

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] x86: Secure Launch kernel early boot stub

2020-10-15 Thread Daniel Kiper
On Tue, Sep 29, 2020 at 10:03:47AM -0400, Ross Philipson wrote:
> On 9/25/20 3:18 PM, Arvind Sankar wrote:

[...]

> > You should see them if you do
> > readelf -r arch/x86/boot/compressed/vmlinux
> >
> > In terms of the code, things like:
> >
> > addl%ebx, (sl_gdt_desc + 2)(%ebx)
> >
> > will create a relocation, because the linker interprets this as wanting
> > the runtime address of sl_gdt_desc, rather than just the offset from
> > startup_32.
> >
> > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/arch/x86/boot/compressed/head_64.S*n48__;Iw!!GqivPVa7Brio!JpZWv1cCPZdjD2jbCCGT7P9UIVl_lhX7YjckAnUcvi927jwZI7X3nX0MpIAZOyktJds$
> >
> > has a comment with some explanation and a macro that the 32-bit code in
> > startup_32 uses to avoid creating relocations.
> >
> > Since the SL code is in a different assembler file (and a different
> > section), you can't directly use the same macro. I would suggest getting
> > rid of sl_stub_entry and entering directly at sl_stub, and then the code
> > in sl_stub.S can use sl_stub for the base address, defining the rva()
> > macro there as
> >
> > #define rva(X) ((X) - sl_stub)
> >
> > You will also need to avoid initializing data with symbol addresses.
> >
> > .long mle_header
> > .long sl_stub_entry
> > .long sl_gdt
> >
> > will create relocations. The third one is easy, just replace it with
> > sl_gdt - sl_gdt_desc and initialize it at runtime with
> >
> > lealrva(sl_gdt_desc)(%ebx), %eax
> > addl%eax, 2(%eax)
> > lgdt(%eax)
> >
> > The other two are more messy, unfortunately there is no easy way to tell
> > the linker what we want here. The other entry point addresses (for the
> > EFI stub) are populated in a post-processing step after the compressed
> > kernel has been linked, we could teach it to also update kernel_info.
> >
> > Without that, for kernel_info, you could change it to store the offset
> > of the MLE header from kernel_info, instead of from the start of the
> > image.
> >
> > For the MLE header, it could be moved to .head.text in head_64.S, and
> > initialized with
> > .long rva(sl_stub)
> > This will also let it be placed at a fixed offset from startup_32, so
> > that kernel_info can just be populated with a constant.

I am discussing with Ross the other option. We can create
.rodata.mle_header section and put it at fixed offset as
kernel_info is. So, we would have, e.g.:

arch/x86/boot/compressed/vmlinux.lds.S:
.rodata.kernel_info KERNEL_INFO_OFFSET : {
*(.rodata.kernel_info)
}
ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, "kernel_info at bad 
address!")

.rodata.mle_header MLE_HEADER_OFFSET : {
*(.rodata.mle_header)
}
ASSERT(ABSOLUTE(mle_header) == MLE_HEADER_OFFSET, "mle_header at bad 
address!")

arch/x86/boot/compressed/sl_stub.S:
#define mleh_rva(X) (((X) - mle_header) + MLE_HEADER_OFFSET)

.section ".rodata.mle_header", "a"

SYM_DATA_START(mle_header)
.long   0x9082ac5a/* UUID0 */
.long   0x74a7476f/* UUID1 */
.long   0xa2555c0f/* UUID2 */
.long   0x42b651cb/* UUID3 */
.long   0x0034/* MLE header size */
.long   0x00020002/* MLE version 2.2 */
.long   mleh_rva(sl_stub_entry)/* Linear entry point of MLE (virt. 
address) */
.long   0x/* First valid page of MLE */
.long   0x/* Offset within binary of first byte of MLE */
.long   0x/* Offset within binary of last byte + 1 of MLE */
.long   0x0223/* Bit vector of MLE-supported capabilities */
.long   0x/* Starting linear address of command line 
(unused) */
.long   0x/* Ending linear address of command line (unused) 
*/
SYM_DATA_END(mle_header)

Of course MLE_HEADER_OFFSET has to be defined as a constant somewhere.
Anyway, is it acceptable?

There is also another problem. We have to put into mle_header size of
the Linux kernel image. Currently it is done by the bootloader but
I think it is not a role of the bootloader. The kernel image should
provide all data describing its properties and do not rely on the
bootloader to do that. Ross and I investigated various options but we
did not find a good/simple way to do that. Could you suggest how we
should do that or at least where we should take a look to get some
ideas?

Daniel
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-15 Thread Raj, Ashok
Hi Jean

+ Baolu who is looking into this.


On Thu, Oct 15, 2020 at 11:00:27AM +0200, Jean-Philippe Brucker wrote:
> Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
> whether the PRI queue needs flushing. When looking at the PCIe spec
> again I noticed that most of the time the SMMUv3 driver doesn't actually
> need to flush the PRI queue. Does this make sense for Intel VT-d as well
> or did I overlook something?
> 
> Before calling iommu_sva_unbind_device(), device drivers must stop the
> device from using the PASID. For PCIe devices, that consists of
> completing any pending DMA, and completing any pending page request
> unless the device uses Stop Markers. So unless the device uses Stop
> Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
> means completing all stall events, so we never need to flush the event
> queue.

I don't think this is true. Baolu is working on an enhancement to this,
I'll quickly summarize this below:

Stop markers are weird, I'm not certain there is any device today that
sends STOP markers. Even if they did, markers don't have a required
response, they are fire and forget from the device pov.

I'm not sure about other IOMMU's how they behave, When there is no space in
the PRQ, IOMMU auto-responds to the device. This puts the device in a
while (1) loop. The fake successful response will let the device do a ATS
lookup, and that would fail forcing the device to do another PRQ. The idea
is somewhere there the OS has repeated the others and this will find a way
in the PRQ. The point is this is less reliable and can't be the only
indication. PRQ draining has a specific sequence. 

The detailed steps are outlined in 
"Chapter 7.10 "Software Steps to Drain Page Requests & Responses"

- Submit invalidation wait with fence flag to ensure all prior
  invalidations are processed.
- submit iotlb followed by devtlb invalidation
- Submit invalidation wait with page-drain to make sure any page-requests
  issued by the device are flushed when this invalidation wait completes.
- If during the above process there was a queue overflow SW can assume no
  outstanding page-requests are there. If we had a queue full condition,
  then sw must repeat step 2,3 above.


To that extent the proposal is as follows: names are suggestive :-) I'm
making this up as I go!

- iommu_stop_page_req() - Kernel needs to make sure we respond to all
  outstanding requests (since we can't drop responses) 
  - Ensure we respond immediatly for anything that comes before the drain
sequence completes
- iommu_drain_page_req() - Which does the above invalidation with
  Page_drain set.

Once the driver has performed a reset and before issuing any new request,
it does iommu_resume_page_req() this will ensure we start processing
incoming page-req after this point.


> 
> First patch adds flags to unbind(), and the second one lets device
> drivers tell whether the PRI queue needs to be flushed.
> 
> Other remarks:
> 
> * The PCIe spec (see quote on patch 2), says that the device signals
>   whether it has sent a Stop Marker or not during Stop PASID. In reality
>   it's unlikely that a given device will sometimes use one stop method
>   and sometimes the other, so it could be a device-wide flag rather than
>   passing it at each unbind(). I don't want to speculate too much about
>   future implementation so I prefer having the flag in unbind().
> 
> * In patch 1, uacce passes 0 to unbind(). To pass the right flag I'm
>   thinking that uacce->ops->stop_queue(), which tells the device driver
>   to stop DMA, should return whether faults are pending. This can be
>   added later once uacce has an actual PCIe user, but we need to
>   remember to do it.

I think intel iommmu driver does this today for SVA when pasid is being
freed. Its still important to go through the drain before that PASID can be
re-purposed.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 7/8] arm64: mm: Set ZONE_DMA size based on early IORT scan

2020-10-15 Thread Catalin Marinas
On Thu, Oct 15, 2020 at 10:26:18PM +0800, Hanjun Guo wrote:
> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
> > From: Ard Biesheuvel 
> > 
> > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> > incorporating masters that can address less than 32 bits of DMA, in
> > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> > peripherals that can only address up to 1 GB (and its PCIe host
> > bridge can only access the bottom 3 GB)
> > 
> > Instructing the DMA layer about these limitations is straight-forward,
> > even though we had to fix some issues regarding memory limits set in
> > the IORT for named components, and regarding the handling of ACPI _DMA
> > methods. However, the DMA layer also needs to be able to allocate
> > memory that is guaranteed to meet those DMA constraints, for bounce
> > buffering as well as allocating the backing for consistent mappings.
> > 
> > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> > problems with kdump, and potentially in other places where allocations
> > cannot cross zone boundaries. Therefore, we should avoid having two
> > separate DMA zones when possible.
> > 
> > So let's do an early scan of the IORT, and only create the ZONE_DMA
> > if we encounter any devices that need it. This puts the burden on
> > the firmware to describe such limitations in the IORT, which may be
> > redundant (and less precise) if _DMA methods are also being provided.
> > However, it should be noted that this situation is highly unusual for
> > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> > the _DMA method if implemented, and so we will not lose the ability to
> > perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> > it.
> 
> Sorry, I'm still a little bit confused. With this patch, if we have
> a device which set the right _DMA method (DMA size >= 32), but with the
> wrong DMA size in IORT, we still have the ZONE_DMA created which
> is actually not needed?

With the current kernel, we get a ZONE_DMA already with an arbitrary
size of 1GB that matches what RPi4 needs. We are trying to eliminate
such unnecessary ZONE_DMA based on some heuristics (well, something that
looks "better" than a OEM ID based quirk). Now, if we learn that IORT
for platforms in the field is that broken as to describe few bits-wide
DMA masks, we may have to go back to the OEM ID quirk.

-- 
Catalin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[GIT PULL] dma-mapping updates for 5.10

2020-10-15 Thread Christoph Hellwig
Hi Linus,

please pull the dma-mapping update for 5.10.  It contains a (clean) merge
of a later 5.9-rc commit, as I had to pull in a media subsystem revert
that is required for later commits.

There are a few trivial contextual conflicts due to the dma-mapping.h
split, for those simple keep both sides, that is the new headers added
by other trees as well as the ones added here, or the entire file
deletions in case of the arm tree.


The following changes since commit 171d4ff79f965c1f164705ef0aaea102a6ad238b:

  Merge tag 'mmc-v5.9-rc4-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc (2020-09-24 09:09:47 
-0700)

are available in the Git repository at:

  git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-5.10

for you to fetch changes up to 2a410d09417b5344ab1f3cf001ac73a1daf8dcce:

  ARM/ixp4xx: add a missing include of dma-map-ops.h (2020-10-13 13:28:22 +0200)


dma-mapping updates for 5.10

 - rework the non-coherent DMA allocator
 - move private definitions out of 
 - lower CMA_ALIGNMENT (Paul Cercueil)
 - remove the omap1 dma address translation in favor of the common
   code
 - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)
 - support per-node DMA CMA areas (Barry Song)
 - increase the default seg boundary limit (Nicolin Chen)
 - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)
 - various cleanups


Barry Song (3):
  dma-contiguous: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA to localize coherent dma buffers
  mm: cma: use CMA_MAX_NAME to define the length of cma name array

Christoph Hellwig (54):
  dma-mapping: remove the dma_dummy_ops export
  dma-debug: remove most exports
  MIPS: make dma_sync_*_for_cpu a little less overzealous
  MIPS/jazzdma: remove the unused vdma_remap function
  MIPS/jazzdma: decouple from dma-direct
  dma-mapping: fix DMA_OPS dependencies
  dma-mapping: add (back) arch_dma_mark_clean for ia64
  dma-direct: remove dma_direct_{alloc,free}_pages
  dma-direct: lift gfp_t manipulation out of__dma_direct_alloc_pages
  dma-direct: use phys_to_dma_direct in dma_direct_alloc
  dma-direct: remove __dma_to_phys
  dma-direct: rename and cleanup __phys_to_dma
  dma-mapping: move dma_common_{mmap,get_sgtable} out of mapping.c
  dma-mapping: move the dma_declare_coherent_memory documentation
  ARM/dma-mapping: remove a __arch_page_to_dma #error
  ARM/dma-mapping: remove dma_to_virt
  ARM/dma-mapping: move various helpers from dma-mapping.h to dma-direct.h
  ARM/keystone: move the DMA offset handling under ifdef CONFIG_ARM_LPAE
  usb: don't inherity DMA properties for USB devices
  dma-mapping: remove DMA_MASK_NONE
  dma-mapping: move valid_dma_direction to dma-direction.h
  dma-mapping: better document dma_addr_t and DMA_MAPPING_ERROR
  ARM/omap1: switch to use dma_direct_set_offset for lbus DMA offsets
  Merge branch 'master' of https://git.kernel.org/.../torvalds/linux into 
dma-mapping-for-next
  mm: turn alloc_pages into an inline function
  drm/exynos: stop setting DMA_ATTR_NON_CONSISTENT
  drm/nouveau/gk20a: stop setting DMA_ATTR_NON_CONSISTENT
  net/au1000-eth: stop using DMA_ATTR_NON_CONSISTENT
  lib82596: move DMA allocation into the callers of i82596_probe
  53c700: improve non-coherent DMA handling
  dma-mapping: add a new dma_alloc_noncoherent API
  sgiwd93: convert to dma_alloc_noncoherent
  hal2: convert to dma_alloc_noncoherent
  lib82596: convert to dma_alloc_noncoherent
  sgiseeq: convert to dma_alloc_noncoherent
  53c700: convert to dma_alloc_noncoherent
  dma-mapping: remove dma_cache_sync
  dma-mapping: add a new dma_alloc_pages API
  dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
  dma-iommu: implement ->alloc_noncoherent
  firewire-ohci: use dma_alloc_pages
  dma-mapping: split 
  dma-contiguous: remove dma_declare_contiguous
  dma-contiguous: remove dev_set_cma_area
  dma-contiguous: remove dma_contiguous_set_default
  dma-mapping: merge  into 
  dma-mapping: remove 
  dma-mapping: move dma-debug.h to kernel/dma/
  dma-mapping: move large parts of  to kernel/dma
  dma-mapping: merge  into 
  dma-direct check for highmem pages in dma_direct_alloc_pages
  dma-direct: factor out a dma_direct_alloc_from_pool helper
  dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
  ARM/ixp4xx: add a missing include of dma-map-ops.h

Jim Quinlan (1):
  dma-mapping: introduce DMA range map, supplanting dma_pfn_offset

Nicolin Chen (2):
  dma-mapping: introduce dma_get_seg_boundary_nr_pages()
  dma-mapping: set default segment_boundary_mask to ULONG_MAX

Paul Cercueil (1):
  cma: decrease CMA_ALIGNMENT lower 

Re: [PATCH 2/3] drm/msm: add DRM_MSM_GEM_SYNC_CACHE for non-coherent cache maintenance

2020-10-15 Thread Christoph Hellwig
On Thu, Oct 15, 2020 at 05:33:34PM +0200, Daniel Vetter wrote:
> On Thu, Oct 15, 2020 at 07:55:32AM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 13, 2020 at 02:42:38PM +0100, Robin Murphy wrote:
> > > I still think this situation would be best handled with a variant of
> > > dma_ops_bypass that also guarantees to bypass SWIOTLB, and can be set
> > > automatically when attaching to an unmanaged IOMMU domain.
> > 
> > dma_ops_bypass should mostly do the right thing as-is.  swiotlb bouncing
> > is triggered of two things:
> > 
> >  1) the dma_mask.  This is under control of the driver, and obviously
> > if it is too small for a legit reason we can't just proceed
> 
> Somewhat related, but is there a way to tell the dma-api to fail instead
> of falling back to swiotlb? In many case for gpu drivers it's much better
> if we fall back to dma_alloc_coherent and manage the copying ourselves
> instead of abstracting this away in the dma-api. Currently that's "solved"
> rather pessimistically by always allocating from dma_alloc_coherent if
> swiotlb could be in the picture (at least for ttm based drivers, i915 just
> falls over).

Is this for the alloc_pages plus manually map logic in various drivers?

They should switch to the new dma_alloc_pages API that I'll send to
Linus for 5.10 soon.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] drm/msm: add DRM_MSM_GEM_SYNC_CACHE for non-coherent cache maintenance

2020-10-15 Thread Daniel Vetter
On Thu, Oct 15, 2020 at 07:55:32AM +0100, Christoph Hellwig wrote:
> On Tue, Oct 13, 2020 at 02:42:38PM +0100, Robin Murphy wrote:
> > I still think this situation would be best handled with a variant of
> > dma_ops_bypass that also guarantees to bypass SWIOTLB, and can be set
> > automatically when attaching to an unmanaged IOMMU domain.
> 
> dma_ops_bypass should mostly do the right thing as-is.  swiotlb bouncing
> is triggered of two things:
> 
>  1) the dma_mask.  This is under control of the driver, and obviously
> if it is too small for a legit reason we can't just proceed

Somewhat related, but is there a way to tell the dma-api to fail instead
of falling back to swiotlb? In many case for gpu drivers it's much better
if we fall back to dma_alloc_coherent and manage the copying ourselves
instead of abstracting this away in the dma-api. Currently that's "solved"
rather pessimistically by always allocating from dma_alloc_coherent if
swiotlb could be in the picture (at least for ttm based drivers, i915 just
falls over).
-Daniel

>  2) force_dma_unencrypted() - we'd need to do an opt-out here, either
> by a flag or by being smart and looking for an attached iommu on
> the device
> 
> > That way the
> > device driver can make DMA API calls in the appropriate places that do the
> > right thing either way, and only needs logic to decide whether to use the
> > returned DMA addresses directly or ignore them if it knows they're
> > overridden by its own IOMMU mapping.
> 
> I'd be happy to review patches for this.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 7/8] arm64: mm: Set ZONE_DMA size based on early IORT scan

2020-10-15 Thread Nicolas Saenz Julienne
On Thu, 2020-10-15 at 22:26 +0800, Hanjun Guo wrote:
> On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:
> > From: Ard Biesheuvel 
> > 
> > We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
> > incorporating masters that can address less than 32 bits of DMA, in
> > particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
> > peripherals that can only address up to 1 GB (and its PCIe host
> > bridge can only access the bottom 3 GB)
> > 
> > Instructing the DMA layer about these limitations is straight-forward,
> > even though we had to fix some issues regarding memory limits set in
> > the IORT for named components, and regarding the handling of ACPI _DMA
> > methods. However, the DMA layer also needs to be able to allocate
> > memory that is guaranteed to meet those DMA constraints, for bounce
> > buffering as well as allocating the backing for consistent mappings.
> > 
> > This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
> > it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
> > problems with kdump, and potentially in other places where allocations
> > cannot cross zone boundaries. Therefore, we should avoid having two
> > separate DMA zones when possible.
> > 
> > So let's do an early scan of the IORT, and only create the ZONE_DMA
> > if we encounter any devices that need it. This puts the burden on
> > the firmware to describe such limitations in the IORT, which may be
> > redundant (and less precise) if _DMA methods are also being provided.
> > However, it should be noted that this situation is highly unusual for
> > arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
> > the _DMA method if implemented, and so we will not lose the ability to
> > perform streaming DMA outside the ZONE_DMA if the _DMA method permits
> > it.
> 
> Sorry, I'm still a little bit confused. With this patch, if we have
> a device which set the right _DMA method (DMA size >= 32), but with the
> wrong DMA size in IORT, we still have the ZONE_DMA created which
> is actually not needed?

Yes, that would be the case.

Regards,
Nicolas



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 7/8] arm64: mm: Set ZONE_DMA size based on early IORT scan

2020-10-15 Thread Hanjun Guo

On 2020/10/15 3:12, Nicolas Saenz Julienne wrote:

From: Ard Biesheuvel 

We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

Instructing the DMA layer about these limitations is straight-forward,
even though we had to fix some issues regarding memory limits set in
the IORT for named components, and regarding the handling of ACPI _DMA
methods. However, the DMA layer also needs to be able to allocate
memory that is guaranteed to meet those DMA constraints, for bounce
buffering as well as allocating the backing for consistent mappings.

This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
problems with kdump, and potentially in other places where allocations
cannot cross zone boundaries. Therefore, we should avoid having two
separate DMA zones when possible.

So let's do an early scan of the IORT, and only create the ZONE_DMA
if we encounter any devices that need it. This puts the burden on
the firmware to describe such limitations in the IORT, which may be
redundant (and less precise) if _DMA methods are also being provided.
However, it should be noted that this situation is highly unusual for
arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
the _DMA method if implemented, and so we will not lose the ability to
perform streaming DMA outside the ZONE_DMA if the _DMA method permits
it.


Sorry, I'm still a little bit confused. With this patch, if we have
a device which set the right _DMA method (DMA size >= 32), but with the
wrong DMA size in IORT, we still have the ZONE_DMA created which
is actually not needed?



Cc: Jeremy Linton 
Cc: Lorenzo Pieralisi 
Cc: Nicolas Saenz Julienne 
Cc: Rob Herring 
Cc: Christoph Hellwig 
Cc: Robin Murphy 
Cc: Hanjun Guo 
Cc: Sudeep Holla 
Cc: Anshuman Khandual 
Signed-off-by: Ard Biesheuvel 
[nsaenz: Rebased, removed documentation change, warnings and add
declaration in acpi_iort.h]
Signed-off-by: Nicolas Saenz Julienne 
---
  arch/arm64/mm/init.c  |  6 +
  drivers/acpi/arm64/iort.c | 51 +++
  include/linux/acpi_iort.h |  4 +++
  3 files changed, 61 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 97b0d2768349..f321761eedb2 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -29,6 +29,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -196,6 +197,11 @@ static void __init zone_sizes_init(unsigned long min, 
unsigned long max)
  #ifdef CONFIG_ZONE_DMA
zone_dma_bits = min(zone_dma_bits,
(unsigned 
int)ilog2(of_dma_get_max_cpu_address(NULL)));
+
+   if (IS_ENABLED(CONFIG_ACPI))
+   zone_dma_bits = min(zone_dma_bits,
+   acpi_iort_get_zone_dma_size());
+
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
  #endif
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 9929ff50c0c0..8f530bf3c03b 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1718,3 +1718,54 @@ void __init acpi_iort_init(void)
  
  	iort_init_platform_devices();

  }
+
+#ifdef CONFIG_ZONE_DMA
+/*
+ * Check the IORT whether any devices exist whose DMA mask is < 32 bits.
+ * If so, return the smallest value encountered, or 32 otherwise.
+ */
+unsigned int __init acpi_iort_get_zone_dma_size(void)
+{
+   struct acpi_table_iort *iort;
+   struct acpi_iort_node *node, *end;
+   acpi_status status;
+   u8 limit = 32;
+   int i;
+
+   if (acpi_disabled)
+   return limit;
+
+   status = acpi_get_table(ACPI_SIG_IORT, 0,
+   (struct acpi_table_header **));
+   if (ACPI_FAILURE(status))
+   return limit;
+
+   node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
+   end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
+
+   for (i = 0; i < iort->node_count; i++) {
+   if (node >= end)
+   break;
+
+   switch (node->type) {
+   struct acpi_iort_named_component *ncomp;
+   struct acpi_iort_root_complex *rc;
+
+   case ACPI_IORT_NODE_NAMED_COMPONENT:
+   ncomp = (struct acpi_iort_named_component 
*)node->node_data;
+   if (ncomp->memory_address_limit)
+   limit = min(limit, ncomp->memory_address_limit);
+   break;
+
+   case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
+   rc = (struct acpi_iort_root_complex 

Re: [git pull] IOMMU Updates for Linux v5.10

2020-10-15 Thread Joerg Roedel
On Wed, Oct 14, 2020 at 12:15:55PM -0700, Linus Torvalds wrote:
> As a result, "git show" doesn't ever really understand the notion of a
> "merge conflict", and all it shows is really "whee, this end result
> looks like neither side" as a kind of pseudo-conflict diff.

Ah okay, that makes sense. Thanks for the explanation.

> Anyway, thanks for the describing the conflict, it was indeed not
> complicated, this email is just to explain your "but somehow git
> [show|log] didn't show it to me" thing.

Yeah, next time I try to get the diff before commiting the merge, that
probably works better. A diff is better than describing conflict
resolutions verbally.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 7/8] arm64: mm: Set ZONE_DMA size based on early IORT scan

2020-10-15 Thread Lorenzo Pieralisi
On Wed, Oct 14, 2020 at 09:12:09PM +0200, Nicolas Saenz Julienne wrote:

[...]

> +unsigned int __init acpi_iort_get_zone_dma_size(void)
> +{
> + struct acpi_table_iort *iort;
> + struct acpi_iort_node *node, *end;
> + acpi_status status;
> + u8 limit = 32;
> + int i;
> +
> + if (acpi_disabled)
> + return limit;
> +
> + status = acpi_get_table(ACPI_SIG_IORT, 0,
> + (struct acpi_table_header **));
> + if (ACPI_FAILURE(status))
> + return limit;
> +
> + node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
> + end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
> +
> + for (i = 0; i < iort->node_count; i++) {
> + if (node >= end)
> + break;
> +
> + switch (node->type) {
> + struct acpi_iort_named_component *ncomp;
> + struct acpi_iort_root_complex *rc;
> +
> + case ACPI_IORT_NODE_NAMED_COMPONENT:
> + ncomp = (struct acpi_iort_named_component 
> *)node->node_data;
> + if (ncomp->memory_address_limit)
> + limit = min(limit, ncomp->memory_address_limit);
> + break;
> +
> + case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
> + rc = (struct acpi_iort_root_complex *)node->node_data;
> + if (rc->memory_address_limit)

You need to add a node revision check here, see rc_dma_get_range() in
drivers/acpi/arm64/iort.c, otherwise we may be reading junk data
in older IORT tables - acpica structures are always referring to the
latest specs.

Thanks,
Lorenzo

> + limit = min(limit, rc->memory_address_limit);
> + break;
> + }
> + node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
> + }
> + acpi_put_table(>header);
> + return limit;
> +}
> +#endif
> diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
> index 20a32120bb88..7d2e184f0d4d 100644
> --- a/include/linux/acpi_iort.h
> +++ b/include/linux/acpi_iort.h
> @@ -38,6 +38,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 
> *size);
>  const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
>   const u32 *id_in);
>  int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
> *head);
> +unsigned int acpi_iort_get_zone_dma_size(void);
>  #else
>  static inline void acpi_iort_init(void) { }
>  static inline u32 iort_msi_map_id(struct device *dev, u32 id)
> @@ -55,6 +56,9 @@ static inline const struct iommu_ops 
> *iort_iommu_configure_id(
>  static inline
>  int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
> *head)
>  { return 0; }
> +
> +static inline unsigned int acpi_iort_get_zone_dma_size(void)
> +{ return 32; }
>  #endif
>  
>  #endif /* __ACPI_IORT_H__ */
> -- 
> 2.28.0
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-15 Thread Liu, Yi L
> From: Jason Wang 
> Sent: Thursday, October 15, 2020 4:41 PM
> 
> 
> On 2020/10/15 ??3:58, Tian, Kevin wrote:
> >> From: Jason Wang 
> >> Sent: Thursday, October 15, 2020 2:52 PM
> >>
> >>
> >> On 2020/10/14 ??11:08, Tian, Kevin wrote:
>  From: Jason Wang 
>  Sent: Tuesday, October 13, 2020 2:22 PM
> 
> 
>  On 2020/10/12 ??4:38, Tian, Kevin wrote:
> >> From: Jason Wang 
> >> Sent: Monday, September 14, 2020 12:20 PM
> >>
> > [...]
> > > If it's possible, I would suggest a generic uAPI instead of
> > a VFIO
> >> specific one.
> >>
> >> Jason suggest something like /dev/sva. There will be a lot of
> >> other subsystems that could benefit from this (e.g vDPA).
> >>
> >> Have you ever considered this approach?
> >>
> > Hi, Jason,
> >
> > We did some study on this approach and below is the output. It's a
> > long writing but I didn't find a way to further abstract w/o
> > losing necessary context. Sorry about that.
> >
> > Overall the real purpose of this series is to enable IOMMU nested
> > translation capability with vSVA as one major usage, through below
> > new uAPIs:
> > 1) Report/enable IOMMU nested translation capability;
> > 2) Allocate/free PASID;
> > 3) Bind/unbind guest page table;
> > 4) Invalidate IOMMU cache;
> > 5) Handle IOMMU page request/response (not in this series);
> > 1/3/4) is the minimal set for using IOMMU nested translation, with
> > the other two optional. For example, the guest may enable vSVA on
> > a device without using PASID. Or, it may bind its gIOVA page table
> > which doesn't require page fault support. Finally, all operations
> > can be applied to either physical device or subdevice.
> >
> > Then we evaluated each uAPI whether generalizing it is a good
> > thing both in concept and regarding to complexity.
> >
> > First, unlike other uAPIs which are all backed by iommu_ops, PASID
> > allocation/free is through the IOASID sub-system.
>  A question here, is IOASID expected to be the single management
>  interface for PASID?
> >>> yes
> >>>
>  (I'm asking since there're already vendor specific IDA based PASID
>  allocator e.g amdgpu_pasid_alloc())
> >>> That comes before IOASID core was introduced. I think it should be
> >>> changed to use the new generic interface. Jacob/Jean can better
> >>> comment if other reason exists for this exception.
> >>
> >> If there's no exception it should be fixed.
> >>
> >>
> > From this angle
> > we feel generalizing PASID management does make some sense.
> > First, PASID is just a number and not related to any device before
> > it's bound to a page table and IOMMU domain. Second, PASID is a
> > global resource (at least on Intel VT-d),
>  I think we need a definition of "global" here. It looks to me for
>  vt-d the PASID table is per device.
> >>> PASID table is per device, thus VT-d could support per-device PASIDs
> >>> in concept.
> >>
> >> I think that's the requirement of PCIE spec which said PASID + RID
> >> identifies the process address space ID.
> >>
> >>
> >>>However on Intel platform we require PASIDs to be managed in
> >>> system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
> >>> and ENQCMD together.
> >>
> >> Any reason for such requirement? (I'm not familiar with ENQCMD, but
> >> my understanding is that vSVA, SIOV or SR-IOV doesn't have the
> >> requirement for system-wide PASID).
> > ENQCMD is a new instruction to allow multiple processes submitting
> > workload to one shared workqueue. Each process has an unique PASID
> > saved in a MSR, which is included in the ENQCMD payload to indicate
> > the address space when the CPU sends to the device. As one process
> > might issue ENQCMD to multiple devices, OS-wide PASID allocation is
> > required both in host and guest side.
> >
> > When executing ENQCMD in the guest to a SIOV device, the guest
> > programmed value in the PASID_MSR must be translated to a host PASID
> > value for proper function/isolation as PASID represents the address
> > space. The translation is done through a new VMCS PASID translation
> > structure (per-VM, and 1:1 mapping). From this angle the host PASIDs
> > must be allocated 'globally' cross all assigned devices otherwise it
> > may lead to 1:N mapping when a guest process issues ENQCMD to multiple
> > assigned devices/subdevices.
> >
> > There will be a KVM forum session for this topic btw.
> 
> 
> Thanks for the background. Now I see the restrict comes from ENQCMD.
> 
> 
> >
> >>
> >>> Thus the host creates only one 'global' PASID namespace but do use
> >>> per-device PASID table to assure isolation between devices on Intel
> >>> platforms. But ARM does it differently as Jean explained.
> >>> They have a global namespace for host processes on all host-owned
> >>> devices 

Re: [PATCH v3 6/8] arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges

2020-10-15 Thread Nicolas Saenz Julienne
On Thu, 2020-10-15 at 07:39 +0200, Christoph Hellwig wrote:
> On Wed, Oct 14, 2020 at 09:12:08PM +0200, Nicolas Saenz Julienne wrote:
> > +   zone_dma_bits = min(zone_dma_bits,
> > +   (unsigned 
> > int)ilog2(of_dma_get_max_cpu_address(NULL)));
> 
> Plase avoid pointlessly long lines.  Especially if it is completely trivial
> by using either min_t or not overindenting like here.

Noted



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 5/8] dma-direct: Turn zone_dma_bits default value into a define

2020-10-15 Thread Nicolas Saenz Julienne
On Thu, 2020-10-15 at 07:38 +0200, Christoph Hellwig wrote:
> On Wed, Oct 14, 2020 at 09:12:07PM +0200, Nicolas Saenz Julienne wrote:
> > Set zone_dma_bits default value through a define so as for architectures
> > to be able to override it with their default value.
> 
> Architectures can do that already by assigning a value to zone_dma_bits
> at runtime.  I really do not want to add the extra clutter.

I'll remove it then.



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/8] of/address: Introduce of_dma_get_max_cpu_address()

2020-10-15 Thread Nicolas Saenz Julienne
On Thu, 2020-10-15 at 07:42 +0200, Christoph Hellwig wrote:
> > +phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
> > +{
> > +   phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
> > +   struct of_range_parser parser;
> > +   phys_addr_t subtree_max_addr;
> > +   struct device_node *child;
> > +   phys_addr_t cpu_end = 0;
> > +   struct of_range range;
> > +   const __be32 *ranges;
> > +   int len;
> > +
> > +   if (!np)
> > +   np = of_root;
> 
> Requiring of_root to be passed explicitly would seem more natural
> to me than the magic NULL argument.  There doesn't seem to be any
> precedent for that kind of calling convention either.

I inspired that behavior from __of_find_all_nodes(). I'll change it.



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 3/3] iommu/tegra-smmu: Add PCI support

2020-10-15 Thread Robin Murphy

On 2020-10-15 05:13, Nicolin Chen wrote:

On Wed, Oct 14, 2020 at 06:42:36PM +0100, Robin Murphy wrote:

On 2020-10-09 17:19, Nicolin Chen wrote:

This patch simply adds support for PCI devices.

Reviewed-by: Dmitry Osipenko 
Tested-by: Dmitry Osipenko 
Signed-off-by: Nicolin Chen 
---

Changelog
v6->v7
   * Renamed goto labels, suggested by Thierry.
v5->v6
   * Added Dmitry's Reviewed-by and Tested-by.
v4->v5
   * Added Dmitry's Reviewed-by
v3->v4
   * Dropped !iommu_present() check
   * Added CONFIG_PCI check in the exit path
v2->v3
   * Replaced ternary conditional operator with if-else in .device_group()
   * Dropped change in tegra_smmu_remove()
v1->v2
   * Added error-out labels in tegra_smmu_probe()
   * Dropped pci_request_acs() since IOMMU core would call it.

   drivers/iommu/tegra-smmu.c | 35 +--
   1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index be29f5977145..2941d6459076 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -10,6 +10,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -865,7 +866,11 @@ static struct iommu_group *tegra_smmu_device_group(struct 
device *dev)
group->smmu = smmu;
group->soc = soc;
-   group->group = iommu_group_alloc();
+   if (dev_is_pci(dev))
+   group->group = pci_device_group(dev);


Just to check, is it OK to have two or more swgroups "owning" the same
iommu_group if an existing one gets returned here? It looks like that might
not play nice with the use of iommu_group_set_iommudata().


Do you mean by "gets returned here" the "IS_ERR" check below?


I mean that unlike iommu_group_alloc()/generic_device_group(), 
pci_device_group() may give you back a group that already contains 
another device and has already been set up from that device's 
perspective. This can happen for topological reasons like requester ID 
aliasing through a PCI-PCIe bridge or lack of isolation between functions.


Robin.


+   else
+   group->group = generic_device_group(dev);
+
if (IS_ERR(group->group)) {
devm_kfree(smmu->dev, group);
mutex_unlock(>lock);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/8] of: unittest: Add test for of_dma_get_max_cpu_address()

2020-10-15 Thread Nicolas Saenz Julienne
On Wed, 2020-10-14 at 17:04 -0500, Rob Herring wrote:
> On Wed, Oct 14, 2020 at 2:12 PM Nicolas Saenz Julienne
>  wrote:
> > Introduce a test for of_dma_get_max_cup_address(), it uses the same DT
> > data as the rest of dma-ranges unit tests.
> > 
> > Signed-off-by: Nicolas Saenz Julienne 
> > ---
> >  drivers/of/unittest.c | 20 
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
> > index 06cc988faf78..2cbf2a585c9f 100644
> > --- a/drivers/of/unittest.c
> > +++ b/drivers/of/unittest.c
> > @@ -869,6 +869,25 @@ static void __init of_unittest_changeset(void)
> >  #endif
> >  }
> > 
> > +static void __init of_unittest_dma_get_max_cpu_address(void)
> > +{
> > +#ifdef CONFIG_HAS_DMA
> 
> Can't the unittest run without this? I run the unittests under UML.

It was cargo culted from its sibling of_unittest_dma_ranges_one(), now that you
mention it, I can't seem to find the reason why it's here in the first place,
nor for other similar usages in OF code.

I ran the test in UML with all HAS_DMA conditionals removed from OF code and
things went well. I'll prepare a fix for that.

> > +   struct device_node *np;
> > +   phys_addr_t cpu_addr;
> > +
> > +   np = of_find_node_by_path("/testcase-data/address-tests");
> > +   if (!np) {
> > +   pr_err("missing testcase data\n");
> > +   return;
> > +   }
> > +
> > +   cpu_addr = of_dma_get_max_cpu_address(np);
> > +   unittest(cpu_addr == 0x5000ULL,
> > +"of_dma_get_max_cpu_address: wrong CPU addr %pad 
> > (expecting %llx)\n",
> > +_addr, 0x5000ULL);
> > +#endif
> > +}
> > +
> >  static void __init of_unittest_dma_ranges_one(const char *path,
> > u64 expect_dma_addr, u64 expect_paddr)
> >  {
> > @@ -3266,6 +3285,7 @@ static int __init of_unittest(void)
> > of_unittest_changeset();
> > of_unittest_parse_interrupts();
> > of_unittest_parse_interrupts_extended();
> > +   of_unittest_dma_get_max_cpu_address();
> > of_unittest_parse_dma_ranges();
> > of_unittest_pci_dma_ranges();
> > of_unittest_match_node();
> > --
> > 2.28.0
> > 



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/8] of/address: Introduce of_dma_get_max_cpu_address()

2020-10-15 Thread Ard Biesheuvel
On Thu, 15 Oct 2020 at 11:16, Nicolas Saenz Julienne
 wrote:
>
> On Thu, 2020-10-15 at 08:56 +0200, Ard Biesheuvel wrote:
> > On Thu, 15 Oct 2020 at 00:03, Rob Herring  wrote:
> > > On Wed, Oct 14, 2020 at 2:12 PM Nicolas Saenz Julienne
> > >  wrote:
> > > > Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
> > > > physical address addressable by all DMA masters in the system. It's
> > > > specially useful for setting memory zones sizes at early boot time.
> > > >
> > > > Signed-off-by: Nicolas Saenz Julienne 
> > > >
> > > > ---
> > > >
> > > > Changes since v2:
> > > >  - Use PHYS_ADDR_MAX
> > > >  - return phys_dma_t
> > > >  - Rename function
> > > >  - Correct subject
> > > >  - Add support to start parsing from an arbitrary device node in order
> > > >for the function to work with unit tests
> > > >
> > > >  drivers/of/address.c | 42 ++
> > > >  include/linux/of.h   |  7 +++
> > > >  2 files changed, 49 insertions(+)
> > > >
> > > > diff --git a/drivers/of/address.c b/drivers/of/address.c
> > > > index eb9ab4f1e80b..b5a9695aaf82 100644
> > > > --- a/drivers/of/address.c
> > > > +++ b/drivers/of/address.c
> > > > @@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, 
> > > > const struct bus_dma_region **map)
> > > >  }
> > > >  #endif /* CONFIG_HAS_DMA */
> > > >
> > > > +/**
> > > > + * of_dma_get_max_cpu_address - Gets highest CPU address suitable for 
> > > > DMA
> > > > + * @np: The node to start searching from or NULL to start from the root
> > > > + *
> > > > + * Gets the highest CPU physical address that is addressable by all 
> > > > DMA masters
> > > > + * in the system (or subtree when np is non-NULL). If no DMA 
> > > > constrained device
> > > > + * is found, it returns PHYS_ADDR_MAX.
> > > > + */
> > > > +phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
> > > > +{
> > > > +   phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
> > >
> > > One issue with using phys_addr_t is it may be 32-bit even though the
> > > DT is 64-bit addresses. LPAE capable system with LPAE disabled. Maybe
> > > the truncation is fine here? Maybe not.
> > >
> >
> > PHYS_ADDR_MAX is the max addressable CPU address on the system, and so
> > it makes sense to use it for the return type, and for the preliminary
> > return value: this is actually what /prevents/ truncation, because we
> > will only overwrite max_cpu_addr if the new u64 value is lower.
> >
>
> Actually I now see how things might go south.
>
> > > > +   if (ranges && len) {
> > > > +   of_dma_range_parser_init(, np);
> > > > +   for_each_of_range(, )
> > > > +   if (range.cpu_addr + range.size > cpu_end)
> > > > +   cpu_end = range.cpu_addr + range.size;
>
> If cpu_end hits 0x1_, it'll overflow to 0. This is possible on 32-bit
> systems (LPAE or not). And something similar might happen on LPAE disabled
> systems.
>
> I could add some extra logic, something like:
>
> /* We overflowed */
> if (cpu_end < range.cpu_addr)
> cpu_end = PHYS_ADDR_MAX;
>
> Which is not perfect but will cover most sensible cases.
>
> Or simply deal internally in u64s, and upon returning, check if "max_cpu_addr"
> falls higher than PHYS_ADDR_MAX.
>

Just use a u64 for cpu_end

> > > > +
> > > > +   if (max_cpu_addr > cpu_end)
> > > > +   max_cpu_addr = cpu_end;

... then this comparison and assignment will work as expected.

> > > > +   }
> > > > +
> > > > +   for_each_available_child_of_node(np, child) {
> > > > +   subtree_max_addr = of_dma_get_max_cpu_address(child);
> > > > +   if (max_cpu_addr > subtree_max_addr)
> > > > +   max_cpu_addr = subtree_max_addr;
> > > > +   }
> > > > +
> > > > +   return max_cpu_addr;
> > > > +}
>
> Regards,
> Nicolas
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 3/8] of/address: Introduce of_dma_get_max_cpu_address()

2020-10-15 Thread Nicolas Saenz Julienne
On Thu, 2020-10-15 at 08:56 +0200, Ard Biesheuvel wrote:
> On Thu, 15 Oct 2020 at 00:03, Rob Herring  wrote:
> > On Wed, Oct 14, 2020 at 2:12 PM Nicolas Saenz Julienne
> >  wrote:
> > > Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
> > > physical address addressable by all DMA masters in the system. It's
> > > specially useful for setting memory zones sizes at early boot time.
> > > 
> > > Signed-off-by: Nicolas Saenz Julienne 
> > > 
> > > ---
> > > 
> > > Changes since v2:
> > >  - Use PHYS_ADDR_MAX
> > >  - return phys_dma_t
> > >  - Rename function
> > >  - Correct subject
> > >  - Add support to start parsing from an arbitrary device node in order
> > >for the function to work with unit tests
> > > 
> > >  drivers/of/address.c | 42 ++
> > >  include/linux/of.h   |  7 +++
> > >  2 files changed, 49 insertions(+)
> > > 
> > > diff --git a/drivers/of/address.c b/drivers/of/address.c
> > > index eb9ab4f1e80b..b5a9695aaf82 100644
> > > --- a/drivers/of/address.c
> > > +++ b/drivers/of/address.c
> > > @@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, const 
> > > struct bus_dma_region **map)
> > >  }
> > >  #endif /* CONFIG_HAS_DMA */
> > > 
> > > +/**
> > > + * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
> > > + * @np: The node to start searching from or NULL to start from the root
> > > + *
> > > + * Gets the highest CPU physical address that is addressable by all DMA 
> > > masters
> > > + * in the system (or subtree when np is non-NULL). If no DMA constrained 
> > > device
> > > + * is found, it returns PHYS_ADDR_MAX.
> > > + */
> > > +phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
> > > +{
> > > +   phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
> > 
> > One issue with using phys_addr_t is it may be 32-bit even though the
> > DT is 64-bit addresses. LPAE capable system with LPAE disabled. Maybe
> > the truncation is fine here? Maybe not.
> > 
> 
> PHYS_ADDR_MAX is the max addressable CPU address on the system, and so
> it makes sense to use it for the return type, and for the preliminary
> return value: this is actually what /prevents/ truncation, because we
> will only overwrite max_cpu_addr if the new u64 value is lower.
> 

Actually I now see how things might go south.

> > > +   if (ranges && len) {
> > > +   of_dma_range_parser_init(, np);
> > > +   for_each_of_range(, )
> > > +   if (range.cpu_addr + range.size > cpu_end)
> > > +   cpu_end = range.cpu_addr + range.size;

If cpu_end hits 0x1_, it'll overflow to 0. This is possible on 32-bit
systems (LPAE or not). And something similar might happen on LPAE disabled
systems.

I could add some extra logic, something like:

/* We overflowed */
if (cpu_end < range.cpu_addr)
cpu_end = PHYS_ADDR_MAX;

Which is not perfect but will cover most sensible cases.

Or simply deal internally in u64s, and upon returning, check if "max_cpu_addr"
falls higher than PHYS_ADDR_MAX.

> > > +
> > > +   if (max_cpu_addr > cpu_end)
> > > +   max_cpu_addr = cpu_end;
> > > +   }
> > > +
> > > +   for_each_available_child_of_node(np, child) {
> > > +   subtree_max_addr = of_dma_get_max_cpu_address(child);
> > > +   if (max_cpu_addr > subtree_max_addr)
> > > +   max_cpu_addr = subtree_max_addr;
> > > +   }
> > > +
> > > +   return max_cpu_addr;
> > > +}

Regards,
Nicolas



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[RFC PATCH 2/2] iommu: Add IOMMU_UNBIND_FAULT_PENDING flag

2020-10-15 Thread Jean-Philippe Brucker
IOMMU drivers only need to flush their PRI queue when faults might be
pending. According to the PCIe spec (quoted below) this only happens
when using the "Stop Marker" method. Otherwise the function waits for
pending faults before signaling to the device driver that it can
unbind().

Add the IOMMU_UNBIND_FAULT_PENDING flags to unbind(), to tell the IOMMU
driver whether it's worth flushing the queue.

Signed-off-by: Jean-Philippe Brucker 
---
 include/linux/iommu.h | 31 +++
 drivers/iommu/intel/svm.c |  3 ++-
 drivers/iommu/iommu.c |  5 -
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 26c1358a2a37..fd9630b1240d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -163,6 +163,37 @@ enum iommu_dev_features {
 
 #define IOMMU_PASID_INVALID(-1U)
 
+/*
+ * Indicate that a device stops using a PASID by issuing a Stop Marker Message.
+ * From PCIe 4.0r1.0, 10.4.1.2 Managing PASID TLP Prefix Usage:
+ *
+ * "To stop without using a Stop Marker Message, the Function shall:
+ *  1. Stop queueing new Page Request Messages for this PASID.
+ *  2. Finish transmitting any multi-page Page Request Messages for this PASID
+ * (i.e. send the Page Request Message with the L bit Set).
+ *  3. Wait for PRG Response Messages associated any outstanding Page Request
+ * Messages for the PASID.
+ *  4. Indicate that the PASID has stopped using a device specific mechanism.
+ * This mechanism must indicate that a Stop Marker Message will not be
+ * generated.
+ *  To stop with the use of a Stop Marker Message the Function shall:
+ * [1. and 2. are the same]
+ *  3. Internally mark all outstanding Page Request Messages for this PASID as
+ * stale. PRG Response Messages associated with these requests will return
+ * Page Request Allocation credits and PRG Index values but are otherwise
+ * ignored.
+ *  4. Indicate that the PASID has stopped using a device specific mechanism.
+ * This mechanism must indicate that a Stop Marker Message will be
+ * generated.
+ *  5. Send a Stop Marker Message to indicate to the host that all subsequent
+ * Page Request Messages for this PASID are for a new use of the PASID
+ * value."
+ *
+ * If the device indicates that the Stop Marker Message will be generated, the
+ * device driver should set the IOMMU_UNBIND_FAULT_PENDING flag.
+ */
+#define IOMMU_UNBIND_FAULT_PENDING (1UL << 0)
+
 #ifdef CONFIG_IOMMU_API
 
 /**
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 700b05612af9..aa1fcb66fa95 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -680,7 +680,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 
pasid,
 * hard to be as defensive as we might like. */
intel_pasid_tear_down_entry(iommu, dev,
svm->pasid, false);
-   intel_svm_drain_prq(dev, svm->pasid);
+   if (flags & IOMMU_UNBIND_FAULT_PENDING)
+   intel_svm_drain_prq(dev, svm->pasid);
kfree_rcu(sdev, rcu);
 
if (list_empty(>devs)) {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 741c463095a8..eede0592a2c0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2995,7 +2995,10 @@ EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
  *
  * Put reference to a bond between device and address space. The device should
  * not be issuing any more transaction for this PASID. All outstanding page
- * requests for this PASID must have been flushed to the IOMMU.
+ * requests for this PASID must have been completed, or flushed to the IOMMU. 
If
+ * they have not been completed, for example when using a Stop Marker Message 
to
+ * stop PASID in a PCIe device, then the caller must set the flag
+ * %IOMMU_UNBIND_FAULT_PENDING.
  *
  * Returns 0 on success, or an error value
  */
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-15 Thread Jean-Philippe Brucker
Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
whether the PRI queue needs flushing. When looking at the PCIe spec
again I noticed that most of the time the SMMUv3 driver doesn't actually
need to flush the PRI queue. Does this make sense for Intel VT-d as well
or did I overlook something?

Before calling iommu_sva_unbind_device(), device drivers must stop the
device from using the PASID. For PCIe devices, that consists of
completing any pending DMA, and completing any pending page request
unless the device uses Stop Markers. So unless the device uses Stop
Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
means completing all stall events, so we never need to flush the event
queue.

First patch adds flags to unbind(), and the second one lets device
drivers tell whether the PRI queue needs to be flushed.

Other remarks:

* The PCIe spec (see quote on patch 2), says that the device signals
  whether it has sent a Stop Marker or not during Stop PASID. In reality
  it's unlikely that a given device will sometimes use one stop method
  and sometimes the other, so it could be a device-wide flag rather than
  passing it at each unbind(). I don't want to speculate too much about
  future implementation so I prefer having the flag in unbind().

* In patch 1, uacce passes 0 to unbind(). To pass the right flag I'm
  thinking that uacce->ops->stop_queue(), which tells the device driver
  to stop DMA, should return whether faults are pending. This can be
  added later once uacce has an actual PCIe user, but we need to
  remember to do it.

Jean-Philippe Brucker (2):
  iommu: Add flags to sva_unbind()
  iommu: Add IOMMU_UNBIND_FAULT_PENDING flag

 include/linux/intel-iommu.h |  2 +-
 include/linux/iommu.h   | 38 ++---
 drivers/iommu/intel/svm.c   | 10 ++
 drivers/iommu/iommu.c   | 10 +++---
 drivers/misc/uacce/uacce.c  |  4 ++--
 5 files changed, 51 insertions(+), 13 deletions(-)

-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 1/2] iommu: Add flags to sva_unbind()

2020-10-15 Thread Jean-Philippe Brucker
Provide a way for device drivers to tell IOMMU drivers about the device
state and the cleanup work to be done, when unbinding. No functional
change.

Signed-off-by: Jean-Philippe Brucker 
---
 include/linux/intel-iommu.h | 2 +-
 include/linux/iommu.h   | 7 ---
 drivers/iommu/intel/svm.c   | 7 ---
 drivers/iommu/iommu.c   | 5 +++--
 drivers/misc/uacce/uacce.c  | 4 ++--
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index fbf5b3e7707e..5b66b23d591d 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -747,7 +747,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
 int intel_svm_unbind_gpasid(struct device *dev, u32 pasid);
 struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
 void *drvdata);
-void intel_svm_unbind(struct iommu_sva *handle);
+void intel_svm_unbind(struct iommu_sva *handle, unsigned long flags);
 u32 intel_svm_get_pasid(struct iommu_sva *handle);
 int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
struct iommu_page_response *msg);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b95a6f8db6ff..26c1358a2a37 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -285,7 +285,7 @@ struct iommu_ops {
 
struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm,
  void *drvdata);
-   void (*sva_unbind)(struct iommu_sva *handle);
+   void (*sva_unbind)(struct iommu_sva *handle, unsigned long flags);
u32 (*sva_get_pasid)(struct iommu_sva *handle);
 
int (*page_response)(struct device *dev,
@@ -636,7 +636,7 @@ int iommu_aux_get_pasid(struct iommu_domain *domain, struct 
device *dev);
 struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm,
void *drvdata);
-void iommu_sva_unbind_device(struct iommu_sva *handle);
+void iommu_sva_unbind_device(struct iommu_sva *handle, unsigned long flags);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
 #else /* CONFIG_IOMMU_API */
@@ -1026,7 +1026,8 @@ iommu_sva_bind_device(struct device *dev, struct 
mm_struct *mm, void *drvdata)
return NULL;
 }
 
-static inline void iommu_sva_unbind_device(struct iommu_sva *handle)
+static inline void iommu_sva_unbind_device(struct iommu_sva *handle,
+  unsigned long flags)
 {
 }
 
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index f1861fa3d0e4..700b05612af9 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -651,7 +651,8 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 }
 
 /* Caller must hold pasid_mutex */
-static int intel_svm_unbind_mm(struct device *dev, u32 pasid)
+static int intel_svm_unbind_mm(struct device *dev, u32 pasid,
+  unsigned long flags)
 {
struct intel_svm_dev *sdev;
struct intel_iommu *iommu;
@@ -1091,13 +1092,13 @@ intel_svm_bind(struct device *dev, struct mm_struct 
*mm, void *drvdata)
return sva;
 }
 
-void intel_svm_unbind(struct iommu_sva *sva)
+void intel_svm_unbind(struct iommu_sva *sva, unsigned long flags)
 {
struct intel_svm_dev *sdev;
 
mutex_lock(_mutex);
sdev = to_intel_svm_dev(sva);
-   intel_svm_unbind_mm(sdev->dev, sdev->pasid);
+   intel_svm_unbind_mm(sdev->dev, sdev->pasid, flags);
mutex_unlock(_mutex);
 }
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c470f451a32..741c463095a8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2991,6 +2991,7 @@ EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
 /**
  * iommu_sva_unbind_device() - Remove a bond created with iommu_sva_bind_device
  * @handle: the handle returned by iommu_sva_bind_device()
+ * @flags: IOMMU_UNBIND_* flags
  *
  * Put reference to a bond between device and address space. The device should
  * not be issuing any more transaction for this PASID. All outstanding page
@@ -2998,7 +2999,7 @@ EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
  *
  * Returns 0 on success, or an error value
  */
-void iommu_sva_unbind_device(struct iommu_sva *handle)
+void iommu_sva_unbind_device(struct iommu_sva *handle, unsigned long flags)
 {
struct iommu_group *group;
struct device *dev = handle->dev;
@@ -3012,7 +3013,7 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
return;
 
mutex_lock(>mutex);
-   ops->sva_unbind(handle);
+   ops->sva_unbind(handle, flags);
mutex_unlock(>mutex);
 
iommu_group_put(group);
diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index 56dd98ab5a81..0800566a6656 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -105,7 +105,7 @@ 

Re: [PATCH v3 1/8] arm64: mm: Move reserve_crashkernel() into mem_init()

2020-10-15 Thread Nicolas Saenz Julienne
On Thu, 2020-10-15 at 09:40 +0100, Will Deacon wrote:
> On Wed, Oct 14, 2020 at 09:12:03PM +0200, Nicolas Saenz Julienne wrote:
> > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > boot table initialization, so move it later in the boot process.
> > Specifically into mem_init(), this is the last place crashkernel will be
> > able to reserve the memory before the page allocator kicks in and there is
> > no need to do it earlier.
> > 
> > Signed-off-by: Nicolas Saenz Julienne 
> > ---
> >  arch/arm64/mm/init.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Please can you cc me on the whole series next time? I know different
> maintainers have different preferences here, but I find it much easier to
> figure out what's happening when I can see all of the changes together.

Will do.

Regards,
Nicolas




signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/8] of/address: Introduce of_dma_get_max_cpu_address()

2020-10-15 Thread Nicolas Saenz Julienne
On Wed, 2020-10-14 at 17:02 -0500, Rob Herring wrote:
> On Wed, Oct 14, 2020 at 2:12 PM Nicolas Saenz Julienne
>  wrote:
> > Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
> > physical address addressable by all DMA masters in the system. It's
> > specially useful for setting memory zones sizes at early boot time.
> > 
> > Signed-off-by: Nicolas Saenz Julienne 
> > 
> > ---

[...]

> > +   struct of_range_parser parser;
> > +   phys_addr_t subtree_max_addr;
> > +   struct device_node *child;
> > +   phys_addr_t cpu_end = 0;
> > +   struct of_range range;
> > +   const __be32 *ranges;
> > +   int len;
> > +
> > +   if (!np)
> > +   np = of_root;
> > +
> > +   ranges = of_get_property(np, "dma-ranges", );
> 
> I'm not really following why you changed the algorithm here. You're
> skipping disabled nodes which is good. Was there some other reason?

Yes, it's a little more complex. But I had to change it in order to be able to
start parsing down from an arbitrary device node, which is needed for the unit
tests.

for_each_of_allnodes() and friends will traverse the whole tree, regardless of
the starting point. I couldn't find a similar function that would just iterate
over a subsection of the tree, so I went with this recursive approach.

Regards,
Nicolas



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-15 Thread Jason Wang


On 2020/10/15 下午3:58, Tian, Kevin wrote:

From: Jason Wang 
Sent: Thursday, October 15, 2020 2:52 PM


On 2020/10/14 上午11:08, Tian, Kevin wrote:

From: Jason Wang 
Sent: Tuesday, October 13, 2020 2:22 PM


On 2020/10/12 下午4:38, Tian, Kevin wrote:

From: Jason Wang 
Sent: Monday, September 14, 2020 12:20 PM


[...]
> If it's possible, I would suggest a generic uAPI instead of a VFIO

specific one.

Jason suggest something like /dev/sva. There will be a lot of other
subsystems that could benefit from this (e.g vDPA).

Have you ever considered this approach?


Hi, Jason,

We did some study on this approach and below is the output. It's a
long writing but I didn't find a way to further abstract w/o losing
necessary context. Sorry about that.

Overall the real purpose of this series is to enable IOMMU nested
translation capability with vSVA as one major usage, through
below new uAPIs:
1) Report/enable IOMMU nested translation capability;
2) Allocate/free PASID;
3) Bind/unbind guest page table;
4) Invalidate IOMMU cache;
5) Handle IOMMU page request/response (not in this series);
1/3/4) is the minimal set for using IOMMU nested translation, with
the other two optional. For example, the guest may enable vSVA on
a device without using PASID. Or, it may bind its gIOVA page table
which doesn't require page fault support. Finally, all operations can
be applied to either physical device or subdevice.

Then we evaluated each uAPI whether generalizing it is a good thing
both in concept and regarding to complexity.

First, unlike other uAPIs which are all backed by iommu_ops, PASID
allocation/free is through the IOASID sub-system.

A question here, is IOASID expected to be the single management
interface for PASID?

yes


(I'm asking since there're already vendor specific IDA based PASID
allocator e.g amdgpu_pasid_alloc())

That comes before IOASID core was introduced. I think it should be
changed to use the new generic interface. Jacob/Jean can better
comment if other reason exists for this exception.


If there's no exception it should be fixed.



From this angle
we feel generalizing PASID management does make some sense.
First, PASID is just a number and not related to any device before
it's bound to a page table and IOMMU domain. Second, PASID is a
global resource (at least on Intel VT-d),

I think we need a definition of "global" here. It looks to me for vt-d
the PASID table is per device.

PASID table is per device, thus VT-d could support per-device PASIDs
in concept.


I think that's the requirement of PCIE spec which said PASID + RID
identifies the process address space ID.



   However on Intel platform we require PASIDs to be managed
in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
and ENQCMD together.


Any reason for such requirement? (I'm not familiar with ENQCMD, but my
understanding is that vSVA, SIOV or SR-IOV doesn't have the requirement
for system-wide PASID).

ENQCMD is a new instruction to allow multiple processes submitting
workload to one shared workqueue. Each process has an unique PASID
saved in a MSR, which is included in the ENQCMD payload to indicate
the address space when the CPU sends to the device. As one process
might issue ENQCMD to multiple devices, OS-wide PASID allocation is
required both in host and guest side.

When executing ENQCMD in the guest to a SIOV device, the guest
programmed value in the PASID_MSR must be translated to a host PASID
value for proper function/isolation as PASID represents the address
space. The translation is done through a new VMCS PASID translation
structure (per-VM, and 1:1 mapping). From this angle the host PASIDs
must be allocated 'globally' cross all assigned devices otherwise it may
lead to 1:N mapping when a guest process issues ENQCMD to multiple
assigned devices/subdevices.

There will be a KVM forum session for this topic btw.



Thanks for the background. Now I see the restrict comes from ENQCMD.







Thus the host creates only one 'global' PASID
namespace but do use per-device PASID table to assure isolation between
devices on Intel platforms. But ARM does it differently as Jean explained.
They have a global namespace for host processes on all host-owned
devices (same as Intel), but then per-device namespace when a device
(and its PASID table) is assigned to userspace.


Another question, is this possible to have two DMAR hardware unit(at
least I can see two even in my laptop). In this case, is PASID still a
global resource?

yes


while having separate VFIO/
VDPA allocation interfaces may easily cause confusion in userspace,
e.g. which interface to be used if both VFIO/VDPA devices exist.
Moreover, an unified interface allows centralized control over how
many PASIDs are allowed per process.

Yes.



One unclear part with this generalization is about the permission.
Do we open this interface to any process or only to those which
have assigned devices? If the latter, 

Re: [PATCH v3 1/8] arm64: mm: Move reserve_crashkernel() into mem_init()

2020-10-15 Thread Will Deacon
On Wed, Oct 14, 2020 at 09:12:03PM +0200, Nicolas Saenz Julienne wrote:
> crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> boot table initialization, so move it later in the boot process.
> Specifically into mem_init(), this is the last place crashkernel will be
> able to reserve the memory before the page allocator kicks in and there is
> no need to do it earlier.
> 
> Signed-off-by: Nicolas Saenz Julienne 
> ---
>  arch/arm64/mm/init.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Please can you cc me on the whole series next time? I know different
maintainers have different preferences here, but I find it much easier to
figure out what's happening when I can see all of the changes together.

Thanks,

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-15 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, October 15, 2020 2:52 PM
> 
> 
> On 2020/10/14 上午11:08, Tian, Kevin wrote:
> >> From: Jason Wang 
> >> Sent: Tuesday, October 13, 2020 2:22 PM
> >>
> >>
> >> On 2020/10/12 下午4:38, Tian, Kevin wrote:
>  From: Jason Wang 
>  Sent: Monday, September 14, 2020 12:20 PM
> 
> >>> [...]
> >>>> If it's possible, I would suggest a generic uAPI instead of a VFIO
>  specific one.
> 
>  Jason suggest something like /dev/sva. There will be a lot of other
>  subsystems that could benefit from this (e.g vDPA).
> 
>  Have you ever considered this approach?
> 
> >>> Hi, Jason,
> >>>
> >>> We did some study on this approach and below is the output. It's a
> >>> long writing but I didn't find a way to further abstract w/o losing
> >>> necessary context. Sorry about that.
> >>>
> >>> Overall the real purpose of this series is to enable IOMMU nested
> >>> translation capability with vSVA as one major usage, through
> >>> below new uAPIs:
> >>>   1) Report/enable IOMMU nested translation capability;
> >>>   2) Allocate/free PASID;
> >>>   3) Bind/unbind guest page table;
> >>>   4) Invalidate IOMMU cache;
> >>>   5) Handle IOMMU page request/response (not in this series);
> >>> 1/3/4) is the minimal set for using IOMMU nested translation, with
> >>> the other two optional. For example, the guest may enable vSVA on
> >>> a device without using PASID. Or, it may bind its gIOVA page table
> >>> which doesn't require page fault support. Finally, all operations can
> >>> be applied to either physical device or subdevice.
> >>>
> >>> Then we evaluated each uAPI whether generalizing it is a good thing
> >>> both in concept and regarding to complexity.
> >>>
> >>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
> >>> allocation/free is through the IOASID sub-system.
> >>
> >> A question here, is IOASID expected to be the single management
> >> interface for PASID?
> > yes
> >
> >> (I'm asking since there're already vendor specific IDA based PASID
> >> allocator e.g amdgpu_pasid_alloc())
> > That comes before IOASID core was introduced. I think it should be
> > changed to use the new generic interface. Jacob/Jean can better
> > comment if other reason exists for this exception.
> 
> 
> If there's no exception it should be fixed.
> 
> 
> >
> >>
> >>>From this angle
> >>> we feel generalizing PASID management does make some sense.
> >>> First, PASID is just a number and not related to any device before
> >>> it's bound to a page table and IOMMU domain. Second, PASID is a
> >>> global resource (at least on Intel VT-d),
> >>
> >> I think we need a definition of "global" here. It looks to me for vt-d
> >> the PASID table is per device.
> > PASID table is per device, thus VT-d could support per-device PASIDs
> > in concept.
> 
> 
> I think that's the requirement of PCIE spec which said PASID + RID
> identifies the process address space ID.
> 
> 
> >   However on Intel platform we require PASIDs to be managed
> > in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
> > and ENQCMD together.
> 
> 
> Any reason for such requirement? (I'm not familiar with ENQCMD, but my
> understanding is that vSVA, SIOV or SR-IOV doesn't have the requirement
> for system-wide PASID).

ENQCMD is a new instruction to allow multiple processes submitting
workload to one shared workqueue. Each process has an unique PASID
saved in a MSR, which is included in the ENQCMD payload to indicate
the address space when the CPU sends to the device. As one process 
might issue ENQCMD to multiple devices, OS-wide PASID allocation is 
required both in host and guest side.

When executing ENQCMD in the guest to a SIOV device, the guest
programmed value in the PASID_MSR must be translated to a host PASID
value for proper function/isolation as PASID represents the address
space. The translation is done through a new VMCS PASID translation 
structure (per-VM, and 1:1 mapping). From this angle the host PASIDs 
must be allocated 'globally' cross all assigned devices otherwise it may 
lead to 1:N mapping when a guest process issues ENQCMD to multiple 
assigned devices/subdevices. 

There will be a KVM forum session for this topic btw.

> 
> 
> > Thus the host creates only one 'global' PASID
> > namespace but do use per-device PASID table to assure isolation between
> > devices on Intel platforms. But ARM does it differently as Jean explained.
> > They have a global namespace for host processes on all host-owned
> > devices (same as Intel), but then per-device namespace when a device
> > (and its PASID table) is assigned to userspace.
> >
> >> Another question, is this possible to have two DMAR hardware unit(at
> >> least I can see two even in my laptop). In this case, is PASID still a
> >> global resource?
> > yes
> >
> >>
> >>>while having separate VFIO/
> >>> VDPA allocation interfaces may easily cause confusion in userspace,
> >>> e.g. which interface to be 

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-15 Thread Jason Wang


On 2020/10/15 上午7:10, Alex Williamson wrote:

On Wed, 14 Oct 2020 03:08:31 +
"Tian, Kevin"  wrote:


From: Jason Wang 
Sent: Tuesday, October 13, 2020 2:22 PM


On 2020/10/12 下午4:38, Tian, Kevin wrote:

From: Jason Wang 
Sent: Monday, September 14, 2020 12:20 PM
  

[...]
   > If it's possible, I would suggest a generic uAPI instead of a VFIO

specific one.

Jason suggest something like /dev/sva. There will be a lot of other
subsystems that could benefit from this (e.g vDPA).

Have you ever considered this approach?
  

Hi, Jason,

We did some study on this approach and below is the output. It's a
long writing but I didn't find a way to further abstract w/o losing
necessary context. Sorry about that.

Overall the real purpose of this series is to enable IOMMU nested
translation capability with vSVA as one major usage, through
below new uAPIs:
1) Report/enable IOMMU nested translation capability;
2) Allocate/free PASID;
3) Bind/unbind guest page table;
4) Invalidate IOMMU cache;
5) Handle IOMMU page request/response (not in this series);
1/3/4) is the minimal set for using IOMMU nested translation, with
the other two optional. For example, the guest may enable vSVA on
a device without using PASID. Or, it may bind its gIOVA page table
which doesn't require page fault support. Finally, all operations can
be applied to either physical device or subdevice.

Then we evaluated each uAPI whether generalizing it is a good thing
both in concept and regarding to complexity.

First, unlike other uAPIs which are all backed by iommu_ops, PASID
allocation/free is through the IOASID sub-system.


A question here, is IOASID expected to be the single management
interface for PASID?

yes


(I'm asking since there're already vendor specific IDA based PASID
allocator e.g amdgpu_pasid_alloc())

That comes before IOASID core was introduced. I think it should be
changed to use the new generic interface. Jacob/Jean can better
comment if other reason exists for this exception.

   

   From this angle
we feel generalizing PASID management does make some sense.
First, PASID is just a number and not related to any device before
it's bound to a page table and IOMMU domain. Second, PASID is a
global resource (at least on Intel VT-d),


I think we need a definition of "global" here. It looks to me for vt-d
the PASID table is per device.

PASID table is per device, thus VT-d could support per-device PASIDs
in concept. However on Intel platform we require PASIDs to be managed
in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
and ENQCMD together. Thus the host creates only one 'global' PASID
namespace but do use per-device PASID table to assure isolation between
devices on Intel platforms. But ARM does it differently as Jean explained.
They have a global namespace for host processes on all host-owned
devices (same as Intel), but then per-device namespace when a device
(and its PASID table) is assigned to userspace.


Another question, is this possible to have two DMAR hardware unit(at
least I can see two even in my laptop). In this case, is PASID still a
global resource?

yes

   

   while having separate VFIO/
VDPA allocation interfaces may easily cause confusion in userspace,
e.g. which interface to be used if both VFIO/VDPA devices exist.
Moreover, an unified interface allows centralized control over how
many PASIDs are allowed per process.


Yes.

   

One unclear part with this generalization is about the permission.
Do we open this interface to any process or only to those which
have assigned devices? If the latter, what would be the mechanism
to coordinate between this new interface and specific passthrough
frameworks?


I'm not sure, but if you just want a permission, you probably can
introduce new capability (CAP_XXX) for this.

   

   A more tricky case, vSVA support on ARM (Eric/Jean
please correct me) plans to do per-device PASID namespace which
is built on a bind_pasid_table iommu callback to allow guest fully
manage its PASIDs on a given passthrough device.


I see, so I think the answer is to prepare for the namespace support
from the start. (btw, I don't see how namespace is handled in current
IOASID module?)

The PASID table is based on GPA when nested translation is enabled
on ARM SMMU. This design implies that the guest manages PASID
table thus PASIDs instead of going through host-side API on assigned
device. From this angle we don't need explicit namespace in the host
API. Just need a way to control how many PASIDs a process is allowed
to allocate in the global namespace. btw IOASID module already has
'set' concept per-process and PASIDs are managed per-set. Then the
quota control can be easily introduced in the 'set' level.

   

   I'm not sure
how such requirement can be unified w/o involving passthrough
frameworks, or whether ARM could also switch to global PASID
style...

Second, IOMMU nested translation is a per IOMMU domain
capability. Since 

Re: [PATCH v3 3/8] of/address: Introduce of_dma_get_max_cpu_address()

2020-10-15 Thread Ard Biesheuvel
On Thu, 15 Oct 2020 at 00:03, Rob Herring  wrote:
>
> On Wed, Oct 14, 2020 at 2:12 PM Nicolas Saenz Julienne
>  wrote:
> >
> > Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
> > physical address addressable by all DMA masters in the system. It's
> > specially useful for setting memory zones sizes at early boot time.
> >
> > Signed-off-by: Nicolas Saenz Julienne 
> >
> > ---
> >
> > Changes since v2:
> >  - Use PHYS_ADDR_MAX
> >  - return phys_dma_t
> >  - Rename function
> >  - Correct subject
> >  - Add support to start parsing from an arbitrary device node in order
> >for the function to work with unit tests
> >
> >  drivers/of/address.c | 42 ++
> >  include/linux/of.h   |  7 +++
> >  2 files changed, 49 insertions(+)
> >
> > diff --git a/drivers/of/address.c b/drivers/of/address.c
> > index eb9ab4f1e80b..b5a9695aaf82 100644
> > --- a/drivers/of/address.c
> > +++ b/drivers/of/address.c
> > @@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, const 
> > struct bus_dma_region **map)
> >  }
> >  #endif /* CONFIG_HAS_DMA */
> >
> > +/**
> > + * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
> > + * @np: The node to start searching from or NULL to start from the root
> > + *
> > + * Gets the highest CPU physical address that is addressable by all DMA 
> > masters
> > + * in the system (or subtree when np is non-NULL). If no DMA constrained 
> > device
> > + * is found, it returns PHYS_ADDR_MAX.
> > + */
> > +phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
> > +{
> > +   phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
>
> One issue with using phys_addr_t is it may be 32-bit even though the
> DT is 64-bit addresses. LPAE capable system with LPAE disabled. Maybe
> the truncation is fine here? Maybe not.
>

PHYS_ADDR_MAX is the max addressable CPU address on the system, and so
it makes sense to use it for the return type, and for the preliminary
return value: this is actually what /prevents/ truncation, because we
will only overwrite max_cpu_addr if the new u64 value is lower.


> > +   struct of_range_parser parser;
> > +   phys_addr_t subtree_max_addr;
> > +   struct device_node *child;
> > +   phys_addr_t cpu_end = 0;
> > +   struct of_range range;
> > +   const __be32 *ranges;
> > +   int len;
> > +
> > +   if (!np)
> > +   np = of_root;
> > +
> > +   ranges = of_get_property(np, "dma-ranges", );
>
> I'm not really following why you changed the algorithm here. You're
> skipping disabled nodes which is good. Was there some other reason?
>
> > +   if (ranges && len) {
> > +   of_dma_range_parser_init(, np);
> > +   for_each_of_range(, )
> > +   if (range.cpu_addr + range.size > cpu_end)
> > +   cpu_end = range.cpu_addr + range.size;
> > +
> > +   if (max_cpu_addr > cpu_end)
> > +   max_cpu_addr = cpu_end;
> > +   }
> > +
> > +   for_each_available_child_of_node(np, child) {
> > +   subtree_max_addr = of_dma_get_max_cpu_address(child);
> > +   if (max_cpu_addr > subtree_max_addr)
> > +   max_cpu_addr = subtree_max_addr;
> > +   }
> > +
> > +   return max_cpu_addr;
> > +}
> > +
> >  /**
> >   * of_dma_is_coherent - Check if device is coherent
> >   * @np:device node
> > diff --git a/include/linux/of.h b/include/linux/of.h
> > index 481ec0467285..db8db8f2c967 100644
> > --- a/include/linux/of.h
> > +++ b/include/linux/of.h
> > @@ -558,6 +558,8 @@ int of_map_id(struct device_node *np, u32 id,
> >const char *map_name, const char *map_mask_name,
> >struct device_node **target, u32 *id_out);
> >
> > +phys_addr_t of_dma_get_max_cpu_address(struct device_node *np);
> > +
> >  #else /* CONFIG_OF */
> >
> >  static inline void of_core_init(void)
> > @@ -995,6 +997,11 @@ static inline int of_map_id(struct device_node *np, 
> > u32 id,
> > return -EINVAL;
> >  }
> >
> > +static inline phys_addr_t of_dma_get_max_cpu_address(struct device_node 
> > *np)
> > +{
> > +   return PHYS_ADDR_MAX;
> > +}
> > +
> >  #define of_match_ptr(_ptr) NULL
> >  #define of_match_node(_matches, _node) NULL
> >  #endif /* CONFIG_OF */
> > --
> > 2.28.0
> >
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] drm/msm: add DRM_MSM_GEM_SYNC_CACHE for non-coherent cache maintenance

2020-10-15 Thread Christoph Hellwig
On Tue, Oct 13, 2020 at 02:42:38PM +0100, Robin Murphy wrote:
> I still think this situation would be best handled with a variant of
> dma_ops_bypass that also guarantees to bypass SWIOTLB, and can be set
> automatically when attaching to an unmanaged IOMMU domain.

dma_ops_bypass should mostly do the right thing as-is.  swiotlb bouncing
is triggered of two things:

 1) the dma_mask.  This is under control of the driver, and obviously
if it is too small for a legit reason we can't just proceed
 2) force_dma_unencrypted() - we'd need to do an opt-out here, either
by a flag or by being smart and looking for an attached iommu on
the device

> That way the
> device driver can make DMA API calls in the appropriate places that do the
> right thing either way, and only needs logic to decide whether to use the
> returned DMA addresses directly or ignore them if it knows they're
> overridden by its own IOMMU mapping.

I'd be happy to review patches for this.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-15 Thread Jason Wang


On 2020/10/14 上午11:08, Tian, Kevin wrote:

From: Jason Wang 
Sent: Tuesday, October 13, 2020 2:22 PM


On 2020/10/12 下午4:38, Tian, Kevin wrote:

From: Jason Wang 
Sent: Monday, September 14, 2020 12:20 PM


[...]
   > If it's possible, I would suggest a generic uAPI instead of a VFIO

specific one.

Jason suggest something like /dev/sva. There will be a lot of other
subsystems that could benefit from this (e.g vDPA).

Have you ever considered this approach?


Hi, Jason,

We did some study on this approach and below is the output. It's a
long writing but I didn't find a way to further abstract w/o losing
necessary context. Sorry about that.

Overall the real purpose of this series is to enable IOMMU nested
translation capability with vSVA as one major usage, through
below new uAPIs:
1) Report/enable IOMMU nested translation capability;
2) Allocate/free PASID;
3) Bind/unbind guest page table;
4) Invalidate IOMMU cache;
5) Handle IOMMU page request/response (not in this series);
1/3/4) is the minimal set for using IOMMU nested translation, with
the other two optional. For example, the guest may enable vSVA on
a device without using PASID. Or, it may bind its gIOVA page table
which doesn't require page fault support. Finally, all operations can
be applied to either physical device or subdevice.

Then we evaluated each uAPI whether generalizing it is a good thing
both in concept and regarding to complexity.

First, unlike other uAPIs which are all backed by iommu_ops, PASID
allocation/free is through the IOASID sub-system.


A question here, is IOASID expected to be the single management
interface for PASID?

yes


(I'm asking since there're already vendor specific IDA based PASID
allocator e.g amdgpu_pasid_alloc())

That comes before IOASID core was introduced. I think it should be
changed to use the new generic interface. Jacob/Jean can better
comment if other reason exists for this exception.



If there's no exception it should be fixed.







   From this angle
we feel generalizing PASID management does make some sense.
First, PASID is just a number and not related to any device before
it's bound to a page table and IOMMU domain. Second, PASID is a
global resource (at least on Intel VT-d),


I think we need a definition of "global" here. It looks to me for vt-d
the PASID table is per device.

PASID table is per device, thus VT-d could support per-device PASIDs
in concept.



I think that's the requirement of PCIE spec which said PASID + RID 
identifies the process address space ID.




  However on Intel platform we require PASIDs to be managed
in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
and ENQCMD together.



Any reason for such requirement? (I'm not familiar with ENQCMD, but my 
understanding is that vSVA, SIOV or SR-IOV doesn't have the requirement 
for system-wide PASID).




Thus the host creates only one 'global' PASID
namespace but do use per-device PASID table to assure isolation between
devices on Intel platforms. But ARM does it differently as Jean explained.
They have a global namespace for host processes on all host-owned
devices (same as Intel), but then per-device namespace when a device
(and its PASID table) is assigned to userspace.


Another question, is this possible to have two DMAR hardware unit(at
least I can see two even in my laptop). In this case, is PASID still a
global resource?

yes




   while having separate VFIO/
VDPA allocation interfaces may easily cause confusion in userspace,
e.g. which interface to be used if both VFIO/VDPA devices exist.
Moreover, an unified interface allows centralized control over how
many PASIDs are allowed per process.


Yes.



One unclear part with this generalization is about the permission.
Do we open this interface to any process or only to those which
have assigned devices? If the latter, what would be the mechanism
to coordinate between this new interface and specific passthrough
frameworks?


I'm not sure, but if you just want a permission, you probably can
introduce new capability (CAP_XXX) for this.



   A more tricky case, vSVA support on ARM (Eric/Jean
please correct me) plans to do per-device PASID namespace which
is built on a bind_pasid_table iommu callback to allow guest fully
manage its PASIDs on a given passthrough device.


I see, so I think the answer is to prepare for the namespace support
from the start. (btw, I don't see how namespace is handled in current
IOASID module?)

The PASID table is based on GPA when nested translation is enabled
on ARM SMMU. This design implies that the guest manages PASID
table thus PASIDs instead of going through host-side API on assigned
device. From this angle we don't need explicit namespace in the host
API. Just need a way to control how many PASIDs a process is allowed
to allocate in the global namespace. btw IOASID module already has
'set' concept per-process and PASIDs are managed per-set. Then the
quota control can be