Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation

2019-08-16 Thread John Garry

On 15/08/2019 14:55, Will Deacon wrote:

On Thu, Aug 15, 2019 at 12:19:58PM +0100, John Garry wrote:

On 14/08/2019 18:56, Will Deacon wrote:

If you'd like to play with the patches, then I've also pushed them here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap

but they should behave as a no-op on their own.


As anticipated, my storage testing scenarios roughly give parity throughput
and CPU loading before and after this series.

Patches to convert the

Arm SMMUv3 driver to the new API are here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq


I quickly tested this again and now I see a performance lift:

before (5.3-rc1)after
D05 8x SAS disks907K IOPS   970K IOPS
D05 1x NVMe 450K IOPS   466K IOPS
D06 1x NVMe 467K IOPS   466K IOPS

The CPU loading seems to track throughput, so nothing much to say there.

Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for
!IOMMU.


Cheers, John. For interest, how do things look if you pass iommu.strict=0?
That might give some indication about how much the invalidation is still
hurting us.


So I tested for iommu/cmdq for NVMe only, and I see:

 !SMMU  5.3-rc4 strict/!strict  cmdq strict/!strict
D05 NVMe 750K IOPS  456K/540K IOPS  466K/537K
D06 NVMe 750K IOPS  456K/740K IOPS  466K/745K

I don't know why the D06 iommu.strict performance is ~ same as D05, 
while !strict is so much better. D06 SMMU implementation is supposed to 
be generally much better than that of D05, so I would have thought that 
the strict performance would be better (than that of D05).





BTW, what were your thoughts on changing
arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems
suitable, but looks untouched. Were you waiting for a resolution to the
performance issue which Leizhen reported?


In principle, I'm supportive of such a change, but I'm not currently able
to test any ATS stuff so somebody else would need to write the patch.
Jean-Philippe is on holiday at the moment, but I'd be happy to review
something from you if you send it out.


Unfortunately I don't have anything ATS-enabled either. Not many do, it 
seems.


Cheers,
John



Will

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation

2019-08-15 Thread Will Deacon
On Thu, Aug 15, 2019 at 12:19:58PM +0100, John Garry wrote:
> On 14/08/2019 18:56, Will Deacon wrote:
> > If you'd like to play with the patches, then I've also pushed them here:
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap
> > 
> > but they should behave as a no-op on their own.
> 
> As anticipated, my storage testing scenarios roughly give parity throughput
> and CPU loading before and after this series.
> 
> Patches to convert the
> > Arm SMMUv3 driver to the new API are here:
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq
> 
> I quickly tested this again and now I see a performance lift:
> 
>   before (5.3-rc1)after
> D05 8x SAS disks  907K IOPS   970K IOPS
> D05 1x NVMe   450K IOPS   466K IOPS
> D06 1x NVMe   467K IOPS   466K IOPS
> 
> The CPU loading seems to track throughput, so nothing much to say there.
> 
> Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for
> !IOMMU.

Cheers, John. For interest, how do things look if you pass iommu.strict=0?
That might give some indication about how much the invalidation is still
hurting us.

> BTW, what were your thoughts on changing
> arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems
> suitable, but looks untouched. Were you waiting for a resolution to the
> performance issue which Leizhen reported?

In principle, I'm supportive of such a change, but I'm not currently able
to test any ATS stuff so somebody else would need to write the patch.
Jean-Philippe is on holiday at the moment, but I'd be happy to review
something from you if you send it out.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation

2019-08-15 Thread John Garry

On 14/08/2019 18:56, Will Deacon wrote:

Hi everybody,

These are the core IOMMU changes that I have posted previously as part
of my ongoing effort to reduce the lock contention of the SMMUv3 command
queue. I thought it would be better to split this out as a separate
series, since I think it's ready to go and all the driver conversions
mean that it's quite a pain for me to maintain out of tree!

The idea of the patch series is to allow TLB invalidation to be batched
up into a new 'struct iommu_iotlb_gather' structure, which tracks the
properties of the virtual address range being invalidated so that it
can be deferred until the driver's ->iotlb_sync() function is called.
This allows for more efficient invalidation on hardware that can submit
multiple invalidations in one go.

The previous series was included in:

  https://lkml.kernel.org/r/20190711171927.28803-1-w...@kernel.org

The only real change since then is incorporating the newly merged
virtio-iommu driver.

If you'd like to play with the patches, then I've also pushed them here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap

but they should behave as a no-op on their own.


Hi Will,

As anticipated, my storage testing scenarios roughly give parity 
throughput and CPU loading before and after this series.


Patches to convert the

Arm SMMUv3 driver to the new API are here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq


I quickly tested this again and now I see a performance lift:

before (5.3-rc1)after
D05 8x SAS disks907K IOPS   970K IOPS
D05 1x NVMe 450K IOPS   466K IOPS
D06 1x NVMe 467K IOPS   466K IOPS

The CPU loading seems to track throughput, so nothing much to say there.

Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for 
!IOMMU.


BTW, what were your thoughts on changing 
arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It 
seems suitable, but looks untouched. Were you waiting for a resolution 
to the performance issue which Leizhen reported?


Thanks,
John



Cheers,

Will

--->8

Cc: Jean-Philippe Brucker 
Cc: Robin Murphy 
Cc: Jayachandran Chandrasekharan Nair 
Cc: Jan Glauber 
Cc: Jon Masters 
Cc: Eric Auger 
Cc: Zhen Lei 
Cc: Jonathan Cameron 
Cc: Vijay Kilary 
Cc: Joerg Roedel 
Cc: John Garry 
Cc: Alex Williamson 
Cc: Marek Szyprowski 
Cc: David Woodhouse 

Will Deacon (13):
  iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops
  iommu/io-pgtable-arm: Remove redundant call to io_pgtable_tlb_sync()
  iommu/io-pgtable: Rename iommu_gather_ops to iommu_flush_ops
  iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes
  iommu: Introduce iommu_iotlb_gather_add_page()
  iommu: Pass struct iommu_iotlb_gather to ->unmap() and ->iotlb_sync()
  iommu/io-pgtable: Introduce tlb_flush_walk() and tlb_flush_leaf()
  iommu/io-pgtable: Hook up ->tlb_flush_walk() and ->tlb_flush_leaf() in
drivers
  iommu/io-pgtable-arm: Call ->tlb_flush_walk() and ->tlb_flush_leaf()
  iommu/io-pgtable: Replace ->tlb_add_flush() with ->tlb_add_page()
  iommu/io-pgtable: Remove unused ->tlb_sync() callback
  iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->unmap()
  iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->tlb_add_page()

 drivers/gpu/drm/panfrost/panfrost_mmu.c |  24 +---
 drivers/iommu/amd_iommu.c   |  11 ++--
 drivers/iommu/arm-smmu-v3.c |  52 +++-
 drivers/iommu/arm-smmu.c| 103 
 drivers/iommu/dma-iommu.c   |   9 ++-
 drivers/iommu/exynos-iommu.c|   3 +-
 drivers/iommu/intel-iommu.c |   3 +-
 drivers/iommu/io-pgtable-arm-v7s.c  |  57 +-
 drivers/iommu/io-pgtable-arm.c  |  48 ---
 drivers/iommu/iommu.c   |  24 
 drivers/iommu/ipmmu-vmsa.c  |  28 +
 drivers/iommu/msm_iommu.c   |  42 +
 drivers/iommu/mtk_iommu.c   |  45 +++---
 drivers/iommu/mtk_iommu_v1.c|   3 +-
 drivers/iommu/omap-iommu.c  |   2 +-
 drivers/iommu/qcom_iommu.c  |  44 +++---
 drivers/iommu/rockchip-iommu.c  |   2 +-
 drivers/iommu/s390-iommu.c  |   3 +-
 drivers/iommu/tegra-gart.c  |  12 +++-
 drivers/iommu/tegra-smmu.c  |   2 +-
 drivers/iommu/virtio-iommu.c|   5 +-
 drivers/vfio/vfio_iommu_type1.c |  27 +
 include/linux/io-pgtable.h  |  57 --
 include/linux/iommu.h   |  92 +---
 24 files changed, 483 insertions(+), 215 deletions(-)




___
iommu mailing list
iommu@lists.linux-foundation.org