Re: [PATCH 0/5] Optimize iommu_map_sg() performance

2021-01-11 Thread isaacm

On 2021-01-10 23:52, Sai Prakash Ranjan wrote:

On 2021-01-11 11:52, Sai Prakash Ranjan wrote:

Hi Isaac,

I gave this series a go on chromebook and saw these warnings
and several device probe failures, logs attached below:

WARN corresponds to this code in arm_lpae_map_by_pgsize()

if (WARN_ON(iaext || (paddr + size) >> cfg->oas))
return -ERANGE;

Logs:

[2.411391] [ cut here ]
[2.416149] WARNING: CPU: 6 PID: 56 at
drivers/iommu/io-pgtable-arm.c:492 arm_lpae_map_sg+0x234/0x248
[2.425606] Modules linked in:
[2.428749] CPU: 6 PID: 56 Comm: kworker/6:1 Not tainted 5.10.5 
#970

[2.440287] Workqueue: events deferred_probe_work_func
[2.445563] pstate: 20c9 (nzCv daif +PAN +UAO -TCO BTYPE=--)
[2.451726] pc : arm_lpae_map_sg+0x234/0x248
[2.456112] lr : arm_lpae_map_sg+0xe0/0x248
[2.460410] sp : ffc010513750
[2.463820] x29: ffc010513790 x28: ffb943332000
[2.469281] x27: 000ff000 x26: ffb943d14900
[2.474738] x25: 1000 x24: 000103465000
[2.480196] x23: 0001 x22: 000103466000
[2.485645] x21: 0003 x20: 0a20
[2.491103] x19: ffc010513850 x18: 0001
[2.496562] x17: 0002 x16: 
[2.502021] x15:  x14: 
[2.507479] x13: 0001 x12: 
[2.512928] x11: 0010 x10: 
[2.518385] x9 : 0001 x8 : 40201000
[2.523844] x7 : 0a20 x6 : ffb943463000
[2.529302] x5 : 0003 x4 : 1000
[2.534760] x3 : 0001 x2 : ffb941f605a0
[2.540219] x1 : 0003 x0 : 0e40
[2.545679] Call trace:
[2.548196]  arm_lpae_map_sg+0x234/0x248
[2.552225]  arm_smmu_map_sg+0x80/0xc4
[2.556078]  __iommu_map_sg+0x6c/0x188
[2.559931]  iommu_map_sg_atomic+0x18/0x20
[2.564144]  iommu_dma_alloc_remap+0x26c/0x34c
[2.568703]  iommu_dma_alloc+0x9c/0x268
[2.572647]  dma_alloc_attrs+0x88/0xfc
[2.576503]  gsi_ring_alloc+0x50/0x144
[2.580356]  gsi_init+0x2c4/0x5c4
[2.583766]  ipa_probe+0x14c/0x2b4
[2.587263]  platform_drv_probe+0x94/0xb4
[2.591377]  really_probe+0x138/0x348
[2.595145]  driver_probe_device+0x80/0xb8
[2.599358]  __device_attach_driver+0x90/0xa8
[2.603829]  bus_for_each_drv+0x84/0xcc
[2.607772]  __device_attach+0xc0/0x148
[2.611713]  device_initial_probe+0x18/0x20
[2.616012]  bus_probe_device+0x38/0x94
[2.619953]  deferred_probe_work_func+0x78/0xb0
[2.624611]  process_one_work+0x210/0x3dc
[2.628726]  worker_thread+0x284/0x3e0
[2.632578]  kthread+0x148/0x1a8
[2.635891]  ret_from_fork+0x10/0x18
[2.639562] ---[ end trace 9bac18cad6a9862e ]---
[2.644414] ipa 1e4.ipa: error -12 allocating channel 0 event 
ring

[2.651656] ipa: probe of 1e4.ipa failed with error -12
[2.660072] dwc3 a60.dwc3: Adding to iommu group 8
[2.668632] xhci-hcd xhci-hcd.13.auto: xHCI Host Controller
[2.674680] xhci-hcd xhci-hcd.13.auto: new USB bus registered,
assigned bus number 1



...

Isaac provided a fix which he will post as v2 and no warnings were 
observed

with that fix.

Tested-by: Sai Prakash Ranjan 

Thanks,
Sai


Thanks for testing out the patches. I've added the fix (there was an 
off-by-one error in the calculation
used to check if the IOVA/physical addresses are within limits) to 
version 2 of the series:

https://lore.kernel.org/linux-iommu/1610376862-927-1-git-send-email-isa...@codeaurora.org/T/#t

Thanks,
Isaac


Re: [PATCH 0/5] Optimize iommu_map_sg() performance

2021-01-10 Thread Sai Prakash Ranjan

On 2021-01-11 11:52, Sai Prakash Ranjan wrote:

Hi Isaac,

I gave this series a go on chromebook and saw these warnings
and several device probe failures, logs attached below:

WARN corresponds to this code in arm_lpae_map_by_pgsize()

if (WARN_ON(iaext || (paddr + size) >> cfg->oas))
return -ERANGE;

Logs:

[2.411391] [ cut here ]
[2.416149] WARNING: CPU: 6 PID: 56 at
drivers/iommu/io-pgtable-arm.c:492 arm_lpae_map_sg+0x234/0x248
[2.425606] Modules linked in:
[2.428749] CPU: 6 PID: 56 Comm: kworker/6:1 Not tainted 5.10.5 #970
[2.440287] Workqueue: events deferred_probe_work_func
[2.445563] pstate: 20c9 (nzCv daif +PAN +UAO -TCO BTYPE=--)
[2.451726] pc : arm_lpae_map_sg+0x234/0x248
[2.456112] lr : arm_lpae_map_sg+0xe0/0x248
[2.460410] sp : ffc010513750
[2.463820] x29: ffc010513790 x28: ffb943332000
[2.469281] x27: 000ff000 x26: ffb943d14900
[2.474738] x25: 1000 x24: 000103465000
[2.480196] x23: 0001 x22: 000103466000
[2.485645] x21: 0003 x20: 0a20
[2.491103] x19: ffc010513850 x18: 0001
[2.496562] x17: 0002 x16: 
[2.502021] x15:  x14: 
[2.507479] x13: 0001 x12: 
[2.512928] x11: 0010 x10: 
[2.518385] x9 : 0001 x8 : 40201000
[2.523844] x7 : 0a20 x6 : ffb943463000
[2.529302] x5 : 0003 x4 : 1000
[2.534760] x3 : 0001 x2 : ffb941f605a0
[2.540219] x1 : 0003 x0 : 0e40
[2.545679] Call trace:
[2.548196]  arm_lpae_map_sg+0x234/0x248
[2.552225]  arm_smmu_map_sg+0x80/0xc4
[2.556078]  __iommu_map_sg+0x6c/0x188
[2.559931]  iommu_map_sg_atomic+0x18/0x20
[2.564144]  iommu_dma_alloc_remap+0x26c/0x34c
[2.568703]  iommu_dma_alloc+0x9c/0x268
[2.572647]  dma_alloc_attrs+0x88/0xfc
[2.576503]  gsi_ring_alloc+0x50/0x144
[2.580356]  gsi_init+0x2c4/0x5c4
[2.583766]  ipa_probe+0x14c/0x2b4
[2.587263]  platform_drv_probe+0x94/0xb4
[2.591377]  really_probe+0x138/0x348
[2.595145]  driver_probe_device+0x80/0xb8
[2.599358]  __device_attach_driver+0x90/0xa8
[2.603829]  bus_for_each_drv+0x84/0xcc
[2.607772]  __device_attach+0xc0/0x148
[2.611713]  device_initial_probe+0x18/0x20
[2.616012]  bus_probe_device+0x38/0x94
[2.619953]  deferred_probe_work_func+0x78/0xb0
[2.624611]  process_one_work+0x210/0x3dc
[2.628726]  worker_thread+0x284/0x3e0
[2.632578]  kthread+0x148/0x1a8
[2.635891]  ret_from_fork+0x10/0x18
[2.639562] ---[ end trace 9bac18cad6a9862e ]---
[2.644414] ipa 1e4.ipa: error -12 allocating channel 0 event 
ring

[2.651656] ipa: probe of 1e4.ipa failed with error -12
[2.660072] dwc3 a60.dwc3: Adding to iommu group 8
[2.668632] xhci-hcd xhci-hcd.13.auto: xHCI Host Controller
[2.674680] xhci-hcd xhci-hcd.13.auto: new USB bus registered,
assigned bus number 1



...

Isaac provided a fix which he will post as v2 and no warnings were 
observed

with that fix.

Tested-by: Sai Prakash Ranjan 

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH 0/5] Optimize iommu_map_sg() performance

2021-01-10 Thread Sai Prakash Ranjan
Hi Isaac,

On 2021-01-09 07:20, Isaac J. Manjarres wrote:
> The iommu_map_sg() code currently iterates through the given
> scatter-gather list, and in the worst case, invokes iommu_map()
> for each element in the scatter-gather list, which calls into
> the IOMMU driver through an indirect call. For an IOMMU driver
> that uses a format supported by the io-pgtable code, the IOMMU
> driver will then call into the io-pgtable code to map the chunk.
> 
> Jumping between the IOMMU core code, the IOMMU driver, and the
> io-pgtable code and back for each element in a scatter-gather list
> is not efficient.
> 
> Instead, add a map_sg() hook in both the IOMMU driver ops and the
> io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
> map_sg() hook with the entire scatter-gather list, which can call
> into the io-pgtable map_sg() hook, which can process the entire
> scatter-gather list, signficantly reducing the number of indirect
> calls, and jumps between these layers, boosting performance.
> 
> On a system that uses the ARM SMMU driver, and the ARM LPAE format,
> the current implementation of iommu_map_sg() yields the following
> latencies for mapping scatter-gather lists of various sizes. These
> latencies are calculated by repeating the mapping operation 10 times:
> 
> sizeiommu_map_sg latency
>   4K0.624 us
>  64K9.468 us
>   1M  122.557 us
>   2M  239.807 us
>  12M 1435.979 us
>  24M 2884.968 us
>  32M 3832.979 us
> 
> On the same system, the proposed modifications yield the following
> results:
> 
> sizeiommu_map_sg latency
>   4K3.645 us
>  64K4.198 us
>   1M   11.010 us
>   2M   17.125 us
>  12M   82.416 us
>  24M  158.677 us
>  32M  210.468 us
> 
> The procedure for collecting the iommu_map_sg latencies is
> the same in both experiments. Clearly, reducing the jumps
> between the different layers in the IOMMU code offers a
> signficant performance boost in iommu_map_sg() latency.
> 

I gave this series a go on chromebook and saw these warnings
and several device probe failures, logs attached below:

WARN corresponds to this code in arm_lpae_map_by_pgsize()

if (WARN_ON(iaext || (paddr + size) >> cfg->oas))
return -ERANGE;

Logs:

[2.411391] [ cut here ]
[2.416149] WARNING: CPU: 6 PID: 56 at drivers/iommu/io-pgtable-arm.c:492 
arm_lpae_map_sg+0x234/0x248
[2.425606] Modules linked in:
[2.428749] CPU: 6 PID: 56 Comm: kworker/6:1 Not tainted 5.10.5 #970
[2.440287] Workqueue: events deferred_probe_work_func
[2.445563] pstate: 20c9 (nzCv daif +PAN +UAO -TCO BTYPE=--)
[2.451726] pc : arm_lpae_map_sg+0x234/0x248
[2.456112] lr : arm_lpae_map_sg+0xe0/0x248
[2.460410] sp : ffc010513750
[2.463820] x29: ffc010513790 x28: ffb943332000 
[2.469281] x27: 000ff000 x26: ffb943d14900 
[2.474738] x25: 1000 x24: 000103465000 
[2.480196] x23: 0001 x22: 000103466000 
[2.485645] x21: 0003 x20: 0a20 
[2.491103] x19: ffc010513850 x18: 0001 
[2.496562] x17: 0002 x16:  
[2.502021] x15:  x14:  
[2.507479] x13: 0001 x12:  
[2.512928] x11: 0010 x10:  
[2.518385] x9 : 0001 x8 : 40201000 
[2.523844] x7 : 0a20 x6 : ffb943463000 
[2.529302] x5 : 0003 x4 : 1000 
[2.534760] x3 : 0001 x2 : ffb941f605a0 
[2.540219] x1 : 0003 x0 : 0e40 
[2.545679] Call trace:
[2.548196]  arm_lpae_map_sg+0x234/0x248
[2.552225]  arm_smmu_map_sg+0x80/0xc4
[2.556078]  __iommu_map_sg+0x6c/0x188
[2.559931]  iommu_map_sg_atomic+0x18/0x20
[2.564144]  iommu_dma_alloc_remap+0x26c/0x34c
[2.568703]  iommu_dma_alloc+0x9c/0x268
[2.572647]  dma_alloc_attrs+0x88/0xfc
[2.576503]  gsi_ring_alloc+0x50/0x144
[2.580356]  gsi_init+0x2c4/0x5c4
[2.583766]  ipa_probe+0x14c/0x2b4
[2.587263]  platform_drv_probe+0x94/0xb4
[2.591377]  really_probe+0x138/0x348
[2.595145]  driver_probe_device+0x80/0xb8
[2.599358]  __device_attach_driver+0x90/0xa8
[2.603829]  bus_for_each_drv+0x84/0xcc
[2.607772]  __device_attach+0xc0/0x148
[2.611713]  device_initial_probe+0x18/0x20
[2.616012]  bus_probe_device+0x38/0x94
[2.619953]  deferred_probe_work_func+0x78/0xb0
[2.624611]  process_one_work+0x210/0x3dc
[2.628726]  worker_thread+0x284/0x3e0
[2.632578]  kthread+0x148/0x1a8
[2.635891]  ret_from_fork+0x10/0x18
[2.639562] ---[ end trace 9bac18cad6a9862e ]---
[2.644414] ipa 1e4.ipa: error -12 allocating channel 0 event ring
[

[PATCH 0/5] Optimize iommu_map_sg() performance

2021-01-08 Thread Isaac J. Manjarres
The iommu_map_sg() code currently iterates through the given
scatter-gather list, and in the worst case, invokes iommu_map()
for each element in the scatter-gather list, which calls into
the IOMMU driver through an indirect call. For an IOMMU driver
that uses a format supported by the io-pgtable code, the IOMMU
driver will then call into the io-pgtable code to map the chunk.

Jumping between the IOMMU core code, the IOMMU driver, and the
io-pgtable code and back for each element in a scatter-gather list
is not efficient.

Instead, add a map_sg() hook in both the IOMMU driver ops and the
io-pgtable ops. iommu_map_sg() can then call into the IOMMU driver's
map_sg() hook with the entire scatter-gather list, which can call
into the io-pgtable map_sg() hook, which can process the entire
scatter-gather list, signficantly reducing the number of indirect
calls, and jumps between these layers, boosting performance.

On a system that uses the ARM SMMU driver, and the ARM LPAE format,
the current implementation of iommu_map_sg() yields the following
latencies for mapping scatter-gather lists of various sizes. These
latencies are calculated by repeating the mapping operation 10 times:

sizeiommu_map_sg latency
  4K0.624 us
 64K9.468 us
  1M  122.557 us
  2M  239.807 us
 12M 1435.979 us
 24M 2884.968 us
 32M 3832.979 us

On the same system, the proposed modifications yield the following
results:

sizeiommu_map_sg latency
  4K3.645 us
 64K4.198 us
  1M   11.010 us
  2M   17.125 us
 12M   82.416 us
 24M  158.677 us
 32M  210.468 us

The procedure for collecting the iommu_map_sg latencies is
the same in both experiments. Clearly, reducing the jumps
between the different layers in the IOMMU code offers a
signficant performance boost in iommu_map_sg() latency.

Thanks,
Isaac

Isaac J. Manjarres (5):
  iommu/io-pgtable: Introduce map_sg() as a page table op
  iommu/io-pgtable-arm: Hook up map_sg()
  iommu/io-pgtable-arm-v7s: Hook up map_sg()
  iommu: Introduce map_sg() as an IOMMU op for IOMMU drivers
  iommu/arm-smmu: Hook up map_sg()

 drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 
 drivers/iommu/io-pgtable-arm-v7s.c| 90 +++
 drivers/iommu/io-pgtable-arm.c| 86 +
 drivers/iommu/iommu.c | 25 --
 include/linux/io-pgtable.h|  6 +++
 include/linux/iommu.h | 13 +
 6 files changed, 234 insertions(+), 5 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project