Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-08-08 Thread Leizhen (ThunderTown)


On 2017/8/9 11:24, Ganapatrao Kulkarni wrote:
> On Wed, Aug 9, 2017 at 7:12 AM, Leizhen (ThunderTown)
>  wrote:
>>
>>
>> On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
>>> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
>>>  wrote:


 On 2017/7/26 19:08, Joerg Roedel wrote:
> Hi Robin.
>
> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain 
>> quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less 
>> slow.
>
> Do you have some numbers here? How big was the impact before these
> patches and how is it with the patches?
 Here are some numbers:

 (before)$ iperf -s
 
 Server listening on TCP port 5001
 TCP window size: 85.3 KByte (default)
 
 [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
 [ ID] Interval   Transfer Bandwidth
 [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
 [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
 [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
 [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
 [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec

 (after)$ iperf -s
 
 Server listening on TCP port 5001
 TCP window size: 85.3 KByte (default)
 
 [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
 [ ID] Interval   Transfer Bandwidth
 [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
 [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
 [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
 [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
 [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec

>>>
>>> Is this testing done on Host or on Guest/VM?
>> Host
> 
> As per your log, iperf throughput is improved to 938 Mbits/sec
> from  6.43 Mbits/sec.
> IMO, this seems to be unrealistic, some thing wrong with the testing?
For 64bits non-pci devices, the iova allocation is always searched from the 
last rb-tree node.
When many iovas allocated and keep a long time, the search process should check 
many rb nodes
then find a suitable free space. As my tracking, the average times exceeds 10K.
[free-space][free][used][...][used]
  ^ ^  ^
  | |  |-rb_last
  | |- maybe more than 10K allocated iova nodes
  |--- for 32bits devices, cached32_node remember the 
lastest freed node, which can help us reduce check times

This patch series add a new member "cached_node" to service for 64bits devices, 
like cached32_node service for 32bits devices.

> 
>>
>>>
>
>
>   Joerg
>
>
> .
>

 --
 Thanks!
 BestRegards


 ___
 linux-arm-kernel mailing list
 linux-arm-ker...@lists.infradead.org
 http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>
>>> thanks
>>> Ganapat
>>>
>>> .
>>>
>>
>> --
>> Thanks!
>> BestRegards
>>
> 
> thanks
> Ganapat
> 
> .
> 

-- 
Thanks!
BestRegards

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-08-08 Thread Ganapatrao Kulkarni
On Wed, Aug 9, 2017 at 7:12 AM, Leizhen (ThunderTown)
 wrote:
>
>
> On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
>> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
>>  wrote:
>>>
>>>
>>> On 2017/7/26 19:08, Joerg Roedel wrote:
 Hi Robin.

 On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
> Hi all,
>
> In the wake of the ARM SMMU optimisation efforts, it seems that certain
> workloads (e.g. storage I/O with large scatterlists) probably remain quite
> heavily influenced by IOVA allocation performance. Separately, Ard also
> reported massive performance drops for a graphical desktop on AMD Seattle
> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
> ops domain getting initialised differently for ACPI vs. DT, and exposing
> the overhead of the rbtree slow path. Whilst we could go around trying to
> close up all the little gaps that lead to hitting the slowest case, it
> seems a much better idea to simply make said slowest case a lot less slow.

 Do you have some numbers here? How big was the impact before these
 patches and how is it with the patches?
>>> Here are some numbers:
>>>
>>> (before)$ iperf -s
>>> 
>>> Server listening on TCP port 5001
>>> TCP window size: 85.3 KByte (default)
>>> 
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>>> [ ID] Interval   Transfer Bandwidth
>>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>>
>>> (after)$ iperf -s
>>> 
>>> Server listening on TCP port 5001
>>> TCP window size: 85.3 KByte (default)
>>> 
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>>> [ ID] Interval   Transfer Bandwidth
>>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>>
>>
>> Is this testing done on Host or on Guest/VM?
> Host

As per your log, iperf throughput is improved to 938 Mbits/sec
from  6.43 Mbits/sec.
IMO, this seems to be unrealistic, some thing wrong with the testing?

>
>>


   Joerg


 .

>>>
>>> --
>>> Thanks!
>>> BestRegards
>>>
>>>
>>> ___
>>> linux-arm-kernel mailing list
>>> linux-arm-ker...@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
>> thanks
>> Ganapat
>>
>> .
>>
>
> --
> Thanks!
> BestRegards
>

thanks
Ganapat
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-08-08 Thread Leizhen (ThunderTown)


On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
>  wrote:
>>
>>
>> On 2017/7/26 19:08, Joerg Roedel wrote:
>>> Hi Robin.
>>>
>>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
 Hi all,

 In the wake of the ARM SMMU optimisation efforts, it seems that certain
 workloads (e.g. storage I/O with large scatterlists) probably remain quite
 heavily influenced by IOVA allocation performance. Separately, Ard also
 reported massive performance drops for a graphical desktop on AMD Seattle
 when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
 ops domain getting initialised differently for ACPI vs. DT, and exposing
 the overhead of the rbtree slow path. Whilst we could go around trying to
 close up all the little gaps that lead to hitting the slowest case, it
 seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> Do you have some numbers here? How big was the impact before these
>>> patches and how is it with the patches?
>> Here are some numbers:
>>
>> (before)$ iperf -s
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>
>> (after)$ iperf -s
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>
> 
> Is this testing done on Host or on Guest/VM?
Host

> 
>>>
>>>
>>>   Joerg
>>>
>>>
>>> .
>>>
>>
>> --
>> Thanks!
>> BestRegards
>>
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> thanks
> Ganapat
> 
> .
> 

-- 
Thanks!
BestRegards

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-08-08 Thread Ganapatrao Kulkarni
On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
 wrote:
>
>
> On 2017/7/26 19:08, Joerg Roedel wrote:
>> Hi Robin.
>>
>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>> Hi all,
>>>
>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>> close up all the little gaps that lead to hitting the slowest case, it
>>> seems a much better idea to simply make said slowest case a lot less slow.
>>
>> Do you have some numbers here? How big was the impact before these
>> patches and how is it with the patches?
> Here are some numbers:
>
> (before)$ iperf -s
> 
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> 
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
> [ ID] Interval   Transfer Bandwidth
> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>
> (after)$ iperf -s
> 
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> 
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
> [ ID] Interval   Transfer Bandwidth
> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>

Is this testing done on Host or on Guest/VM?

>>
>>
>>   Joerg
>>
>>
>> .
>>
>
> --
> Thanks!
> BestRegards
>
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

thanks
Ganapat
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-07-26 Thread Leizhen (ThunderTown)


On 2017/7/26 19:08, Joerg Roedel wrote:
> Hi Robin.
> 
> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less slow.
> 
> Do you have some numbers here? How big was the impact before these
> patches and how is it with the patches?
Here are some numbers:

(before)$ iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
[  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
[  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
[  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec

(after)$ iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
[  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
[  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
[  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec

> 
> 
>   Joerg
> 
> 
> .
> 

-- 
Thanks!
BestRegards

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-07-26 Thread Joerg Roedel
Hi Robin.

On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
> Hi all,
> 
> In the wake of the ARM SMMU optimisation efforts, it seems that certain
> workloads (e.g. storage I/O with large scatterlists) probably remain quite
> heavily influenced by IOVA allocation performance. Separately, Ard also
> reported massive performance drops for a graphical desktop on AMD Seattle
> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
> ops domain getting initialised differently for ACPI vs. DT, and exposing
> the overhead of the rbtree slow path. Whilst we could go around trying to
> close up all the little gaps that lead to hitting the slowest case, it
> seems a much better idea to simply make said slowest case a lot less slow.

Do you have some numbers here? How big was the impact before these
patches and how is it with the patches?


Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-07-21 Thread Robin Murphy
Hi all,

In the wake of the ARM SMMU optimisation efforts, it seems that certain
workloads (e.g. storage I/O with large scatterlists) probably remain quite
heavily influenced by IOVA allocation performance. Separately, Ard also
reported massive performance drops for a graphical desktop on AMD Seattle
when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
ops domain getting initialised differently for ACPI vs. DT, and exposing
the overhead of the rbtree slow path. Whilst we could go around trying to
close up all the little gaps that lead to hitting the slowest case, it
seems a much better idea to simply make said slowest case a lot less slow.

I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
the changes rather too hard to follow, so I've taken the liberty here of
picking the whole thing up and reimplementing the main part in a rather
less invasive manner.

Robin.

Changes from v1:
 - Fix overflow with 32-bit dma_addr_t
 - Add tested-bys

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html

Robin Murphy (1):
  iommu/iova: Extend rbtree node caching

Zhen Lei (3):
  iommu/iova: Optimise rbtree searching
  iommu/iova: Optimise the padding calculation
  iommu/iova: Make dma_32bit_pfn implicit

 drivers/gpu/drm/tegra/drm.c  |   3 +-
 drivers/gpu/host1x/dev.c |   3 +-
 drivers/iommu/amd_iommu.c|   7 +--
 drivers/iommu/dma-iommu.c|  18 +--
 drivers/iommu/intel-iommu.c  |  11 ++--
 drivers/iommu/iova.c | 112 ---
 drivers/misc/mic/scif/scif_rma.c |   3 +-
 include/linux/iova.h |   8 +--
 8 files changed, 60 insertions(+), 105 deletions(-)

-- 
2.12.2.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu