Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-17 Thread Sai Prakash Ranjan
On 2021-06-15 17:21, Sai Prakash Ranjan wrote:
> Hi Krishna,
> 
> On 2021-06-14 23:18, Krishna Reddy wrote:
>>> Right but we won't know until we profile the specific usecases or try them 
>>> in
>>> generic workload to see if they affect the performance. Sure, over 
>>> invalidation is
>>> a concern where multiple buffers can be mapped to same context and the cache
>>> is not usable at the time for lookup and such but we don't do it for small 
>>> buffers
>>> and only for large buffers which means thousands of TLB entry mappings in
>>> which case TLBIASID is preferred (note: I mentioned the HW team
>>> recommendation to use it for anything greater than 128 TLB entries) in my
>>> earlier reply. And also note that we do this only for partial walk flush, 
>>> we are not
>>> arbitrarily changing all the TLBIs to ASID based.
>>
>> Most of the heavy bw use cases does involve processing larger buffers.
>> When the physical memory is allocated dis-contiguously at page_size
>> (let's use 4KB here)
>> granularity, each aligned 2MB chunks IOVA unmap would involve
>> performing a TLBIASID
>> as 2MB is not a leaf. Essentially, It happens all the time during
>> large buffer unmaps and
>> potentially impact active traffic on other large buffers. Depending on how 
>> much
>> latency HW engines can absorb, the overflow/underflow issues for ISO
>> engines can be
>> sporadic and vendor specific.
>> Performing TLBIASID as default for all SoCs is not a safe operation.
>>
> 
> Ok so from what I gather from this is that its not easy to test for the
> negative impact and you don't have data on such yet and the behaviour is
> very vendor specific. To add on qcom impl, we have several performance
> improvements for TLB cache invalidations in HW like wait-for-safe(for realtime
> clients such as camera and display) and few others to allow for cache
> lookups/updates when TLBI is in progress for the same context bank, so atleast
> we are good here.
> 
>>
>>> I am no camera expert but from what the camera team mentioned is that there
>>> is a thread which frees memory(large unused memory buffers) periodically 
>>> which
>>> ends up taking around 100+ms and causing some camera test failures with
>>> frame drops. Parallel efforts are already being made to optimize this usage 
>>> of
>>> thread but as I mentioned previously, this is *not a camera specific*, lets 
>>> say
>>> someone else invokes such large unmaps, it's going to face the same issue.
>>
>> From the above, It doesn't look like the root cause of frame drops is
>> fully understood.
>> Why is 100+ms delay causing camera frame drop?  Is the same thread
>> submitting the buffers
>> to camera after unmap is complete? If not, how is the unmap latency
>> causing issue here?
>>
> 
> Ok since you are interested in camera usecase, I have requested for more 
> details
> from the camera team and will give it once they comeback. However I don't 
> think
> its good to have unmap latency at all and that is being addressed by this 
> patch.
> 

As promised, here are some more details shared by camera team:

Mapping of a framework buffer happens at the time of process request and 
unmapping
of a framework buffer happens once the buffer is available from hardware and 
result
will be notified to camera framework.
 * When there is a delay in unmapping of a buffer, result notification to 
framework
   will be delayed and based on pipeline delay depth, new requests from 
framework
   will be delayed.
 * Camera stack uses internal buffer managers for internal and framework 
buffers.
   While mapping and unmapping these managers will be accessed, so uses common 
lock
   and hence is a blocking call. So unmapping delay will cause the delay for 
mapping
   of a new request and leads to framedrop.

Map and unmap happens in the camera service process context. There is no 
separate perf
path to perform unmapping.

In Camera stack along with map/unmap delay, additional delays are due to HW. So 
HW should
be able to get the requests in time from SW to avoid frame drops.

Thanks,
Sai
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-17 Thread Sai Prakash Ranjan

Hi Krishna,

On 2021-06-18 02:48, Krishna Reddy wrote:
Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk 
since this is
related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which 
will

be set in init_context impl hook and the prev condition in
io_pgtable_tlb_flush_walk()
becomes something like below. Seems very minimal and neat instead of 
poking

into tlb_flush_walk functions or touching dma strict with some flag?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||
 iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {
 iop->cfg.tlb->tlb_flush_all(iop->cookie);
 return;
}


Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or
IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID?



tlb_flush_all() callback implementations can use TLBIALL or TLBIASID. so
having ASID in the quirk name doesn't sound right given this quirk 
should

be generic enough to be usable on other implementations as well.
Instead I will go with IO_PGTABLE_QUIRK_TLB_INV_ALL and will be happy to
change if others have some other preference.

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-17 Thread Krishna Reddy
> Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk since 
> this is
> related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV which will
> be set in init_context impl hook and the prev condition in
> io_pgtable_tlb_flush_walk()
> becomes something like below. Seems very minimal and neat instead of poking
> into tlb_flush_walk functions or touching dma strict with some flag?
> 
> if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||
>  iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {
>  iop->cfg.tlb->tlb_flush_all(iop->cookie);
>  return;
> }

Can you name it as IO_PGTABLE_QUIRK_TLB_INV_ASID or 
IO_PGTABLE_QUIRK_TLB_INV_ALL_ASID?

-KR
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-16 Thread Sai Prakash Ranjan

On 2021-06-16 12:28, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-15 19:23, Robin Murphy wrote:

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:


...

Hi @Robin, from these discussions it seems they are not ok with the 
change
for all SoC vendor implementations and do not have any data on such 
impact.
As I mentioned above, on QCOM platforms we do have several 
optimizations in HW
for TLBIs and would like to make use of it and reduce the unmap 
latency.

What do you think, should this be made implementation specific?


Yes, it sounds like there's enough uncertainty for now that this needs
to be an opt-in feature. However, I still think that non-strict mode
could use it generically, since that's all about over-invalidating to
save time on individual unmaps - and relatively non-deterministic -
already.

So maybe we have a second set of iommu_flush_ops, or just a flag
somewhere to control the tlb_flush_walk functions internally, and the
choice can be made in the iommu_get_dma_strict() test, but also forced
on all the time by your init_context hook. What do you reckon?



Sounds good to me. Since you mentioned non-strict mode using it 
generically,
can't we just set tlb_flush_all() in io_pgtable_tlb_flush_walk() like 
below
based on quirk so that we don't need to add any check in 
iommu_get_dma_strict()

and just force the new flush_ops in init_context hook?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
iop->cfg.tlb->tlb_flush_all(iop->cookie);
return;
}



Instead of flush_ops in init_context hook, perhaps a io_pgtable quirk 
since this
is related to tlb, probably a bad name but IO_PGTABLE_QUIRK_TLB_INV 
which will be
set in init_context impl hook and the prev condition in 
io_pgtable_tlb_flush_walk()
becomes something like below. Seems very minimal and neat instead of 
poking into

tlb_flush_walk functions or touching dma strict with some flag?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT ||
iop->cfg.quirks & IO_PGTABLE_QUIRK_TLB_INV) {
iop->cfg.tlb->tlb_flush_all(iop->cookie);
return;
}

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-16 Thread Sai Prakash Ranjan

Hi Robin,

On 2021-06-15 19:23, Robin Murphy wrote:

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:


...

Hi @Robin, from these discussions it seems they are not ok with the 
change
for all SoC vendor implementations and do not have any data on such 
impact.
As I mentioned above, on QCOM platforms we do have several 
optimizations in HW
for TLBIs and would like to make use of it and reduce the unmap 
latency.

What do you think, should this be made implementation specific?


Yes, it sounds like there's enough uncertainty for now that this needs
to be an opt-in feature. However, I still think that non-strict mode
could use it generically, since that's all about over-invalidating to
save time on individual unmaps - and relatively non-deterministic -
already.

So maybe we have a second set of iommu_flush_ops, or just a flag
somewhere to control the tlb_flush_walk functions internally, and the
choice can be made in the iommu_get_dma_strict() test, but also forced
on all the time by your init_context hook. What do you reckon?



Sounds good to me. Since you mentioned non-strict mode using it 
generically,
can't we just set tlb_flush_all() in io_pgtable_tlb_flush_walk() like 
below
based on quirk so that we don't need to add any check in 
iommu_get_dma_strict()

and just force the new flush_ops in init_context hook?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
iop->cfg.tlb->tlb_flush_all(iop->cookie);
return;
}

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-15 Thread Robin Murphy

On 2021-06-15 12:51, Sai Prakash Ranjan wrote:

Hi Krishna,

On 2021-06-14 23:18, Krishna Reddy wrote:
Right but we won't know until we profile the specific usecases or try 
them in
generic workload to see if they affect the performance. Sure, over 
invalidation is
a concern where multiple buffers can be mapped to same context and 
the cache
is not usable at the time for lookup and such but we don't do it for 
small buffers
and only for large buffers which means thousands of TLB entry 
mappings in

which case TLBIASID is preferred (note: I mentioned the HW team
recommendation to use it for anything greater than 128 TLB entries) 
in my
earlier reply. And also note that we do this only for partial walk 
flush, we are not

arbitrarily changing all the TLBIs to ASID based.


Most of the heavy bw use cases does involve processing larger buffers.
When the physical memory is allocated dis-contiguously at page_size
(let's use 4KB here)
granularity, each aligned 2MB chunks IOVA unmap would involve
performing a TLBIASID
as 2MB is not a leaf. Essentially, It happens all the time during
large buffer unmaps and
potentially impact active traffic on other large buffers. Depending on 
how much

latency HW engines can absorb, the overflow/underflow issues for ISO
engines can be
sporadic and vendor specific.
Performing TLBIASID as default for all SoCs is not a safe operation.



Ok so from what I gather from this is that its not easy to test for the
negative impact and you don't have data on such yet and the behaviour is
very vendor specific. To add on qcom impl, we have several performance
improvements for TLB cache invalidations in HW like wait-for-safe(for 
realtime

clients such as camera and display) and few others to allow for cache
lookups/updates when TLBI is in progress for the same context bank, so 
atleast

we are good here.



I am no camera expert but from what the camera team mentioned is that 
there
is a thread which frees memory(large unused memory buffers) 
periodically which

ends up taking around 100+ms and causing some camera test failures with
frame drops. Parallel efforts are already being made to optimize this 
usage of
thread but as I mentioned previously, this is *not a camera 
specific*, lets say
someone else invokes such large unmaps, it's going to face the same 
issue.


From the above, It doesn't look like the root cause of frame drops is
fully understood.
Why is 100+ms delay causing camera frame drop?  Is the same thread
submitting the buffers
to camera after unmap is complete? If not, how is the unmap latency
causing issue here?



Ok since you are interested in camera usecase, I have requested for more 
details
from the camera team and will give it once they comeback. However I 
don't think
its good to have unmap latency at all and that is being addressed by 
this patch.





> If unmap is queued and performed on a back ground thread, would it
> resolve the frame drops?

Not sure I understand what you mean by queuing on background thread 
but with

that or not, we still do the same number of TLBIs and hop through
iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
help?


I mean adding the unmap requests into a queue and processing them from
a different thread.
It is not to reduce the TLBIs. But, not to block subsequent buffer
allocation, IOVA map requests, if they
are being requested from same thread that is performing unmap. If
unmap is already performed from
a different thread, then the issue still need to be root caused to
understand it fully. Check for any
serialization issues.



This patch is to optimize unmap latency because of large number of mmio 
writes(TLBIVAs)
wasting CPU cycles and not to fix camera issue which can probably be 
solved by
parallelization. It seems to me like you are ok with the unmap latency 
in general

which we are not and want to avoid that latency.

Hi @Robin, from these discussions it seems they are not ok with the change
for all SoC vendor implementations and do not have any data on such impact.
As I mentioned above, on QCOM platforms we do have several optimizations 
in HW

for TLBIs and would like to make use of it and reduce the unmap latency.
What do you think, should this be made implementation specific?


Yes, it sounds like there's enough uncertainty for now that this needs 
to be an opt-in feature. However, I still think that non-strict mode 
could use it generically, since that's all about over-invalidating to 
save time on individual unmaps - and relatively non-deterministic - already.


So maybe we have a second set of iommu_flush_ops, or just a flag 
somewhere to control the tlb_flush_walk functions internally, and the 
choice can be made in the iommu_get_dma_strict() test, but also forced 
on all the time by your init_context hook. What do you reckon?


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-15 Thread Sai Prakash Ranjan

Hi Krishna,

On 2021-06-14 23:18, Krishna Reddy wrote:
Right but we won't know until we profile the specific usecases or try 
them in
generic workload to see if they affect the performance. Sure, over 
invalidation is
a concern where multiple buffers can be mapped to same context and the 
cache
is not usable at the time for lookup and such but we don't do it for 
small buffers
and only for large buffers which means thousands of TLB entry mappings 
in

which case TLBIASID is preferred (note: I mentioned the HW team
recommendation to use it for anything greater than 128 TLB entries) in 
my
earlier reply. And also note that we do this only for partial walk 
flush, we are not

arbitrarily changing all the TLBIs to ASID based.


Most of the heavy bw use cases does involve processing larger buffers.
When the physical memory is allocated dis-contiguously at page_size
(let's use 4KB here)
granularity, each aligned 2MB chunks IOVA unmap would involve
performing a TLBIASID
as 2MB is not a leaf. Essentially, It happens all the time during
large buffer unmaps and
potentially impact active traffic on other large buffers. Depending on 
how much

latency HW engines can absorb, the overflow/underflow issues for ISO
engines can be
sporadic and vendor specific.
Performing TLBIASID as default for all SoCs is not a safe operation.



Ok so from what I gather from this is that its not easy to test for the
negative impact and you don't have data on such yet and the behaviour is
very vendor specific. To add on qcom impl, we have several performance
improvements for TLB cache invalidations in HW like wait-for-safe(for 
realtime

clients such as camera and display) and few others to allow for cache
lookups/updates when TLBI is in progress for the same context bank, so 
atleast

we are good here.



I am no camera expert but from what the camera team mentioned is that 
there
is a thread which frees memory(large unused memory buffers) 
periodically which
ends up taking around 100+ms and causing some camera test failures 
with
frame drops. Parallel efforts are already being made to optimize this 
usage of
thread but as I mentioned previously, this is *not a camera specific*, 
lets say
someone else invokes such large unmaps, it's going to face the same 
issue.


From the above, It doesn't look like the root cause of frame drops is
fully understood.
Why is 100+ms delay causing camera frame drop?  Is the same thread
submitting the buffers
to camera after unmap is complete? If not, how is the unmap latency
causing issue here?



Ok since you are interested in camera usecase, I have requested for more 
details
from the camera team and will give it once they comeback. However I 
don't think
its good to have unmap latency at all and that is being addressed by 
this patch.





> If unmap is queued and performed on a back ground thread, would it
> resolve the frame drops?

Not sure I understand what you mean by queuing on background thread 
but with

that or not, we still do the same number of TLBIs and hop through
iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
help?


I mean adding the unmap requests into a queue and processing them from
a different thread.
It is not to reduce the TLBIs. But, not to block subsequent buffer
allocation, IOVA map requests, if they
are being requested from same thread that is performing unmap. If
unmap is already performed from
a different thread, then the issue still need to be root caused to
understand it fully. Check for any
serialization issues.



This patch is to optimize unmap latency because of large number of mmio 
writes(TLBIVAs)
wasting CPU cycles and not to fix camera issue which can probably be 
solved by
parallelization. It seems to me like you are ok with the unmap latency 
in general

which we are not and want to avoid that latency.

Hi @Robin, from these discussions it seems they are not ok with the 
change
for all SoC vendor implementations and do not have any data on such 
impact.
As I mentioned above, on QCOM platforms we do have several optimizations 
in HW

for TLBIs and would like to make use of it and reduce the unmap latency.
What do you think, should this be made implementation specific?

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-14 Thread Krishna Reddy
> Right but we won't know until we profile the specific usecases or try them in
> generic workload to see if they affect the performance. Sure, over 
> invalidation is
> a concern where multiple buffers can be mapped to same context and the cache
> is not usable at the time for lookup and such but we don't do it for small 
> buffers
> and only for large buffers which means thousands of TLB entry mappings in
> which case TLBIASID is preferred (note: I mentioned the HW team
> recommendation to use it for anything greater than 128 TLB entries) in my
> earlier reply. And also note that we do this only for partial walk flush, we 
> are not
> arbitrarily changing all the TLBIs to ASID based.

Most of the heavy bw use cases does involve processing larger buffers.
When the physical memory is allocated dis-contiguously at page_size (let's use 
4KB here)
granularity, each aligned 2MB chunks IOVA unmap would involve performing a 
TLBIASID
as 2MB is not a leaf. Essentially, It happens all the time during large buffer 
unmaps and
potentially impact active traffic on other large buffers. Depending on how much
latency HW engines can absorb, the overflow/underflow issues for ISO engines 
can be
sporadic and vendor specific. 
Performing TLBIASID as default for all SoCs is not a safe operation.


> I am no camera expert but from what the camera team mentioned is that there
> is a thread which frees memory(large unused memory buffers) periodically which
> ends up taking around 100+ms and causing some camera test failures with
> frame drops. Parallel efforts are already being made to optimize this usage of
> thread but as I mentioned previously, this is *not a camera specific*, lets 
> say
> someone else invokes such large unmaps, it's going to face the same issue.

>From the above, It doesn't look like the root cause of frame drops is fully 
>understood.
Why is 100+ms delay causing camera frame drop?  Is the same thread submitting 
the buffers
to camera after unmap is complete? If not, how is the unmap latency causing 
issue here?
 

> > If unmap is queued and performed on a back ground thread, would it
> > resolve the frame drops?
> 
> Not sure I understand what you mean by queuing on background thread but with
> that or not, we still do the same number of TLBIs and hop through
> iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
> help?

I mean adding the unmap requests into a queue and processing them from a 
different thread.
It is not to reduce the TLBIs. But, not to block subsequent buffer allocation, 
IOVA map requests, if they
are being requested from same thread that is performing unmap. If unmap is 
already performed from
a different thread, then the issue still need to be root caused to understand 
it fully. Check for any
serialization issues. 


-KR
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-11 Thread Sai Prakash Ranjan

Hi Krishna,

On 2021-06-11 22:19, Krishna Reddy wrote:

Hi Sai,

>> > No, the unmap latency is not just in some test case written, the
>> > issue is very real and we have workloads where camera is reporting
>> > frame drops because of this unmap latency in the order of 100s of
milliseconds.


Not exactly, this issue is not specific to camera. If you look at the 
numbers in the
commit text, even for the test device its the same observation. It 
depends on
the buffer size we are unmapping which affects the number of TLBIs 
issue. I am
not aware of any such HW side bw issues for camera specifically on 
QCOM

devices.


It is clear that reducing number of TLBIs  reduces the umap API
latency. But, It is
at the expense of throwing away valid tlb entries.
Quantifying the impact of arbitrary invalidation of valid tlb entries
at context level is not straight forward and
use case dependent. The side-effects might be rare or won't be known
until they are noticed.


Right but we won't know until we profile the specific usecases or try 
them

in generic workload to see if they affect the performance. Sure, over
invalidation is a concern where multiple buffers can be mapped to same 
context
and the cache is not usable at the time for lookup and such but we don't 
do it
for small buffers and only for large buffers which means thousands of 
TLB entry
mappings in which case TLBIASID is preferred (note: I mentioned the HW 
team
recommendation to use it for anything greater than 128 TLB entries) in 
my earlier
reply. And also note that we do this only for partial walk flush, we are 
not

arbitrarily changing all the TLBIs to ASID based.


Can you provide more details on How the unmap latency is causing
camera to drop frames?
Is unmap performed in the perf path?


I am no camera expert but from what the camera team mentioned is that
there is a thread which frees memory(large unused memory buffers)
periodically which ends up taking around 100+ms and causing some camera 
test
failures with frame drops. Parallel efforts are already being made to 
optimize
this usage of thread but as I mentioned previously, this is *not a 
camera
specific*, lets say someone else invokes such large unmaps, it's going 
to face

the same issue.


If unmap is queued and performed on a back ground thread, would it
resolve the frame drops?


Not sure I understand what you mean by queuing on background thread but 
with

that or not, we still do the same number of TLBIs and hop through
iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that 
help?


Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-11 Thread Krishna Reddy
Hi Sai,
> >> > No, the unmap latency is not just in some test case written, the
> >> > issue is very real and we have workloads where camera is reporting
> >> > frame drops because of this unmap latency in the order of 100s of
> milliseconds.

> Not exactly, this issue is not specific to camera. If you look at the numbers 
> in the
> commit text, even for the test device its the same observation. It depends on
> the buffer size we are unmapping which affects the number of TLBIs issue. I am
> not aware of any such HW side bw issues for camera specifically on QCOM
> devices.

It is clear that reducing number of TLBIs  reduces the umap API latency. But, 
It is
at the expense of throwing away valid tlb entries. 
Quantifying the impact of arbitrary invalidation of valid tlb entries at 
context level is not straight forward and
use case dependent. The side-effects might be rare or won't be known until they 
are noticed.
Can you provide more details on How the unmap latency is causing camera to drop 
frames?
Is unmap performed in the perf path?
If unmap is queued and performed on a back ground thread, would it resolve the 
frame drops?

-KR

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Sai Prakash Ranjan

Hi Krishna,

On 2021-06-11 06:07, Krishna Reddy wrote:

> No, the unmap latency is not just in some test case written, the issue
> is very real and we have workloads where camera is reporting frame
> drops because of this unmap latency in the order of 100s of milliseconds.
> And hardware team recommends using ASID based invalidations for
> anything larger than 128 TLB entries. So yes, we have taken note of
> impacts here before going this way and hence feel more inclined to
> make this qcom specific if required.


Seems like the real issue here is not the unmap API latency.
It should be the high number of back to back SMMU TLB invalidate
register writes that is resulting
in lower ISO BW to Camera and overflow. Isn't it?
Even Tegra186 SoC has similar issue and HW team recommended to rate
limit the number of
back to back SMMU tlb invalidate registers writes. The subsequent
Tegra194 SoC has a dedicated SMMU for
ISO clients to avoid the impact of TLB invalidates from NISO clients on 
ISO BW.




Not exactly, this issue is not specific to camera. If you look at
the numbers in the commit text, even for the test device its the
same observation. It depends on the buffer size we are unmapping
which affects the number of TLBIs issue. I am not aware of any
such HW side bw issues for camera specifically on QCOM devices.

Thanks,
Sai


Thinking some more, I
wonder if the Tegra folks might have an opinion to add here, given
that their multiple-SMMU solution was seemingly about trying to get
enough TLB and pagetable walk bandwidth in the first place?


While it is good to reduce the number of tlb register writes, Flushing
all TLB entries at context granularity arbitrarily
can have negative impact on active traffic and BW. I don't have much
data on possible impact at this point.
Can the flushing at context granularity be made a quirk than
performing it as default?

-KR


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Krishna Reddy
> > No, the unmap latency is not just in some test case written, the issue
> > is very real and we have workloads where camera is reporting frame
> > drops because of this unmap latency in the order of 100s of milliseconds.
> > And hardware team recommends using ASID based invalidations for
> > anything larger than 128 TLB entries. So yes, we have taken note of
> > impacts here before going this way and hence feel more inclined to
> > make this qcom specific if required.

Seems like the real issue here is not the unmap API latency.
It should be the high number of back to back SMMU TLB invalidate register 
writes that is resulting
in lower ISO BW to Camera and overflow. Isn't it?
Even Tegra186 SoC has similar issue and HW team recommended to rate limit the 
number of
back to back SMMU tlb invalidate registers writes. The subsequent Tegra194 SoC 
has a dedicated SMMU for
ISO clients to avoid the impact of TLB invalidates from NISO clients on ISO BW.

>> Thinking some more, I
>> wonder if the Tegra folks might have an opinion to add here, given 
>> that their multiple-SMMU solution was seemingly about trying to get 
>> enough TLB and pagetable walk bandwidth in the first place?

While it is good to reduce the number of tlb register writes, Flushing all TLB 
entries at context granularity arbitrarily
can have negative impact on active traffic and BW. I don't have much data on 
possible impact at this point.
Can the flushing at context granularity be made a quirk than performing it as 
default? 

-KR

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Sai Prakash Ranjan

Hi Robin,

On 2021-06-10 20:59, Robin Murphy wrote:

On 2021-06-10 12:54, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 17:03, Robin Murphy wrote:

On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 14:38, Robin Murphy wrote:

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
Currently for iommu_unmap() of large scatter-gather list with 
page size
elements, the majority of time is spent in flushing of partial 
walks in
__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA 
for

arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size 
elements
(8192 entries), there are 16->2MB buffer unmaps based on the 
pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 
TLBIVAs (2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB 
causing a huge

overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the 
entire context
if size (pgsize) is greater than the granule size (4K, 16K, 
64K). For this
example of 32MB scatter-gather list unmap, this results in just 
16 ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for 
io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at 
least 512
TLB invalidations for which tlb_flush_all() is already 
recommended. For
example, take 4K granule with 2MB pgsize, this will result in 
512 TLBIVA

in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of 
iommu_{map_sg}/unmap:

(average over 10 iterations)

Before this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    2.067 us 1.854 us
  64K    9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us    67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    1.723 us 1.765 us
  64K    9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us    21.250 us
  16M 2391.890 us    27.437 us
  24M 3570.895 us    39.937 us
  32M 4755.234 us    51.797 us

This is further reduced once the map/unmap_pages() support gets 
in which
will result in just 1 tlb_flush_all() as opposed to 16 
tlb_flush_all().


Signed-off-by: Sai Prakash Ranjan 


---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

    if (!iopte_leaf(pte, lvl, iop->fmt)) {
  /* Also flush any partial walks */
-    io_pgtable_tlb_flush_walk(iop, iova, size,
-  ARM_LPAE_GRANULE(data));
+    if (size > ARM_LPAE_GRANULE(data))
+    io_pgtable_tlb_flush_all(iop);
+    else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)

Taking a step back, though, what about the impact to drivers 
other

than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the 
same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c

which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up 
ARM_SMMU_FEAT_RANGE_INV with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
 io_pgtable_tlb_flush_walk(iop, iova, size,
   ARM_LPAE_GRANULE(data));
else
 io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs
invalidating, and we left it up to the drivers to decide exactly
*how*. Even though things have evolved a bit I don't think that has
fundamentally changed - tlb_flush_walk is now only used in this one
place (technically I suppose it could be renamed tlb_flush_table 
but

it's not worth the churn), so drivers can implement their own
preferred 

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Robin Murphy

On 2021-06-10 12:54, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 17:03, Robin Murphy wrote:

On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 14:38, Robin Murphy wrote:

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
Currently for iommu_unmap() of large scatter-gather list with 
page size
elements, the majority of time is spent in flushing of partial 
walks in

__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size 
elements
(8192 entries), there are 16->2MB buffer unmaps based on the 
pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 
TLBIVAs (2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing 
a huge

overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the 
entire context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). 
For this
example of 32MB scatter-gather list unmap, this results in just 
16 ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for 
io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at 
least 512
TLB invalidations for which tlb_flush_all() is already 
recommended. For
example, take 4K granule with 2MB pgsize, this will result in 512 
TLBIVA

in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    2.067 us 1.854 us
  64K    9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us    67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    1.723 us 1.765 us
  64K    9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us    21.250 us
  16M 2391.890 us    27.437 us
  24M 3570.895 us    39.937 us
  32M 4755.234 us    51.797 us

This is further reduced once the map/unmap_pages() support gets 
in which
will result in just 1 tlb_flush_all() as opposed to 16 
tlb_flush_all().


Signed-off-by: Sai Prakash Ranjan 
---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

    if (!iopte_leaf(pte, lvl, iop->fmt)) {
  /* Also flush any partial walks */
-    io_pgtable_tlb_flush_walk(iop, iova, size,
-  ARM_LPAE_GRANULE(data));
+    if (size > ARM_LPAE_GRANULE(data))
+    io_pgtable_tlb_flush_all(iop);
+    else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)


Taking a step back, though, what about the impact to drivers other
than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c

which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up 
ARM_SMMU_FEAT_RANGE_INV with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
 io_pgtable_tlb_flush_walk(iop, iova, size,
   ARM_LPAE_GRANULE(data));
else
 io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs
invalidating, and we left it up to the drivers to decide exactly
*how*. Even though things have evolved a bit I don't think that has
fundamentally changed - tlb_flush_walk is now only used in this one
place (technically I suppose it could be renamed tlb_flush_table but
it's not worth the churn), so drivers can implement their own
preferred table-invalidating behaviour even more easily than choosing

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Thierry Reding
On Thu, Jun 10, 2021 at 12:33:56PM +0100, Robin Murphy wrote:
> On 2021-06-10 10:36, Sai Prakash Ranjan wrote:
> > Hi Robin,
> > 
> > On 2021-06-10 14:38, Robin Murphy wrote:
> > > On 2021-06-10 06:24, Sai Prakash Ranjan wrote:
> > > > Hi Robin,
> > > > 
> > > > On 2021-06-10 00:14, Robin Murphy wrote:
> > > > > On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
> > > > > > Currently for iommu_unmap() of large scatter-gather list
> > > > > > with page size
> > > > > > elements, the majority of time is spent in flushing of
> > > > > > partial walks in
> > > > > > __arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
> > > > > > arm-smmu).
> > > > > > 
> > > > > > For example: to unmap a 32MB scatter-gather list with
> > > > > > page size elements
> > > > > > (8192 entries), there are 16->2MB buffer unmaps based on
> > > > > > the pgsize (2MB
> > > > > > for 4K granule) and each of 2MB will further result in
> > > > > > 512 TLBIVAs (2MB/4K)
> > > > > > resulting in a total of 8192 TLBIVAs (512*16) for
> > > > > > 16->2MB causing a huge
> > > > > > overhead.
> > > > > > 
> > > > > > So instead use io_pgtable_tlb_flush_all() to invalidate
> > > > > > the entire context
> > > > > > if size (pgsize) is greater than the granule size (4K,
> > > > > > 16K, 64K). For this
> > > > > > example of 32MB scatter-gather list unmap, this results
> > > > > > in just 16 ASID
> > > > > > based TLB invalidations or tlb_flush_all() callback
> > > > > > (TLBIASID in case of
> > > > > > arm-smmu) as opposed to 8192 TLBIVAs thereby increasing
> > > > > > the performance of
> > > > > > unmaps drastically.
> > > > > > 
> > > > > > Condition (size > granule size) is chosen for
> > > > > > io_pgtable_tlb_flush_all()
> > > > > > because for any granule with supported pgsizes, we will
> > > > > > have at least 512
> > > > > > TLB invalidations for which tlb_flush_all() is already
> > > > > > recommended. For
> > > > > > example, take 4K granule with 2MB pgsize, this will
> > > > > > result in 512 TLBIVA
> > > > > > in partial walk flush.
> > > > > > 
> > > > > > Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
> > > > > > (average over 10 iterations)
> > > > > > 
> > > > > > Before this optimization:
> > > > > > 
> > > > > >  size    iommu_map_sg  iommu_unmap
> > > > > >    4K    2.067 us 1.854 us
> > > > > >   64K    9.598 us 8.802 us
> > > > > >    1M  148.890 us   130.718 us
> > > > > >    2M  305.864 us    67.291 us
> > > > > >   12M 1793.604 us   390.838 us
> > > > > >   16M 2386.848 us   518.187 us
> > > > > >   24M 3563.296 us   775.989 us
> > > > > >   32M 4747.171 us  1033.364 us
> > > > > > 
> > > > > > After this optimization:
> > > > > > 
> > > > > >  size    iommu_map_sg  iommu_unmap
> > > > > >    4K    1.723 us 1.765 us
> > > > > >   64K    9.880 us 8.869 us
> > > > > >    1M  155.364 us   135.223 us
> > > > > >    2M  303.906 us 5.385 us
> > > > > >   12M 1786.557 us    21.250 us
> > > > > >   16M 2391.890 us    27.437 us
> > > > > >   24M 3570.895 us    39.937 us
> > > > > >   32M 4755.234 us    51.797 us
> > > > > > 
> > > > > > This is further reduced once the map/unmap_pages()
> > > > > > support gets in which
> > > > > > will result in just 1 tlb_flush_all() as opposed to 16
> > > > > > tlb_flush_all().
> > > > > > 
> > > > > > Signed-off-by: Sai Prakash Ranjan 
> > > > > > ---
> > > > > >   drivers/iommu/io-pgtable-arm.c | 7 +--
> > > > > >   1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/iommu/io-pgtable-arm.c
> > > > > > b/drivers/iommu/io-pgtable-arm.c
> > > > > > index 87def58e79b5..c3cb9add3179 100644
> > > > > > --- a/drivers/iommu/io-pgtable-arm.c
> > > > > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > > > > @@ -589,8 +589,11 @@ static size_t
> > > > > > __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > > > > >     if (!iopte_leaf(pte, lvl, iop->fmt)) {
> > > > > >   /* Also flush any partial walks */
> > > > > > -    io_pgtable_tlb_flush_walk(iop, iova, size,
> > > > > > -  ARM_LPAE_GRANULE(data));
> > > > > > +    if (size > ARM_LPAE_GRANULE(data))
> > > > > > +    io_pgtable_tlb_flush_all(iop);
> > > > > > +    else
> > > > > 
> > > > > Erm, when will the above condition ever not be true? ;)
> > > > > 
> > > > 
> > > > Ah right, silly me :)
> > > > 
> > > > > Taking a step back, though, what about the impact to drivers other
> > > > > than SMMUv2?
> > > > 
> > > > Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
> > > > thing as arm-smmu-v2 (page based invalidations), then there is
> > > > ipmmu-vmsa.c
> > > > which does 

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Sai Prakash Ranjan

Hi Robin,

On 2021-06-10 17:03, Robin Murphy wrote:

On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 14:38, Robin Murphy wrote:

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
Currently for iommu_unmap() of large scatter-gather list with page 
size
elements, the majority of time is spent in flushing of partial 
walks in
__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA 
for

arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size 
elements
(8192 entries), there are 16->2MB buffer unmaps based on the 
pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
(2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing 
a huge

overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). 
For this
example of 32MB scatter-gather list unmap, this results in just 16 
ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for 
io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at 
least 512
TLB invalidations for which tlb_flush_all() is already 
recommended. For
example, take 4K granule with 2MB pgsize, this will result in 512 
TLBIVA

in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    2.067 us 1.854 us
  64K    9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us    67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    1.723 us 1.765 us
  64K    9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us    21.250 us
  16M 2391.890 us    27.437 us
  24M 3570.895 us    39.937 us
  32M 4755.234 us    51.797 us

This is further reduced once the map/unmap_pages() support gets in 
which
will result in just 1 tlb_flush_all() as opposed to 16 
tlb_flush_all().


Signed-off-by: Sai Prakash Ranjan 


---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

    if (!iopte_leaf(pte, lvl, iop->fmt)) {
  /* Also flush any partial walks */
-    io_pgtable_tlb_flush_walk(iop, iova, size,
-  ARM_LPAE_GRANULE(data));
+    if (size > ARM_LPAE_GRANULE(data))
+    io_pgtable_tlb_flush_all(iop);
+    else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)


Taking a step back, though, what about the impact to drivers other
than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c

which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
 io_pgtable_tlb_flush_walk(iop, iova, size,
   ARM_LPAE_GRANULE(data));
else
 io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs
invalidating, and we left it up to the drivers to decide exactly
*how*. Even though things have evolved a bit I don't think that has
fundamentally changed - tlb_flush_walk is now only used in this one
place (technically I suppose it could be renamed tlb_flush_table but
it's not worth the churn), so drivers can implement their own
preferred table-invalidating behaviour even more easily than choosing
whether to bounce a quirk through the common 

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Robin Murphy

On 2021-06-10 10:36, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 14:38, Robin Murphy wrote:

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
Currently for iommu_unmap() of large scatter-gather list with page 
size
elements, the majority of time is spent in flushing of partial 
walks in

__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size 
elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize 
(2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
(2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 
huge

overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). 
For this
example of 32MB scatter-gather list unmap, this results in just 16 
ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for 
io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at 
least 512
TLB invalidations for which tlb_flush_all() is already recommended. 
For
example, take 4K granule with 2MB pgsize, this will result in 512 
TLBIVA

in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    2.067 us 1.854 us
  64K    9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us    67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    1.723 us 1.765 us
  64K    9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us    21.250 us
  16M 2391.890 us    27.437 us
  24M 3570.895 us    39.937 us
  32M 4755.234 us    51.797 us

This is further reduced once the map/unmap_pages() support gets in 
which
will result in just 1 tlb_flush_all() as opposed to 16 
tlb_flush_all().


Signed-off-by: Sai Prakash Ranjan 
---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

    if (!iopte_leaf(pte, lvl, iop->fmt)) {
  /* Also flush any partial walks */
-    io_pgtable_tlb_flush_walk(iop, iova, size,
-  ARM_LPAE_GRANULE(data));
+    if (size > ARM_LPAE_GRANULE(data))
+    io_pgtable_tlb_flush_all(iop);
+    else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)


Taking a step back, though, what about the impact to drivers other
than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c

which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
 io_pgtable_tlb_flush_walk(iop, iova, size,
   ARM_LPAE_GRANULE(data));
else
 io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs
invalidating, and we left it up to the drivers to decide exactly
*how*. Even though things have evolved a bit I don't think that has
fundamentally changed - tlb_flush_walk is now only used in this one
place (technically I suppose it could be renamed tlb_flush_table but
it's not worth the churn), so drivers can implement their own
preferred table-invalidating behaviour even more easily than choosing
whether to bounce a quirk through the common code or not. Consider
what you've already seen for the 

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Sai Prakash Ranjan

Hi Robin,

On 2021-06-10 14:38, Robin Murphy wrote:

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
Currently for iommu_unmap() of large scatter-gather list with page 
size
elements, the majority of time is spent in flushing of partial walks 
in

__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size 
elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize 
(2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
(2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 
huge

overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). 
For this
example of 32MB scatter-gather list unmap, this results in just 16 
ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in 
case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for 
io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at 
least 512
TLB invalidations for which tlb_flush_all() is already recommended. 
For
example, take 4K granule with 2MB pgsize, this will result in 512 
TLBIVA

in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    2.067 us 1.854 us
  64K    9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us    67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    1.723 us 1.765 us
  64K    9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us    21.250 us
  16M 2391.890 us    27.437 us
  24M 3570.895 us    39.937 us
  32M 4755.234 us    51.797 us

This is further reduced once the map/unmap_pages() support gets in 
which
will result in just 1 tlb_flush_all() as opposed to 16 
tlb_flush_all().


Signed-off-by: Sai Prakash Ranjan 
---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

    if (!iopte_leaf(pte, lvl, iop->fmt)) {
  /* Also flush any partial walks */
-    io_pgtable_tlb_flush_walk(iop, iova, size,
-  ARM_LPAE_GRANULE(data));
+    if (size > ARM_LPAE_GRANULE(data))
+    io_pgtable_tlb_flush_all(iop);
+    else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)


Taking a step back, though, what about the impact to drivers other
than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c

which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
     io_pgtable_tlb_flush_walk(iop, iova, size,
   ARM_LPAE_GRANULE(data));
else
     io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs
invalidating, and we left it up to the drivers to decide exactly
*how*. Even though things have evolved a bit I don't think that has
fundamentally changed - tlb_flush_walk is now only used in this one
place (technically I suppose it could be renamed tlb_flush_table but
it's not worth the churn), so drivers can implement their own
preferred table-invalidating behaviour even more easily than choosing
whether to bounce a quirk through the common code or not. Consider
what you've already seen for the Renesas IPMMU, or SMMUv1 stage 2...



Thanks for 

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-10 Thread Robin Murphy

On 2021-06-10 06:24, Sai Prakash Ranjan wrote:

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

Currently for iommu_unmap() of large scatter-gather list with page size
elements, the majority of time is spent in flushing of partial walks in
__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
(2MB/4K)

resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). For 
this

example of 32MB scatter-gather list unmap, this results in just 16 ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at least 
512

TLB invalidations for which tlb_flush_all() is already recommended. For
example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    2.067 us 1.854 us
  64K    9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us    67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 size    iommu_map_sg  iommu_unmap
   4K    1.723 us 1.765 us
  64K    9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us    21.250 us
  16M 2391.890 us    27.437 us
  24M 3570.895 us    39.937 us
  32M 4755.234 us    51.797 us

This is further reduced once the map/unmap_pages() support gets in which
will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().

Signed-off-by: Sai Prakash Ranjan 
---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

    if (!iopte_leaf(pte, lvl, iop->fmt)) {
  /* Also flush any partial walks */
-    io_pgtable_tlb_flush_walk(iop, iova, size,
-  ARM_LPAE_GRANULE(data));
+    if (size > ARM_LPAE_GRANULE(data))
+    io_pgtable_tlb_flush_all(iop);
+    else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)


Taking a step back, though, what about the impact to drivers other
than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is ipmmu-vmsa.c
which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
     io_pgtable_tlb_flush_walk(iop, iova, size,
   ARM_LPAE_GRANULE(data));
else
     io_pgtable_tlb_flush_all(iop);


The design here has always been that io-pgtable says *what* needs 
invalidating, and we left it up to the drivers to decide exactly *how*. 
Even though things have evolved a bit I don't think that has 
fundamentally changed - tlb_flush_walk is now only used in this one 
place (technically I suppose it could be renamed tlb_flush_table but 
it's not worth the churn), so drivers can implement their own preferred 
table-invalidating behaviour even more easily than choosing whether to 
bounce a quirk through the common code or not. Consider what you've 
already seen for the Renesas IPMMU, or SMMUv1 stage 2...


I'm instinctively a little twitchy about making this a blanket 

Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-09 Thread Sai Prakash Ranjan

Hi Robin,

On 2021-06-10 00:14, Robin Murphy wrote:

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:
Currently for iommu_unmap() of large scatter-gather list with page 
size
elements, the majority of time is spent in flushing of partial walks 
in

__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size 
elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize 
(2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs 
(2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a 
huge

overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire 
context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). For 
this
example of 32MB scatter-gather list unmap, this results in just 16 
ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in case 
of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the 
performance of

unmaps drastically.

Condition (size > granule size) is chosen for 
io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at least 
512
TLB invalidations for which tlb_flush_all() is already recommended. 
For
example, take 4K granule with 2MB pgsize, this will result in 512 
TLBIVA

in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 sizeiommu_map_sg  iommu_unmap
   4K2.067 us 1.854 us
  64K9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 sizeiommu_map_sg  iommu_unmap
   4K1.723 us 1.765 us
  64K9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us21.250 us
  16M 2391.890 us27.437 us
  24M 3570.895 us39.937 us
  32M 4755.234 us51.797 us

This is further reduced once the map/unmap_pages() support gets in 
which
will result in just 1 tlb_flush_all() as opposed to 16 
tlb_flush_all().


Signed-off-by: Sai Prakash Ranjan 
---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c 
b/drivers/iommu/io-pgtable-arm.c

index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct 
arm_lpae_io_pgtable *data,

if (!iopte_leaf(pte, lvl, iop->fmt)) {
/* Also flush any partial walks */
-   io_pgtable_tlb_flush_walk(iop, iova, size,
- ARM_LPAE_GRANULE(data));
+   if (size > ARM_LPAE_GRANULE(data))
+   io_pgtable_tlb_flush_all(iop);
+   else


Erm, when will the above condition ever not be true? ;)



Ah right, silly me :)


Taking a step back, though, what about the impact to drivers other
than SMMUv2?


Other drivers would be msm_iommu.c, qcom_iommu.c which does the same
thing as arm-smmu-v2 (page based invalidations), then there is 
ipmmu-vmsa.c

which does tlb_flush_all() for flush walk.


In particular I'm thinking of SMMUv3.2 where the whole
range can be invalidated by VA in a single command anyway, so the
additional penalties of TLBIALL are undesirable.



Right, so I am thinking we can have a new generic quirk 
IO_PGTABLE_QUIRK_RANGE_INV
to choose between range based invalidations(tlb_flush_walk) and 
tlb_flush_all().
In this case of arm-smmu-v3.2, we can tie up ARM_SMMU_FEAT_RANGE_INV 
with this quirk

and have something like below, thoughts?

if (iop->cfg.quirks & IO_PGTABLE_QUIRK_RANGE_INV)
io_pgtable_tlb_flush_walk(iop, iova, size,
  ARM_LPAE_GRANULE(data));
else
io_pgtable_tlb_flush_all(iop);

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-09 Thread Robin Murphy

On 2021-06-09 15:53, Sai Prakash Ranjan wrote:

Currently for iommu_unmap() of large scatter-gather list with page size
elements, the majority of time is spent in flushing of partial walks in
__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this
example of 32MB scatter-gather list unmap, this results in just 16 ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of
unmaps drastically.

Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at least 512
TLB invalidations for which tlb_flush_all() is already recommended. For
example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

 sizeiommu_map_sg  iommu_unmap
   4K2.067 us 1.854 us
  64K9.598 us 8.802 us
   1M  148.890 us   130.718 us
   2M  305.864 us67.291 us
  12M 1793.604 us   390.838 us
  16M 2386.848 us   518.187 us
  24M 3563.296 us   775.989 us
  32M 4747.171 us  1033.364 us

After this optimization:

 sizeiommu_map_sg  iommu_unmap
   4K1.723 us 1.765 us
  64K9.880 us 8.869 us
   1M  155.364 us   135.223 us
   2M  303.906 us 5.385 us
  12M 1786.557 us21.250 us
  16M 2391.890 us27.437 us
  24M 3570.895 us39.937 us
  32M 4755.234 us51.797 us

This is further reduced once the map/unmap_pages() support gets in which
will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().

Signed-off-by: Sai Prakash Ranjan 
---
  drivers/iommu/io-pgtable-arm.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable 
*data,
  
  		if (!iopte_leaf(pte, lvl, iop->fmt)) {

/* Also flush any partial walks */
-   io_pgtable_tlb_flush_walk(iop, iova, size,
- ARM_LPAE_GRANULE(data));
+   if (size > ARM_LPAE_GRANULE(data))
+   io_pgtable_tlb_flush_all(iop);
+   else


Erm, when will the above condition ever not be true? ;)

Taking a step back, though, what about the impact to drivers other than 
SMMUv2? In particular I'm thinking of SMMUv3.2 where the whole range can 
be invalidated by VA in a single command anyway, so the additional 
penalties of TLBIALL are undesirable.


Robin.


+   io_pgtable_tlb_flush_walk(iop, iova, size,
+ 
ARM_LPAE_GRANULE(data));
ptep = iopte_deref(pte, data);
__arm_lpae_free_pgtable(data, lvl + 1, ptep);
} else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

2021-06-09 Thread Sai Prakash Ranjan
Currently for iommu_unmap() of large scatter-gather list with page size
elements, the majority of time is spent in flushing of partial walks in
__arm_lpae_unmap() which is a VA based TLB invalidation (TLBIVA for
arm-smmu).

For example: to unmap a 32MB scatter-gather list with page size elements
(8192 entries), there are 16->2MB buffer unmaps based on the pgsize (2MB
for 4K granule) and each of 2MB will further result in 512 TLBIVAs (2MB/4K)
resulting in a total of 8192 TLBIVAs (512*16) for 16->2MB causing a huge
overhead.

So instead use io_pgtable_tlb_flush_all() to invalidate the entire context
if size (pgsize) is greater than the granule size (4K, 16K, 64K). For this
example of 32MB scatter-gather list unmap, this results in just 16 ASID
based TLB invalidations or tlb_flush_all() callback (TLBIASID in case of
arm-smmu) as opposed to 8192 TLBIVAs thereby increasing the performance of
unmaps drastically.

Condition (size > granule size) is chosen for io_pgtable_tlb_flush_all()
because for any granule with supported pgsizes, we will have at least 512
TLB invalidations for which tlb_flush_all() is already recommended. For
example, take 4K granule with 2MB pgsize, this will result in 512 TLBIVA
in partial walk flush.

Test on QTI SM8150 SoC for 10 iterations of iommu_{map_sg}/unmap:
(average over 10 iterations)

Before this optimization:

sizeiommu_map_sg  iommu_unmap
  4K2.067 us 1.854 us
 64K9.598 us 8.802 us
  1M  148.890 us   130.718 us
  2M  305.864 us67.291 us
 12M 1793.604 us   390.838 us
 16M 2386.848 us   518.187 us
 24M 3563.296 us   775.989 us
 32M 4747.171 us  1033.364 us

After this optimization:

sizeiommu_map_sg  iommu_unmap
  4K1.723 us 1.765 us
 64K9.880 us 8.869 us
  1M  155.364 us   135.223 us
  2M  303.906 us 5.385 us
 12M 1786.557 us21.250 us
 16M 2391.890 us27.437 us
 24M 3570.895 us39.937 us
 32M 4755.234 us51.797 us

This is further reduced once the map/unmap_pages() support gets in which
will result in just 1 tlb_flush_all() as opposed to 16 tlb_flush_all().

Signed-off-by: Sai Prakash Ranjan 
---
 drivers/iommu/io-pgtable-arm.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..c3cb9add3179 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -589,8 +589,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable 
*data,
 
if (!iopte_leaf(pte, lvl, iop->fmt)) {
/* Also flush any partial walks */
-   io_pgtable_tlb_flush_walk(iop, iova, size,
- ARM_LPAE_GRANULE(data));
+   if (size > ARM_LPAE_GRANULE(data))
+   io_pgtable_tlb_flush_all(iop);
+   else
+   io_pgtable_tlb_flush_walk(iop, iova, size,
+ 
ARM_LPAE_GRANULE(data));
ptep = iopte_deref(pte, data);
__arm_lpae_free_pgtable(data, lvl + 1, ptep);
} else if (iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT) {
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu