Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-05 Thread Yicong Yang
On 2023/7/5 16:43, Barry Song wrote:
> On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang  wrote:
>>
>> On 2023/6/30 1:26, Catalin Marinas wrote:
>>> On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
 On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> From: Barry Song 
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
 [...]
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/tlbbatch.h | 12 
>  arch/arm64/include/asm/tlbflush.h | 33 -
>  arch/arm64/mm/flush.c | 69 +++
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 ++--

 First of all, this patch needs to be split in some preparatory patches
 introducing/renaming functions with no functional change for x86. Once
 done, you can add the arm64-only changes.

>>
>> got it. will try to split this patch as suggested.
>>
 Now, on the implementation, I had some comments on v7 but we didn't get
 to a conclusion and the thread eventually died:

 https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/

 I know I said a command line argument is better than Kconfig or some
 random number of CPUs heuristics but it would be even better if we don't
 bother with any, just make this always on.
>>
>> ok, will make this always on.
>>
 Barry had some comments
 around mprotect() being racy and that's why we have
 flush_tlb_batched_pending() but I don't think it's needed (or, for
 arm64, it can be a DSB since this patch issues the TLBIs but without the
 DVM Sync). So we need to clarify this (see Barry's last email on the
 above thread) and before attempting new versions of this patchset. With
 flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
 implementation would be faster on any SoC irrespective of the number of
 CPUs.
>>>
>>> I think I got the need for flush_tlb_batched_pending(). If
>>> try_to_unmap() marks the pte !present and we have a pending TLBI,
>>> change_pte_range() will skip the TLB maintenance altogether since it did
>>> not change the pte. So we could be left with stale TLB entries after
>>> mprotect() before TTU does the batch flushing.
>>>
> 
> Good catch.
> This could be also true for MADV_DONTNEED. after try_to_unmap, we run
> MADV_DONTNEED on this area, as pte is not present, we don't do anything
> on this PTE in zap_pte_range afterwards.
> 
>>> We can have an arch-specific flush_tlb_batched_pending() that can be a
>>> DSB only on arm64 and a full mm flush on x86.
>>>
>>
>> We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
>> condition so we first check whether there's a pended batched flush and
>> if so do the tlb flush. The pending checking is common and the differences
>> among the archs is how to flush the TLB here within the 
>> flush_tlb_batched_pending(),
>> on arm64 it should only be a dsb.
>>
>> As we only needs to maintain the TLBs already pended in batched flush,
>> does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
>> Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
>> flush_tlb_batched_pending() and no arch specific function needed.
> 
> as we have issued no-sync tlbi on those pending addresses , that means
> our hardware
> has already "recorded" what should be flushed in the specific mm. so
> DSB only will flush
> them correctly. right?
> 

yes it's right. I was just thought something like below. arch_tlbbatch_flush()
will only be a dsb on arm64 so this will match what Catalin wants. But as you
told that this maybe incorrect on x86 so we'd better have arch specific
implementation for flush_tlb_batched_pending() as suggested.

diff --git a/mm/rmap.c b/mm/rmap.c
index 9699c6011b0e..afa3571503a0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -717,7 +717,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

if (pending != flushed) {
-   flush_tlb_mm(mm);
+   arch_tlbbatch_flush(>tlb_ubc.arch);
/*
 * If the new TLB flushing is pending during flushing, leave
 * mm->tlb_flush_batched as is, to avoid losing flushing.


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-05 Thread Barry Song
On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang  wrote:
>
> On 2023/6/30 1:26, Catalin Marinas wrote:
> > On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
> >> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> >>> From: Barry Song 
> >>>
> >>> on x86, batched and deferred tlb shootdown has lead to 90%
> >>> performance increase on tlb shootdown. on arm64, HW can do
> >>> tlb shootdown without software IPI. But sync tlbi is still
> >>> quite expensive.
> >> [...]
> >>>  .../features/vm/TLB/arch-support.txt  |  2 +-
> >>>  arch/arm64/Kconfig|  1 +
> >>>  arch/arm64/include/asm/tlbbatch.h | 12 
> >>>  arch/arm64/include/asm/tlbflush.h | 33 -
> >>>  arch/arm64/mm/flush.c | 69 +++
> >>>  arch/x86/include/asm/tlbflush.h   |  5 +-
> >>>  include/linux/mm_types_task.h |  4 +-
> >>>  mm/rmap.c | 12 ++--
> >>
> >> First of all, this patch needs to be split in some preparatory patches
> >> introducing/renaming functions with no functional change for x86. Once
> >> done, you can add the arm64-only changes.
> >>
>
> got it. will try to split this patch as suggested.
>
> >> Now, on the implementation, I had some comments on v7 but we didn't get
> >> to a conclusion and the thread eventually died:
> >>
> >> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
> >>
> >> I know I said a command line argument is better than Kconfig or some
> >> random number of CPUs heuristics but it would be even better if we don't
> >> bother with any, just make this always on.
>
> ok, will make this always on.
>
> >> Barry had some comments
> >> around mprotect() being racy and that's why we have
> >> flush_tlb_batched_pending() but I don't think it's needed (or, for
> >> arm64, it can be a DSB since this patch issues the TLBIs but without the
> >> DVM Sync). So we need to clarify this (see Barry's last email on the
> >> above thread) and before attempting new versions of this patchset. With
> >> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
> >> implementation would be faster on any SoC irrespective of the number of
> >> CPUs.
> >
> > I think I got the need for flush_tlb_batched_pending(). If
> > try_to_unmap() marks the pte !present and we have a pending TLBI,
> > change_pte_range() will skip the TLB maintenance altogether since it did
> > not change the pte. So we could be left with stale TLB entries after
> > mprotect() before TTU does the batch flushing.
> >

Good catch.
This could be also true for MADV_DONTNEED. after try_to_unmap, we run
MADV_DONTNEED on this area, as pte is not present, we don't do anything
on this PTE in zap_pte_range afterwards.

> > We can have an arch-specific flush_tlb_batched_pending() that can be a
> > DSB only on arm64 and a full mm flush on x86.
> >
>
> We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
> condition so we first check whether there's a pended batched flush and
> if so do the tlb flush. The pending checking is common and the differences
> among the archs is how to flush the TLB here within the 
> flush_tlb_batched_pending(),
> on arm64 it should only be a dsb.
>
> As we only needs to maintain the TLBs already pended in batched flush,
> does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
> Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
> flush_tlb_batched_pending() and no arch specific function needed.

as we have issued no-sync tlbi on those pending addresses , that means
our hardware
has already "recorded" what should be flushed in the specific mm. so
DSB only will flush
them correctly. right?

>
> Thanks.
>

Barry


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-04 Thread Yicong Yang
On 2023/6/30 1:26, Catalin Marinas wrote:
> On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
>> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
>>> From: Barry Song 
>>>
>>> on x86, batched and deferred tlb shootdown has lead to 90%
>>> performance increase on tlb shootdown. on arm64, HW can do
>>> tlb shootdown without software IPI. But sync tlbi is still
>>> quite expensive.
>> [...]
>>>  .../features/vm/TLB/arch-support.txt  |  2 +-
>>>  arch/arm64/Kconfig|  1 +
>>>  arch/arm64/include/asm/tlbbatch.h | 12 
>>>  arch/arm64/include/asm/tlbflush.h | 33 -
>>>  arch/arm64/mm/flush.c | 69 +++
>>>  arch/x86/include/asm/tlbflush.h   |  5 +-
>>>  include/linux/mm_types_task.h |  4 +-
>>>  mm/rmap.c | 12 ++--
>>
>> First of all, this patch needs to be split in some preparatory patches
>> introducing/renaming functions with no functional change for x86. Once
>> done, you can add the arm64-only changes.
>>

got it. will try to split this patch as suggested.

>> Now, on the implementation, I had some comments on v7 but we didn't get
>> to a conclusion and the thread eventually died:
>>
>> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
>>
>> I know I said a command line argument is better than Kconfig or some
>> random number of CPUs heuristics but it would be even better if we don't
>> bother with any, just make this always on.

ok, will make this always on.

>> Barry had some comments
>> around mprotect() being racy and that's why we have
>> flush_tlb_batched_pending() but I don't think it's needed (or, for
>> arm64, it can be a DSB since this patch issues the TLBIs but without the
>> DVM Sync). So we need to clarify this (see Barry's last email on the
>> above thread) and before attempting new versions of this patchset. With
>> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
>> implementation would be faster on any SoC irrespective of the number of
>> CPUs.
> 
> I think I got the need for flush_tlb_batched_pending(). If
> try_to_unmap() marks the pte !present and we have a pending TLBI,
> change_pte_range() will skip the TLB maintenance altogether since it did
> not change the pte. So we could be left with stale TLB entries after
> mprotect() before TTU does the batch flushing.
> 
> We can have an arch-specific flush_tlb_batched_pending() that can be a
> DSB only on arm64 and a full mm flush on x86.
> 

We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
condition so we first check whether there's a pended batched flush and
if so do the tlb flush. The pending checking is common and the differences
among the archs is how to flush the TLB here within the 
flush_tlb_batched_pending(),
on arm64 it should only be a dsb.

As we only needs to maintain the TLBs already pended in batched flush,
does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
flush_tlb_batched_pending() and no arch specific function needed.

Thanks.



Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas
On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> > From: Barry Song 
> > 
> > on x86, batched and deferred tlb shootdown has lead to 90%
> > performance increase on tlb shootdown. on arm64, HW can do
> > tlb shootdown without software IPI. But sync tlbi is still
> > quite expensive.
> [...]
> >  .../features/vm/TLB/arch-support.txt  |  2 +-
> >  arch/arm64/Kconfig|  1 +
> >  arch/arm64/include/asm/tlbbatch.h | 12 
> >  arch/arm64/include/asm/tlbflush.h | 33 -
> >  arch/arm64/mm/flush.c | 69 +++
> >  arch/x86/include/asm/tlbflush.h   |  5 +-
> >  include/linux/mm_types_task.h |  4 +-
> >  mm/rmap.c | 12 ++--
> 
> First of all, this patch needs to be split in some preparatory patches
> introducing/renaming functions with no functional change for x86. Once
> done, you can add the arm64-only changes.
> 
> Now, on the implementation, I had some comments on v7 but we didn't get
> to a conclusion and the thread eventually died:
> 
> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
> 
> I know I said a command line argument is better than Kconfig or some
> random number of CPUs heuristics but it would be even better if we don't
> bother with any, just make this always on. Barry had some comments
> around mprotect() being racy and that's why we have
> flush_tlb_batched_pending() but I don't think it's needed (or, for
> arm64, it can be a DSB since this patch issues the TLBIs but without the
> DVM Sync). So we need to clarify this (see Barry's last email on the
> above thread) and before attempting new versions of this patchset. With
> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
> implementation would be faster on any SoC irrespective of the number of
> CPUs.

I think I got the need for flush_tlb_batched_pending(). If
try_to_unmap() marks the pte !present and we have a pending TLBI,
change_pte_range() will skip the TLB maintenance altogether since it did
not change the pte. So we could be left with stale TLB entries after
mprotect() before TTU does the batch flushing.

We can have an arch-specific flush_tlb_batched_pending() that can be a
DSB only on arm64 and a full mm flush on x86.

-- 
Catalin


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas
On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
[...]
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/tlbbatch.h | 12 
>  arch/arm64/include/asm/tlbflush.h | 33 -
>  arch/arm64/mm/flush.c | 69 +++
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 ++--

First of all, this patch needs to be split in some preparatory patches
introducing/renaming functions with no functional change for x86. Once
done, you can add the arm64-only changes.

Now, on the implementation, I had some comments on v7 but we didn't get
to a conclusion and the thread eventually died:

https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/

I know I said a command line argument is better than Kconfig or some
random number of CPUs heuristics but it would be even better if we don't
bother with any, just make this always on. Barry had some comments
around mprotect() being racy and that's why we have
flush_tlb_batched_pending() but I don't think it's needed (or, for
arm64, it can be a DSB since this patch issues the TLBIs but without the
DVM Sync). So we need to clarify this (see Barry's last email on the
above thread) and before attempting new versions of this patchset. With
flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
implementation would be faster on any SoC irrespective of the number of
CPUs.

-- 
Catalin


[RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-05-18 Thread Yicong Yang
From: Barry Song 

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include 
 #include 
 #include 
 #include 

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
 volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
  MAP_SHARED | MAP_ANONYMOUS, -1, 0);

 memset(p, 0x88, SIZE);

 for (int k = 0; k < 1; k++) {
 /* swap in */
 for (int i = 0; i < SIZE; i += 4096) {
 (void)p[i];
 }

 /* swap out */
 madvise(p, SIZE, MADV_PAGEOUT);
 }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only 
options.
 # To display the perf.data header info, please use --header/--header-only 
options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object  Symbol
 #   ...  .  
.
 #
21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
 8.23%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
 6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
 5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
 3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
 3.49%  a.out[kernel.kallsyms]  [k] memset64
 1.63%  a.out[kernel.kallsyms]  [k] clear_page
 1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
 1.26%  a.out[kernel.kallsyms]  [k] 
mod_zone_state.llvm.8525150236079521930
 1.23%  a.out[kernel.kallsyms]  [k] xas_load
 1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
Tested with benchmark in the commit on Kunpeng920 arm64 server,
observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real0m13.460s   0m11.771s
user0m0.248s0m0.279s
sys 0m12.039s   0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and add a runtime tunable to allow
to disable it according to the scenario.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
20 times, this patch decrease the time used more than 50% and saved the
time used by ptep_clear_flush().

This patch extends arch_tlbbatch_add_mm() to take an address of the
target page to support the feature on arm64. Also rename it to
arch_tlbbatch_add_pending() to better match its function since we
don't need to handle the mm on arm64 and add_mm is not proper.
add_pending will make sense to both as on x86 we're pending the
TLB flush operations while on arm64