Re: [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-03-30 Thread Punit Agrawal
threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

The commit log and the patch disagree on the name of the config option
(CONFIG_NR_CPUS_FOR_BATCHED_TLB vs CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB).

But more importantly, I was wondering why this posting doesn't address
Catalin's feedback [a] about using a runtime tunable. Maybe I missed the
follow-up discussion.

Thanks,
Punit

[a] https://lore.kernel.org/linux-mm/y7xmhptawcut4...@arm.com/

> Also this patch improve the performance of page migration. Using pmbench
> and tries to migrate the pages of pmbench between node 0 and node 1 for
> 20 times, this patch decrease the time used more than 50% and saved the
> time used by ptep_clear_flush().
>
> This patch extends arch_tlbbatch_add_mm() to take an address of the
> target page to support the feature on arm64. Also rename it to
> arch_tlbbatch_add_pending() to better match its function since we
> don't need to handle the mm on arm64 and add_mm is not proper.
> add_pending will make sense to both as on x86 we're pending the
> TLB flush operations while on arm64 we're pending the synchronize
> operations.
>
> Cc: Anshuman Khandual 
> Cc: Jonathan Corbet 
> Cc: Nadav Amit 
> Cc: Mel Gorman 
> Tested-by: Yicong Yang 
> Tested-by: Xin Hao 
> Tested-by: Punit Agrawal 
> Signed-off-by: Barry Song 
> Signed-off-by: Yicong Yang 
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Xin Hao 
> Reviewed-by: Anshuman Khandual 
> ---
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  6 +++
>  arch/arm64/include/asm/tlbbatch.h | 12 +
>  arch/arm64/include/asm/tlbflush.h | 52 ++-
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 +++--
>  7 files changed, 81 insertions(+), 12 deletions(-)
>  create mode 100644 arch/arm64/include/asm/tlbbatch.h


[...]



Re: [PATCH v2 00/33] Per-VMA locks

2023-02-28 Thread Punit Agrawal
Punit Agrawal  writes:

> Suren Baghdasaryan  writes:
>
>> Previous version:
>> v1: https://lore.kernel.org/all/20230109205336.3665937-1-sur...@google.com/
>> RFC: https://lore.kernel.org/all/20220901173516.702122-1-sur...@google.com/
>>
>> LWN article describing the feature:
>> https://lwn.net/Articles/906852/
>>
>> Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM
>> last year [2], which concluded with suggestion that “a reader/writer
>> semaphore could be put into the VMA itself; that would have the effect of
>> using the VMA as a sort of range lock. There would still be contention at
>> the VMA level, but it would be an improvement.” This patchset implements
>> this suggested approach.
>
> I took the patches for a spin on a 2-socket 32 core (64 threads) system
> with Intel 8336C (Ice Lake) and 512GB of RAM.
>
> For the initial testing, "pft-threads" from the mm-tests suite[0] was
> used. The test mmaps a memory region (~100GB on the test system) and
> triggers access by a number of threads executing in parallel. For each
> degree of parallelism, the test is repeated 10 times to get a better
> feel for the behaviour. Below is an excerpt of the harmonic mean
> reported by 'compare_kernel' script[1] included with mm-tests.
>
> The first column is results for mm-unstable as of 2023-02-10, the second
> column is the patches posted here while the third column includes
> optimizations to reclaim some of the observed regression.
>
> From the results, there is a drop in page fault/second for low number of
> CPUs but good improvement with higher CPUs.
>
> 6.2.0-rc46.2.0-rc4
> 6.2.0-rc4
>  mm-unstable-20230210   pvl-v2
>pvl-v2+opt
>
> Hmean faults/cpu-1 898792.9338 (   0.00%)   894597.0474 *  -0.47%*   
> 895933.2782 *  -0.32%*
> Hmean faults/cpu-4 751903.9803 (   0.00%)   677764.2975 *  -9.86%*   
> 688643.8163 *  -8.41%*
> Hmean faults/cpu-7 612275.5663 (   0.00%)   565363.4137 *  -7.66%*   
> 597538.9396 *  -2.41%*
> Hmean faults/cpu-12434460.9074 (   0.00%)   410974.2708 *  -5.41%*   
> 452501.4290 *   4.15%*
> Hmean faults/cpu-21291475.5165 (   0.00%)   293936.8460 (   0.84%)   
> 308712.2434 *   5.91%*
> Hmean faults/cpu-30218021.3980 (   0.00%)   228265.0559 *   4.70%*   
> 241897.5225 *  10.95%*
> Hmean faults/cpu-48141798.5030 (   0.00%)   162322.5972 *  14.47%*   
> 166081.9459 *  17.13%*
> Hmean faults/cpu-79 90060.9577 (   0.00%)   107028.7779 *  18.84%*   
> 109810.4488 *  21.93%*
> Hmean faults/cpu-11064729.3561 (   0.00%)80597.7246 *  24.51%*
> 83134.0679 *  28.43%*
> Hmean faults/cpu-12855740.1334 (   0.00%)68395.4426 *  22.70%*
> 69248.2836 *  24.23%*
>
> Hmean faults/sec-1 898781.7694 (   0.00%)   894247.3174 *  -0.50%*   
> 894440.3118 *  -0.48%*
> Hmean faults/sec-42965588.9697 (   0.00%)  2683651.5664 *  -9.51%*  
> 2726450.9710 *  -8.06%*
> Hmean faults/sec-74144512.3996 (   0.00%)  3891644.2128 *  -6.10%*  
> 4099918.8601 (  -1.08%)
> Hmean faults/sec-12   4969513.6934 (   0.00%)  4829731.4355 *  -2.81%*  
> 5264682.7371 *   5.94%*
> Hmean faults/sec-21   5814379.4789 (   0.00%)  5941405.3116 *   2.18%*  
> 6263716.3903 *   7.73%*
> Hmean faults/sec-30   6153685.3709 (   0.00%)  6489311.6634 *   5.45%*  
> 6910843.5858 *  12.30%*
> Hmean faults/sec-48   6197953.1327 (   0.00%)  7216320.7727 *  16.43%*  
> 7412782.2927 *  19.60%*
> Hmean faults/sec-79   6167135.3738 (   0.00%)  7425927.1022 *  20.41%*  
> 7637042.2198 *  23.83%*
> Hmean faults/sec-110  6264768.2247 (   0.00%)  7813329.3863 *  24.72%*  
> 7984344.4005 *  27.45%*
> Hmean faults/sec-128  6460727.8216 (   0.00%)  7875664.8999 *  21.90%*  
> 8049910.3601 *  24.60%*


The above workload represent the worst case with regards to per-VMA
locks as it creates a single large VMA. As a follow-up, I modified
pft[2] to create a VMA per thread to understand the behaviour in
scenarios where per-VMA locks should show the most benefit.

6.2.0-rc46.2.0-rc4  
  6.2.0-rc4
 mm-unstable-20230210   pvl-v2  
 pvl-v2+opt

Hmean faults/cpu-1 905497.4354 (   0.00%)   888736.5570 *  -1.85%*   
888695.2675 *  -1.86%*
Hmean faults/cpu-4 758519.2719 (   0.00%)   812103.1991 *   7.06%*   
825077.9277 *   8.77%*
Hmean faults/cpu-7 617153.8038 (   0.00%)   729943.4518 *  18.28%*   
770872.3161 *  24.91%*
Hmean fa

Re: [External] [PATCH v2 00/33] Per-VMA locks

2023-02-15 Thread Punit Agrawal
Suren Baghdasaryan  writes:

> Previous version:
> v1: https://lore.kernel.org/all/20230109205336.3665937-1-sur...@google.com/
> RFC: https://lore.kernel.org/all/20220901173516.702122-1-sur...@google.com/
>
> LWN article describing the feature:
> https://lwn.net/Articles/906852/
>
> Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM
> last year [2], which concluded with suggestion that “a reader/writer
> semaphore could be put into the VMA itself; that would have the effect of
> using the VMA as a sort of range lock. There would still be contention at
> the VMA level, but it would be an improvement.” This patchset implements
> this suggested approach.

I took the patches for a spin on a 2-socket 32 core (64 threads) system
with Intel 8336C (Ice Lake) and 512GB of RAM.

For the initial testing, "pft-threads" from the mm-tests suite[0] was
used. The test mmaps a memory region (~100GB on the test system) and
triggers access by a number of threads executing in parallel. For each
degree of parallelism, the test is repeated 10 times to get a better
feel for the behaviour. Below is an excerpt of the harmonic mean
reported by 'compare_kernel' script[1] included with mm-tests.

The first column is results for mm-unstable as of 2023-02-10, the second
column is the patches posted here while the third column includes
optimizations to reclaim some of the observed regression.

>From the results, there is a drop in page fault/second for low number of
CPUs but good improvement with higher CPUs.

6.2.0-rc46.2.0-rc4  
  6.2.0-rc4
 mm-unstable-20230210   pvl-v2  
 pvl-v2+opt

Hmean faults/cpu-1 898792.9338 (   0.00%)   894597.0474 *  -0.47%*   
895933.2782 *  -0.32%*
Hmean faults/cpu-4 751903.9803 (   0.00%)   677764.2975 *  -9.86%*   
688643.8163 *  -8.41%*
Hmean faults/cpu-7 612275.5663 (   0.00%)   565363.4137 *  -7.66%*   
597538.9396 *  -2.41%*
Hmean faults/cpu-12434460.9074 (   0.00%)   410974.2708 *  -5.41%*   
452501.4290 *   4.15%*
Hmean faults/cpu-21291475.5165 (   0.00%)   293936.8460 (   0.84%)   
308712.2434 *   5.91%*
Hmean faults/cpu-30218021.3980 (   0.00%)   228265.0559 *   4.70%*   
241897.5225 *  10.95%*
Hmean faults/cpu-48141798.5030 (   0.00%)   162322.5972 *  14.47%*   
166081.9459 *  17.13%*
Hmean faults/cpu-79 90060.9577 (   0.00%)   107028.7779 *  18.84%*   
109810.4488 *  21.93%*
Hmean faults/cpu-11064729.3561 (   0.00%)80597.7246 *  24.51%*
83134.0679 *  28.43%*
Hmean faults/cpu-12855740.1334 (   0.00%)68395.4426 *  22.70%*
69248.2836 *  24.23%*

Hmean faults/sec-1 898781.7694 (   0.00%)   894247.3174 *  -0.50%*   
894440.3118 *  -0.48%*
Hmean faults/sec-42965588.9697 (   0.00%)  2683651.5664 *  -9.51%*  
2726450.9710 *  -8.06%*
Hmean faults/sec-74144512.3996 (   0.00%)  3891644.2128 *  -6.10%*  
4099918.8601 (  -1.08%)
Hmean faults/sec-12   4969513.6934 (   0.00%)  4829731.4355 *  -2.81%*  
5264682.7371 *   5.94%*
Hmean faults/sec-21   5814379.4789 (   0.00%)  5941405.3116 *   2.18%*  
6263716.3903 *   7.73%*
Hmean faults/sec-30   6153685.3709 (   0.00%)  6489311.6634 *   5.45%*  
6910843.5858 *  12.30%*
Hmean faults/sec-48   6197953.1327 (   0.00%)  7216320.7727 *  16.43%*  
7412782.2927 *  19.60%*
Hmean faults/sec-79   6167135.3738 (   0.00%)  7425927.1022 *  20.41%*  
7637042.2198 *  23.83%*
Hmean faults/sec-110  6264768.2247 (   0.00%)  7813329.3863 *  24.72%*  
7984344.4005 *  27.45%*
Hmean faults/sec-128  6460727.8216 (   0.00%)  7875664.8999 *  21.90%*  
8049910.3601 *  24.60%*

[0] https://github.com/gormanm/mmtests
[1] https://github.com/gormanm/mmtests/blob/master/compare-kernels.sh


Re: [External] [PATCH v5 0/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-11-11 Thread Punit Agrawal
Yicong Yang  writes:

> From: Yicong Yang 
>
> Though ARM64 has the hardware to do tlb shootdown, the hardware
> broadcasting is not free.
> A simplest micro benchmark shows even on snapdragon 888 with only
> 8 cores, the overhead for ptep_clear_flush is huge even for paging
> out one page mapped by only one process:
> 5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
>
> While pages are mapped by multiple processes or HW has more CPUs,
> the cost should become even higher due to the bad scalability of
> tlb shootdown.
>
> The same benchmark can result in 16.99% CPU consumption on ARM64
> server with around 100 cores according to Yicong's test on patch
> 4/4.
>
> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
> 1. only send tlbi instructions in the first stage -
>   arch_tlbbatch_add_mm()
> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>   sync in arch_tlbbatch_flush()
> Testing on snapdragon shows the overhead of ptep_clear_flush
> is removed by the patchset. The micro benchmark becomes 5% faster
> even for one page mapped by single process on snapdragon 888.
>
> With this support we're possible to do more optimization for memory
> reclamation and migration[*].

I applied the patches on v6.1-rc4 and was able to see the drop in
ptep_clear_flush() in the perf report when running the test program from
Patch 2. The tests were done on a rk3399 based system with benefits
visible when running the tests on either of the clusters. 

So, for the series,

Tested-by: Punit Agrawal 

Thanks,
Punit

[...]



Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-10-31 Thread Punit Agrawal
Barry Song <21cn...@gmail.com> writes:

> On Sat, Oct 29, 2022 at 2:11 AM Punit Agrawal
>  wrote:
>>
>> Yicong Yang  writes:
>>
>> > On 2022/10/27 22:19, Punit Agrawal wrote:
>> >>
>> >> [ Apologies for chiming in late in the conversation ]
>> >>
>> >> Anshuman Khandual  writes:
>> >>
>> >>> On 9/28/22 05:53, Barry Song wrote:
>> >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang  
>> >>>> wrote:
>> >>>>>
>> >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>> >>>>>> [...]
>> >>>>>>
>> >>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>> >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>> >>>>>>> +{
>> >>>>>>> +/* for small systems with small number of CPUs, TLB shootdown 
>> >>>>>>> is cheap */
>> >>>>>>> +if (num_online_cpus() <= 4)
>> >>>>>>
>> >>>>>> It would be great to have some more inputs from others, whether 4 
>> >>>>>> (which should
>> >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or 
>> >>>>>> something similar)
>> >>>>>> is optimal for an wide range of arm64 platforms.
>> >>>>>>
>> >>>>
>> >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>> >>>> with 5,6,7
>> >>>> cores.
>> >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't 
>> >>>> need
>> >>>> this patch.
>> >>>>
>> >>>> so it seems safe to have
>> >>>> if (num_online_cpus()  < 8)
>> >>>>
>> >>>>>
>> >>>>> Do you prefer this macro to be static or make it configurable through 
>> >>>>> kconfig then
>> >>>>> different platforms can make choice based on their own situations? It 
>> >>>>> maybe hard to
>> >>>>> test on all the arm64 platforms.
>> >>>>
>> >>>> Maybe we can have this default enabled on machines with 8 and more cpus 
>> >>>> and
>> >>>> provide a tlbflush_batched = on or off to allow users enable or
>> >>>> disable it according
>> >>>> to their hardware and products. Similar example: rodata=on or off.
>> >>>
>> >>> No, sounds bit excessive. Kernel command line options should not be added
>> >>> for every possible run time switch options.
>> >>>
>> >>>>
>> >>>> Hi Anshuman, Will,  Catalin, Andrew,
>> >>>> what do you think about this approach?
>> >>>>
>> >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on 
>> >>>> arm64:
>> >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf...@linux.alibaba.com/
>> >>>>
>> >>>> I do believe we need it based on the expensive cost of tlb shootdown in 
>> >>>> arm64
>> >>>> even by hardware broadcast.
>> >>>
>> >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 
>> >>> selectively
>> >>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>> >>
>> >> When running the test program in the commit in a VM, I saw benefits from
>> >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> >> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> >> up.
>> >>
>> >
>> > Maybe you're booting VM on a server with more than 32 cores and Barry 
>> > tested
>> > on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully 
>> > equivalent to
>> > a 4 CPU real machine as the tbli and dsb in the VM may influence the host
>> > as well.
>>
>> Yeah, I also wondered about this.
>>
>> I was able to test on a 6-core RK3399 based system - there the
>> ptep_clear_flush() was only 0.10% of the overall execution time. The
>> hardware seems to do a pretty good job of keeping the TLB flushing
>> overhead low.

I found a problem with my measurements (missing volatile). Correcting
that increased the overhead somewhat - more below.

> RK3399 has Dual-core ARM Cortex-A72 MPCore processor and
> Quad-core ARM Cortex-A53 MPCore processor. you are probably
> going to see different overhead of ptep_clear_flush() when you
> bind the micro-benchmark on different cores.

Indeed - binding the code on the A53 shows half the overhead from
ptep_clear_flush() compared to the A72.

On the A53 -

$ perf report --stdio -i perf.vanilla.a53.data | grep ptep_clear_flush
 0.63%  pageout  [kernel.kallsyms]  [k] ptep_clear_flush

On the A72

$ perf report --stdio -i perf.vanilla.a72.data | grep ptep_clear_flush
 1.34%  pageout  [kernel.kallsyms]  [k] ptep_clear_flush


[...]



Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-10-28 Thread Punit Agrawal
Anshuman Khandual  writes:

> On 10/28/22 03:25, Barry Song wrote:
>> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
>>  wrote:
>>>
>>> [ Apologies for chiming in late in the conversation ]
>>>
>>> Anshuman Khandual  writes:
>>>
>>>> On 9/28/22 05:53, Barry Song wrote:
>>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang  
>>>>> wrote:
>>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> +/* for small systems with small number of CPUs, TLB shootdown is 
>>>>>>>> cheap */
>>>>>>>> +if (num_online_cpus() <= 4)
>>>>>>> It would be great to have some more inputs from others, whether 4 
>>>>>>> (which should
>>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something 
>>>>>>> similar)
>>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>>
>>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>>> with 5,6,7
>>>>> cores.
>>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't 
>>>>> need
>>>>> this patch.
>>>>>
>>>>> so it seems safe to have
>>>>> if (num_online_cpus()  < 8)
>>>>>
>>>>>> Do you prefer this macro to be static or make it configurable through 
>>>>>> kconfig then
>>>>>> different platforms can make choice based on their own situations? It 
>>>>>> maybe hard to
>>>>>> test on all the arm64 platforms.
>>>>> Maybe we can have this default enabled on machines with 8 and more cpus 
>>>>> and
>>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>>> disable it according
>>>>> to their hardware and products. Similar example: rodata=on or off.
>>>> No, sounds bit excessive. Kernel command line options should not be added
>>>> for every possible run time switch options.
>>>>
>>>>> Hi Anshuman, Will,  Catalin, Andrew,
>>>>> what do you think about this approach?
>>>>>
>>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on 
>>>>> arm64:
>>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf...@linux.alibaba.com/
>>>>>
>>>>> I do believe we need it based on the expensive cost of tlb shootdown in 
>>>>> arm64
>>>>> even by hardware broadcast.
>>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 
>>>> selectively
>>>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>>> When running the test program in the commit in a VM, I saw benefits from
>>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>>> up.
>>>
>>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
>>> there any overhead? I am wondering what are the downsides of enabling
>>> the config by default.
>> As we are deferring tlb flush, but sometimes while we are modifying the vma
>> which are deferred, we need to do a sync by flush_tlb_batched_pending() in
>> mprotect() , madvise() to make sure they can see the flushed result. if 
>> nobody
>> is doing mprotect(), madvise() etc in the deferred period, the overhead is 
>> zero.
>
> Right, it is difficult to justify this overhead for smaller systems,
> which for sure would not benefit from this batched TLB framework.

Thank you for the pointers to the overhead.

Having looked at this more closely, I also see that
flush_tlb_batched_pending() discards the entire mm vs just flushing the
page being unmapped (as is done with ptep_clear_flush()).


Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-10-28 Thread Punit Agrawal
Yicong Yang  writes:

> On 2022/10/27 22:19, Punit Agrawal wrote:
>> 
>> [ Apologies for chiming in late in the conversation ]
>> 
>> Anshuman Khandual  writes:
>> 
>>> On 9/28/22 05:53, Barry Song wrote:
>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang  wrote:
>>>>>
>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>> [...]
>>>>>>
>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +/* for small systems with small number of CPUs, TLB shootdown is 
>>>>>>> cheap */
>>>>>>> +if (num_online_cpus() <= 4)
>>>>>>
>>>>>> It would be great to have some more inputs from others, whether 4 (which 
>>>>>> should
>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something 
>>>>>> similar)
>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>
>>>>
>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>> with 5,6,7
>>>> cores.
>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>> this patch.
>>>>
>>>> so it seems safe to have
>>>> if (num_online_cpus()  < 8)
>>>>
>>>>>
>>>>> Do you prefer this macro to be static or make it configurable through 
>>>>> kconfig then
>>>>> different platforms can make choice based on their own situations? It 
>>>>> maybe hard to
>>>>> test on all the arm64 platforms.
>>>>
>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>> disable it according
>>>> to their hardware and products. Similar example: rodata=on or off.
>>>
>>> No, sounds bit excessive. Kernel command line options should not be added
>>> for every possible run time switch options.
>>>
>>>>
>>>> Hi Anshuman, Will,  Catalin, Andrew,
>>>> what do you think about this approach?
>>>>
>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on 
>>>> arm64:
>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf...@linux.alibaba.com/
>>>>
>>>> I do believe we need it based on the expensive cost of tlb shootdown in 
>>>> arm64
>>>> even by hardware broadcast.
>>>
>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 
>>> selectively
>>> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?
>> 
>> When running the test program in the commit in a VM, I saw benefits from
>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> up.
>> 
>
> Maybe you're booting VM on a server with more than 32 cores and Barry tested
> on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to
> a 4 CPU real machine as the tbli and dsb in the VM may influence the host
> as well.

Yeah, I also wondered about this.

I was able to test on a 6-core RK3399 based system - there the
ptep_clear_flush() was only 0.10% of the overall execution time. The
hardware seems to do a pretty good job of keeping the TLB flushing
overhead low.

[...]



Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-10-27 Thread Punit Agrawal


[ Apologies for chiming in late in the conversation ]

Anshuman Khandual  writes:

> On 9/28/22 05:53, Barry Song wrote:
>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang  wrote:
>>>
>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
 [...]

 On 9/21/22 14:13, Yicong Yang wrote:
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +/* for small systems with small number of CPUs, TLB shootdown is 
> cheap */
> +if (num_online_cpus() <= 4)

 It would be great to have some more inputs from others, whether 4 (which 
 should
 to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something 
 similar)
 is optimal for an wide range of arm64 platforms.

>> 
>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>> with 5,6,7
>> cores.
>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>> this patch.
>> 
>> so it seems safe to have
>> if (num_online_cpus()  < 8)
>> 
>>>
>>> Do you prefer this macro to be static or make it configurable through 
>>> kconfig then
>>> different platforms can make choice based on their own situations? It maybe 
>>> hard to
>>> test on all the arm64 platforms.
>> 
>> Maybe we can have this default enabled on machines with 8 and more cpus and
>> provide a tlbflush_batched = on or off to allow users enable or
>> disable it according
>> to their hardware and products. Similar example: rodata=on or off.
>
> No, sounds bit excessive. Kernel command line options should not be added
> for every possible run time switch options.
>
>> 
>> Hi Anshuman, Will,  Catalin, Andrew,
>> what do you think about this approach?
>> 
>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf...@linux.alibaba.com/
>> 
>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>> even by hardware broadcast.
>
> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> with CONFIG_EXPERT and for num_online_cpus()  > 8 ?

When running the test program in the commit in a VM, I saw benefits from
the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
ptep_clear_flush() went from ~1% in the unpatched version to not showing
up.

Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
there any overhead? I am wondering what are the downsides of enabling
the config by default.

Thanks,
Punit


Re: [PATCH v10 02/25] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2018-05-14 Thread Punit Agrawal
Laurent Dufour  writes:

> On 08/05/2018 13:04, Punit Agrawal wrote:
>> Hi Laurent,
>> 
>> Laurent Dufour  writes:
>> 
>>> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
>>> Speculative Page Fault handler when building for 64bit.
>>>
>>> Cc: Thomas Gleixner 
>>> Signed-off-by: Laurent Dufour 
>>> ---
>>>  arch/x86/Kconfig | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index d8983df5a2bc..ebdeb48e4a4a 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -30,6 +30,7 @@ config X86_64
>>> select MODULES_USE_ELF_RELA
>>> select X86_DEV_DMA_OPS
>>> select ARCH_HAS_SYSCALL_WRAPPER
>>> +   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> 
>> I'd suggest merging this patch with the one making changes to the
>> architectural fault handler towards the end of the series.
>> 
>> The Kconfig change is closely tied to the architectural support for SPF
>> and makes sense to be in a single patch.
>> 
>> If there's a good reason to keep them as separate patches, please move
>> the architecture Kconfig changes after the patch adding fault handler
>> changes.
>> 
>> It's better to enable the feature once the core infrastructure is merged
>> rather than at the beginning of the series to avoid potential bad
>> fallout from incomplete functionality during bisection.
>
> Indeed bisection was the reason why Andrew asked me to push the configuration
> enablement on top of the series (https://lkml.org/lkml/2017/10/10/1229).

The config options have gone through another round of splitting (between
core and architecture) since that comment. I agree that it still makes
sense to define the core config - CONFIG_SPECULATIVE_PAGE_FAULT early
on.

Just to clarify, my suggestion was to only move the architecture configs
further down.

>
> I also think it would be better to have the architecture enablement in on 
> patch
> but that would mean that the code will not be build when bisecting without the
> latest patch adding the per architecture code.

I don't see that as a problem. But if I'm in the minority, I am OK with
leaving things as they are as well.

Thanks,
Punit


Re: [PATCH v10 02/25] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2018-05-08 Thread Punit Agrawal
Hi Laurent,

Laurent Dufour  writes:

> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
> Speculative Page Fault handler when building for 64bit.
>
> Cc: Thomas Gleixner 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/x86/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d8983df5a2bc..ebdeb48e4a4a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -30,6 +30,7 @@ config X86_64
>   select MODULES_USE_ELF_RELA
>   select X86_DEV_DMA_OPS
>   select ARCH_HAS_SYSCALL_WRAPPER
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

I'd suggest merging this patch with the one making changes to the
architectural fault handler towards the end of the series.

The Kconfig change is closely tied to the architectural support for SPF
and makes sense to be in a single patch.

If there's a good reason to keep them as separate patches, please move
the architecture Kconfig changes after the patch adding fault handler
changes.

It's better to enable the feature once the core infrastructure is merged
rather than at the beginning of the series to avoid potential bad
fallout from incomplete functionality during bisection.

All the comments here definitely hold for the arm64 patches that you
plan to include with the next update.

Thanks,
Punit

>  
>  #
>  # Arch settings


Re: [PATCH v10 24/25] x86/mm: add speculative pagefault handling

2018-05-04 Thread Punit Agrawal
Laurent Dufour  writes:

> On 30/04/2018 20:43, Punit Agrawal wrote:
>> Hi Laurent,
>> 
>> I am looking to add support for speculative page fault handling to
>> arm64 (effectively porting this patch) and had a few questions.
>> Apologies if I've missed an obvious explanation for my queries. I'm
>> jumping in bit late to the discussion.
>
> Hi Punit,
>
> Thanks for giving this series a review.
> I don't have arm64 hardware to play with, but I'll be happy to add arm64
> patches to my series and to try to maintain them.

I'll be happy to try them on arm64 platforms I have access to and
provide feedback.

>
>> 
>> On Tue, Apr 17, 2018 at 3:33 PM, Laurent Dufour
>>  wrote:
>>> From: Peter Zijlstra 
>>>

[...]

>>>
>>> -   vma = find_vma(mm, address);
>>> +   if (!vma || !can_reuse_spf_vma(vma, address))
>>> +   vma = find_vma(mm, address);
>> 
>> Is there a measurable benefit from reusing the vma?
>> 
>> Dropping the vma reference unconditionally after speculative page
>> fault handling gets rid of the implicit state when "vma != NULL"
>> (increased ref-count). I found it a bit confusing to follow.
>
> I do agree, this is quite confusing. My initial goal was to be able to reuse
> the VMA in the case a protection key error was detected, but it's not really
> necessary on x86 since we know at the beginning of the fault operation that
> protection key are in the loop. This is not the case on ppc64 but I couldn't
> find a way to easily rely on the speculatively fetched VMA neither, so for
> protection keys, this didn't help.
>
> Regarding the measurable benefit of reusing the fetched vma, I did further
> tests using will-it-scale/page_fault2_threads test, and I'm no more really
> convince that this worth the added complexity. I think I'll drop the patch 
> "mm:
> speculative page fault handler return VMA" of the series, and thus remove the
> call to can_reuse_spf_vma().

Makes sense. Thanks for giving this a go.

Punit

[...]



Re: [PATCH v10 00/25] Speculative page faults

2018-05-02 Thread Punit Agrawal
Hi Laurent,

Thanks for your reply.

Laurent Dufour  writes:

> On 02/05/2018 16:17, Punit Agrawal wrote:
>> Hi Laurent,
>> 
>> One query below -
>> 
>> Laurent Dufour  writes:
>> 
>> [...]
>> 
>>>
>>> Ebizzy:
>>> ---
>>> The test is counting the number of records per second it can manage, the
>>> higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
>>> result I repeated the test 100 times and measure the average result. The
>>> number is the record processes per second, the higher is the best.
>>>
>>> BASESPF delta   
>>> 16 CPUs x86 VM  12405.5291104.52634.39%
>>> 80 CPUs P8 node 37880.0176201.05101.16%
>> 
>> How do you measure the number of records processed? Is there a specific
>> version of ebizzy that reports this? I couldn't find a way to get this
>> information with the ebizzy that's included in ltp.
>
> I'm using the original one : http://ebizzy.sourceforge.net/

Turns out I missed the records processed in the verbose output enabled
by "-vvv". Sorry for the noise.

[...]

>> 
>> A trial run showed increased fault handling when SPF is enabled on an
>> 8-core ARM64 system running 4.17-rc3. I am using a port of your x86
>> patch to enable spf on arm64.
>> 
>> SPF
>> ---
>> 
>> Performance counter stats for './ebizzy -vvvmTRp':
>> 
>>  1,322,736  faults   
>>
>>  1,299,241  software/config=11/  
>>
>> 
>>   10.005348034 seconds time elapsed
>> 
>> No SPF
>> -
>> 
>>  Performance counter stats for './ebizzy -vvvmTRp':
>> 
>>708,916  faults
>>  0  software/config=11/
>> 
>>   10.005807432 seconds time elapsed
>
> Thanks for sharing these good numbers !


A quick run showed 71041 (no-spf) vs 122306 (spf) records/s (~72%
improvement).

I'd like to do some runs on a slightly larger system (if I can get my
hands on one) to see how the patches behave. I'll also have a closer
look at your series - the previous comments were just somethings I
observed as part of trying the functionality on arm64.

Thanks,
Punit



Re: [PATCH v10 00/25] Speculative page faults

2018-05-02 Thread Punit Agrawal
Hi Laurent,

One query below -

Laurent Dufour  writes:

[...]

>
> Ebizzy:
> ---
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
> result I repeated the test 100 times and measure the average result. The
> number is the record processes per second, the higher is the best.
>
>   BASESPF delta   
> 16 CPUs x86 VM12405.5291104.52634.39%
> 80 CPUs P8 node 37880.01  76201.05101.16%

How do you measure the number of records processed? Is there a specific
version of ebizzy that reports this? I couldn't find a way to get this
information with the ebizzy that's included in ltp.

>
> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>  Performance counter stats for './ebizzy -mRTp':
> 860074  faults
> 856866  spf
>285  pagefault:spf_pte_lock
>   1506  pagefault:spf_vma_changed
>  0  pagefault:spf_vma_noanon
> 73  pagefault:spf_vma_notsup
>  0  pagefault:spf_vma_access
>  0  pagefault:spf_pmd_changed
>
> And the ones captured during a run on a 80 CPUs Power node:
>  Performance counter stats for './ebizzy -mRTp':
> 722695  faults
> 699402  spf
>  16048  pagefault:spf_pte_lock
>   6838  pagefault:spf_vma_changed
>  0  pagefault:spf_vma_noanon
>277  pagefault:spf_vma_notsup
>  0  pagefault:spf_vma_access
>  0  pagefault:spf_pmd_changed
>
> In ebizzy's case most of the page fault were handled in a speculative way,
> leading the ebizzy performance boost.

A trial run showed increased fault handling when SPF is enabled on an
8-core ARM64 system running 4.17-rc3. I am using a port of your x86
patch to enable spf on arm64.

SPF
---

Performance counter stats for './ebizzy -vvvmTRp':

 1,322,736  faults  

 1,299,241  software/config=11/ 


  10.005348034 seconds time elapsed

No SPF
-

 Performance counter stats for './ebizzy -vvvmTRp':

   708,916  faults
 0  software/config=11/

  10.005807432 seconds time elapsed

Thanks,
Punit

[...]



Re: [PATCH v10 17/25] mm: protect mm_rb tree with a rwlock

2018-04-30 Thread Punit Agrawal
Hi Laurent,

One nitpick below.

On Tue, Apr 17, 2018 at 3:33 PM, Laurent Dufour
 wrote:
> This change is inspired by the Peter's proposal patch [1] which was
> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
> that particular case, and it is introducing major performance degradation
> due to excessive scheduling operations.
>
> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
> is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
> search it is safe to protect it using such a lock.  The VMA cache is not
> protected by the new rwlock and it should not be used without holding the
> mmap_sem.
>
> To allow the picked VMA structure to be used once the rwlock is released, a
> use count is added to the VMA structure. When the VMA is allocated it is
> set to 1.  Each time the VMA is picked with the rwlock held its use count
> is incremented. Each time the VMA is released it is decremented. When the
> use count hits zero, this means that the VMA is no more used and should be
> freed.
>
> This patch is preparing for 2 kind of VMA access :
>  - as usual, under the control of the mmap_sem,
>  - without holding the mmap_sem for the speculative page fault handler.
>
> Access done under the control the mmap_sem doesn't require to grab the
> rwlock to protect read access to the mm_rb tree, but access in write must
> be done under the protection of the rwlock too. This affects inserting and
> removing of elements in the RB tree.
>
> The patch is introducing 2 new functions:
>  - vma_get() to find a VMA based on an address by holding the new rwlock.
>  - vma_put() to release the VMA when its no more used.
> These services are designed to be used when access are made to the RB tree
> without holding the mmap_sem.
>
> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
> we rely on the WMB done when releasing the rwlock to serialize the write
> with the RMB done in a later patch to check for the VMA's validity.
>
> When free_vma is called, the file associated with the VMA is closed
> immediately, but the policy and the file structure remained in used until
> the VMA's use count reach 0, which may happens later when exiting an
> in progress speculative page fault.
>
> [1] https://patchwork.kernel.org/patch/5108281/
>
> Cc: Peter Zijlstra (Intel) 
> Cc: Matthew Wilcox 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h   |   1 +
>  include/linux/mm_types.h |   4 ++
>  kernel/fork.c|   3 ++
>  mm/init-mm.c |   3 ++
>  mm/internal.h|   6 +++
>  mm/mmap.c| 115 
> +++
>  6 files changed, 104 insertions(+), 28 deletions(-)
>

[...]

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5601f1ef8bb9..a82950960f2e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -160,6 +160,27 @@ void unlink_file_vma(struct vm_area_struct *vma)
> }
>  }
>
> +static void __free_vma(struct vm_area_struct *vma)
> +{
> +   if (vma->vm_file)
> +   fput(vma->vm_file);
> +   mpol_put(vma_policy(vma));
> +   kmem_cache_free(vm_area_cachep, vma);
> +}
> +
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +void put_vma(struct vm_area_struct *vma)
> +{
> +   if (atomic_dec_and_test(&vma->vm_ref_count))
> +   __free_vma(vma);
> +}
> +#else
> +static inline void put_vma(struct vm_area_struct *vma)
> +{
> +   return __free_vma(vma);

Please drop the "return".

Thanks,
Punit

[...]


Re: [PATCH v10 24/25] x86/mm: add speculative pagefault handling

2018-04-30 Thread Punit Agrawal
Hi Laurent,

I am looking to add support for speculative page fault handling to
arm64 (effectively porting this patch) and had a few questions.
Apologies if I've missed an obvious explanation for my queries. I'm
jumping in bit late to the discussion.

On Tue, Apr 17, 2018 at 3:33 PM, Laurent Dufour
 wrote:
> From: Peter Zijlstra 
>
> Try a speculative fault before acquiring mmap_sem, if it returns with
> VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
> traditional fault.
>
> Signed-off-by: Peter Zijlstra (Intel) 
>
> [Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
>  handle_speculative_fault()]
> [Retry with usual fault path in the case VM_ERROR is returned by
>  handle_speculative_fault(). This allows signal to be delivered]
> [Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
> [Try speculative fault path only for multi threaded processes]
> [Try reuse to the VMA fetch during the speculative path in case of retry]
> [Call reuse_spf_or_find_vma()]
> [Handle memory protection key fault]
> Signed-off-by: Laurent Dufour 
> ---
>  arch/x86/mm/fault.c | 42 ++
>  1 file changed, 38 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 73bd8c95ac71..59f778386df5 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1220,7 +1220,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
> struct mm_struct *mm;
> int fault, major = 0;
> unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
> -   u32 pkey;
> +   u32 pkey, *pt_pkey = &pkey;
>
> tsk = current;
> mm = tsk->mm;
> @@ -1310,6 +1310,30 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
> flags |= FAULT_FLAG_INSTRUCTION;
>
> /*
> +* Do not try speculative page fault for kernel's pages and if
> +* the fault was due to protection keys since it can't be resolved.
> +*/
> +   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
> +   !(error_code & X86_PF_PK)) {

You can simplify this condition by dropping the IS_ENABLED() check as
you already provide an alternate implementation of
handle_speculative_fault() when CONFIG_SPECULATIVE_PAGE_FAULT is not
defined.

> +   fault = handle_speculative_fault(mm, address, flags, &vma);
> +   if (fault != VM_FAULT_RETRY) {
> +   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
> +   /*
> +* Do not advertise for the pkey value since we don't
> +* know it.
> +* This is not a matter as we checked for X86_PF_PK
> +* earlier, so we should not handle pkey fault here,
> +* but to be sure that mm_fault_error() callees will
> +* not try to use it, we invalidate the pointer.
> +*/
> +   pt_pkey = NULL;
> +   goto done;
> +   }
> +   } else {
> +   vma = NULL;
> +   }

The else part can be dropped if vma is initialised to NULL when it is
declared at the top of the function.

> +
> +   /*
>  * When running in the kernel we expect faults to occur only to
>  * addresses in user space.  All other faults represent errors in
>  * the kernel and should generate an OOPS.  Unfortunately, in the
> @@ -1342,7 +1366,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
> might_sleep();
> }
>
> -   vma = find_vma(mm, address);
> +   if (!vma || !can_reuse_spf_vma(vma, address))
> +   vma = find_vma(mm, address);

Is there a measurable benefit from reusing the vma?

Dropping the vma reference unconditionally after speculative page
fault handling gets rid of the implicit state when "vma != NULL"
(increased ref-count). I found it a bit confusing to follow.

> if (unlikely(!vma)) {
> bad_area(regs, error_code, address);
> return;
> @@ -1409,8 +1434,15 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> flags &= ~FAULT_FLAG_ALLOW_RETRY;
> flags |= FAULT_FLAG_TRIED;
> -   if (!fatal_signal_pending(tsk))
> +   if (!fatal_signal_pending(tsk)) {
> +   /*
> +* Do not try to reuse this vma and fetch it
> +* again since we will release the mmap_sem.
> +*/
> +   if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> +   vma = NULL;

Regardless of the above comment, can the vma be reset here unconditionally?

Thanks,
Punit