From: Barry Song
on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.
Even running a simplest program which requires swapout can
prove this is true,
#include
#include
#include
#include
int main()
{
#define SIZE (1 * 1024 * 1024)
volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
memset(p, 0x88, SIZE);
for (int k = 0; k < 1; k++) {
/* swap in */
for (int i = 0; i < SIZE; i += 4096) {
(void)p[i];
}
/* swap out */
madvise(p, SIZE, MADV_PAGEOUT);
}
}
Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.
~ # perf record taskset -c 4 ./a.out
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
~ # perf report
# To display the perf.data header info, please use --header/--header-only
options.
# To display the perf.data header info, please use --header/--header-only
options.
#
#
# Total Lost Samples: 0
#
# Samples: 60K of event 'cycles'
# Event count (approx.): 35706225414
#
# Overhead Command Shared Object Symbol
# ... . ..
#
21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq
8.23% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
6.67% a.out[kernel.kallsyms] [k] filemap_map_pages
6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write
5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush
3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock
3.49% a.out[kernel.kallsyms] [k] memset64
1.63% a.out[kernel.kallsyms] [k] clear_page
1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock
1.26% a.out[kernel.kallsyms] [k]
mod_zone_state.llvm.8525150236079521930
1.23% a.out[kernel.kallsyms] [k] xas_load
1.15% a.out[kernel.kallsyms] [k] zram_slot_lock
ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.
Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.
With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.
Typical collapsed time w/o patch:
~ # time taskset -c 4 ./a.out
0.21user 14.34system 0:14.69elapsed
w/ patch:
~ # time taskset -c 4 ./a.out
0.22user 13.45system 0:13.80elapsed
Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real0m13.460s 0m11.771s
user0m0.248s0m0.279s
sys 0m12.039s 0m11.458s
Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:
[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.
Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().
Cc: Anshuman Khandual
Cc: Jonathan Corbet
Cc: Nadav Amit
Cc: Mel Gorman
Tested-by: Yicong Yang
Tested-by: Xin Hao
Tested-by: Punit Agrawal
Signed-off-by: Barry Song
Signed-off-by: Yicong Yang
Reviewed-by: Kefeng Wang
Reviewed-by: Xin Hao
Reviewed-by: Anshuman Khandual
---
.../features/vm/TLB/arch-support.txt | 2 +-
arch/arm64/Kconfig| 1 +
arch/arm64/include/asm/tlbbatch.h | 12 +
arch/arm64/include/asm/tlbflush.h | 44 +--
4 files changed, 55 insertions(+), 4 deletions(-)
create mode 100644 arch/arm64/include/asm/tlbbatch.h
diff --git