[PATCH 10/10] watchdog/hardlockup: Rename HAVE_HARDLOCKUP_DETECTOR_NON_ARCH to ..._PERF_OR_BUDDY
HAVE_HARDLOCKUP_DETECTOR_NON_ARCH is a mouthful and confusing. HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY is even more of a mouthful, but probably less confusing. Rename the Kconfig names. Signed-off-by: Douglas Anderson --- lib/Kconfig.debug | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index eb1edd5905bc..b9e162698a82 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1058,7 +1058,7 @@ config HARDLOCKUP_DETECTOR_BUDDY # needs SMP). In either case, using the "non-arch" code conflicts with # the NMI watchdog code (which is sometimes used directly and sometimes used # by the arch-provided hardlockup detector). -config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH +config HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY bool depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG default y @@ -1077,10 +1077,10 @@ config HARDLOCKUP_DETECTOR_PREFER_BUDDY an arch-specific hardlockup detector or if resources needed for the hardlockup detector are better used for other things. -# This will select the appropriate non-arch hardlockdup detector -config HARDLOCKUP_DETECTOR_NON_ARCH +# This will select the appropriate non-arch hardlockup detector +config HARDLOCKUP_DETECTOR_PERF_OR_BUDDY bool - depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH + depends on HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || HARDLOCKUP_DETECTOR_PREFER_BUDDY select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && !HARDLOCKUP_DETECTOR_PREFER_BUDDY @@ -1098,9 +1098,9 @@ config HARDLOCKUP_CHECK_TIMESTAMP config HARDLOCKUP_DETECTOR bool "Detect Hard Lockups" depends on DEBUG_KERNEL && !S390 - depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || HAVE_HARDLOCKUP_DETECTOR_ARCH + depends on HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH select LOCKUP_DETECTOR - select HARDLOCKUP_DETECTOR_NON_ARCH if HAVE_HARDLOCKUP_DETECTOR_NON_ARCH + select HARDLOCKUP_DETECTOR_PERF_OR_BUDDY if HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY help Say Y here to enable the kernel to act as a watchdog to detect -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 09/10] watchdog/hardlockup: Move SMP barriers from common code to buddy code
It's been suggested that since the SMP barriers are only potentially useful for the buddy hardlockup detector, not the perf hardlockup detector, that the barriers belong in the buddy code. Let's move them and add clearer comments about why they're needed. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- kernel/watchdog.c | 6 -- kernel/watchdog_buddy.c | 21 + 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 6cc46b8e3d07..a351ab0c35eb 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -109,9 +109,6 @@ EXPORT_SYMBOL(arch_touch_nmi_watchdog); void watchdog_hardlockup_touch_cpu(unsigned int cpu) { per_cpu(watchdog_hardlockup_touched, cpu) = true; - - /* Match with smp_rmb() in watchdog_hardlockup_check() */ - smp_wmb(); } static bool is_hardlockup(unsigned int cpu) @@ -141,9 +138,6 @@ static void watchdog_hardlockup_kick(void) void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) { - /* Match with smp_wmb() in watchdog_hardlockup_touch_cpu() */ - smp_rmb(); - if (per_cpu(watchdog_hardlockup_touched, cpu)) { per_cpu(watchdog_hardlockup_touched, cpu) = false; return; diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c index 2ef88722c5e7..34dbfe091f4b 100644 --- a/kernel/watchdog_buddy.c +++ b/kernel/watchdog_buddy.c @@ -51,6 +51,13 @@ void watchdog_hardlockup_enable(unsigned int cpu) if (next_cpu < nr_cpu_ids) watchdog_hardlockup_touch_cpu(next_cpu); + /* +* Makes sure that watchdog is touched on this CPU before +* other CPUs could see it in watchdog_cpus. The counter +* part is in watchdog_buddy_check_hardlockup(). +*/ + smp_wmb(); + cpumask_set_cpu(cpu, _cpus); } @@ -68,6 +75,13 @@ void watchdog_hardlockup_disable(unsigned int cpu) if (next_cpu < nr_cpu_ids) watchdog_hardlockup_touch_cpu(next_cpu); + /* +* Makes sure that watchdog is touched on the next CPU before +* this CPU disappear in watchdog_cpus. The counter part is in +* watchdog_buddy_check_hardlockup(). +*/ + smp_wmb(); + cpumask_clear_cpu(cpu, _cpus); } @@ -88,5 +102,12 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) if (next_cpu >= nr_cpu_ids) return; + /* +* Make sure that the watchdog was touched on next CPU when +* watchdog_next_cpu() returned another one because of +* a change in watchdog_hardlockup_enable()/disable(). +*/ + smp_rmb(); + watchdog_hardlockup_check(next_cpu, NULL); } -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 08/10] watchdog/buddy: Simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
The dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY was more complicated than it needed to be. If the "perf" detector is available and we have SMP then we have a choice, so enable the config based on just those two config items. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- lib/Kconfig.debug | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index c9df93402237..eb1edd5905bc 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1065,7 +1065,7 @@ config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH config HARDLOCKUP_DETECTOR_PREFER_BUDDY bool "Prefer the buddy CPU hardlockup detector" - depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH && HAVE_HARDLOCKUP_DETECTOR_PERF && SMP + depends on HAVE_HARDLOCKUP_DETECTOR_PERF && SMP help Say Y here to prefer the buddy hardlockup detector over the perf one. -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 07/10] watchdog/buddy: Don't copy the cpumask in watchdog_next_cpu()
There's no reason to make a copy of the "watchdog_cpus" locally in watchdog_next_cpu(). Making a copy wouldn't make things any more race free and we're just reading the value so there's no need for a copy. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- kernel/watchdog_buddy.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c index 3ffc5f2ede5a..2ef88722c5e7 100644 --- a/kernel/watchdog_buddy.c +++ b/kernel/watchdog_buddy.c @@ -10,12 +10,11 @@ static cpumask_t __read_mostly watchdog_cpus; static unsigned int watchdog_next_cpu(unsigned int cpu) { - cpumask_t cpus = watchdog_cpus; unsigned int next_cpu; - next_cpu = cpumask_next(cpu, ); + next_cpu = cpumask_next(cpu, _cpus); if (next_cpu >= nr_cpu_ids) - next_cpu = cpumask_first(); + next_cpu = cpumask_first(_cpus); if (next_cpu == cpu) return nr_cpu_ids; -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 06/10] watchdog/buddy: Cleanup how watchdog_buddy_check_hardlockup() is called
In the patch ("watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs"), we added a call from the common watchdog.c file into the buddy. That call could be done more cleanly. Specifically: 1. If we move the call into watchdog_hardlockup_kick() then it keeps watchdog_timer_fn() simpler. 2. We don't need to pass an "unsigned long" to the buddy for the timer count. In the patch ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") the count was changed to "atomic_t" which is backed by an int, so we should match types. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- include/linux/nmi.h | 4 ++-- kernel/watchdog.c | 15 +++ kernel/watchdog_buddy.c | 2 +- 3 files changed, 10 insertions(+), 11 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 6c6a5ce77c66..43893e5f858a 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -114,9 +114,9 @@ void watchdog_hardlockup_disable(unsigned int cpu); void lockup_detector_reconfigure(void); #ifdef CONFIG_HARDLOCKUP_DETECTOR_BUDDY -void watchdog_buddy_check_hardlockup(unsigned long hrtimer_interrupts); +void watchdog_buddy_check_hardlockup(int hrtimer_interrupts); #else -static inline void watchdog_buddy_check_hardlockup(unsigned long hrtimer_interrupts) {} +static inline void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) {} #endif /** diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 85f4839b6faf..6cc46b8e3d07 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -131,9 +131,12 @@ static bool is_hardlockup(unsigned int cpu) return false; } -static unsigned long watchdog_hardlockup_kick(void) +static void watchdog_hardlockup_kick(void) { - return atomic_inc_return(this_cpu_ptr(_interrupts)); + int new_interrupts; + + new_interrupts = atomic_inc_return(this_cpu_ptr(_interrupts)); + watchdog_buddy_check_hardlockup(new_interrupts); } void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) @@ -195,7 +198,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) #else /* CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */ -static inline unsigned long watchdog_hardlockup_kick(void) { return 0; } +static inline void watchdog_hardlockup_kick(void) { } #endif /* !CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */ @@ -449,15 +452,11 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) struct pt_regs *regs = get_irq_regs(); int duration; int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace; - unsigned long hrtimer_interrupts; if (!watchdog_enabled) return HRTIMER_NORESTART; - hrtimer_interrupts = watchdog_hardlockup_kick(); - - /* test for hardlockups */ - watchdog_buddy_check_hardlockup(hrtimer_interrupts); + watchdog_hardlockup_kick(); /* kick the softlockup detector */ if (completion_done(this_cpu_ptr(_completion))) { diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c index fee45af2e5bd..3ffc5f2ede5a 100644 --- a/kernel/watchdog_buddy.c +++ b/kernel/watchdog_buddy.c @@ -72,7 +72,7 @@ void watchdog_hardlockup_disable(unsigned int cpu) cpumask_clear_cpu(cpu, _cpus); } -void watchdog_buddy_check_hardlockup(unsigned long hrtimer_interrupts) +void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) { unsigned int next_cpu; -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 05/10] watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
In the patch ("watchdog/hardlockup: add comments to touch_nmi_watchdog()") we adjusted some comments for touch_nmi_watchdog(). The comment about the softlockup had a typo and were also felt to be too obvious. Remove it. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- include/linux/nmi.h | 4 1 file changed, 4 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 3a27169ec383..6c6a5ce77c66 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -140,10 +140,6 @@ static inline void touch_nmi_watchdog(void) */ arch_touch_nmi_watchdog(); - /* -* Touching the hardlock detector implcitily pets the -* softlockup detector too -*/ touch_softlockup_watchdog(); } -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 04/10] watchdog/hardlockup: In watchdog_hardlockup_check() use cpumask_copy()
In the patch ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") we started using a cpumask to keep track of which CPUs to backtrace. When setting up this cpumask, it's better to use cpumask_copy() than to just copy the structure directly. Fix this. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- kernel/watchdog.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 32dac8028753..85f4839b6faf 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -154,7 +154,9 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) */ if (is_hardlockup(cpu)) { unsigned int this_cpu = smp_processor_id(); - struct cpumask backtrace_mask = *cpu_online_mask; + struct cpumask backtrace_mask; + + cpumask_copy(_mask, cpu_online_mask); /* Only print hardlockups once. */ if (per_cpu(watchdog_hardlockup_warned, cpu)) -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 03/10] watchdog/hardlockup: Don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
In the patch ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") there was no reason to use raw_cpu_ptr(). Using this_cpu_ptr() works fine. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- kernel/watchdog.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 62230f5b8878..32dac8028753 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -133,7 +133,7 @@ static bool is_hardlockup(unsigned int cpu) static unsigned long watchdog_hardlockup_kick(void) { - return atomic_inc_return(raw_cpu_ptr(_interrupts)); + return atomic_inc_return(this_cpu_ptr(_interrupts)); } void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 00/10] watchdog: Cleanup / fixes after buddy series v5 reviews
This patch series attempts to finish resolving the feedback received from Petr Mladek on the v5 series I posted. Andrew has already landed v5 so I'm posting this as additional patches. Probably the only thing that wasn't fully as clean as Petr requested was the Kconfig stuff. I couldn't find a better way to express it without a more major overhaul. In the very least, I renamed "NON_ARCH" to "PERF_OR_BUDDY" in the hopes that will make it marginally better. Nothing in this series is terribly critical and even the bugfixes are small. However, it does cleanup a few things that were pointed out in review. Douglas Anderson (10): watchdog/hardlockup: Keep kernel.nmi_watchdog sysctl as 0444 if probe fails watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe() watchdog/hardlockup: Don't use raw_cpu_ptr() in watchdog_hardlockup_kick() watchdog/hardlockup: In watchdog_hardlockup_check() use cpumask_copy() watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog() watchdog/buddy: Cleanup how watchdog_buddy_check_hardlockup() is called watchdog/buddy: Don't copy the cpumask in watchdog_next_cpu() watchdog/buddy: Simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY watchdog/hardlockup: Move SMP barriers from common code to buddy code watchdog/hardlockup: Rename HAVE_HARDLOCKUP_DETECTOR_NON_ARCH to ..._PERF_OR_BUDDY arch/Kconfig| 3 +- arch/sparc/kernel/nmi.c | 5 +++ include/linux/nmi.h | 14 ++--- kernel/watchdog.c | 68 ++--- kernel/watchdog_buddy.c | 28 ++--- lib/Kconfig.debug | 14 - 6 files changed, 70 insertions(+), 62 deletions(-) -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 01/10] watchdog/hardlockup: Keep kernel.nmi_watchdog sysctl as 0444 if probe fails
The permissions for the kernel.nmi_watchdog sysctl have always been set at compile time despite the fact that a watchdog can fail to probe. Let's fix this and set the permissions based on whether the hardlockup detector actually probed. Fixes: a994a3147e4c ("watchdog/hardlockup/perf: Implement init time detection of perf") Reported-by: Petr Mladek Closes: https://lore.kernel.org/r/ZHCn4hNxFpY5-9Ki@alley Signed-off-by: Douglas Anderson --- include/linux/nmi.h | 6 -- kernel/watchdog.c | 30 -- 2 files changed, 20 insertions(+), 16 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 333465e235e1..3a27169ec383 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -95,12 +95,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs); static inline void arch_touch_nmi_watchdog(void) { } #endif -#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR) -# define NMI_WATCHDOG_SYSCTL_PERM 0644 -#else -# define NMI_WATCHDOG_SYSCTL_PERM 0444 -#endif - #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) extern void hardlockup_detector_perf_stop(void); extern void hardlockup_detector_perf_restart(void); diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 237990e8d345..4b9e31edb47f 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -880,15 +880,6 @@ static struct ctl_table watchdog_sysctls[] = { .extra1 = SYSCTL_ZERO, .extra2 = (void *), }, - { - .procname = "nmi_watchdog", - .data = _hardlockup_user_enabled, - .maxlen = sizeof(int), - .mode = NMI_WATCHDOG_SYSCTL_PERM, - .proc_handler = proc_nmi_watchdog, - .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, - }, { .procname = "watchdog_cpumask", .data = _cpumask_bits, @@ -952,10 +943,28 @@ static struct ctl_table watchdog_sysctls[] = { {} }; +static struct ctl_table watchdog_hardlockup_sysctl[] = { + { + .procname = "nmi_watchdog", + .data = _hardlockup_user_enabled, + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = proc_nmi_watchdog, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + {} +}; + static void __init watchdog_sysctl_init(void) { register_sysctl_init("kernel", watchdog_sysctls); + + if (watchdog_hardlockup_available) + watchdog_hardlockup_sysctl[0].mode = 0644; + register_sysctl_init("kernel", watchdog_hardlockup_sysctl); } + #else #define watchdog_sysctl_init() do { } while (0) #endif /* CONFIG_SYSCTL */ @@ -1011,6 +1020,8 @@ static int __init lockup_detector_check(void) /* Make sure no work is pending. */ flush_work(_work); + watchdog_sysctl_init(); + return 0; } @@ -1030,5 +1041,4 @@ void __init lockup_detector_init(void) allow_lockup_detector_init_retry = true; lockup_detector_setup(); - watchdog_sysctl_init(); } -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH 02/10] watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
Right now there is one arch (sparc64) that selects HAVE_NMI_WATCHDOG without selecting HAVE_HARDLOCKUP_DETECTOR_ARCH. Because of that one architecture, we have some special case code in the watchdog core to handle the fact that watchdog_hardlockup_probe() isn't implemented. Let's implement watchdog_hardlockup_probe() for sparc64 and get rid of the special case. As a side effect of doing this, code inspection tells us that we could fix a minor bug where the system won't properly realize that NMI watchdogs are disabled. Specifically, on powerpc if CONFIG_PPC_WATCHDOG is turned off the arch might still select CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH which selects CONFIG_HAVE_NMI_WATCHDOG. Since CONFIG_PPC_WATCHDOG was off then nothing will override the "weak" watchdog_hardlockup_probe() and we'll fallback to looking at CONFIG_HAVE_NMI_WATCHDOG. Suggested-by: Petr Mladek Signed-off-by: Douglas Anderson --- Though this does fix a minor bug, I didn't mark this as "Fixes" because it's super minor. One could also argue that this wasn't a bug at all but simply was never an implemented feature. The code that added some amount of dynamicness here was commit a994a3147e4c ("watchdog/hardlockup/perf: Implement init time detection of perf") which, as per the title, was only intending to make "perf" dynamic. The old NMI watchdog presumably has never been handled dynamically. arch/Kconfig| 3 ++- arch/sparc/kernel/nmi.c | 5 + kernel/watchdog.c | 13 - 3 files changed, 7 insertions(+), 14 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 64d771855ecd..b4f6387b12fe 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -428,7 +428,8 @@ config HAVE_NMI_WATCHDOG bool help The arch provides a low level NMI watchdog. It provides - asm/nmi.h, and defines its own arch_touch_nmi_watchdog(). + asm/nmi.h, and defines its own watchdog_hardlockup_probe() and + arch_touch_nmi_watchdog(). config HAVE_HARDLOCKUP_DETECTOR_ARCH bool diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c index 9d9e29b75c43..17cdfdbf1f3b 100644 --- a/arch/sparc/kernel/nmi.c +++ b/arch/sparc/kernel/nmi.c @@ -65,6 +65,11 @@ void arch_touch_nmi_watchdog(void) } EXPORT_SYMBOL(arch_touch_nmi_watchdog); +int __init watchdog_hardlockup_probe(void) +{ + return 0; +} + static void die_nmi(const char *str, struct pt_regs *regs, int do_panic) { int this_cpu = smp_processor_id(); diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4b9e31edb47f..62230f5b8878 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -217,19 +217,6 @@ void __weak watchdog_hardlockup_disable(unsigned int cpu) { } */ int __weak __init watchdog_hardlockup_probe(void) { - /* -* If CONFIG_HAVE_NMI_WATCHDOG is defined then an architecture -* is assumed to have the hard watchdog available and we return 0. -*/ - if (IS_ENABLED(CONFIG_HAVE_NMI_WATCHDOG)) - return 0; - - /* -* Hardlockup detectors other than those using CONFIG_HAVE_NMI_WATCHDOG -* are required to implement a non-weak version of this probe function -* to tell whether they are available. If they don't override then -* we'll return -ENODEV. -*/ return -ENODEV; } -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young()
Use mmu_notifier_test_clear_young() to handle KVM PTEs in batches when the fast path is supported. This reduces the contention on kvm->mmu_lock when the host is under heavy memory pressure. An existing selftest can quickly demonstrate the effectiveness of this patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM: $ sudo max_guest_memory_test -c 64 -m 250 -s 250 MGLRU run2 -- Before [1]~64s After ~51s kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.71% evict_folios 97.29% shrink_folio_list ==>> 13.05% folio_referenced 12.83% rmap_walk_file 12.31% folio_referenced_one 7.90% __mmu_notifier_clear_young 7.72% kvm_mmu_notifier_clear_young 7.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.59% evict_folios 80.37% shrink_folio_list ==>> 3.74% folio_referenced 3.59% rmap_walk_file 3.19% folio_referenced_one 2.53% lru_gen_look_around 1.06% __mmu_notifier_test_clear_young [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to apples. https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/ Signed-off-by: Yu Zhao --- Documentation/admin-guide/mm/multigen_lru.rst | 6 +- include/linux/mmzone.h| 6 +- mm/rmap.c | 8 +- mm/vmscan.c | 139 -- 4 files changed, 138 insertions(+), 21 deletions(-) diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..0ae2a6d4d94c 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -48,6 +48,10 @@ Values Components verified on x86 varieties other than Intel and AMD. If it is disabled, the multi-gen LRU will suffer a negligible performance degradation. +0x0008 Clearing the accessed bit in KVM page table entries in large + batches, when KVM MMU sets it (e.g., on x86_64). This can + improve the performance of guests when the host is under memory + pressure. [yYnN] Apply to all the components above. == === @@ -56,7 +60,7 @@ E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled -0x0007 +0x000f echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5a7ada0413da..1b148f39fabc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -369,6 +369,7 @@ enum { LRU_GEN_CORE, LRU_GEN_MM_WALK, LRU_GEN_NONLEAF_YOUNG, + LRU_GEN_KVM_MMU_WALK, NR_LRU_GEN_CAPS }; @@ -471,7 +472,7 @@ struct lru_gen_mm_walk { }; void lru_gen_init_lruvec(struct lruvec *lruvec); -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw); #ifdef CONFIG_MEMCG @@ -559,8 +560,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec) { } -static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { + return false; } #ifdef CONFIG_MEMCG diff --git a/mm/rmap.c b/mm/rmap.c index ae127f60a4fb..3a2c4e6a0887 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -825,12 +825,10 @@ static bool folio_referenced_one(struct folio *folio, return false; /* To break the loop */ } - if (pvmw.pte) { - if (lru_gen_enabled() && pte_young(*pvmw.pte)) { - lru_gen_look_around(); + if (lru_gen_enabled() && pvmw.pte) { + if (lru_gen_look_around()) referenced++; - } - + } else if (pvmw.pte) { if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; diff --git a/mm/vmscan.c b/mm/vmscan.c index ef687f9be13c..3f734588b600 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -58,6 +58,7 @@ #include #include #include +#include #include #include @@ -3244,6 +3245,20 @@ static bool should_clear_pmd_young(void) return
[PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()
Implement kvm_arch_test_clear_young() to support the fast path in mmu_notifier_ops->test_clear_young(). It focuses on a simple case, i.e., TDP MMU sets the accessed bit in KVM PTEs and VMs are not nested, where it can rely on RCU and clear_bit() to safely clear the accessed bit without taking kvm->mmu_lock. Complex cases fall back to the existing slow path where kvm->mmu_lock is then taken. Signed-off-by: Yu Zhao --- arch/x86/include/asm/kvm_host.h | 7 +++ arch/x86/kvm/mmu/tdp_mmu.c | 34 + 2 files changed, 41 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 753c67072c47..d6dfdebe3d94 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2223,4 +2223,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages); */ #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1) +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return IS_ENABLED(CONFIG_X86_64) && + (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && shadow_accessed_mask)); +} + #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 08340219c35a..6875a819e007 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); } +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range) +{ + struct kvm_mmu_page *root; + int offset = ffs(shadow_accessed_mask) - 1; + + if (kvm_shadow_root_allocated(kvm)) + return true; + + rcu_read_lock(); + + list_for_each_entry_rcu(root, >arch.tdp_mmu_roots, link) { + struct tdp_iter iter; + + if (kvm_mmu_page_as_id(root) != range->slot->as_id) + continue; + + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) { + u64 *sptep = rcu_dereference(iter.sptep); + + VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep))); + + if (!(iter.old_spte & shadow_accessed_mask)) + continue; + + if (kvm_should_clear_young(range, iter.gfn)) + clear_bit(offset, (unsigned long *)sptep); + } + } + + rcu_read_unlock(); + + return false; +} + static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, struct kvm_gfn_range *range) { -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
tdp_mmu_enabled and shadow_accessed_mask are needed to implement kvm_arch_has_test_clear_young(). Signed-off-by: Yu Zhao --- arch/x86/include/asm/kvm_host.h | 6 ++ arch/x86/kvm/mmu.h | 6 -- arch/x86/kvm/mmu/spte.h | 1 - 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fb9d1f2d6136..753c67072c47 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf { extern u32 __read_mostly kvm_nr_uret_msrs; extern u64 __read_mostly host_efer; +extern u64 __read_mostly shadow_accessed_mask; extern bool __read_mostly allow_smaller_maxphyaddr; extern bool __read_mostly enable_apicv; extern struct kvm_x86_ops kvm_x86_ops; @@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin, bool mask); extern bool tdp_enabled; +#ifdef CONFIG_X86_64 +extern bool tdp_mmu_enabled; +#else +#define tdp_mmu_enabled false +#endif u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 92d5a1924fc1..84aedb2671ef 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -253,12 +253,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm) return smp_load_acquire(>arch.shadow_root_allocated); } -#ifdef CONFIG_X86_64 -extern bool tdp_mmu_enabled; -#else -#define tdp_mmu_enabled false -#endif - static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) { return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm); diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index 1279db2eab44..a82c4fa1c47b 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask; extern u64 __read_mostly shadow_nx_mask; extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */ extern u64 __read_mostly shadow_user_mask; -extern u64 __read_mostly shadow_accessed_mask; extern u64 __read_mostly shadow_dirty_mask; extern u64 __read_mostly shadow_mmio_value; extern u64 __read_mostly shadow_mmio_mask; -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()
Implement kvm_arch_test_clear_young() to support the fast path in mmu_notifier_ops->test_clear_young(). It focuses on a simple case, i.e., radix MMU sets the accessed bit in KVM PTEs and VMs are not nested, where it can rely on RCU and pte_xchg() to safely clear the accessed bit without taking kvm->mmu_lock. Complex cases fall back to the existing slow path where kvm->mmu_lock is then taken. Signed-off-by: Yu Zhao --- arch/powerpc/include/asm/kvm_host.h| 8 arch/powerpc/include/asm/kvm_ppc.h | 1 + arch/powerpc/kvm/book3s.c | 6 +++ arch/powerpc/kvm/book3s.h | 1 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++ arch/powerpc/kvm/book3s_hv.c | 5 +++ 6 files changed, 80 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 14ee0dece853..75c260ea8a9e 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) && + cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) && + radix_enabled(); +} + #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 79a9c0bb8bba..ff1af6a7b44f 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -287,6 +287,7 @@ struct kvmppc_ops { bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); + bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); void (*free_memslot)(struct kvm_memory_slot *slot); int (*init_vm)(struct kvm *kvm); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 686d8d9eda3e..37bf40b0c4ff 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return kvm->arch.kvm_ops->test_age_gfn(kvm, range); } +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range) +{ + return !kvm->arch.kvm_ops->test_clear_young || + kvm->arch.kvm_ops->test_clear_young(kvm, range); +} + bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { return kvm->arch.kvm_ops->set_spte_gfn(kvm, range); diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 58391b4b32ed..fa2659e21ccc 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 3b65b3b11041..0a392e9a100a 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, return ref; } +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range) +{ + bool err; + gfn_t gfn = range->start; + + rcu_read_lock(); + + err = !kvm_is_radix(kvm); + if (err) + goto unlock; + + /* +* Case 1: This function kvmppc_switch_mmu_to_hpt() +* +* rcu_read_lock() +* Test kvm_is_radix()kvm->arch.radix = 0 +* Use kvm->arch.pgtable synchronize_rcu() +* rcu_read_unlock() +* kvmppc_free_radix() +* +* +* Case 2: This function kvmppc_switch_mmu_to_radix() +* +* kvmppc_init_vm_radix() +* smp_wmb() +* Test kvm_is_radix()kvm->arch.radix = 1 +* smp_rmb() +* Use kvm->arch.pgtable +*/ +
[PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe
KVM page tables are currently not RCU safe against remapping, i.e., kvmppc_unmap_free_pmd_entry_table() et al. The previous mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with that operation. However, the new mmu_notifier_ops member test_clear_young() provides a fast path that does not take kvm->mmu_lock. To implement kvm_arch_test_clear_young() for that path, orphan page tables need to be freed by RCU. Unmapping, specifically kvm_unmap_radix(), does not free page tables, hence not a concern. Signed-off-by: Yu Zhao --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 461307b89c3a..3b65b3b11041 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void) { unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE; - kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor); + kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, + SLAB_TYPESAFE_BY_RCU, pte_ctor); if (!kvm_pte_cache) return -ENOMEM; size = sizeof(void *) << RADIX_PMD_INDEX_SIZE; - kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor); + kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, + SLAB_TYPESAFE_BY_RCU, pmd_ctor); if (!kvm_pmd_cache) { kmem_cache_destroy(kvm_pte_cache); return -ENOMEM; -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()
Implement kvm_arch_test_clear_young() to support the fast path in mmu_notifier_ops->test_clear_young(). It focuses on a simple case, i.e., hardware sets the accessed bit in KVM PTEs and VMs are not protected, where it can rely on RCU and cmpxchg to safely clear the accessed bit without taking kvm->mmu_lock. Complex cases fall back to the existing slow path where kvm->mmu_lock is then taken. Signed-off-by: Yu Zhao --- arch/arm64/include/asm/kvm_host.h | 6 ++ arch/arm64/kvm/mmu.c | 36 +++ 2 files changed, 42 insertions(+) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 7e7e19ef6993..da32b0890716 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { } void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu); bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu); +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return cpu_has_hw_af() && !is_protected_kvm_enabled(); +} + #endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index c3b3e2afe26f..26a8d955b49c 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) range->start << PAGE_SHIFT); } +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx, + enum kvm_pgtable_walk_flags flags) +{ + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF; + + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep))); + + if (!kvm_pte_valid(new)) + return 0; + + if (new == ctx->old) + return 0; + + if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE)) + stage2_try_set_pte(ctx, new); + + return 0; +} + +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range) +{ + u64 start = range->start * PAGE_SIZE; + u64 end = range->end * PAGE_SIZE; + struct kvm_pgtable_walker walker = { + .cb = stage2_test_clear_young, + .arg= range, + .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED, + }; + + BUILD_BUG_ON(is_hyp_code()); + + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, ); + + return false; +} + phys_addr_t kvm_mmu_get_httbr(void) { return __pa(hyp_pgtable->pgd); -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit
TLDR This patchset adds a fast path to clear the accessed bit without taking kvm->mmu_lock. It can significantly improve the performance of guests when the host is under heavy memory pressure. ChromeOS has been using a similar approach [1] since mid 2021 and it was proven successful on tens of millions devices. This v2 addressed previous requests [2] on refactoring code, removing inaccurate/redundant texts, etc. [1] https://crrev.com/c/2987928 [2] https://lore.kernel.org/r/20230217041230.2417228-1-yuz...@google.com/ Overview The goal of this patchset is to optimize the performance of guests when the host memory is overcommitted. It focuses on a simple yet common case where hardware sets the accessed bit in KVM PTEs and VMs are not nested. Complex cases fall back to the existing slow path where kvm->mmu_lock is then taken. The fast path relies on two techniques to safely clear the accessed bit: RCU and CAS. The former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. A new mmu_notifier_ops member, test_clear_young(), supersedes the existing clear_young() and test_young(). This extended callback can operate on a range of KVM PTEs individually according to a bitmap, if the caller provides it. Evaluation == An existing selftest can quickly demonstrate the effectiveness of this patchset. On a generic workstation equipped with 128 CPUs and 256GB DRAM: $ sudo max_guest_memory_test -c 64 -m 250 -s 250 MGLRU run2 -- Before [1]~64s After ~51s kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.71% evict_folios 97.29% shrink_folio_list ==>> 13.05% folio_referenced 12.83% rmap_walk_file 12.31% folio_referenced_one 7.90% __mmu_notifier_clear_young 7.72% kvm_mmu_notifier_clear_young 7.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.59% evict_folios 80.37% shrink_folio_list ==>> 3.74% folio_referenced 3.59% rmap_walk_file 3.19% folio_referenced_one 2.53% lru_gen_look_around 1.06% __mmu_notifier_test_clear_young Comprehensive benchmarks are coming soon. [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to apples. https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/ Yu Zhao (10): mm/kvm: add mmu_notifier_ops->test_clear_young() mm/kvm: use mmu_notifier_ops->test_clear_young() kvm/arm64: export stage2_try_set_pte() and macros kvm/arm64: make stage2 page tables RCU safe kvm/arm64: add kvm_arch_test_clear_young() kvm/powerpc: make radix page tables RCU safe kvm/powerpc: add kvm_arch_test_clear_young() kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask kvm/x86: add kvm_arch_test_clear_young() mm: multi-gen LRU: use mmu_notifier_test_clear_young() Documentation/admin-guide/mm/multigen_lru.rst | 6 +- arch/arm64/include/asm/kvm_host.h | 6 + arch/arm64/include/asm/kvm_pgtable.h | 55 +++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 61 +--- arch/arm64/kvm/mmu.c | 53 ++- arch/powerpc/include/asm/kvm_host.h | 8 + arch/powerpc/include/asm/kvm_ppc.h| 1 + arch/powerpc/kvm/book3s.c | 6 + arch/powerpc/kvm/book3s.h | 1 + arch/powerpc/kvm/book3s_64_mmu_radix.c| 65 +++- arch/powerpc/kvm/book3s_hv.c | 5 + arch/x86/include/asm/kvm_host.h | 13 ++ arch/x86/kvm/mmu.h| 6 - arch/x86/kvm/mmu/spte.h | 1 - arch/x86/kvm/mmu/tdp_mmu.c| 34 + include/linux/kvm_host.h | 22 +++ include/linux/mmu_notifier.h | 79 ++ include/linux/mmzone.h| 6 +- include/trace/events/kvm.h| 15 -- mm/mmu_notifier.c | 48 ++ mm/rmap.c | 8 +- mm/vmscan.c | 139 -- virt/kvm/kvm_main.c | 114 -- 24 files changed, 546 insertions(+), 207 deletions(-) -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
Stage2 page tables are currently not RCU safe against unmapping or VM destruction. The previous mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with those operations. However, the new mmu_notifier_ops member test_clear_young() provides a fast path that does not take kvm->mmu_lock. To implement kvm_arch_test_clear_young() for that path, unmapped page tables need to be freed by RCU and kvm_free_stage2_pgd() needs to be after mmu_notifier_unregister(). Remapping, specifically stage2_free_removed_table(), is already RCU safe. Signed-off-by: Yu Zhao --- arch/arm64/include/asm/kvm_pgtable.h | 2 ++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 8 ++-- arch/arm64/kvm/mmu.c | 17 - 4 files changed, 25 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index ff520598b62c..5cab52e3a35f 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) * @put_page: Decrement the refcount on a page. When the * refcount reaches 0 the page is automatically * freed. + * @put_page_rcu: RCU variant of the above. * @page_count:Return the refcount of a page. * @phys_to_virt: Convert a physical address into a virtual * address mapped in the current context. @@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops { void(*free_removed_table)(void *addr, u32 level); void(*get_page)(void *addr); void(*put_page)(void *addr); + void(*put_page_rcu)(void *addr); int (*page_count)(void *addr); void* (*phys_to_virt)(phys_addr_t phys); phys_addr_t (*virt_to_phys)(void *addr); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 14391826241c..ee93271035d9 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) */ void kvm_arch_destroy_vm(struct kvm *kvm) { + kvm_free_stage2_pgd(>arch.mmu); bitmap_free(kvm->arch.pmu_filter); free_cpumask_var(kvm->arch.supported_cpus); diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 24678ccba76a..dbace4c6a841 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops), kvm_granule_size(ctx->level)); - if (childp) - mm_ops->put_page(childp); + if (childp) { + if (mm_ops->put_page_rcu) + mm_ops->put_page_rcu(childp); + else + mm_ops->put_page(childp); + } return 0; } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 3b9d4d24c361..c3b3e2afe26f 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr) return page_count(virt_to_page(addr)); } +static void kvm_s2_rcu_put_page(struct rcu_head *head) +{ + put_page(container_of(head, struct page, rcu_head)); +} + +static void kvm_s2_put_page_rcu(void *addr) +{ + struct page *page = virt_to_page(addr); + + if (kvm_host_page_count(addr) == 1) + kvm_account_pgtable_pages(addr, -1); + + call_rcu(>rcu_head, kvm_s2_rcu_put_page); +} + static phys_addr_t kvm_host_pa(void *addr) { return __pa(addr); @@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { .free_removed_table = stage2_free_removed_table, .get_page = kvm_host_get_page, .put_page = kvm_s2_put_page, + .put_page_rcu = kvm_s2_put_page_rcu, .page_count = kvm_host_page_count, .phys_to_virt = kvm_host_va, .virt_to_phys = kvm_host_pa, @@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) void kvm_arch_flush_shadow_all(struct kvm *kvm) { - kvm_free_stage2_pgd(>arch.mmu); } void kvm_arch_flush_shadow_memslot(struct kvm *kvm, -- 2.41.0.rc0.172.g3f132b7071-goog
[PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young()
Replace test_young() and clear_young() with test_clear_young(). Signed-off-by: Yu Zhao --- include/linux/mmu_notifier.h | 29 ++- include/trace/events/kvm.h | 15 -- mm/mmu_notifier.c| 42 virt/kvm/kvm_main.c | 54 4 files changed, 2 insertions(+), 138 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index dfdbb370682d..c8f35fc08703 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -104,26 +104,6 @@ struct mmu_notifier_ops { unsigned long start, unsigned long end); - /* -* clear_young is a lightweight version of clear_flush_young. Like the -* latter, it is supposed to test-and-clear the young/accessed bitflag -* in the secondary pte, but it may omit flushing the secondary tlb. -*/ - int (*clear_young)(struct mmu_notifier *subscription, - struct mm_struct *mm, - unsigned long start, - unsigned long end); - - /* -* test_young is called to check the young/accessed bitflag in -* the secondary pte. This is used to know if the page is -* frequently used without actually clearing the flag or tearing -* down the secondary mapping on the page. -*/ - int (*test_young)(struct mmu_notifier *subscription, - struct mm_struct *mm, - unsigned long address); - int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end, bool clear, unsigned long *bitmap); @@ -393,11 +373,6 @@ extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long start, unsigned long end); -extern int __mmu_notifier_clear_young(struct mm_struct *mm, - unsigned long start, - unsigned long end); -extern int __mmu_notifier_test_young(struct mm_struct *mm, -unsigned long address); extern int __mmu_notifier_test_clear_young(struct mm_struct *mm, unsigned long start, unsigned long end, bool clear, unsigned long *bitmap); @@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long end) { if (mm_has_notifiers(mm)) - return __mmu_notifier_clear_young(mm, start, end); + return __mmu_notifier_test_clear_young(mm, start, end, true, NULL); return 0; } @@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm, unsigned long address) { if (mm_has_notifiers(mm)) - return __mmu_notifier_test_young(mm, address); + return __mmu_notifier_test_clear_young(mm, address, address + 1, false, NULL); return 0; } diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 3bd31ea23fee..46c347e56e60 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -489,21 +489,6 @@ TRACE_EVENT(kvm_age_hva, __entry->start, __entry->end) ); -TRACE_EVENT(kvm_test_age_hva, - TP_PROTO(unsigned long hva), - TP_ARGS(hva), - - TP_STRUCT__entry( - __field(unsigned long, hva ) - ), - - TP_fast_assign( - __entry->hva= hva; - ), - - TP_printk("mmu notifier test age hva: %#016lx", __entry->hva) -); - #endif /* _TRACE_KVM_MAIN_H */ /* This part must be outside protection */ diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 7e6aba4bddcb..c7e9747c9920 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -382,48 +382,6 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, return young; } -int __mmu_notifier_clear_young(struct mm_struct *mm, - unsigned long start, - unsigned long end) -{ - struct mmu_notifier *subscription; - int young = 0, id; - - id = srcu_read_lock(); - hlist_for_each_entry_rcu(subscription, ->notifier_subscriptions->list, hlist, -srcu_read_lock_held()) { - if (subscription->ops->clear_young) - young |= subscription->ops->clear_young(subscription, - mm, start, end); - } -
[PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
Add mmu_notifier_ops->test_clear_young() to supersede test_young() and clear_young(). test_clear_young() has a fast path, which if supported, allows its callers to safely clear the accessed bit without taking kvm->mmu_lock. The fast path requires arch-specific code that generally relies on RCU and CAS: the former protects KVM page tables from being freed while the latter clears the accessed bit atomically against both the hardware and other software page table walkers. If the fast path is unsupported, test_clear_young() falls back to the existing slow path where kvm->mmu_lock is then taken. test_clear_young() can also operate on a range of KVM PTEs individually according to a bitmap, if the caller provides it. Signed-off-by: Yu Zhao --- include/linux/kvm_host.h | 22 +++ include/linux/mmu_notifier.h | 52 mm/mmu_notifier.c| 24 virt/kvm/kvm_main.c | 76 +++- 4 files changed, 173 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0e571e973bc2..374262545f96 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #ifdef KVM_ARCH_WANT_MMU_NOTIFIER struct kvm_gfn_range { struct kvm_memory_slot *slot; + void *args; gfn_t start; gfn_t end; pte_t pte; @@ -267,6 +268,27 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn); +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range); +#endif + +/* + * Architectures that implement kvm_arch_test_clear_young() should override + * kvm_arch_has_test_clear_young(). + * + * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it + * can return true if kvm_arch_test_clear_young() is supported but disabled due + * to some runtime constraint. In this case, kvm_arch_test_clear_young() should + * return true; otherwise, it should return false. + * + * For each young KVM PTE, kvm_arch_test_clear_young() should call + * kvm_should_clear_young() to decide whether to clear the accessed bit. + */ +#ifndef kvm_arch_has_test_clear_young +static inline bool kvm_arch_has_test_clear_young(void) +{ + return false; +} #endif enum { diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 64a3e051c3c4..dfdbb370682d 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -60,6 +60,8 @@ enum mmu_notifier_event { }; #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +#define MMU_NOTIFIER_RANGE_LOCKLESS(1 << 1) +#define MMU_NOTIFIER_RANGE_YOUNG (1 << 2) struct mmu_notifier_ops { /* @@ -122,6 +124,10 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end, + bool clear, unsigned long *bitmap); + /* * change_pte is called in cases that pte mapping to page is changed: * for example, when ksm remaps pte to point to a new shared page. @@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm, unsigned long end); extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); +extern int __mmu_notifier_test_clear_young(struct mm_struct *mm, + unsigned long start, unsigned long end, + bool clear, unsigned long *bitmap); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); @@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm, return 0; } +/* + * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs within + * a given range was young. Specifically, it returns MMU_NOTIFIER_RANGE_LOCKLESS + * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise. + * + * The last parameter to the function is a bitmap and only the fast path + * supports it: if it is NULL, the function falls back to the slow path if the + * fast path was unsuccessful; otherwise, the function bails out. + * + * The bitmap has the following specifications: + * 1. The number of bits should be at least (end-start)/PAGE_SIZE. + * 2. The offset of each bit
[PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros
stage2_try_set_pte() and KVM_PTE_LEAF_ATTR_LO_S2_AF are needed to implement kvm_arch_test_clear_young(). Signed-off-by: Yu Zhao --- arch/arm64/include/asm/kvm_pgtable.h | 53 arch/arm64/kvm/hyp/pgtable.c | 53 2 files changed, 53 insertions(+), 53 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index dc3c072e862f..ff520598b62c 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -44,6 +44,49 @@ typedef u64 kvm_pte_t; #define KVM_PHYS_INVALID (-1ULL) +#define KVM_PTE_TYPE BIT(1) +#define KVM_PTE_TYPE_BLOCK 0 +#define KVM_PTE_TYPE_PAGE 1 +#define KVM_PTE_TYPE_TABLE 1 + +#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2) + +#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2) +#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6) +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3 +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1 +#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8) +#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3 +#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10) + +#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2) +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6) +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7) +#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8) +#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3 +#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10) + +#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51) + +#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55) + +#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) + +#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) + +#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \ +KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ +KVM_PTE_LEAF_ATTR_HI_S2_XN) + +#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) +#define KVM_MAX_OWNER_ID 1 + +/* + * Used to indicate a pte for which a 'break-before-make' sequence is in + * progress. + */ +#define KVM_INVALID_PTE_LOCKED BIT(10) + static inline bool kvm_pte_valid(kvm_pte_t pte) { return pte & KVM_PTE_VALID; @@ -224,6 +267,16 @@ static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *c return ctx->flags & KVM_PGTABLE_WALK_SHARED; } +static inline bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) +{ + if (!kvm_pgtable_walk_shared(ctx)) { + WRITE_ONCE(*ctx->ptep, new); + return true; + } + + return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old; +} + /** * struct kvm_pgtable_walker - Hook into a page-table walk. * @cb:Callback function to invoke during the walk. diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 5282cb9ca4cf..24678ccba76a 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -12,49 +12,6 @@ #include -#define KVM_PTE_TYPE BIT(1) -#define KVM_PTE_TYPE_BLOCK 0 -#define KVM_PTE_TYPE_PAGE 1 -#define KVM_PTE_TYPE_TABLE 1 - -#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2) - -#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2) -#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6) -#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3 -#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1 -#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8) -#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3 -#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10) - -#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2) -#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6) -#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7) -#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8) -#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3 -#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10) - -#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51) - -#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55) - -#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) - -#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) - -#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \ -KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ -KVM_PTE_LEAF_ATTR_HI_S2_XN) - -#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) -#define KVM_MAX_OWNER_ID 1 - -/* - * Used to indicate a pte for which a 'break-before-make' sequence is in - * progress. - */ -#define KVM_INVALID_PTE_LOCKED BIT(10) - struct kvm_pgtable_walk_data { struct kvm_pgtable_walker *walker; @@ -702,16 +659,6 @@ static bool stage2_pte_is_locked(kvm_pte_t pte) return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED); } -static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
Patch "spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 5.4-stable tree
This is a note to let you know that I've just added the patch titled spi: fsl-spi: Re-organise transfer bits_per_word adaptation to the 5.4-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch and it can be found in the queue-5.4 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:03 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:16 +0200 Subject: spi: fsl-spi: Re-organise transfer bits_per_word adaptation To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <1e4bfb4850ba849c316f48a0ab0d7123da0e2f54.1684156552.git.christophe.le...@csgroup.eu> From: Christophe Leroy (backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e) For different reasons, fsl-spi driver performs bits_per_word modifications for different reasons: - On CPU mode, to minimise amount of interrupts - On CPM/QE mode to work around controller byte order For CPU mode that's done in fsl_spi_prepare_message() while for CPM mode that's done in fsl_spi_setup_transfer(). Reunify all of it in fsl_spi_prepare_message(), and catch impossible cases early through master's bits_per_word_mask instead of returning EINVAL later. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 50 +- 1 file changed, 23 insertions(+), 27 deletions(-) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -204,26 +204,6 @@ static int mspi_apply_cpu_mode_quirks(st return bits_per_word; } -static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs, - struct spi_device *spi, - int bits_per_word) -{ - /* CPM/QE uses Little Endian for words > 8 -* so transform 16 and 32 bits words into 8 bits -* Unfortnatly that doesn't work for LSB so -* reject these for now */ - /* Note: 32 bits word, LSB works iff -* tfcr/rfcr is set to CPMFCR_GBL */ - if (spi->mode & SPI_LSB_FIRST && - bits_per_word > 8) - return -EINVAL; - if (bits_per_word <= 8) - return bits_per_word; - if (bits_per_word == 16 || bits_per_word == 32) - return 8; /* pretend its 8 bits */ - return -EINVAL; -} - static int fsl_spi_setup_transfer(struct spi_device *spi, struct spi_transfer *t) { @@ -251,9 +231,6 @@ static int fsl_spi_setup_transfer(struct bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi, mpc8xxx_spi, bits_per_word); - else - bits_per_word = mspi_apply_qe_mode_quirks(cs, spi, - bits_per_word); if (bits_per_word < 0) return bits_per_word; @@ -371,14 +348,27 @@ static int fsl_spi_do_one_msg(struct spi * In CPU mode, optimize large byte transfers to use larger * bits_per_word values to reduce number of interrupts taken. */ - if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { - list_for_each_entry(t, >transfers, transfer_list) { + list_for_each_entry(t, >transfers, transfer_list) { + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { if (t->len < 256 || t->bits_per_word != 8) continue; if ((t->len & 3) == 0) t->bits_per_word = 32; else if ((t->len & 1) == 0) t->bits_per_word = 16; + } else { + /* +* CPM/QE uses Little Endian for words > 8 +* so transform 16 and 32 bits words into 8 bits +* Unfortnatly that doesn't work for LSB so +* reject these for now +* Note: 32 bits word, LSB works iff +* tfcr/rfcr is set to CPMFCR_GBL +*/ + if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 8) + return -EINVAL; + if (t->bits_per_word == 16 || t->bits_per_word == 32) + t->bits_per_word = 8; /* pretend its 8 bits */ } } @@ -637,8 +627,14
Patch "spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 5.4-stable tree
This is a note to let you know that I've just added the patch titled spi: fsl-cpm: Use 16 bit mode for large transfers with even size to the 5.4-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch and it can be found in the queue-5.4 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:07:59 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:17 +0200 Subject: spi: fsl-cpm: Use 16 bit mode for large transfers with even size To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <3a1b8774ad7004acb594fbf220f98488dbaa2896.1684156552.git.christophe.le...@csgroup.eu> From: Christophe Leroy (cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab) On CPM, the RISC core is a lot more efficiant when doing transfers in 16-bits chunks than in 8-bits chunks, but unfortunately the words need to be byte swapped as seen in a previous commit. So, for large tranfers with an even size, allocate a temporary tx buffer and byte-swap data before and after transfer. This change allows setting higher speed for transfer. For instance on an MPC 8xx (CPM1 comms RISC processor), the documentation tells that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load at 25 MHz while a word transfer at the same speed uses 0.032% of CPM load. This means the speed can be 6 times higher in word mode for the same CPM load. For the time being, only do it on CPM1 as there must be a trade-off between the CPM load reduction and the CPU load required to byte swap the data. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-cpm.c | 23 +++ drivers/spi/spi-fsl-spi.c |3 +++ 2 files changed, 26 insertions(+) --- a/drivers/spi/spi-fsl-cpm.c +++ b/drivers/spi/spi-fsl-cpm.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "spi-fsl-cpm.h" #include "spi-fsl-lib.h" @@ -120,6 +121,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi mspi->rx_dma = mspi->dma_dummy_rx; mspi->map_rx_dma = 0; } + if (t->bits_per_word == 16 && t->tx_buf) { + const u16 *src = t->tx_buf; + u16 *dst; + int i; + + dst = kmalloc(t->len, GFP_KERNEL); + if (!dst) + return -ENOMEM; + + for (i = 0; i < t->len >> 1; i++) + dst[i] = cpu_to_le16p(src + i); + + mspi->tx = dst; + mspi->map_tx_dma = 1; + } if (mspi->map_tx_dma) { void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */ @@ -173,6 +189,13 @@ void fsl_spi_cpm_bufs_complete(struct mp if (mspi->map_rx_dma) dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE); mspi->xfer_in_progress = NULL; + + if (t->bits_per_word == 16 && t->rx_buf) { + int i; + + for (i = 0; i < t->len; i += 2) + le16_to_cpus(t->rx_buf + i); + } } EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete); --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -369,6 +369,9 @@ static int fsl_spi_do_one_msg(struct spi return -EINVAL; if (t->bits_per_word == 16 || t->bits_per_word == 32) t->bits_per_word = 8; /* pretend its 8 bits */ + if (t->bits_per_word == 8 && t->len >= 256 && + (mpc8xxx_spi->flags & SPI_CPM1)) + t->bits_per_word = 16; } } Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-5.4/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-5.4/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 5.15-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation to the 5.15-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch and it can be found in the queue-5.15 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:03 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:16 +0200 Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <1e4bfb4850ba849c316f48a0ab0d7123da0e2f54.1684156552.git.christophe.le...@csgroup.eu> From: Christophe Leroy (backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e) For different reasons, fsl-spi driver performs bits_per_word modifications for different reasons: - On CPU mode, to minimise amount of interrupts - On CPM/QE mode to work around controller byte order For CPU mode that's done in fsl_spi_prepare_message() while for CPM mode that's done in fsl_spi_setup_transfer(). Reunify all of it in fsl_spi_prepare_message(), and catch impossible cases early through master's bits_per_word_mask instead of returning EINVAL later. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 50 +- 1 file changed, 23 insertions(+), 27 deletions(-) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -203,26 +203,6 @@ static int mspi_apply_cpu_mode_quirks(st return bits_per_word; } -static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs, - struct spi_device *spi, - int bits_per_word) -{ - /* CPM/QE uses Little Endian for words > 8 -* so transform 16 and 32 bits words into 8 bits -* Unfortnatly that doesn't work for LSB so -* reject these for now */ - /* Note: 32 bits word, LSB works iff -* tfcr/rfcr is set to CPMFCR_GBL */ - if (spi->mode & SPI_LSB_FIRST && - bits_per_word > 8) - return -EINVAL; - if (bits_per_word <= 8) - return bits_per_word; - if (bits_per_word == 16 || bits_per_word == 32) - return 8; /* pretend its 8 bits */ - return -EINVAL; -} - static int fsl_spi_setup_transfer(struct spi_device *spi, struct spi_transfer *t) { @@ -250,9 +230,6 @@ static int fsl_spi_setup_transfer(struct bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi, mpc8xxx_spi, bits_per_word); - else - bits_per_word = mspi_apply_qe_mode_quirks(cs, spi, - bits_per_word); if (bits_per_word < 0) return bits_per_word; @@ -370,14 +347,27 @@ static int fsl_spi_do_one_msg(struct spi * In CPU mode, optimize large byte transfers to use larger * bits_per_word values to reduce number of interrupts taken. */ - if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { - list_for_each_entry(t, >transfers, transfer_list) { + list_for_each_entry(t, >transfers, transfer_list) { + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { if (t->len < 256 || t->bits_per_word != 8) continue; if ((t->len & 3) == 0) t->bits_per_word = 32; else if ((t->len & 1) == 0) t->bits_per_word = 16; + } else { + /* +* CPM/QE uses Little Endian for words > 8 +* so transform 16 and 32 bits words into 8 bits +* Unfortnatly that doesn't work for LSB so +* reject these for now +* Note: 32 bits word, LSB works iff +* tfcr/rfcr is set to CPMFCR_GBL +*/ + if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 8) + return -EINVAL; + if (t->bits_per_word == 16 || t->bits_per_word == 32) + t->bits_per_word = 8; /* pretend its 8 bits
Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 5.15-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size to the 5.15-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch and it can be found in the queue-5.15 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:07:59 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:17 +0200 Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <3a1b8774ad7004acb594fbf220f98488dbaa2896.1684156552.git.christophe.le...@csgroup.eu> From: Christophe Leroy (cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab) On CPM, the RISC core is a lot more efficiant when doing transfers in 16-bits chunks than in 8-bits chunks, but unfortunately the words need to be byte swapped as seen in a previous commit. So, for large tranfers with an even size, allocate a temporary tx buffer and byte-swap data before and after transfer. This change allows setting higher speed for transfer. For instance on an MPC 8xx (CPM1 comms RISC processor), the documentation tells that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load at 25 MHz while a word transfer at the same speed uses 0.032% of CPM load. This means the speed can be 6 times higher in word mode for the same CPM load. For the time being, only do it on CPM1 as there must be a trade-off between the CPM load reduction and the CPU load required to byte swap the data. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-cpm.c | 23 +++ drivers/spi/spi-fsl-spi.c |3 +++ 2 files changed, 26 insertions(+) --- a/drivers/spi/spi-fsl-cpm.c +++ b/drivers/spi/spi-fsl-cpm.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "spi-fsl-cpm.h" #include "spi-fsl-lib.h" @@ -120,6 +121,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi mspi->rx_dma = mspi->dma_dummy_rx; mspi->map_rx_dma = 0; } + if (t->bits_per_word == 16 && t->tx_buf) { + const u16 *src = t->tx_buf; + u16 *dst; + int i; + + dst = kmalloc(t->len, GFP_KERNEL); + if (!dst) + return -ENOMEM; + + for (i = 0; i < t->len >> 1; i++) + dst[i] = cpu_to_le16p(src + i); + + mspi->tx = dst; + mspi->map_tx_dma = 1; + } if (mspi->map_tx_dma) { void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */ @@ -173,6 +189,13 @@ void fsl_spi_cpm_bufs_complete(struct mp if (mspi->map_rx_dma) dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE); mspi->xfer_in_progress = NULL; + + if (t->bits_per_word == 16 && t->rx_buf) { + int i; + + for (i = 0; i < t->len; i += 2) + le16_to_cpus(t->rx_buf + i); + } } EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete); --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -368,6 +368,9 @@ static int fsl_spi_do_one_msg(struct spi return -EINVAL; if (t->bits_per_word == 16 || t->bits_per_word == 32) t->bits_per_word = 8; /* pretend its 8 bits */ + if (t->bits_per_word == 8 && t->len >= 256 && + (mpc8xxx_spi->flags & SPI_CPM1)) + t->bits_per_word = 16; } } Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-5.15/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-5.15/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 5.10-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation to the 5.10-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch and it can be found in the queue-5.10 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:03 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:16 +0200 Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <1e4bfb4850ba849c316f48a0ab0d7123da0e2f54.1684156552.git.christophe.le...@csgroup.eu> From: Christophe Leroy (backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e) For different reasons, fsl-spi driver performs bits_per_word modifications for different reasons: - On CPU mode, to minimise amount of interrupts - On CPM/QE mode to work around controller byte order For CPU mode that's done in fsl_spi_prepare_message() while for CPM mode that's done in fsl_spi_setup_transfer(). Reunify all of it in fsl_spi_prepare_message(), and catch impossible cases early through master's bits_per_word_mask instead of returning EINVAL later. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 50 +- 1 file changed, 23 insertions(+), 27 deletions(-) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -203,26 +203,6 @@ static int mspi_apply_cpu_mode_quirks(st return bits_per_word; } -static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs, - struct spi_device *spi, - int bits_per_word) -{ - /* CPM/QE uses Little Endian for words > 8 -* so transform 16 and 32 bits words into 8 bits -* Unfortnatly that doesn't work for LSB so -* reject these for now */ - /* Note: 32 bits word, LSB works iff -* tfcr/rfcr is set to CPMFCR_GBL */ - if (spi->mode & SPI_LSB_FIRST && - bits_per_word > 8) - return -EINVAL; - if (bits_per_word <= 8) - return bits_per_word; - if (bits_per_word == 16 || bits_per_word == 32) - return 8; /* pretend its 8 bits */ - return -EINVAL; -} - static int fsl_spi_setup_transfer(struct spi_device *spi, struct spi_transfer *t) { @@ -250,9 +230,6 @@ static int fsl_spi_setup_transfer(struct bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi, mpc8xxx_spi, bits_per_word); - else - bits_per_word = mspi_apply_qe_mode_quirks(cs, spi, - bits_per_word); if (bits_per_word < 0) return bits_per_word; @@ -370,14 +347,27 @@ static int fsl_spi_do_one_msg(struct spi * In CPU mode, optimize large byte transfers to use larger * bits_per_word values to reduce number of interrupts taken. */ - if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { - list_for_each_entry(t, >transfers, transfer_list) { + list_for_each_entry(t, >transfers, transfer_list) { + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { if (t->len < 256 || t->bits_per_word != 8) continue; if ((t->len & 3) == 0) t->bits_per_word = 32; else if ((t->len & 1) == 0) t->bits_per_word = 16; + } else { + /* +* CPM/QE uses Little Endian for words > 8 +* so transform 16 and 32 bits words into 8 bits +* Unfortnatly that doesn't work for LSB so +* reject these for now +* Note: 32 bits word, LSB works iff +* tfcr/rfcr is set to CPMFCR_GBL +*/ + if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 8) + return -EINVAL; + if (t->bits_per_word == 16 || t->bits_per_word == 32) + t->bits_per_word = 8; /* pretend its 8 bits
Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 5.10-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size to the 5.10-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch and it can be found in the queue-5.10 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:07:59 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:17 +0200 Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <3a1b8774ad7004acb594fbf220f98488dbaa2896.1684156552.git.christophe.le...@csgroup.eu> From: Christophe Leroy (cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab) On CPM, the RISC core is a lot more efficiant when doing transfers in 16-bits chunks than in 8-bits chunks, but unfortunately the words need to be byte swapped as seen in a previous commit. So, for large tranfers with an even size, allocate a temporary tx buffer and byte-swap data before and after transfer. This change allows setting higher speed for transfer. For instance on an MPC 8xx (CPM1 comms RISC processor), the documentation tells that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load at 25 MHz while a word transfer at the same speed uses 0.032% of CPM load. This means the speed can be 6 times higher in word mode for the same CPM load. For the time being, only do it on CPM1 as there must be a trade-off between the CPM load reduction and the CPU load required to byte swap the data. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-cpm.c | 23 +++ drivers/spi/spi-fsl-spi.c |3 +++ 2 files changed, 26 insertions(+) --- a/drivers/spi/spi-fsl-cpm.c +++ b/drivers/spi/spi-fsl-cpm.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "spi-fsl-cpm.h" #include "spi-fsl-lib.h" @@ -120,6 +121,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi mspi->rx_dma = mspi->dma_dummy_rx; mspi->map_rx_dma = 0; } + if (t->bits_per_word == 16 && t->tx_buf) { + const u16 *src = t->tx_buf; + u16 *dst; + int i; + + dst = kmalloc(t->len, GFP_KERNEL); + if (!dst) + return -ENOMEM; + + for (i = 0; i < t->len >> 1; i++) + dst[i] = cpu_to_le16p(src + i); + + mspi->tx = dst; + mspi->map_tx_dma = 1; + } if (mspi->map_tx_dma) { void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */ @@ -173,6 +189,13 @@ void fsl_spi_cpm_bufs_complete(struct mp if (mspi->map_rx_dma) dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE); mspi->xfer_in_progress = NULL; + + if (t->bits_per_word == 16 && t->rx_buf) { + int i; + + for (i = 0; i < t->len; i += 2) + le16_to_cpus(t->rx_buf + i); + } } EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete); --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -368,6 +368,9 @@ static int fsl_spi_do_one_msg(struct spi return -EINVAL; if (t->bits_per_word == 16 || t->bits_per_word == 32) t->bits_per_word = 8; /* pretend its 8 bits */ + if (t->bits_per_word == 8 && t->len >= 256 && + (mpc8xxx_spi->flags & SPI_CPM1)) + t->bits_per_word = 16; } } Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-5.10/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-5.10/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
Patch "Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 4.19-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation to the 4.19-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch and it can be found in the queue-4.19 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:01 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:14 +0200 Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: From: Christophe Leroy (backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e) For different reasons, fsl-spi driver performs bits_per_word modifications for different reasons: - On CPU mode, to minimise amount of interrupts - On CPM/QE mode to work around controller byte order For CPU mode that's done in fsl_spi_prepare_message() while for CPM mode that's done in fsl_spi_setup_transfer(). Reunify all of it in fsl_spi_prepare_message(), and catch impossible cases early through master's bits_per_word_mask instead of returning EINVAL later. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 50 +- 1 file changed, 23 insertions(+), 27 deletions(-) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -201,26 +201,6 @@ static int mspi_apply_cpu_mode_quirks(st return bits_per_word; } -static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs, - struct spi_device *spi, - int bits_per_word) -{ - /* CPM/QE uses Little Endian for words > 8 -* so transform 16 and 32 bits words into 8 bits -* Unfortnatly that doesn't work for LSB so -* reject these for now */ - /* Note: 32 bits word, LSB works iff -* tfcr/rfcr is set to CPMFCR_GBL */ - if (spi->mode & SPI_LSB_FIRST && - bits_per_word > 8) - return -EINVAL; - if (bits_per_word <= 8) - return bits_per_word; - if (bits_per_word == 16 || bits_per_word == 32) - return 8; /* pretend its 8 bits */ - return -EINVAL; -} - static int fsl_spi_setup_transfer(struct spi_device *spi, struct spi_transfer *t) { @@ -248,9 +228,6 @@ static int fsl_spi_setup_transfer(struct bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi, mpc8xxx_spi, bits_per_word); - else - bits_per_word = mspi_apply_qe_mode_quirks(cs, spi, - bits_per_word); if (bits_per_word < 0) return bits_per_word; @@ -368,14 +345,27 @@ static int fsl_spi_do_one_msg(struct spi * In CPU mode, optimize large byte transfers to use larger * bits_per_word values to reduce number of interrupts taken. */ - if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { - list_for_each_entry(t, >transfers, transfer_list) { + list_for_each_entry(t, >transfers, transfer_list) { + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { if (t->len < 256 || t->bits_per_word != 8) continue; if ((t->len & 3) == 0) t->bits_per_word = 32; else if ((t->len & 1) == 0) t->bits_per_word = 16; + } else { + /* +* CPM/QE uses Little Endian for words > 8 +* so transform 16 and 32 bits words into 8 bits +* Unfortnatly that doesn't work for LSB so +* reject these for now +* Note: 32 bits word, LSB works iff +* tfcr/rfcr is set to CPMFCR_GBL +*/ + if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 8) + return -EINVAL; + if (t->bits_per_word == 16 || t->bits_per_word == 32) + t->bits_per_word = 8; /* pretend its 8 bits */ } } @@ -658,8 +648,14 @@ static struct spi_master * fsl_spi_probe
Patch "Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode" has been added to the 4.19-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode to the 4.19-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch and it can be found in the queue-4.19 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:06 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:13 +0200 Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Rasmus Villemoes , Mark Brown Message-ID: <674d9af640acf4aa04abd642cc81de926d3271ed.1684158520.git.christophe.le...@csgroup.eu> From: Rasmus Villemoes (cherry picked from upstream af0e6242909c3c4297392ca3e94eff1b4db71a97) Taking one interrupt for every byte is rather slow. Since the controller is perfectly capable of transmitting 32 bits at a time, change t->bits_per-word to 32 when the length is divisible by 4 and large enough that the reduced number of interrupts easily compensates for the one or two extra fsl_spi_setup_transfer() calls this causes. Signed-off-by: Rasmus Villemoes Signed-off-by: Mark Brown Signed-off-by: Christophe Leroy Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 16 1 file changed, 16 insertions(+) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -357,12 +357,28 @@ static int fsl_spi_bufs(struct spi_devic static int fsl_spi_do_one_msg(struct spi_master *master, struct spi_message *m) { + struct mpc8xxx_spi *mpc8xxx_spi = spi_master_get_devdata(master); struct spi_device *spi = m->spi; struct spi_transfer *t, *first; unsigned int cs_change; const int nsecs = 50; int status; + /* +* In CPU mode, optimize large byte transfers to use larger +* bits_per_word values to reduce number of interrupts taken. +*/ + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { + list_for_each_entry(t, >transfers, transfer_list) { + if (t->len < 256 || t->bits_per_word != 8) + continue; + if ((t->len & 3) == 0) + t->bits_per_word = 32; + else if ((t->len & 1) == 0) + t->bits_per_word = 16; + } + } + /* Don't allow changes if CS is active */ first = list_first_entry(>transfers, struct spi_transfer, transfer_list); Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-4.19/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-4.19/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch queue-4.19/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch
Patch "Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 4.19-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size to the 4.19-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch and it can be found in the queue-4.19 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:07:58 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:15 +0200 Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <9363da33f54e9862b4b59c0ed97924ca7265f7a4.1684158520.git.christophe.le...@csgroup.eu> From: Christophe Leroy (cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab) On CPM, the RISC core is a lot more efficiant when doing transfers in 16-bits chunks than in 8-bits chunks, but unfortunately the words need to be byte swapped as seen in a previous commit. So, for large tranfers with an even size, allocate a temporary tx buffer and byte-swap data before and after transfer. This change allows setting higher speed for transfer. For instance on an MPC 8xx (CPM1 comms RISC processor), the documentation tells that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load at 25 MHz while a word transfer at the same speed uses 0.032% of CPM load. This means the speed can be 6 times higher in word mode for the same CPM load. For the time being, only do it on CPM1 as there must be a trade-off between the CPM load reduction and the CPU load required to byte swap the data. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-cpm.c | 23 +++ drivers/spi/spi-fsl-spi.c |3 +++ 2 files changed, 26 insertions(+) --- a/drivers/spi/spi-fsl-cpm.c +++ b/drivers/spi/spi-fsl-cpm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "spi-fsl-cpm.h" #include "spi-fsl-lib.h" @@ -124,6 +125,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi mspi->rx_dma = mspi->dma_dummy_rx; mspi->map_rx_dma = 0; } + if (t->bits_per_word == 16 && t->tx_buf) { + const u16 *src = t->tx_buf; + u16 *dst; + int i; + + dst = kmalloc(t->len, GFP_KERNEL); + if (!dst) + return -ENOMEM; + + for (i = 0; i < t->len >> 1; i++) + dst[i] = cpu_to_le16p(src + i); + + mspi->tx = dst; + mspi->map_tx_dma = 1; + } if (mspi->map_tx_dma) { void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */ @@ -177,6 +193,13 @@ void fsl_spi_cpm_bufs_complete(struct mp if (mspi->map_rx_dma) dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE); mspi->xfer_in_progress = NULL; + + if (t->bits_per_word == 16 && t->rx_buf) { + int i; + + for (i = 0; i < t->len; i += 2) + le16_to_cpus(t->rx_buf + i); + } } EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete); --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -366,6 +366,9 @@ static int fsl_spi_do_one_msg(struct spi return -EINVAL; if (t->bits_per_word == 16 || t->bits_per_word == 32) t->bits_per_word = 8; /* pretend its 8 bits */ + if (t->bits_per_word == 8 && t->len >= 256 && + (mpc8xxx_spi->flags & SPI_CPM1)) + t->bits_per_word = 16; } } Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-4.19/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-4.19/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch queue-4.19/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch
Patch "Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode" has been added to the 4.14-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode to the 4.14-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch and it can be found in the queue-4.14 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:06 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:13 +0200 Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Rasmus Villemoes , Mark Brown Message-ID: <674d9af640acf4aa04abd642cc81de926d3271ed.1684158520.git.christophe.le...@csgroup.eu> From: Rasmus Villemoes (cherry picked from upstream af0e6242909c3c4297392ca3e94eff1b4db71a97) Taking one interrupt for every byte is rather slow. Since the controller is perfectly capable of transmitting 32 bits at a time, change t->bits_per-word to 32 when the length is divisible by 4 and large enough that the reduced number of interrupts easily compensates for the one or two extra fsl_spi_setup_transfer() calls this causes. Signed-off-by: Rasmus Villemoes Signed-off-by: Mark Brown Signed-off-by: Christophe Leroy Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 16 1 file changed, 16 insertions(+) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -357,12 +357,28 @@ static int fsl_spi_bufs(struct spi_devic static int fsl_spi_do_one_msg(struct spi_master *master, struct spi_message *m) { + struct mpc8xxx_spi *mpc8xxx_spi = spi_master_get_devdata(master); struct spi_device *spi = m->spi; struct spi_transfer *t, *first; unsigned int cs_change; const int nsecs = 50; int status; + /* +* In CPU mode, optimize large byte transfers to use larger +* bits_per_word values to reduce number of interrupts taken. +*/ + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { + list_for_each_entry(t, >transfers, transfer_list) { + if (t->len < 256 || t->bits_per_word != 8) + continue; + if ((t->len & 3) == 0) + t->bits_per_word = 32; + else if ((t->len & 1) == 0) + t->bits_per_word = 16; + } + } + /* Don't allow changes if CS is active */ first = list_first_entry(>transfers, struct spi_transfer, transfer_list); Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-4.14/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-4.14/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch queue-4.14/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch
Patch "Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 4.14-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation to the 4.14-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch and it can be found in the queue-4.14 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:08:01 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:14 +0200 Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: From: Christophe Leroy (backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e) For different reasons, fsl-spi driver performs bits_per_word modifications for different reasons: - On CPU mode, to minimise amount of interrupts - On CPM/QE mode to work around controller byte order For CPU mode that's done in fsl_spi_prepare_message() while for CPM mode that's done in fsl_spi_setup_transfer(). Reunify all of it in fsl_spi_prepare_message(), and catch impossible cases early through master's bits_per_word_mask instead of returning EINVAL later. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-spi.c | 50 +- 1 file changed, 23 insertions(+), 27 deletions(-) --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -201,26 +201,6 @@ static int mspi_apply_cpu_mode_quirks(st return bits_per_word; } -static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs, - struct spi_device *spi, - int bits_per_word) -{ - /* CPM/QE uses Little Endian for words > 8 -* so transform 16 and 32 bits words into 8 bits -* Unfortnatly that doesn't work for LSB so -* reject these for now */ - /* Note: 32 bits word, LSB works iff -* tfcr/rfcr is set to CPMFCR_GBL */ - if (spi->mode & SPI_LSB_FIRST && - bits_per_word > 8) - return -EINVAL; - if (bits_per_word <= 8) - return bits_per_word; - if (bits_per_word == 16 || bits_per_word == 32) - return 8; /* pretend its 8 bits */ - return -EINVAL; -} - static int fsl_spi_setup_transfer(struct spi_device *spi, struct spi_transfer *t) { @@ -248,9 +228,6 @@ static int fsl_spi_setup_transfer(struct bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi, mpc8xxx_spi, bits_per_word); - else - bits_per_word = mspi_apply_qe_mode_quirks(cs, spi, - bits_per_word); if (bits_per_word < 0) return bits_per_word; @@ -368,14 +345,27 @@ static int fsl_spi_do_one_msg(struct spi * In CPU mode, optimize large byte transfers to use larger * bits_per_word values to reduce number of interrupts taken. */ - if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { - list_for_each_entry(t, >transfers, transfer_list) { + list_for_each_entry(t, >transfers, transfer_list) { + if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) { if (t->len < 256 || t->bits_per_word != 8) continue; if ((t->len & 3) == 0) t->bits_per_word = 32; else if ((t->len & 1) == 0) t->bits_per_word = 16; + } else { + /* +* CPM/QE uses Little Endian for words > 8 +* so transform 16 and 32 bits words into 8 bits +* Unfortnatly that doesn't work for LSB so +* reject these for now +* Note: 32 bits word, LSB works iff +* tfcr/rfcr is set to CPMFCR_GBL +*/ + if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 8) + return -EINVAL; + if (t->bits_per_word == 16 || t->bits_per_word == 32) + t->bits_per_word = 8; /* pretend its 8 bits */ } } @@ -658,8 +648,14 @@ static struct spi_master * fsl_spi_probe
Patch "Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 4.14-stable tree
This is a note to let you know that I've just added the patch titled Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size to the 4.14-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch and it can be found in the queue-4.14 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let know about it. >From christophe.le...@csgroup.eu Mon May 15 15:07:58 2023 From: Christophe Leroy Date: Mon, 15 May 2023 16:07:15 +0200 Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size To: gre...@linuxfoundation.org, sta...@vger.kernel.org Cc: Christophe Leroy , linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown Message-ID: <9363da33f54e9862b4b59c0ed97924ca7265f7a4.1684158520.git.christophe.le...@csgroup.eu> From: Christophe Leroy (cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab) On CPM, the RISC core is a lot more efficiant when doing transfers in 16-bits chunks than in 8-bits chunks, but unfortunately the words need to be byte swapped as seen in a previous commit. So, for large tranfers with an even size, allocate a temporary tx buffer and byte-swap data before and after transfer. This change allows setting higher speed for transfer. For instance on an MPC 8xx (CPM1 comms RISC processor), the documentation tells that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load at 25 MHz while a word transfer at the same speed uses 0.032% of CPM load. This means the speed can be 6 times higher in word mode for the same CPM load. For the time being, only do it on CPM1 as there must be a trade-off between the CPM load reduction and the CPU load required to byte swap the data. Signed-off-by: Christophe Leroy Link: https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman --- drivers/spi/spi-fsl-cpm.c | 23 +++ drivers/spi/spi-fsl-spi.c |3 +++ 2 files changed, 26 insertions(+) --- a/drivers/spi/spi-fsl-cpm.c +++ b/drivers/spi/spi-fsl-cpm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "spi-fsl-cpm.h" #include "spi-fsl-lib.h" @@ -124,6 +125,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi mspi->rx_dma = mspi->dma_dummy_rx; mspi->map_rx_dma = 0; } + if (t->bits_per_word == 16 && t->tx_buf) { + const u16 *src = t->tx_buf; + u16 *dst; + int i; + + dst = kmalloc(t->len, GFP_KERNEL); + if (!dst) + return -ENOMEM; + + for (i = 0; i < t->len >> 1; i++) + dst[i] = cpu_to_le16p(src + i); + + mspi->tx = dst; + mspi->map_tx_dma = 1; + } if (mspi->map_tx_dma) { void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */ @@ -177,6 +193,13 @@ void fsl_spi_cpm_bufs_complete(struct mp if (mspi->map_rx_dma) dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE); mspi->xfer_in_progress = NULL; + + if (t->bits_per_word == 16 && t->rx_buf) { + int i; + + for (i = 0; i < t->len; i += 2) + le16_to_cpus(t->rx_buf + i); + } } EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete); --- a/drivers/spi/spi-fsl-spi.c +++ b/drivers/spi/spi-fsl-spi.c @@ -366,6 +366,9 @@ static int fsl_spi_do_one_msg(struct spi return -EINVAL; if (t->bits_per_word == 16 || t->bits_per_word == 32) t->bits_per_word = 8; /* pretend its 8 bits */ + if (t->bits_per_word == 8 && t->len >= 256 && + (mpc8xxx_spi->flags & SPI_CPM1)) + t->bits_per_word = 16; } } Patches currently in stable-queue which might be from christophe.le...@csgroup.eu are queue-4.14/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch queue-4.14/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch queue-4.14/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch
Re: [PATCH v5 13/18] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly
On Wed 2023-05-24 12:38:49, Doug Anderson wrote: > Hi, > > On Wed, May 24, 2023 at 6:59 AM Petr Mladek wrote: > > > > On Fri 2023-05-19 10:18:37, Douglas Anderson wrote: > > > The fact that there watchdog_hardlockup_enable(), > > > watchdog_hardlockup_disable(), and watchdog_hardlockup_probe() are > > > declared __weak means that the configured hardlockup detector can > > > define non-weak versions of those functions if it needs to. Instead of > > > doing this, the perf hardlockup detector hooked itself into the > > > default __weak implementation, which was a bit awkward. Clean this up. > > > > > > >From comments, it looks as if the original design was done because the > > > __weak function were expected to implemented by the architecture and > > > not by the configured hardlockup detector. This got awkward when we > > > tried to add the buddy lockup detector which was not arch-specific but > > > wanted to hook into those same functions. > > > > > > This is not expected to have any functional impact. > > > > > > @@ -187,27 +187,33 @@ static inline void watchdog_hardlockup_kick(void) { > > > } > > > #endif /* !CONFIG_HARDLOCKUP_DETECTOR_PERF */ > > > > > > /* > > > - * These functions can be overridden if an architecture implements its > > > - * own hardlockup detector. > > > + * These functions can be overridden based on the configured hardlockdup > > > detector. > > > * > > > * watchdog_hardlockup_enable/disable can be implemented to start and > > > stop when > > > - * softlockup watchdog start and stop. The arch must select the > > > + * softlockup watchdog start and stop. The detector must select the > > > * SOFTLOCKUP_DETECTOR Kconfig. > > > */ > > > -void __weak watchdog_hardlockup_enable(unsigned int cpu) > > > -{ > > > - hardlockup_detector_perf_enable(); > > > -} > > > +void __weak watchdog_hardlockup_enable(unsigned int cpu) { } > > > > > > -void __weak watchdog_hardlockup_disable(unsigned int cpu) > > > -{ > > > - hardlockup_detector_perf_disable(); > > > -} > > > +void __weak watchdog_hardlockup_disable(unsigned int cpu) { } > > > > > > /* Return 0, if a hardlockup watchdog is available. Error code otherwise > > > */ > > > int __weak __init watchdog_hardlockup_probe(void) > > > { > > > - return hardlockup_detector_perf_init(); > > > + /* > > > + * If CONFIG_HAVE_NMI_WATCHDOG is defined then an architecture > > > + * is assumed to have the hard watchdog available and we return 0. > > > + */ > > > + if (IS_ENABLED(CONFIG_HAVE_NMI_WATCHDOG)) > > > + return 0; > > > + > > > + /* > > > + * Hardlockup detectors other than those using > > > CONFIG_HAVE_NMI_WATCHDOG > > > + * are required to implement a non-weak version of this probe > > > function > > > + * to tell whether they are available. If they don't override then > > > + * we'll return -ENODEV. > > > + */ > > > + return -ENODEV; > > > } > > > > When thinking more about it. It is weird that we need to handle > > CONFIG_HAVE_NMI_WATCHDOG in this default week function. > > > > It should be handled in watchdog_hardlockup_probe() implemented > > in kernel/watchdog_perf.c. > > > > IMHO, the default __weak function could always return -ENODEV; > > > > Would it make sense, please? > > I don't quite understand. I'd agree that the special case for > CONFIG_HAVE_NMI_WATCHDOG is ugly, but it was also ugly before. IMO > it's actually a little less ugly / easier to understand after my > patch. ...but let me walk through how I think this special case works > and maybe you can tell me where I'm confused. > > The first thing to understand is that CONFIG_HARDLOCKUP_DETECTOR_PERF > and CONFIG_HAVE_NMI_WATCHDOG are mutually exclusive from each other. > This was true before any of my patches and is still true after them. > Specifically, if CONFIG_HAVE_NMI_WATCHDOG is defined then an > architecture implements arch_touch_nmi_watchdog() (as documented in > the Kconfig docs for HAVE_NMI_WATCHDOG). Looking at the tree before my > series you can see that the perf hardlockup detector also implemented > arch_touch_nmi_watchdog(). This would have caused a conflict. The > mutual exclusion was presumably enforced by an architecture not > defining both HAVE_NMI_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_PERF. > > The second thing to understand is that an architecture that defines > CONFIG_HAVE_NMI_WATCHDOG is _not_ required to implement > watchdog_hardlockup_probe() (used to be called watchdog_nmi_probe()). > Maybe this should change, but at the very least it appears that > SPARC64 defines HAVE_NMI_WATCHDOG but doesn't define > watchdog_hardlockup_probe() AKA watchdog_nmi_probe(). Anyone who > defines CONFIG_HAVE_NMI_WATCHDOG and doesn't implement > watchdog_hardlockup_probe() is claiming that their watchdog needs no > probing and is always available. > > So with that context: > > 1. We can't handle any special cases for CONFIG_HAVE_NMI_WATCHDOG in > "kernel/watchdog_perf.c".
Re: [PATCH v5 15/18] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs
On Fri 2023-05-19 10:18:39, Douglas Anderson wrote: > On arm64, NMI support needs to be detected at runtime. Add a weak > function to the perf hardlockup detector so that an architecture can > implement it to detect whether NMIs are available. > > Signed-off-by: Douglas Anderson > --- > While I won't object to this patch landing, I consider it part of the > arm64 perf hardlockup effort. I would be OK with the earlier patches > in the series landing and then not landing ${SUBJECT} patch nor > anything else later. > > I'll also note that, as an alternative to this, it would be nice if we > could figure out how to make perf_event_create_kernel_counter() fail > on arm64 if NMIs aren't available. Maybe we could add a "must_use_nmi" > element to "struct perf_event_attr"? > > --- a/kernel/watchdog_perf.c > +++ b/kernel/watchdog_perf.c > @@ -234,12 +234,22 @@ void __init hardlockup_detector_perf_restart(void) > } > } > > +bool __weak __init arch_perf_nmi_is_available(void) > +{ > + return true; > +} > + > /** > * watchdog_hardlockup_probe - Probe whether NMI event is available at all > */ > int __init watchdog_hardlockup_probe(void) > { > - int ret = hardlockup_detector_event_create(); > + int ret; > + > + if (!arch_perf_nmi_is_available()) > + return -ENODEV; My understanding is that this would block the perf hardlockup detector at runtime. Does it work with the "nmi_watchdog" sysctl. I see that it is made read-only when it is not enabled at build time, see NMI_WATCHDOG_SYSCTL_PERM. > + > + ret = hardlockup_detector_event_create(); > > if (ret) { > pr_info("Perf NMI watchdog permanently disabled\n"); Best Regards, Petr
Re: [PATCH v5 14/18] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
On Thu 2023-05-25 13:08:04, Doug Anderson wrote: > Hi, > > On Thu, May 25, 2023 at 9:27 AM Petr Mladek wrote: > > > > On Fri 2023-05-19 10:18:38, Douglas Anderson wrote: > > > Implement a hardlockup detector that doesn't doesn't need any extra > > > arch-specific support code to detect lockups. Instead of using > > > something arch-specific we will use the buddy system, where each CPU > > > watches out for another one. Specifically, each CPU will use its > > > softlockup hrtimer to check that the next CPU is processing hrtimer > > > interrupts by verifying that a counter is increasing. > > > > > > --- a/kernel/watchdog.c > > > +++ b/kernel/watchdog.c > > > @@ -85,7 +85,7 @@ __setup("nmi_watchdog=", hardlockup_panic_setup); > > > > > > #endif /* CONFIG_HARDLOCKUP_DETECTOR */ > > > > > > -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) > > > +#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER) > > > > > > static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts); > > > static DEFINE_PER_CPU(int, hrtimer_interrupts_saved); > > > @@ -106,6 +106,14 @@ notrace void arch_touch_nmi_watchdog(void) > > > } > > > EXPORT_SYMBOL(arch_touch_nmi_watchdog); > > > > > > +void watchdog_hardlockup_touch_cpu(unsigned int cpu) > > > +{ > > > + per_cpu(watchdog_hardlockup_touched, cpu) = true; > > > + > > > + /* Match with smp_rmb() in watchdog_hardlockup_check() */ > > > + smp_wmb(); > > > > It is great that you described where the related barrier is. > > > > Another important information is what exactly is synchronized. > > And I am actually not sure what we are synchronizing here. > > > > My understanding is that a write barrier should synchronize > > related writes, for example: > > > > X = ...; > > /* Make sure that X is modified before Y */ > > smp_wmb(); > > Y = ...; > > > > And the related read barrier should synchronize the related reads, > > for example: > > > > if (test(Y)) { > > /* > > * Make sure that we use the updated X when > > * we saw the updated Y. > > */ > > smp_rmb(); > > do_something(X); > > } > > > > IMHO, we do not need any barrier here because we have only > > one variable "watchdog_hardlockup_touched" here. > > watchdog_hardlockup_check() will either see the updated value > > or not. But it does not synchronize it against any other > > variables or values. > > Fair. These barriers were present in the original buddy lockup > detector that we've been carrying in ChromeOS but that doesn't > necessarily mean that they were there for a good reason. > > Reasoning about weakly ordered memory always makes my brain hurt and I > never feel confident at the end that I got the right answer and, of > course, this is coupled by the fact that if I have a logic error in my > reasoning that it might cause a rare / subtle bug. :( Sure. Lockless code is complicated. > When possible I > try to write code that uses full blown locks to make it easier to > reason about (even if less efficient), Makes sense. There should be a good reason to use lockless code because it is complicated to do it right and maintain. > but that's not necessarily > possible here. While we obviously don't just want to sprinkle barriers > all over the code, IMO it's not a terrible sin to put a barrier in a > case where it makes it easier to reason about the order of things. I understand this. Well, it is always important to describe the the reason why the barrier was added there. Even when it is wrong, it gives a clue what was the motivation. Otherwise, it is hard to do any changes on the code later. I guess that it might be more problematic for you because you probably are not the original author. > In any case, I guess in this case I would worry about some sort of > ordering race when enabling / disabling the buddy lockup detector. At > the end of the buddy's watchdog_hardlockup_enable() / > watchdog_hardlockup_disable() we adjust the "watchdog_cpus" which > changes buddy assignments. Without a barrier, I _think_ it would be > possible for a new CPU to notice the change in buddies without > noticing the touch. Does that match your understanding? Now when > reasoning about CPUs going online/offline we need to consider this > extra case and we have to decide if there's any chance it could lead > to a false lockup detection. With the memory barriers here, it's a > little easier to think about. This makes sense. I did not think about the hotplug scenario. Well, I suggest to move the barriers into the buddy code and describe it there. It does not make sense to use the barriers for the perf hardlockup. I mean something like: diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c index fee45af2e5bd..ebe71dcb55e6 100644 --- a/kernel/watchdog_buddy.c +++ b/kernel/watchdog_buddy.c @@ -52,6 +52,13 @@ void watchdog_hardlockup_enable(unsigned int cpu) if (next_cpu <
Bug: Write fault blocked by KUAP! in do_notify_resume()
Can't find how that can happen. I have: CONFIG_PREEMPT_BUILD=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_PREEMPTION=y CONFIG_PREEMPT_RCU=y We are inside an access_begin / access_end block. [ 380.407589] [ cut here ] [ 380.408019] Bug: Write fault blocked by KUAP! [ 380.408497] WARNING: CPU: 0 PID: 422 at arch/powerpc/mm/fault.c:228 do_page_fault+0x510/0x78c [ 380.409353] CPU: 0 PID: 422 Comm: CORSurv Tainted: GW 6.4.0-rc2-s3k-dev-02274-gca69d28ba73c #328 [ 380.409879] Hardware name: MCR3000_2G 8xx 0x50 CMPC885 [ 380.410346] NIP: c00135cc LR: c00135cc CTR: c0065b08 [ 380.410834] REGS: cae53ce0 TRAP: 0700 Tainted: GW (6.4.0-rc2-s3k-dev-02274-gca69d28ba73c) [ 380.411335] MSR: 00021032 CR: 22e822d2 XER: 2000f701 [ 380.414300] [ 380.414300] GPR00: c00135cc cae53da0 c26439c0 0021 c0a75d78 0001 c0a75e88 1032 [ 380.414300] GPR08: 0027 0001 921a5f00 22e822d2 1002c9f4 0001 0005 [ 380.414300] GPR16: 10025760 100254bc 10024eec 0004 [ 380.414300] GPR24: 0006 10025770 100100cc c26c8d80 7fab7b50 0200 0300 cae53de0 [ 380.428316] NIP [c00135cc] do_page_fault+0x510/0x78c [ 380.429005] LR [c00135cc] do_page_fault+0x510/0x78c [ 380.429673] Call Trace: [ 380.430115] [cae53da0] [c00135cc] do_page_fault+0x510/0x78c (unreliable) [ 380.431513] [cae53dd0] [c0003ac4] DataTLBError_virt+0x114/0x118 [ 380.432546] --- interrupt: 300 at __unsafe_save_user_regs.constprop.0+0x24/0x88 [ 380.433231] NIP: c000656c LR: c0006980 CTR: 0006 [ 380.433714] REGS: cae53de0 TRAP: 0300 Tainted: GW (6.4.0-rc2-s3k-dev-02274-gca69d28ba73c) [ 380.434216] MSR: 9032 CR: 22084442 XER: 2000f701 [ 380.437561] DAR: 7fab7b50 DSISR: 8a00 [ 380.437561] GPR00: c0006980 cae53ea0 c26439c0 cae53f40 7fab7b50 7fab7b30 001d 7fab8010 [ 380.437561] GPR08: cae53f38 7fab7b50 cae53f40 918ffc00 22084442 1002c9f4 0001 0005 [ 380.437561] GPR16: 10025760 100254bc 10024eec 0004 [ 380.437561] GPR24: 0006 10025770 100100cc cae53f40 cae53f40 cae53ec8 c26439c0 7fab7b50 [ 380.451991] NIP [c000656c] __unsafe_save_user_regs.constprop.0+0x24/0x88 [ 380.452678] LR [c0006980] handle_signal32+0x9c/0x1a0 [ 380.453329] --- interrupt: 300 [ 380.453803] [cae53ea0] [10025770] 0x10025770 (unreliable) [ 380.455086] [cae53ec0] [c0008844] do_notify_resume+0x128/0x2cc [ 380.456149] [cae53f20] [c000df70] interrupt_exit_user_prepare_main+0x7c/0xd8 [ 380.457205] [cae53f30] [c00120b4] ret_from_syscall+0xc/0x28 [ 380.458249] --- interrupt: c00 at 0xff42e98 [ 380.458809] NIP: 0ff42e98 LR: 0ff42e8c CTR: 0fdff244 [ 380.459298] REGS: cae53f40 TRAP: 0c00 Tainted: GW (6.4.0-rc2-s3k-dev-02274-gca69d28ba73c) [ 380.459796] MSR: d032 CR: 38084448 XER: 2000f701 [ 380.463473] [ 380.463473] GPR00: 001d 7fab8010 77fc34d0 0004 [ 380.463473] GPR08: 2009 7fab7ef0 28084842 1002c9f4 0001 0005 [ 380.463473] GPR16: 10025760 100254bc 10024eec 0004 [ 380.463473] GPR24: 0006 10025770 100100cc 0fc425ec 1001048c 7fab8048 10025558 10024ee8 [ 380.477466] NIP [0ff42e98] 0xff42e98 [ 380.478015] LR [0ff42e8c] 0xff42e8c [ 380.478546] --- interrupt: c00 [ 380.479026] Code: 408201f8 807f0080 48033361 2c03 41a2ffc8 2c1d 41820138 3c80c096 3884d5d0 3c60c096 3863d6bc 4800fa79 <0fe0> 93410018 814203c8 714a0100 [ 380.486263] ---[ end trace ]--- [ 380.492813] [ 380.547950] CORSurv[422]: bad frame in handle_signal32: 3e6b61e5 nip 0ff42e98 lr 0ff42e8c Any idea ? Christophe
Re: linux-next: boot failure after merge of the tip tree
* Peter Zijlstra wrote: > On Wed, May 24, 2023 at 03:44:59PM +1000, Stephen Rothwell wrote: > > Hi all, > > > > After merging the tip tree, today's linux-next build (powerpc > > pseries_le_defconfig) failed to boot like this: > > > > Built 1 zonelists, mobility grouping on. Total pages: 32736 > > Policy zone: Normal > > mem auto-init: stack:all(zero), heap alloc:off, heap free:off > > Memory: 2027392K/2097152K available (17984K kernel code, 3328K rwdata, > > 14784K rodata, 6080K init, 1671K bss, 69760K reserved, 0K cma-reserved) > > > > *crickets* > > > > Bisected to commit > > > > f4ab23558310 ("slub: Replace cmpxchg_double()") > > > > I have reverted that commit (and the following one) for today. > > Sorry about that; turns out I forgot to test the case where cmpxchg128 > wasn't available. > > Please see: > > > https://lkml.kernel.org/r/20230524093246.gp83...@hirez.programming.kicks-ass.net > > As stated there, I'm going to zap tip/locking/core for a few days and > let this fixed version run through the robots again before re-instating > it. Note to -next, this should now be resolved in the tip:auto-latest branch. Thanks, Ingo