[PATCH 10/10] watchdog/hardlockup: Rename HAVE_HARDLOCKUP_DETECTOR_NON_ARCH to ..._PERF_OR_BUDDY

2023-05-26 Thread Douglas Anderson
HAVE_HARDLOCKUP_DETECTOR_NON_ARCH is a mouthful and
confusing. HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY is even more of a
mouthful, but probably less confusing. Rename the Kconfig names.

Signed-off-by: Douglas Anderson 
---

 lib/Kconfig.debug | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index eb1edd5905bc..b9e162698a82 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1058,7 +1058,7 @@ config HARDLOCKUP_DETECTOR_BUDDY
 # needs SMP). In either case, using the "non-arch" code conflicts with
 # the NMI watchdog code (which is sometimes used directly and sometimes used
 # by the arch-provided hardlockup detector).
-config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+config HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
bool
depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG
default y
@@ -1077,10 +1077,10 @@ config HARDLOCKUP_DETECTOR_PREFER_BUDDY
  an arch-specific hardlockup detector or if resources needed
  for the hardlockup detector are better used for other things.
 
-# This will select the appropriate non-arch hardlockdup detector
-config HARDLOCKUP_DETECTOR_NON_ARCH
+# This will select the appropriate non-arch hardlockup detector
+config HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
bool
-   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+   depends on HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || 
HARDLOCKUP_DETECTOR_PREFER_BUDDY
select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && 
!HARDLOCKUP_DETECTOR_PREFER_BUDDY
 
@@ -1098,9 +1098,9 @@ config HARDLOCKUP_CHECK_TIMESTAMP
 config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
depends on DEBUG_KERNEL && !S390
-   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
HAVE_HARDLOCKUP_DETECTOR_ARCH
+   depends on HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY || 
HAVE_HARDLOCKUP_DETECTOR_ARCH
select LOCKUP_DETECTOR
-   select HARDLOCKUP_DETECTOR_NON_ARCH if HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+   select HARDLOCKUP_DETECTOR_PERF_OR_BUDDY if 
HAVE_HARDLOCKUP_DETECTOR_PERF_OR_BUDDY
 
help
  Say Y here to enable the kernel to act as a watchdog to detect
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 09/10] watchdog/hardlockup: Move SMP barriers from common code to buddy code

2023-05-26 Thread Douglas Anderson
It's been suggested that since the SMP barriers are only potentially
useful for the buddy hardlockup detector, not the perf hardlockup
detector, that the barriers belong in the buddy code. Let's move them
and add clearer comments about why they're needed.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 kernel/watchdog.c   |  6 --
 kernel/watchdog_buddy.c | 21 +
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 6cc46b8e3d07..a351ab0c35eb 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -109,9 +109,6 @@ EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 void watchdog_hardlockup_touch_cpu(unsigned int cpu)
 {
per_cpu(watchdog_hardlockup_touched, cpu) = true;
-
-   /* Match with smp_rmb() in watchdog_hardlockup_check() */
-   smp_wmb();
 }
 
 static bool is_hardlockup(unsigned int cpu)
@@ -141,9 +138,6 @@ static void watchdog_hardlockup_kick(void)
 
 void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 {
-   /* Match with smp_wmb() in watchdog_hardlockup_touch_cpu() */
-   smp_rmb();
-
if (per_cpu(watchdog_hardlockup_touched, cpu)) {
per_cpu(watchdog_hardlockup_touched, cpu) = false;
return;
diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c
index 2ef88722c5e7..34dbfe091f4b 100644
--- a/kernel/watchdog_buddy.c
+++ b/kernel/watchdog_buddy.c
@@ -51,6 +51,13 @@ void watchdog_hardlockup_enable(unsigned int cpu)
if (next_cpu < nr_cpu_ids)
watchdog_hardlockup_touch_cpu(next_cpu);
 
+   /*
+* Makes sure that watchdog is touched on this CPU before
+* other CPUs could see it in watchdog_cpus. The counter
+* part is in watchdog_buddy_check_hardlockup().
+*/
+   smp_wmb();
+
cpumask_set_cpu(cpu, _cpus);
 }
 
@@ -68,6 +75,13 @@ void watchdog_hardlockup_disable(unsigned int cpu)
if (next_cpu < nr_cpu_ids)
watchdog_hardlockup_touch_cpu(next_cpu);
 
+   /*
+* Makes sure that watchdog is touched on the next CPU before
+* this CPU disappear in watchdog_cpus. The counter part is in
+* watchdog_buddy_check_hardlockup().
+*/
+   smp_wmb();
+
cpumask_clear_cpu(cpu, _cpus);
 }
 
@@ -88,5 +102,12 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interrupts)
if (next_cpu >= nr_cpu_ids)
return;
 
+   /*
+* Make sure that the watchdog was touched on next CPU when
+* watchdog_next_cpu() returned another one because of
+* a change in watchdog_hardlockup_enable()/disable().
+*/
+   smp_rmb();
+
watchdog_hardlockup_check(next_cpu, NULL);
 }
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 08/10] watchdog/buddy: Simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY

2023-05-26 Thread Douglas Anderson
The dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY was more
complicated than it needed to be. If the "perf" detector is available
and we have SMP then we have a choice, so enable the config based on
just those two config items.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 lib/Kconfig.debug | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c9df93402237..eb1edd5905bc 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1065,7 +1065,7 @@ config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
 
 config HARDLOCKUP_DETECTOR_PREFER_BUDDY
bool "Prefer the buddy CPU hardlockup detector"
-   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH && 
HAVE_HARDLOCKUP_DETECTOR_PERF && SMP
+   depends on HAVE_HARDLOCKUP_DETECTOR_PERF && SMP
help
  Say Y here to prefer the buddy hardlockup detector over the perf one.
 
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 07/10] watchdog/buddy: Don't copy the cpumask in watchdog_next_cpu()

2023-05-26 Thread Douglas Anderson
There's no reason to make a copy of the "watchdog_cpus" locally in
watchdog_next_cpu(). Making a copy wouldn't make things any more race
free and we're just reading the value so there's no need for a copy.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 kernel/watchdog_buddy.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c
index 3ffc5f2ede5a..2ef88722c5e7 100644
--- a/kernel/watchdog_buddy.c
+++ b/kernel/watchdog_buddy.c
@@ -10,12 +10,11 @@ static cpumask_t __read_mostly watchdog_cpus;
 
 static unsigned int watchdog_next_cpu(unsigned int cpu)
 {
-   cpumask_t cpus = watchdog_cpus;
unsigned int next_cpu;
 
-   next_cpu = cpumask_next(cpu, );
+   next_cpu = cpumask_next(cpu, _cpus);
if (next_cpu >= nr_cpu_ids)
-   next_cpu = cpumask_first();
+   next_cpu = cpumask_first(_cpus);
 
if (next_cpu == cpu)
return nr_cpu_ids;
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 06/10] watchdog/buddy: Cleanup how watchdog_buddy_check_hardlockup() is called

2023-05-26 Thread Douglas Anderson
In the patch ("watchdog/hardlockup: detect hard lockups using
secondary (buddy) CPUs"), we added a call from the common watchdog.c
file into the buddy. That call could be done more cleanly.
Specifically:
1. If we move the call into watchdog_hardlockup_kick() then it keeps
   watchdog_timer_fn() simpler.
2. We don't need to pass an "unsigned long" to the buddy for the timer
   count. In the patch ("watchdog/hardlockup: add a "cpu" param to
   watchdog_hardlockup_check()") the count was changed to "atomic_t"
   which is backed by an int, so we should match types.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 include/linux/nmi.h |  4 ++--
 kernel/watchdog.c   | 15 +++
 kernel/watchdog_buddy.c |  2 +-
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 6c6a5ce77c66..43893e5f858a 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -114,9 +114,9 @@ void watchdog_hardlockup_disable(unsigned int cpu);
 void lockup_detector_reconfigure(void);
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR_BUDDY
-void watchdog_buddy_check_hardlockup(unsigned long hrtimer_interrupts);
+void watchdog_buddy_check_hardlockup(int hrtimer_interrupts);
 #else
-static inline void watchdog_buddy_check_hardlockup(unsigned long 
hrtimer_interrupts) {}
+static inline void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) {}
 #endif
 
 /**
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 85f4839b6faf..6cc46b8e3d07 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -131,9 +131,12 @@ static bool is_hardlockup(unsigned int cpu)
return false;
 }
 
-static unsigned long watchdog_hardlockup_kick(void)
+static void watchdog_hardlockup_kick(void)
 {
-   return atomic_inc_return(this_cpu_ptr(_interrupts));
+   int new_interrupts;
+
+   new_interrupts = atomic_inc_return(this_cpu_ptr(_interrupts));
+   watchdog_buddy_check_hardlockup(new_interrupts);
 }
 
 void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
@@ -195,7 +198,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct 
pt_regs *regs)
 
 #else /* CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */
 
-static inline unsigned long watchdog_hardlockup_kick(void) { return 0; }
+static inline void watchdog_hardlockup_kick(void) { }
 
 #endif /* !CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */
 
@@ -449,15 +452,11 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
struct pt_regs *regs = get_irq_regs();
int duration;
int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;
-   unsigned long hrtimer_interrupts;
 
if (!watchdog_enabled)
return HRTIMER_NORESTART;
 
-   hrtimer_interrupts = watchdog_hardlockup_kick();
-
-   /* test for hardlockups */
-   watchdog_buddy_check_hardlockup(hrtimer_interrupts);
+   watchdog_hardlockup_kick();
 
/* kick the softlockup detector */
if (completion_done(this_cpu_ptr(_completion))) {
diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c
index fee45af2e5bd..3ffc5f2ede5a 100644
--- a/kernel/watchdog_buddy.c
+++ b/kernel/watchdog_buddy.c
@@ -72,7 +72,7 @@ void watchdog_hardlockup_disable(unsigned int cpu)
cpumask_clear_cpu(cpu, _cpus);
 }
 
-void watchdog_buddy_check_hardlockup(unsigned long hrtimer_interrupts)
+void watchdog_buddy_check_hardlockup(int hrtimer_interrupts)
 {
unsigned int next_cpu;
 
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 05/10] watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()

2023-05-26 Thread Douglas Anderson
In the patch ("watchdog/hardlockup: add comments to
touch_nmi_watchdog()") we adjusted some comments for
touch_nmi_watchdog(). The comment about the softlockup had a typo and
were also felt to be too obvious. Remove it.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 include/linux/nmi.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 3a27169ec383..6c6a5ce77c66 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -140,10 +140,6 @@ static inline void touch_nmi_watchdog(void)
 */
arch_touch_nmi_watchdog();
 
-   /*
-* Touching the hardlock detector implcitily pets the
-* softlockup detector too
-*/
touch_softlockup_watchdog();
 }
 
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 04/10] watchdog/hardlockup: In watchdog_hardlockup_check() use cpumask_copy()

2023-05-26 Thread Douglas Anderson
In the patch ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") we started using a cpumask to keep track
of which CPUs to backtrace. When setting up this cpumask, it's better
to use cpumask_copy() than to just copy the structure directly. Fix this.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 kernel/watchdog.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 32dac8028753..85f4839b6faf 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -154,7 +154,9 @@ void watchdog_hardlockup_check(unsigned int cpu, struct 
pt_regs *regs)
 */
if (is_hardlockup(cpu)) {
unsigned int this_cpu = smp_processor_id();
-   struct cpumask backtrace_mask = *cpu_online_mask;
+   struct cpumask backtrace_mask;
+
+   cpumask_copy(_mask, cpu_online_mask);
 
/* Only print hardlockups once. */
if (per_cpu(watchdog_hardlockup_warned, cpu))
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 03/10] watchdog/hardlockup: Don't use raw_cpu_ptr() in watchdog_hardlockup_kick()

2023-05-26 Thread Douglas Anderson
In the patch ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") there was no reason to use
raw_cpu_ptr(). Using this_cpu_ptr() works fine.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---

 kernel/watchdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 62230f5b8878..32dac8028753 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -133,7 +133,7 @@ static bool is_hardlockup(unsigned int cpu)
 
 static unsigned long watchdog_hardlockup_kick(void)
 {
-   return atomic_inc_return(raw_cpu_ptr(_interrupts));
+   return atomic_inc_return(this_cpu_ptr(_interrupts));
 }
 
 void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 00/10] watchdog: Cleanup / fixes after buddy series v5 reviews

2023-05-26 Thread Douglas Anderson


This patch series attempts to finish resolving the feedback received
from Petr Mladek on the v5 series I posted. Andrew has already landed
v5 so I'm posting this as additional patches.

Probably the only thing that wasn't fully as clean as Petr requested
was the Kconfig stuff. I couldn't find a better way to express it
without a more major overhaul. In the very least, I renamed "NON_ARCH"
to "PERF_OR_BUDDY" in the hopes that will make it marginally better.

Nothing in this series is terribly critical and even the bugfixes are
small. However, it does cleanup a few things that were pointed out in
review.


Douglas Anderson (10):
  watchdog/hardlockup: Keep kernel.nmi_watchdog sysctl as 0444 if probe
fails
  watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement
watchdog_hardlockup_probe()
  watchdog/hardlockup: Don't use raw_cpu_ptr() in
watchdog_hardlockup_kick()
  watchdog/hardlockup: In watchdog_hardlockup_check() use cpumask_copy()
  watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
  watchdog/buddy: Cleanup how watchdog_buddy_check_hardlockup() is
called
  watchdog/buddy: Don't copy the cpumask in watchdog_next_cpu()
  watchdog/buddy: Simplify the dependency for
HARDLOCKUP_DETECTOR_PREFER_BUDDY
  watchdog/hardlockup: Move SMP barriers from common code to buddy code
  watchdog/hardlockup: Rename HAVE_HARDLOCKUP_DETECTOR_NON_ARCH to
..._PERF_OR_BUDDY

 arch/Kconfig|  3 +-
 arch/sparc/kernel/nmi.c |  5 +++
 include/linux/nmi.h | 14 ++---
 kernel/watchdog.c   | 68 ++---
 kernel/watchdog_buddy.c | 28 ++---
 lib/Kconfig.debug   | 14 -
 6 files changed, 70 insertions(+), 62 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 01/10] watchdog/hardlockup: Keep kernel.nmi_watchdog sysctl as 0444 if probe fails

2023-05-26 Thread Douglas Anderson
The permissions for the kernel.nmi_watchdog sysctl have always been
set at compile time despite the fact that a watchdog can fail to
probe. Let's fix this and set the permissions based on whether the
hardlockup detector actually probed.

Fixes: a994a3147e4c ("watchdog/hardlockup/perf: Implement init time detection 
of perf")
Reported-by: Petr Mladek 
Closes: https://lore.kernel.org/r/ZHCn4hNxFpY5-9Ki@alley
Signed-off-by: Douglas Anderson 
---

 include/linux/nmi.h |  6 --
 kernel/watchdog.c   | 30 --
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 333465e235e1..3a27169ec383 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -95,12 +95,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct 
pt_regs *regs);
 static inline void arch_touch_nmi_watchdog(void) { }
 #endif
 
-#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
-# define NMI_WATCHDOG_SYSCTL_PERM  0644
-#else
-# define NMI_WATCHDOG_SYSCTL_PERM  0444
-#endif
-
 #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 237990e8d345..4b9e31edb47f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -880,15 +880,6 @@ static struct ctl_table watchdog_sysctls[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = (void *),
},
-   {
-   .procname   = "nmi_watchdog",
-   .data   = _hardlockup_user_enabled,
-   .maxlen = sizeof(int),
-   .mode   = NMI_WATCHDOG_SYSCTL_PERM,
-   .proc_handler   = proc_nmi_watchdog,
-   .extra1 = SYSCTL_ZERO,
-   .extra2 = SYSCTL_ONE,
-   },
{
.procname   = "watchdog_cpumask",
.data   = _cpumask_bits,
@@ -952,10 +943,28 @@ static struct ctl_table watchdog_sysctls[] = {
{}
 };
 
+static struct ctl_table watchdog_hardlockup_sysctl[] = {
+   {
+   .procname   = "nmi_watchdog",
+   .data   = _hardlockup_user_enabled,
+   .maxlen = sizeof(int),
+   .mode   = 0444,
+   .proc_handler   = proc_nmi_watchdog,
+   .extra1 = SYSCTL_ZERO,
+   .extra2 = SYSCTL_ONE,
+   },
+   {}
+};
+
 static void __init watchdog_sysctl_init(void)
 {
register_sysctl_init("kernel", watchdog_sysctls);
+
+   if (watchdog_hardlockup_available)
+   watchdog_hardlockup_sysctl[0].mode = 0644;
+   register_sysctl_init("kernel", watchdog_hardlockup_sysctl);
 }
+
 #else
 #define watchdog_sysctl_init() do { } while (0)
 #endif /* CONFIG_SYSCTL */
@@ -1011,6 +1020,8 @@ static int __init lockup_detector_check(void)
/* Make sure no work is pending. */
flush_work(_work);
 
+   watchdog_sysctl_init();
+
return 0;
 
 }
@@ -1030,5 +1041,4 @@ void __init lockup_detector_init(void)
allow_lockup_detector_init_retry = true;
 
lockup_detector_setup();
-   watchdog_sysctl_init();
 }
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 02/10] watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()

2023-05-26 Thread Douglas Anderson
Right now there is one arch (sparc64) that selects HAVE_NMI_WATCHDOG
without selecting HAVE_HARDLOCKUP_DETECTOR_ARCH. Because of that one
architecture, we have some special case code in the watchdog core to
handle the fact that watchdog_hardlockup_probe() isn't implemented.

Let's implement watchdog_hardlockup_probe() for sparc64 and get rid of
the special case.

As a side effect of doing this, code inspection tells us that we could
fix a minor bug where the system won't properly realize that NMI
watchdogs are disabled. Specifically, on powerpc if
CONFIG_PPC_WATCHDOG is turned off the arch might still select
CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH which selects
CONFIG_HAVE_NMI_WATCHDOG. Since CONFIG_PPC_WATCHDOG was off then
nothing will override the "weak" watchdog_hardlockup_probe() and we'll
fallback to looking at CONFIG_HAVE_NMI_WATCHDOG.

Suggested-by: Petr Mladek 
Signed-off-by: Douglas Anderson 
---
Though this does fix a minor bug, I didn't mark this as "Fixes"
because it's super minor. One could also argue that this wasn't a bug
at all but simply was never an implemented feature. The code that
added some amount of dynamicness here was commit a994a3147e4c
("watchdog/hardlockup/perf: Implement init time detection of perf")
which, as per the title, was only intending to make "perf"
dynamic. The old NMI watchdog presumably has never been handled
dynamically.

 arch/Kconfig|  3 ++-
 arch/sparc/kernel/nmi.c |  5 +
 kernel/watchdog.c   | 13 -
 3 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 64d771855ecd..b4f6387b12fe 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -428,7 +428,8 @@ config HAVE_NMI_WATCHDOG
bool
help
  The arch provides a low level NMI watchdog. It provides
- asm/nmi.h, and defines its own arch_touch_nmi_watchdog().
+ asm/nmi.h, and defines its own watchdog_hardlockup_probe() and
+ arch_touch_nmi_watchdog().
 
 config HAVE_HARDLOCKUP_DETECTOR_ARCH
bool
diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c
index 9d9e29b75c43..17cdfdbf1f3b 100644
--- a/arch/sparc/kernel/nmi.c
+++ b/arch/sparc/kernel/nmi.c
@@ -65,6 +65,11 @@ void arch_touch_nmi_watchdog(void)
 }
 EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
+int __init watchdog_hardlockup_probe(void)
+{
+   return 0;
+}
+
 static void die_nmi(const char *str, struct pt_regs *regs, int do_panic)
 {
int this_cpu = smp_processor_id();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 4b9e31edb47f..62230f5b8878 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -217,19 +217,6 @@ void __weak watchdog_hardlockup_disable(unsigned int cpu) 
{ }
  */
 int __weak __init watchdog_hardlockup_probe(void)
 {
-   /*
-* If CONFIG_HAVE_NMI_WATCHDOG is defined then an architecture
-* is assumed to have the hard watchdog available and we return 0.
-*/
-   if (IS_ENABLED(CONFIG_HAVE_NMI_WATCHDOG))
-   return 0;
-
-   /*
-* Hardlockup detectors other than those using CONFIG_HAVE_NMI_WATCHDOG
-* are required to implement a non-weak version of this probe function
-* to tell whether they are available. If they don't override then
-* we'll return -ENODEV.
-*/
return -ENODEV;
 }
 
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-05-26 Thread Yu Zhao
Use mmu_notifier_test_clear_young() to handle KVM PTEs in batches
when the fast path is supported. This reduces the contention on
kvm->mmu_lock when the host is under heavy memory pressure.

An existing selftest can quickly demonstrate the effectiveness of
this patch. On a generic workstation equipped with 128 CPUs and 256GB
DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250

  MGLRU run2
  --
  Before [1]~64s
  After ~51s

  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.71%  evict_folios
  97.29%  shrink_folio_list
  ==>>  13.05%  folio_referenced
  12.83%  rmap_walk_file
12.31%  folio_referenced_one
  7.90%  __mmu_notifier_clear_young
7.72%  kvm_mmu_notifier_clear_young
  7.34%  _raw_write_lock

  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.59%  evict_folios
  80.37%  shrink_folio_list
  ==>>  3.74%  folio_referenced
  3.59%  rmap_walk_file
3.19%  folio_referenced_one
  2.53%  lru_gen_look_around
1.06%  __mmu_notifier_test_clear_young

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

Signed-off-by: Yu Zhao 
---
 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 include/linux/mmzone.h|   6 +-
 mm/rmap.c |   8 +-
 mm/vmscan.c   | 139 --
 4 files changed, 138 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst 
b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..0ae2a6d4d94c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Clearing the accessed bit in KVM page table entries in large
+   batches, when KVM MMU sets it (e.g., on x86_64). This can
+   improve the performance of guests when the host is under memory
+   pressure.
 [yYnN] Apply to all the components above.
 == ===
 
@@ -56,7 +60,7 @@ E.g.,
 
 echo y >/sys/kernel/mm/lru_gen/enabled
 cat /sys/kernel/mm/lru_gen/enabled
-0x0007
+0x000f
 echo 5 >/sys/kernel/mm/lru_gen/enabled
 cat /sys/kernel/mm/lru_gen/enabled
 0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5a7ada0413da..1b148f39fabc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+   LRU_GEN_KVM_MMU_WALK,
NR_LRU_GEN_CAPS
 };
 
@@ -471,7 +472,7 @@ struct lru_gen_mm_walk {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 
@@ -559,8 +560,9 @@ static inline void lru_gen_init_lruvec(struct lruvec 
*lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+   return false;
 }
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index ae127f60a4fb..3a2c4e6a0887 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -825,12 +825,10 @@ static bool folio_referenced_one(struct folio *folio,
return false; /* To break the loop */
}
 
-   if (pvmw.pte) {
-   if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
-   lru_gen_look_around();
+   if (lru_gen_enabled() && pvmw.pte) {
+   if (lru_gen_look_around())
referenced++;
-   }
-
+   } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ef687f9be13c..3f734588b600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -3244,6 +3245,20 @@ static bool should_clear_pmd_young(void)
return 

[PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()

2023-05-26 Thread Yu Zhao
Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., TDP MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
clear_bit() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao 
---
 arch/x86/include/asm/kvm_host.h |  7 +++
 arch/x86/kvm/mmu/tdp_mmu.c  | 34 +
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 753c67072c47..d6dfdebe3d94 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2223,4 +2223,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
unsigned long npages);
  */
 #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_X86_64) &&
+  (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && 
shadow_accessed_mask));
+}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 08340219c35a..6875a819e007 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   struct kvm_mmu_page *root;
+   int offset = ffs(shadow_accessed_mask) - 1;
+
+   if (kvm_shadow_root_allocated(kvm))
+   return true;
+
+   rcu_read_lock();
+
+   list_for_each_entry_rcu(root, >arch.tdp_mmu_roots, link) {
+   struct tdp_iter iter;
+
+   if (kvm_mmu_page_as_id(root) != range->slot->as_id)
+   continue;
+
+   tdp_root_for_each_leaf_pte(iter, root, range->start, 
range->end) {
+   u64 *sptep = rcu_dereference(iter.sptep);
+
+   VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));
+
+   if (!(iter.old_spte & shadow_accessed_mask))
+   continue;
+
+   if (kvm_should_clear_young(range, iter.gfn))
+   clear_bit(offset, (unsigned long *)sptep);
+   }
+   }
+
+   rcu_read_unlock();
+
+   return false;
+}
+
 static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 struct kvm_gfn_range *range)
 {
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask

2023-05-26 Thread Yu Zhao
tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_has_test_clear_young().

Signed-off-by: Yu Zhao 
---
 arch/x86/include/asm/kvm_host.h | 6 ++
 arch/x86/kvm/mmu.h  | 6 --
 arch/x86/kvm/mmu/spte.h | 1 -
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..753c67072c47 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf {
 
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
 extern struct kvm_x86_ops kvm_x86_ops;
@@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned 
irqchip, unsigned pin,
 bool mask);
 
 extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..84aedb2671ef 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -253,12 +253,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
return smp_load_acquire(>arch.shadow_root_allocated);
 }
 
-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..a82c4fa1c47b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
 extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
 extern u64 __read_mostly shadow_mmio_value;
 extern u64 __read_mostly shadow_mmio_mask;
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

2023-05-26 Thread Yu Zhao
Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., radix MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
pte_xchg() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao 
---
 arch/powerpc/include/asm/kvm_host.h|  8 
 arch/powerpc/include/asm/kvm_ppc.h |  1 +
 arch/powerpc/kvm/book3s.c  |  6 +++
 arch/powerpc/kvm/book3s.h  |  1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++
 arch/powerpc/kvm/book3s_hv.c   |  5 +++
 6 files changed, 80 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..75c260ea8a9e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu 
*vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
+  cpu_has_feature(CPU_FTR_HVMODE) && 
cpu_has_feature(CPU_FTR_ARCH_300) &&
+  radix_enabled();
+}
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 79a9c0bb8bba..ff1af6a7b44f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -287,6 +287,7 @@ struct kvmppc_ops {
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+   bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
void (*free_memslot)(struct kvm_memory_slot *slot);
int (*init_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..37bf40b0c4ff 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   return !kvm->arch.kvm_ops->test_clear_young ||
+  kvm->arch.kvm_ops->test_clear_young(kvm, range);
+}
+
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..fa2659e21ccc 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range 
*range);
 extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range 
*range);
 extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 3b65b3b11041..0a392e9a100a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
return ref;
 }
 
+bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   bool err;
+   gfn_t gfn = range->start;
+
+   rcu_read_lock();
+
+   err = !kvm_is_radix(kvm);
+   if (err)
+   goto unlock;
+
+   /*
+* Case 1:  This function  kvmppc_switch_mmu_to_hpt()
+*
+*  rcu_read_lock()
+*  Test kvm_is_radix()kvm->arch.radix = 0
+*  Use kvm->arch.pgtable  synchronize_rcu()
+*  rcu_read_unlock()
+* kvmppc_free_radix()
+*
+*
+* Case 2:  This function  kvmppc_switch_mmu_to_radix()
+*
+* kvmppc_init_vm_radix()
+* smp_wmb()
+*  Test kvm_is_radix()kvm->arch.radix = 1
+*  smp_rmb()
+*  Use kvm->arch.pgtable
+*/
+   

[PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

2023-05-26 Thread Yu Zhao
KVM page tables are currently not RCU safe against remapping, i.e.,
kvmppc_unmap_free_pmd_entry_table() et al. The previous
mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
that operation.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, orphan page tables need to
be freed by RCU.

Unmapping, specifically kvm_unmap_radix(), does not free page tables,
hence not a concern.

Signed-off-by: Yu Zhao 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..3b65b3b11041 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
 {
unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
 
-   kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
+   kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
+ SLAB_TYPESAFE_BY_RCU, pte_ctor);
if (!kvm_pte_cache)
return -ENOMEM;
 
size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
 
-   kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
+   kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
+ SLAB_TYPESAFE_BY_RCU, pmd_ctor);
if (!kvm_pmd_cache) {
kmem_cache_destroy(kvm_pte_cache);
return -ENOMEM;
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

2023-05-26 Thread Yu Zhao
Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., hardware sets the accessed bit in
KVM PTEs and VMs are not protected, where it can rely on RCU and
cmpxchg to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_host.h |  6 ++
 arch/arm64/kvm/mmu.c  | 36 +++
 2 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..da32b0890716 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return cpu_has_hw_af() && !is_protected_kvm_enabled();
+}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c3b3e2afe26f..26a8d955b49c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
   range->start << PAGE_SHIFT);
 }
 
+static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
+  enum kvm_pgtable_walk_flags flags)
+{
+   kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+
+   VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
+
+   if (!kvm_pte_valid(new))
+   return 0;
+
+   if (new == ctx->old)
+   return 0;
+
+   if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
+   stage2_try_set_pte(ctx, new);
+
+   return 0;
+}
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   u64 start = range->start * PAGE_SIZE;
+   u64 end = range->end * PAGE_SIZE;
+   struct kvm_pgtable_walker walker = {
+   .cb = stage2_test_clear_young,
+   .arg= range,
+   .flags  = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
+   };
+
+   BUILD_BUG_ON(is_hyp_code());
+
+   kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, );
+
+   return false;
+}
+
 phys_addr_t kvm_mmu_get_httbr(void)
 {
return __pa(hyp_pgtable->pgd);
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit

2023-05-26 Thread Yu Zhao
TLDR

This patchset adds a fast path to clear the accessed bit without
taking kvm->mmu_lock. It can significantly improve the performance of
guests when the host is under heavy memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

This v2 addressed previous requests [2] on refactoring code, removing
inaccurate/redundant texts, etc.

[1] https://crrev.com/c/2987928
[2] https://lore.kernel.org/r/20230217041230.2417228-1-yuz...@google.com/

Overview

The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on a simple yet
common case where hardware sets the accessed bit in KVM PTEs and VMs
are not nested. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

The fast path relies on two techniques to safely clear the accessed
bit: RCU and CAS. The former protects KVM page tables from being
freed while the latter clears the accessed bit atomically against
both the hardware and other software page table walkers.

A new mmu_notifier_ops member, test_clear_young(), supersedes the
existing clear_young() and test_young(). This extended callback can
operate on a range of KVM PTEs individually according to a bitmap, if
the caller provides it.

Evaluation
==
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 128 CPUs and
256GB DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250
  
  MGLRU run2
  --
  Before [1]~64s
  After ~51s
  
  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.71%  evict_folios
  97.29%  shrink_folio_list
  ==>>  13.05%  folio_referenced
  12.83%  rmap_walk_file
12.31%  folio_referenced_one
  7.90%  __mmu_notifier_clear_young
7.72%  kvm_mmu_notifier_clear_young
  7.34%  _raw_write_lock
  
  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.59%  evict_folios
  80.37%  shrink_folio_list
  ==>>  3.74%  folio_referenced
  3.59%  rmap_walk_file
3.19%  folio_referenced_one
  2.53%  lru_gen_look_around
1.06%  __mmu_notifier_test_clear_young

Comprehensive benchmarks are coming soon.

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
 reference" was included so that the comparison is apples to
 apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

Yu Zhao (10):
  mm/kvm: add mmu_notifier_ops->test_clear_young()
  mm/kvm: use mmu_notifier_ops->test_clear_young()
  kvm/arm64: export stage2_try_set_pte() and macros
  kvm/arm64: make stage2 page tables RCU safe
  kvm/arm64: add kvm_arch_test_clear_young()
  kvm/powerpc: make radix page tables RCU safe
  kvm/powerpc: add kvm_arch_test_clear_young()
  kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
  kvm/x86: add kvm_arch_test_clear_young()
  mm: multi-gen LRU: use mmu_notifier_test_clear_young()

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 arch/arm64/include/asm/kvm_host.h |   6 +
 arch/arm64/include/asm/kvm_pgtable.h  |  55 +++
 arch/arm64/kvm/arm.c  |   1 +
 arch/arm64/kvm/hyp/pgtable.c  |  61 +---
 arch/arm64/kvm/mmu.c  |  53 ++-
 arch/powerpc/include/asm/kvm_host.h   |   8 +
 arch/powerpc/include/asm/kvm_ppc.h|   1 +
 arch/powerpc/kvm/book3s.c |   6 +
 arch/powerpc/kvm/book3s.h |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  65 +++-
 arch/powerpc/kvm/book3s_hv.c  |   5 +
 arch/x86/include/asm/kvm_host.h   |  13 ++
 arch/x86/kvm/mmu.h|   6 -
 arch/x86/kvm/mmu/spte.h   |   1 -
 arch/x86/kvm/mmu/tdp_mmu.c|  34 +
 include/linux/kvm_host.h  |  22 +++
 include/linux/mmu_notifier.h  |  79 ++
 include/linux/mmzone.h|   6 +-
 include/trace/events/kvm.h|  15 --
 mm/mmu_notifier.c |  48 ++
 mm/rmap.c |   8 +-
 mm/vmscan.c   | 139 --
 virt/kvm/kvm_main.c   | 114 --
 24 files changed, 546 insertions(+), 207 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-26 Thread Yu Zhao
Stage2 page tables are currently not RCU safe against unmapping or VM
destruction. The previous mmu_notifier_ops members rely on
kvm->mmu_lock to synchronize with those operations.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, unmapped page tables need
to be freed by RCU and kvm_free_stage2_pgd() needs to be after
mmu_notifier_unregister().

Remapping, specifically stage2_free_removed_table(), is already RCU
safe.

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_pgtable.h |  2 ++
 arch/arm64/kvm/arm.c |  1 +
 arch/arm64/kvm/hyp/pgtable.c |  8 ++--
 arch/arm64/kvm/mmu.c | 17 -
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index ff520598b62c..5cab52e3a35f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 
level)
  * @put_page:  Decrement the refcount on a page. When the
  * refcount reaches 0 the page is automatically
  * freed.
+ * @put_page_rcu:  RCU variant of the above.
  * @page_count:Return the refcount of a page.
  * @phys_to_virt:  Convert a physical address into a virtual
  * address mapped in the current context.
@@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops {
void(*free_removed_table)(void *addr, u32 level);
void(*get_page)(void *addr);
void(*put_page)(void *addr);
+   void(*put_page_rcu)(void *addr);
int (*page_count)(void *addr);
void*   (*phys_to_virt)(phys_addr_t phys);
phys_addr_t (*virt_to_phys)(void *addr);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 14391826241c..ee93271035d9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, 
struct vm_fault *vmf)
  */
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+   kvm_free_stage2_pgd(>arch.mmu);
bitmap_free(kvm->arch.pmu_filter);
free_cpumask_var(kvm->arch.supported_cpus);
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 24678ccba76a..dbace4c6a841 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct 
kvm_pgtable_visit_ctx *ctx,
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
   kvm_granule_size(ctx->level));
 
-   if (childp)
-   mm_ops->put_page(childp);
+   if (childp) {
+   if (mm_ops->put_page_rcu)
+   mm_ops->put_page_rcu(childp);
+   else
+   mm_ops->put_page(childp);
+   }
 
return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3b9d4d24c361..c3b3e2afe26f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr)
return page_count(virt_to_page(addr));
 }
 
+static void kvm_s2_rcu_put_page(struct rcu_head *head)
+{
+   put_page(container_of(head, struct page, rcu_head));
+}
+
+static void kvm_s2_put_page_rcu(void *addr)
+{
+   struct page *page = virt_to_page(addr);
+
+   if (kvm_host_page_count(addr) == 1)
+   kvm_account_pgtable_pages(addr, -1);
+
+   call_rcu(>rcu_head, kvm_s2_rcu_put_page);
+}
+
 static phys_addr_t kvm_host_pa(void *addr)
 {
return __pa(addr);
@@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
.free_removed_table = stage2_free_removed_table,
.get_page   = kvm_host_get_page,
.put_page   = kvm_s2_put_page,
+   .put_page_rcu   = kvm_s2_put_page_rcu,
.page_count = kvm_host_page_count,
.phys_to_virt   = kvm_host_va,
.virt_to_phys   = kvm_host_pa,
@@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-   kvm_free_stage2_pgd(>arch.mmu);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young()

2023-05-26 Thread Yu Zhao
Replace test_young() and clear_young() with test_clear_young().

Signed-off-by: Yu Zhao 
---
 include/linux/mmu_notifier.h | 29 ++-
 include/trace/events/kvm.h   | 15 --
 mm/mmu_notifier.c| 42 
 virt/kvm/kvm_main.c  | 54 
 4 files changed, 2 insertions(+), 138 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index dfdbb370682d..c8f35fc08703 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -104,26 +104,6 @@ struct mmu_notifier_ops {
 unsigned long start,
 unsigned long end);
 
-   /*
-* clear_young is a lightweight version of clear_flush_young. Like the
-* latter, it is supposed to test-and-clear the young/accessed bitflag
-* in the secondary pte, but it may omit flushing the secondary tlb.
-*/
-   int (*clear_young)(struct mmu_notifier *subscription,
-  struct mm_struct *mm,
-  unsigned long start,
-  unsigned long end);
-
-   /*
-* test_young is called to check the young/accessed bitflag in
-* the secondary pte. This is used to know if the page is
-* frequently used without actually clearing the flag or tearing
-* down the secondary mapping on the page.
-*/
-   int (*test_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long address);
-
int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
unsigned long start, unsigned long end,
bool clear, unsigned long *bitmap);
@@ -393,11 +373,6 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long start,
  unsigned long end);
-extern int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-extern int __mmu_notifier_test_young(struct mm_struct *mm,
-unsigned long address);
 extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
   unsigned long start, unsigned long 
end,
   bool clear, unsigned long *bitmap);
@@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct 
*mm,
   unsigned long end)
 {
if (mm_has_notifiers(mm))
-   return __mmu_notifier_clear_young(mm, start, end);
+   return __mmu_notifier_test_clear_young(mm, start, end, true, 
NULL);
return 0;
 }
 
@@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
  unsigned long address)
 {
if (mm_has_notifiers(mm))
-   return __mmu_notifier_test_young(mm, address);
+   return __mmu_notifier_test_clear_young(mm, address, address + 
1, false, NULL);
return 0;
 }
 
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 3bd31ea23fee..46c347e56e60 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,21 +489,6 @@ TRACE_EVENT(kvm_age_hva,
  __entry->start, __entry->end)
 );
 
-TRACE_EVENT(kvm_test_age_hva,
-   TP_PROTO(unsigned long hva),
-   TP_ARGS(hva),
-
-   TP_STRUCT__entry(
-   __field(unsigned long,  hva )
-   ),
-
-   TP_fast_assign(
-   __entry->hva= hva;
-   ),
-
-   TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
-);
-
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7e6aba4bddcb..c7e9747c9920 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -382,48 +382,6 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
return young;
 }
 
-int __mmu_notifier_clear_young(struct mm_struct *mm,
-  unsigned long start,
-  unsigned long end)
-{
-   struct mmu_notifier *subscription;
-   int young = 0, id;
-
-   id = srcu_read_lock();
-   hlist_for_each_entry_rcu(subscription,
->notifier_subscriptions->list, hlist,
-srcu_read_lock_held()) {
-   if (subscription->ops->clear_young)
-   young |= subscription->ops->clear_young(subscription,
-   mm, start, end);
-   }
-   

[PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

2023-05-26 Thread Yu Zhao
Add mmu_notifier_ops->test_clear_young() to supersede test_young()
and clear_young().

test_clear_young() has a fast path, which if supported, allows its
callers to safely clear the accessed bit without taking
kvm->mmu_lock.

The fast path requires arch-specific code that generally relies on
RCU and CAS: the former protects KVM page tables from being freed
while the latter clears the accessed bit atomically against both the
hardware and other software page table walkers. If the fast path is
unsupported, test_clear_young() falls back to the existing slow path
where kvm->mmu_lock is then taken.

test_clear_young() can also operate on a range of KVM PTEs
individually according to a bitmap, if the caller provides it.

Signed-off-by: Yu Zhao 
---
 include/linux/kvm_host.h | 22 +++
 include/linux/mmu_notifier.h | 52 
 mm/mmu_notifier.c| 24 
 virt/kvm/kvm_main.c  | 76 +++-
 4 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..374262545f96 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm_gfn_range {
struct kvm_memory_slot *slot;
+   void *args;
gfn_t start;
gfn_t end;
pte_t pte;
@@ -267,6 +268,27 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn);
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+/*
+ * Architectures that implement kvm_arch_test_clear_young() should override
+ * kvm_arch_has_test_clear_young().
+ *
+ * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., 
it
+ * can return true if kvm_arch_test_clear_young() is supported but disabled due
+ * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
+ * return true; otherwise, it should return false.
+ *
+ * For each young KVM PTE, kvm_arch_test_clear_young() should call
+ * kvm_should_clear_young() to decide whether to clear the accessed bit.
+ */
+#ifndef kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return false;
+}
 #endif
 
 enum {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..dfdbb370682d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -60,6 +60,8 @@ enum mmu_notifier_event {
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_LOCKLESS(1 << 1)
+#define MMU_NOTIFIER_RANGE_YOUNG   (1 << 2)
 
 struct mmu_notifier_ops {
/*
@@ -122,6 +124,10 @@ struct mmu_notifier_ops {
  struct mm_struct *mm,
  unsigned long address);
 
+   int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
+   unsigned long start, unsigned long end,
+   bool clear, unsigned long *bitmap);
+
/*
 * change_pte is called in cases that pte mapping to page is changed:
 * for example, when ksm remaps pte to point to a new shared page.
@@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
  unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
+extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+  unsigned long start, unsigned long 
end,
+  bool clear, unsigned long *bitmap);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
  unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
+/*
+ * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs 
within
+ * a given range was young. Specifically, it returns 
MMU_NOTIFIER_RANGE_LOCKLESS
+ * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise.
+ *
+ * The last parameter to the function is a bitmap and only the fast path
+ * supports it: if it is NULL, the function falls back to the slow path if the
+ * fast path was unsuccessful; otherwise, the function bails out.
+ *
+ * The bitmap has the following specifications:
+ * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
+ * 2. The offset of each bit 

[PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros

2023-05-26 Thread Yu Zhao
stage2_try_set_pte() and KVM_PTE_LEAF_ATTR_LO_S2_AF are needed to
implement kvm_arch_test_clear_young().

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_pgtable.h | 53 
 arch/arm64/kvm/hyp/pgtable.c | 53 
 2 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index dc3c072e862f..ff520598b62c 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -44,6 +44,49 @@ typedef u64 kvm_pte_t;
 
 #define KVM_PHYS_INVALID   (-1ULL)
 
+#define KVM_PTE_TYPE   BIT(1)
+#define KVM_PTE_TYPE_BLOCK 0
+#define KVM_PTE_TYPE_PAGE  1
+#define KVM_PTE_TYPE_TABLE 1
+
+#define KVM_PTE_LEAF_ATTR_LO   GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO  3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW  1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS  3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS  3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI   GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID   1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED BIT(10)
+
 static inline bool kvm_pte_valid(kvm_pte_t pte)
 {
return pte & KVM_PTE_VALID;
@@ -224,6 +267,16 @@ static inline bool kvm_pgtable_walk_shared(const struct 
kvm_pgtable_visit_ctx *c
return ctx->flags & KVM_PGTABLE_WALK_SHARED;
 }
 
+static inline bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, 
kvm_pte_t new)
+{
+   if (!kvm_pgtable_walk_shared(ctx)) {
+   WRITE_ONCE(*ctx->ptep, new);
+   return true;
+   }
+
+   return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
+}
+
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
  * @cb:Callback function to invoke during the walk.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5282cb9ca4cf..24678ccba76a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -12,49 +12,6 @@
 #include 
 
 
-#define KVM_PTE_TYPE   BIT(1)
-#define KVM_PTE_TYPE_BLOCK 0
-#define KVM_PTE_TYPE_PAGE  1
-#define KVM_PTE_TYPE_TABLE 1
-
-#define KVM_PTE_LEAF_ATTR_LO   GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO  3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW  1
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS  3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS  3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI   GENMASK(63, 51)
-
-#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
-KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
-KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID   1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED BIT(10)
-
 struct kvm_pgtable_walk_data {
struct kvm_pgtable_walker   *walker;
 
@@ -702,16 +659,6 @@ static bool stage2_pte_is_locked(kvm_pte_t pte)
return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
 }
 
-static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, 
kvm_pte_t new)

Patch "spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 5.4-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

spi: fsl-spi: Re-organise transfer bits_per_word adaptation

to the 5.4-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
and it can be found in the queue-5.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:03 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:16 +0200
Subject: spi: fsl-spi: Re-organise transfer bits_per_word adaptation
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<1e4bfb4850ba849c316f48a0ab0d7123da0e2f54.1684156552.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e)

For different reasons, fsl-spi driver performs bits_per_word
modifications for different reasons:
- On CPU mode, to minimise amount of interrupts
- On CPM/QE mode to work around controller byte order

For CPU mode that's done in fsl_spi_prepare_message() while
for CPM mode that's done in fsl_spi_setup_transfer().

Reunify all of it in fsl_spi_prepare_message(), and catch
impossible cases early through master's bits_per_word_mask
instead of returning EINVAL later.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   50 +-
 1 file changed, 23 insertions(+), 27 deletions(-)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -204,26 +204,6 @@ static int mspi_apply_cpu_mode_quirks(st
return bits_per_word;
 }
 
-static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs,
-   struct spi_device *spi,
-   int bits_per_word)
-{
-   /* CPM/QE uses Little Endian for words > 8
-* so transform 16 and 32 bits words into 8 bits
-* Unfortnatly that doesn't work for LSB so
-* reject these for now */
-   /* Note: 32 bits word, LSB works iff
-* tfcr/rfcr is set to CPMFCR_GBL */
-   if (spi->mode & SPI_LSB_FIRST &&
-   bits_per_word > 8)
-   return -EINVAL;
-   if (bits_per_word <= 8)
-   return bits_per_word;
-   if (bits_per_word == 16 || bits_per_word == 32)
-   return 8; /* pretend its 8 bits */
-   return -EINVAL;
-}
-
 static int fsl_spi_setup_transfer(struct spi_device *spi,
struct spi_transfer *t)
 {
@@ -251,9 +231,6 @@ static int fsl_spi_setup_transfer(struct
bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi,
   mpc8xxx_spi,
   bits_per_word);
-   else
-   bits_per_word = mspi_apply_qe_mode_quirks(cs, spi,
- bits_per_word);
 
if (bits_per_word < 0)
return bits_per_word;
@@ -371,14 +348,27 @@ static int fsl_spi_do_one_msg(struct spi
 * In CPU mode, optimize large byte transfers to use larger
 * bits_per_word values to reduce number of interrupts taken.
 */
-   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
-   list_for_each_entry(t, >transfers, transfer_list) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
if (t->len < 256 || t->bits_per_word != 8)
continue;
if ((t->len & 3) == 0)
t->bits_per_word = 32;
else if ((t->len & 1) == 0)
t->bits_per_word = 16;
+   } else {
+   /*
+* CPM/QE uses Little Endian for words > 8
+* so transform 16 and 32 bits words into 8 bits
+* Unfortnatly that doesn't work for LSB so
+* reject these for now
+* Note: 32 bits word, LSB works iff
+* tfcr/rfcr is set to CPMFCR_GBL
+*/
+   if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 
8)
+   return -EINVAL;
+   if (t->bits_per_word == 16 || t->bits_per_word == 32)
+   t->bits_per_word = 8; /* pretend its 8 bits */
}
}
 
@@ -637,8 +627,14 

Patch "spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 5.4-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

spi: fsl-cpm: Use 16 bit mode for large transfers with even size

to the 5.4-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
and it can be found in the queue-5.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:07:59 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:17 +0200
Subject: spi: fsl-cpm: Use 16 bit mode for large transfers with even size
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<3a1b8774ad7004acb594fbf220f98488dbaa2896.1684156552.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab)

On CPM, the RISC core is a lot more efficiant when doing transfers
in 16-bits chunks than in 8-bits chunks, but unfortunately the
words need to be byte swapped as seen in a previous commit.

So, for large tranfers with an even size, allocate a temporary tx
buffer and byte-swap data before and after transfer.

This change allows setting higher speed for transfer. For instance
on an MPC 8xx (CPM1 comms RISC processor), the documentation tells
that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load
at 25 MHz while a word transfer at the same speed uses 0.032%
of CPM load. This means the speed can be 6 times higher in
word mode for the same CPM load.

For the time being, only do it on CPM1 as there must be a
trade-off between the CPM load reduction and the CPU load required
to byte swap the data.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-cpm.c |   23 +++
 drivers/spi/spi-fsl-spi.c |3 +++
 2 files changed, 26 insertions(+)

--- a/drivers/spi/spi-fsl-cpm.c
+++ b/drivers/spi/spi-fsl-cpm.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spi-fsl-cpm.h"
 #include "spi-fsl-lib.h"
@@ -120,6 +121,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi
mspi->rx_dma = mspi->dma_dummy_rx;
mspi->map_rx_dma = 0;
}
+   if (t->bits_per_word == 16 && t->tx_buf) {
+   const u16 *src = t->tx_buf;
+   u16 *dst;
+   int i;
+
+   dst = kmalloc(t->len, GFP_KERNEL);
+   if (!dst)
+   return -ENOMEM;
+
+   for (i = 0; i < t->len >> 1; i++)
+   dst[i] = cpu_to_le16p(src + i);
+
+   mspi->tx = dst;
+   mspi->map_tx_dma = 1;
+   }
 
if (mspi->map_tx_dma) {
void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */
@@ -173,6 +189,13 @@ void fsl_spi_cpm_bufs_complete(struct mp
if (mspi->map_rx_dma)
dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE);
mspi->xfer_in_progress = NULL;
+
+   if (t->bits_per_word == 16 && t->rx_buf) {
+   int i;
+
+   for (i = 0; i < t->len; i += 2)
+   le16_to_cpus(t->rx_buf + i);
+   }
 }
 EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete);
 
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -369,6 +369,9 @@ static int fsl_spi_do_one_msg(struct spi
return -EINVAL;
if (t->bits_per_word == 16 || t->bits_per_word == 32)
t->bits_per_word = 8; /* pretend its 8 bits */
+   if (t->bits_per_word == 8 && t->len >= 256 &&
+   (mpc8xxx_spi->flags & SPI_CPM1))
+   t->bits_per_word = 16;
}
}
 


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-5.4/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-5.4/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch


Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 5.15-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer 
bits_per_word adaptation

to the 5.15-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:03 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:16 +0200
Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word 
adaptation
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<1e4bfb4850ba849c316f48a0ab0d7123da0e2f54.1684156552.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e)

For different reasons, fsl-spi driver performs bits_per_word
modifications for different reasons:
- On CPU mode, to minimise amount of interrupts
- On CPM/QE mode to work around controller byte order

For CPU mode that's done in fsl_spi_prepare_message() while
for CPM mode that's done in fsl_spi_setup_transfer().

Reunify all of it in fsl_spi_prepare_message(), and catch
impossible cases early through master's bits_per_word_mask
instead of returning EINVAL later.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   50 +-
 1 file changed, 23 insertions(+), 27 deletions(-)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -203,26 +203,6 @@ static int mspi_apply_cpu_mode_quirks(st
return bits_per_word;
 }
 
-static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs,
-   struct spi_device *spi,
-   int bits_per_word)
-{
-   /* CPM/QE uses Little Endian for words > 8
-* so transform 16 and 32 bits words into 8 bits
-* Unfortnatly that doesn't work for LSB so
-* reject these for now */
-   /* Note: 32 bits word, LSB works iff
-* tfcr/rfcr is set to CPMFCR_GBL */
-   if (spi->mode & SPI_LSB_FIRST &&
-   bits_per_word > 8)
-   return -EINVAL;
-   if (bits_per_word <= 8)
-   return bits_per_word;
-   if (bits_per_word == 16 || bits_per_word == 32)
-   return 8; /* pretend its 8 bits */
-   return -EINVAL;
-}
-
 static int fsl_spi_setup_transfer(struct spi_device *spi,
struct spi_transfer *t)
 {
@@ -250,9 +230,6 @@ static int fsl_spi_setup_transfer(struct
bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi,
   mpc8xxx_spi,
   bits_per_word);
-   else
-   bits_per_word = mspi_apply_qe_mode_quirks(cs, spi,
- bits_per_word);
 
if (bits_per_word < 0)
return bits_per_word;
@@ -370,14 +347,27 @@ static int fsl_spi_do_one_msg(struct spi
 * In CPU mode, optimize large byte transfers to use larger
 * bits_per_word values to reduce number of interrupts taken.
 */
-   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
-   list_for_each_entry(t, >transfers, transfer_list) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
if (t->len < 256 || t->bits_per_word != 8)
continue;
if ((t->len & 3) == 0)
t->bits_per_word = 32;
else if ((t->len & 1) == 0)
t->bits_per_word = 16;
+   } else {
+   /*
+* CPM/QE uses Little Endian for words > 8
+* so transform 16 and 32 bits words into 8 bits
+* Unfortnatly that doesn't work for LSB so
+* reject these for now
+* Note: 32 bits word, LSB works iff
+* tfcr/rfcr is set to CPMFCR_GBL
+*/
+   if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 
8)
+   return -EINVAL;
+   if (t->bits_per_word == 16 || t->bits_per_word == 32)
+   t->bits_per_word = 8; /* pretend its 8 bits 

Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 5.15-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large 
transfers with even size

to the 5.15-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:07:59 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:17 +0200
Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers 
with even size
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<3a1b8774ad7004acb594fbf220f98488dbaa2896.1684156552.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab)

On CPM, the RISC core is a lot more efficiant when doing transfers
in 16-bits chunks than in 8-bits chunks, but unfortunately the
words need to be byte swapped as seen in a previous commit.

So, for large tranfers with an even size, allocate a temporary tx
buffer and byte-swap data before and after transfer.

This change allows setting higher speed for transfer. For instance
on an MPC 8xx (CPM1 comms RISC processor), the documentation tells
that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load
at 25 MHz while a word transfer at the same speed uses 0.032%
of CPM load. This means the speed can be 6 times higher in
word mode for the same CPM load.

For the time being, only do it on CPM1 as there must be a
trade-off between the CPM load reduction and the CPU load required
to byte swap the data.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-cpm.c |   23 +++
 drivers/spi/spi-fsl-spi.c |3 +++
 2 files changed, 26 insertions(+)

--- a/drivers/spi/spi-fsl-cpm.c
+++ b/drivers/spi/spi-fsl-cpm.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spi-fsl-cpm.h"
 #include "spi-fsl-lib.h"
@@ -120,6 +121,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi
mspi->rx_dma = mspi->dma_dummy_rx;
mspi->map_rx_dma = 0;
}
+   if (t->bits_per_word == 16 && t->tx_buf) {
+   const u16 *src = t->tx_buf;
+   u16 *dst;
+   int i;
+
+   dst = kmalloc(t->len, GFP_KERNEL);
+   if (!dst)
+   return -ENOMEM;
+
+   for (i = 0; i < t->len >> 1; i++)
+   dst[i] = cpu_to_le16p(src + i);
+
+   mspi->tx = dst;
+   mspi->map_tx_dma = 1;
+   }
 
if (mspi->map_tx_dma) {
void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */
@@ -173,6 +189,13 @@ void fsl_spi_cpm_bufs_complete(struct mp
if (mspi->map_rx_dma)
dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE);
mspi->xfer_in_progress = NULL;
+
+   if (t->bits_per_word == 16 && t->rx_buf) {
+   int i;
+
+   for (i = 0; i < t->len; i += 2)
+   le16_to_cpus(t->rx_buf + i);
+   }
 }
 EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete);
 
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -368,6 +368,9 @@ static int fsl_spi_do_one_msg(struct spi
return -EINVAL;
if (t->bits_per_word == 16 || t->bits_per_word == 32)
t->bits_per_word = 8; /* pretend its 8 bits */
+   if (t->bits_per_word == 8 && t->len >= 256 &&
+   (mpc8xxx_spi->flags & SPI_CPM1))
+   t->bits_per_word = 16;
}
}
 


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-5.15/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-5.15/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch


Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 5.10-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer 
bits_per_word adaptation

to the 5.10-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
and it can be found in the queue-5.10 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:03 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:16 +0200
Subject:[For 5.15/5.10/5.4] spi: fsl-spi: Re-organise transfer bits_per_word 
adaptation
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<1e4bfb4850ba849c316f48a0ab0d7123da0e2f54.1684156552.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e)

For different reasons, fsl-spi driver performs bits_per_word
modifications for different reasons:
- On CPU mode, to minimise amount of interrupts
- On CPM/QE mode to work around controller byte order

For CPU mode that's done in fsl_spi_prepare_message() while
for CPM mode that's done in fsl_spi_setup_transfer().

Reunify all of it in fsl_spi_prepare_message(), and catch
impossible cases early through master's bits_per_word_mask
instead of returning EINVAL later.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   50 +-
 1 file changed, 23 insertions(+), 27 deletions(-)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -203,26 +203,6 @@ static int mspi_apply_cpu_mode_quirks(st
return bits_per_word;
 }
 
-static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs,
-   struct spi_device *spi,
-   int bits_per_word)
-{
-   /* CPM/QE uses Little Endian for words > 8
-* so transform 16 and 32 bits words into 8 bits
-* Unfortnatly that doesn't work for LSB so
-* reject these for now */
-   /* Note: 32 bits word, LSB works iff
-* tfcr/rfcr is set to CPMFCR_GBL */
-   if (spi->mode & SPI_LSB_FIRST &&
-   bits_per_word > 8)
-   return -EINVAL;
-   if (bits_per_word <= 8)
-   return bits_per_word;
-   if (bits_per_word == 16 || bits_per_word == 32)
-   return 8; /* pretend its 8 bits */
-   return -EINVAL;
-}
-
 static int fsl_spi_setup_transfer(struct spi_device *spi,
struct spi_transfer *t)
 {
@@ -250,9 +230,6 @@ static int fsl_spi_setup_transfer(struct
bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi,
   mpc8xxx_spi,
   bits_per_word);
-   else
-   bits_per_word = mspi_apply_qe_mode_quirks(cs, spi,
- bits_per_word);
 
if (bits_per_word < 0)
return bits_per_word;
@@ -370,14 +347,27 @@ static int fsl_spi_do_one_msg(struct spi
 * In CPU mode, optimize large byte transfers to use larger
 * bits_per_word values to reduce number of interrupts taken.
 */
-   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
-   list_for_each_entry(t, >transfers, transfer_list) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
if (t->len < 256 || t->bits_per_word != 8)
continue;
if ((t->len & 3) == 0)
t->bits_per_word = 32;
else if ((t->len & 1) == 0)
t->bits_per_word = 16;
+   } else {
+   /*
+* CPM/QE uses Little Endian for words > 8
+* so transform 16 and 32 bits words into 8 bits
+* Unfortnatly that doesn't work for LSB so
+* reject these for now
+* Note: 32 bits word, LSB works iff
+* tfcr/rfcr is set to CPMFCR_GBL
+*/
+   if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 
8)
+   return -EINVAL;
+   if (t->bits_per_word == 16 || t->bits_per_word == 32)
+   t->bits_per_word = 8; /* pretend its 8 bits 

Patch "Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 5.10-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large 
transfers with even size

to the 5.10-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
and it can be found in the queue-5.10 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:07:59 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:17 +0200
Subject:[For 5.15/5.10/5.4] spi: fsl-cpm: Use 16 bit mode for large transfers 
with even size
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<3a1b8774ad7004acb594fbf220f98488dbaa2896.1684156552.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab)

On CPM, the RISC core is a lot more efficiant when doing transfers
in 16-bits chunks than in 8-bits chunks, but unfortunately the
words need to be byte swapped as seen in a previous commit.

So, for large tranfers with an even size, allocate a temporary tx
buffer and byte-swap data before and after transfer.

This change allows setting higher speed for transfer. For instance
on an MPC 8xx (CPM1 comms RISC processor), the documentation tells
that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load
at 25 MHz while a word transfer at the same speed uses 0.032%
of CPM load. This means the speed can be 6 times higher in
word mode for the same CPM load.

For the time being, only do it on CPM1 as there must be a
trade-off between the CPM load reduction and the CPU load required
to byte swap the data.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-cpm.c |   23 +++
 drivers/spi/spi-fsl-spi.c |3 +++
 2 files changed, 26 insertions(+)

--- a/drivers/spi/spi-fsl-cpm.c
+++ b/drivers/spi/spi-fsl-cpm.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spi-fsl-cpm.h"
 #include "spi-fsl-lib.h"
@@ -120,6 +121,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi
mspi->rx_dma = mspi->dma_dummy_rx;
mspi->map_rx_dma = 0;
}
+   if (t->bits_per_word == 16 && t->tx_buf) {
+   const u16 *src = t->tx_buf;
+   u16 *dst;
+   int i;
+
+   dst = kmalloc(t->len, GFP_KERNEL);
+   if (!dst)
+   return -ENOMEM;
+
+   for (i = 0; i < t->len >> 1; i++)
+   dst[i] = cpu_to_le16p(src + i);
+
+   mspi->tx = dst;
+   mspi->map_tx_dma = 1;
+   }
 
if (mspi->map_tx_dma) {
void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */
@@ -173,6 +189,13 @@ void fsl_spi_cpm_bufs_complete(struct mp
if (mspi->map_rx_dma)
dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE);
mspi->xfer_in_progress = NULL;
+
+   if (t->bits_per_word == 16 && t->rx_buf) {
+   int i;
+
+   for (i = 0; i < t->len; i += 2)
+   le16_to_cpus(t->rx_buf + i);
+   }
 }
 EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete);
 
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -368,6 +368,9 @@ static int fsl_spi_do_one_msg(struct spi
return -EINVAL;
if (t->bits_per_word == 16 || t->bits_per_word == 32)
t->bits_per_word = 8; /* pretend its 8 bits */
+   if (t->bits_per_word == 8 && t->len >= 256 &&
+   (mpc8xxx_spi->flags & SPI_CPM1))
+   t->bits_per_word = 16;
}
}
 


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-5.10/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-5.10/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch


Patch "Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 4.19-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word 
adaptation

to the 4.19-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:01 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:14 +0200
Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word 
adaptation
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 


From: Christophe Leroy 

(backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e)

For different reasons, fsl-spi driver performs bits_per_word
modifications for different reasons:
- On CPU mode, to minimise amount of interrupts
- On CPM/QE mode to work around controller byte order

For CPU mode that's done in fsl_spi_prepare_message() while
for CPM mode that's done in fsl_spi_setup_transfer().

Reunify all of it in fsl_spi_prepare_message(), and catch
impossible cases early through master's bits_per_word_mask
instead of returning EINVAL later.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   50 +-
 1 file changed, 23 insertions(+), 27 deletions(-)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -201,26 +201,6 @@ static int mspi_apply_cpu_mode_quirks(st
return bits_per_word;
 }
 
-static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs,
-   struct spi_device *spi,
-   int bits_per_word)
-{
-   /* CPM/QE uses Little Endian for words > 8
-* so transform 16 and 32 bits words into 8 bits
-* Unfortnatly that doesn't work for LSB so
-* reject these for now */
-   /* Note: 32 bits word, LSB works iff
-* tfcr/rfcr is set to CPMFCR_GBL */
-   if (spi->mode & SPI_LSB_FIRST &&
-   bits_per_word > 8)
-   return -EINVAL;
-   if (bits_per_word <= 8)
-   return bits_per_word;
-   if (bits_per_word == 16 || bits_per_word == 32)
-   return 8; /* pretend its 8 bits */
-   return -EINVAL;
-}
-
 static int fsl_spi_setup_transfer(struct spi_device *spi,
struct spi_transfer *t)
 {
@@ -248,9 +228,6 @@ static int fsl_spi_setup_transfer(struct
bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi,
   mpc8xxx_spi,
   bits_per_word);
-   else
-   bits_per_word = mspi_apply_qe_mode_quirks(cs, spi,
- bits_per_word);
 
if (bits_per_word < 0)
return bits_per_word;
@@ -368,14 +345,27 @@ static int fsl_spi_do_one_msg(struct spi
 * In CPU mode, optimize large byte transfers to use larger
 * bits_per_word values to reduce number of interrupts taken.
 */
-   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
-   list_for_each_entry(t, >transfers, transfer_list) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
if (t->len < 256 || t->bits_per_word != 8)
continue;
if ((t->len & 3) == 0)
t->bits_per_word = 32;
else if ((t->len & 1) == 0)
t->bits_per_word = 16;
+   } else {
+   /*
+* CPM/QE uses Little Endian for words > 8
+* so transform 16 and 32 bits words into 8 bits
+* Unfortnatly that doesn't work for LSB so
+* reject these for now
+* Note: 32 bits word, LSB works iff
+* tfcr/rfcr is set to CPMFCR_GBL
+*/
+   if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 
8)
+   return -EINVAL;
+   if (t->bits_per_word == 16 || t->bits_per_word == 32)
+   t->bits_per_word = 8; /* pretend its 8 bits */
}
}
 
@@ -658,8 +648,14 @@ static struct spi_master * fsl_spi_probe

Patch "Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode" has been added to the 4.19-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word 
in cpu mode

to the 4.19-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:06 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:13 +0200
Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in 
cpu mode
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Rasmus Villemoes 
, Mark Brown 
Message-ID: 
<674d9af640acf4aa04abd642cc81de926d3271ed.1684158520.git.christophe.le...@csgroup.eu>

From: Rasmus Villemoes 

(cherry picked from upstream af0e6242909c3c4297392ca3e94eff1b4db71a97)

Taking one interrupt for every byte is rather slow. Since the
controller is perfectly capable of transmitting 32 bits at a time,
change t->bits_per-word to 32 when the length is divisible by 4 and
large enough that the reduced number of interrupts easily compensates
for the one or two extra fsl_spi_setup_transfer() calls this causes.

Signed-off-by: Rasmus Villemoes 
Signed-off-by: Mark Brown 
Signed-off-by: Christophe Leroy 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   16 
 1 file changed, 16 insertions(+)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -357,12 +357,28 @@ static int fsl_spi_bufs(struct spi_devic
 static int fsl_spi_do_one_msg(struct spi_master *master,
  struct spi_message *m)
 {
+   struct mpc8xxx_spi *mpc8xxx_spi = spi_master_get_devdata(master);
struct spi_device *spi = m->spi;
struct spi_transfer *t, *first;
unsigned int cs_change;
const int nsecs = 50;
int status;
 
+   /*
+* In CPU mode, optimize large byte transfers to use larger
+* bits_per_word values to reduce number of interrupts taken.
+*/
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (t->len < 256 || t->bits_per_word != 8)
+   continue;
+   if ((t->len & 3) == 0)
+   t->bits_per_word = 32;
+   else if ((t->len & 1) == 0)
+   t->bits_per_word = 16;
+   }
+   }
+
/* Don't allow changes if CS is active */
first = list_first_entry(>transfers, struct spi_transfer,
transfer_list);


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-4.19/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-4.19/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
queue-4.19/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch


Patch "Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 4.19-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers 
with even size

to the 4.19-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:07:58 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:15 +0200
Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with 
even size
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<9363da33f54e9862b4b59c0ed97924ca7265f7a4.1684158520.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab)

On CPM, the RISC core is a lot more efficiant when doing transfers
in 16-bits chunks than in 8-bits chunks, but unfortunately the
words need to be byte swapped as seen in a previous commit.

So, for large tranfers with an even size, allocate a temporary tx
buffer and byte-swap data before and after transfer.

This change allows setting higher speed for transfer. For instance
on an MPC 8xx (CPM1 comms RISC processor), the documentation tells
that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load
at 25 MHz while a word transfer at the same speed uses 0.032%
of CPM load. This means the speed can be 6 times higher in
word mode for the same CPM load.

For the time being, only do it on CPM1 as there must be a
trade-off between the CPM load reduction and the CPU load required
to byte swap the data.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-cpm.c |   23 +++
 drivers/spi/spi-fsl-spi.c |3 +++
 2 files changed, 26 insertions(+)

--- a/drivers/spi/spi-fsl-cpm.c
+++ b/drivers/spi/spi-fsl-cpm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spi-fsl-cpm.h"
 #include "spi-fsl-lib.h"
@@ -124,6 +125,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi
mspi->rx_dma = mspi->dma_dummy_rx;
mspi->map_rx_dma = 0;
}
+   if (t->bits_per_word == 16 && t->tx_buf) {
+   const u16 *src = t->tx_buf;
+   u16 *dst;
+   int i;
+
+   dst = kmalloc(t->len, GFP_KERNEL);
+   if (!dst)
+   return -ENOMEM;
+
+   for (i = 0; i < t->len >> 1; i++)
+   dst[i] = cpu_to_le16p(src + i);
+
+   mspi->tx = dst;
+   mspi->map_tx_dma = 1;
+   }
 
if (mspi->map_tx_dma) {
void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */
@@ -177,6 +193,13 @@ void fsl_spi_cpm_bufs_complete(struct mp
if (mspi->map_rx_dma)
dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE);
mspi->xfer_in_progress = NULL;
+
+   if (t->bits_per_word == 16 && t->rx_buf) {
+   int i;
+
+   for (i = 0; i < t->len; i += 2)
+   le16_to_cpus(t->rx_buf + i);
+   }
 }
 EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete);
 
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -366,6 +366,9 @@ static int fsl_spi_do_one_msg(struct spi
return -EINVAL;
if (t->bits_per_word == 16 || t->bits_per_word == 32)
t->bits_per_word = 8; /* pretend its 8 bits */
+   if (t->bits_per_word == 8 && t->len >= 256 &&
+   (mpc8xxx_spi->flags & SPI_CPM1))
+   t->bits_per_word = 16;
}
}
 


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-4.19/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-4.19/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
queue-4.19/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch


Patch "Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode" has been added to the 4.14-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word 
in cpu mode

to the 4.14-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch
and it can be found in the queue-4.14 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:06 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:13 +0200
Subject:[For 4.19/4.14] spi: spi-fsl-spi: automatically adapt bits-per-word in 
cpu mode
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Rasmus Villemoes 
, Mark Brown 
Message-ID: 
<674d9af640acf4aa04abd642cc81de926d3271ed.1684158520.git.christophe.le...@csgroup.eu>

From: Rasmus Villemoes 

(cherry picked from upstream af0e6242909c3c4297392ca3e94eff1b4db71a97)

Taking one interrupt for every byte is rather slow. Since the
controller is perfectly capable of transmitting 32 bits at a time,
change t->bits_per-word to 32 when the length is divisible by 4 and
large enough that the reduced number of interrupts easily compensates
for the one or two extra fsl_spi_setup_transfer() calls this causes.

Signed-off-by: Rasmus Villemoes 
Signed-off-by: Mark Brown 
Signed-off-by: Christophe Leroy 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   16 
 1 file changed, 16 insertions(+)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -357,12 +357,28 @@ static int fsl_spi_bufs(struct spi_devic
 static int fsl_spi_do_one_msg(struct spi_master *master,
  struct spi_message *m)
 {
+   struct mpc8xxx_spi *mpc8xxx_spi = spi_master_get_devdata(master);
struct spi_device *spi = m->spi;
struct spi_transfer *t, *first;
unsigned int cs_change;
const int nsecs = 50;
int status;
 
+   /*
+* In CPU mode, optimize large byte transfers to use larger
+* bits_per_word values to reduce number of interrupts taken.
+*/
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (t->len < 256 || t->bits_per_word != 8)
+   continue;
+   if ((t->len & 3) == 0)
+   t->bits_per_word = 32;
+   else if ((t->len & 1) == 0)
+   t->bits_per_word = 16;
+   }
+   }
+
/* Don't allow changes if CS is active */
first = list_first_entry(>transfers, struct spi_transfer,
transfer_list);


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-4.14/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-4.14/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
queue-4.14/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch


Patch "Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word adaptation" has been added to the 4.14-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word 
adaptation

to the 4.14-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
and it can be found in the queue-4.14 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:08:01 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:14 +0200
Subject:[For 4.19/4.14] spi: fsl-spi: Re-organise transfer bits_per_word 
adaptation
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 


From: Christophe Leroy 

(backported from upstream 8a5299a1278eadf1e08a598a5345c376206f171e)

For different reasons, fsl-spi driver performs bits_per_word
modifications for different reasons:
- On CPU mode, to minimise amount of interrupts
- On CPM/QE mode to work around controller byte order

For CPU mode that's done in fsl_spi_prepare_message() while
for CPM mode that's done in fsl_spi_setup_transfer().

Reunify all of it in fsl_spi_prepare_message(), and catch
impossible cases early through master's bits_per_word_mask
instead of returning EINVAL later.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/0ce96fe96e8b07cba0613e4097cfd94d09b8919a.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-spi.c |   50 +-
 1 file changed, 23 insertions(+), 27 deletions(-)

--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -201,26 +201,6 @@ static int mspi_apply_cpu_mode_quirks(st
return bits_per_word;
 }
 
-static int mspi_apply_qe_mode_quirks(struct spi_mpc8xxx_cs *cs,
-   struct spi_device *spi,
-   int bits_per_word)
-{
-   /* CPM/QE uses Little Endian for words > 8
-* so transform 16 and 32 bits words into 8 bits
-* Unfortnatly that doesn't work for LSB so
-* reject these for now */
-   /* Note: 32 bits word, LSB works iff
-* tfcr/rfcr is set to CPMFCR_GBL */
-   if (spi->mode & SPI_LSB_FIRST &&
-   bits_per_word > 8)
-   return -EINVAL;
-   if (bits_per_word <= 8)
-   return bits_per_word;
-   if (bits_per_word == 16 || bits_per_word == 32)
-   return 8; /* pretend its 8 bits */
-   return -EINVAL;
-}
-
 static int fsl_spi_setup_transfer(struct spi_device *spi,
struct spi_transfer *t)
 {
@@ -248,9 +228,6 @@ static int fsl_spi_setup_transfer(struct
bits_per_word = mspi_apply_cpu_mode_quirks(cs, spi,
   mpc8xxx_spi,
   bits_per_word);
-   else
-   bits_per_word = mspi_apply_qe_mode_quirks(cs, spi,
- bits_per_word);
 
if (bits_per_word < 0)
return bits_per_word;
@@ -368,14 +345,27 @@ static int fsl_spi_do_one_msg(struct spi
 * In CPU mode, optimize large byte transfers to use larger
 * bits_per_word values to reduce number of interrupts taken.
 */
-   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
-   list_for_each_entry(t, >transfers, transfer_list) {
+   list_for_each_entry(t, >transfers, transfer_list) {
+   if (!(mpc8xxx_spi->flags & SPI_CPM_MODE)) {
if (t->len < 256 || t->bits_per_word != 8)
continue;
if ((t->len & 3) == 0)
t->bits_per_word = 32;
else if ((t->len & 1) == 0)
t->bits_per_word = 16;
+   } else {
+   /*
+* CPM/QE uses Little Endian for words > 8
+* so transform 16 and 32 bits words into 8 bits
+* Unfortnatly that doesn't work for LSB so
+* reject these for now
+* Note: 32 bits word, LSB works iff
+* tfcr/rfcr is set to CPMFCR_GBL
+*/
+   if (m->spi->mode & SPI_LSB_FIRST && t->bits_per_word > 
8)
+   return -EINVAL;
+   if (t->bits_per_word == 16 || t->bits_per_word == 32)
+   t->bits_per_word = 8; /* pretend its 8 bits */
}
}
 
@@ -658,8 +648,14 @@ static struct spi_master * fsl_spi_probe

Patch "Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with even size" has been added to the 4.14-stable tree

2023-05-26 Thread gregkh


This is a note to let you know that I've just added the patch titled

Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers 
with even size

to the 4.14-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
and it can be found in the queue-4.14 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From christophe.le...@csgroup.eu Mon May 15 15:07:58 2023
From: Christophe Leroy 
Date: Mon, 15 May 2023 16:07:15 +0200
Subject:[For 4.19/4.14] spi: fsl-cpm: Use 16 bit mode for large transfers with 
even size
To: gre...@linuxfoundation.org, sta...@vger.kernel.org
Cc: Christophe Leroy , 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Mark Brown 

Message-ID: 
<9363da33f54e9862b4b59c0ed97924ca7265f7a4.1684158520.git.christophe.le...@csgroup.eu>

From: Christophe Leroy 

(cherry picked from upstream fc96ec826bced75cc6b9c07a4ac44bbf651337ab)

On CPM, the RISC core is a lot more efficiant when doing transfers
in 16-bits chunks than in 8-bits chunks, but unfortunately the
words need to be byte swapped as seen in a previous commit.

So, for large tranfers with an even size, allocate a temporary tx
buffer and byte-swap data before and after transfer.

This change allows setting higher speed for transfer. For instance
on an MPC 8xx (CPM1 comms RISC processor), the documentation tells
that transfer in byte mode at 1 kbit/s uses 0.200% of CPM load
at 25 MHz while a word transfer at the same speed uses 0.032%
of CPM load. This means the speed can be 6 times higher in
word mode for the same CPM load.

For the time being, only do it on CPM1 as there must be a
trade-off between the CPM load reduction and the CPU load required
to byte swap the data.

Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/f2e981f20f92dd28983c3949702a09248c23845c.1680371809.git.christophe.le...@csgroup.eu
Signed-off-by: Mark Brown 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/spi/spi-fsl-cpm.c |   23 +++
 drivers/spi/spi-fsl-spi.c |3 +++
 2 files changed, 26 insertions(+)

--- a/drivers/spi/spi-fsl-cpm.c
+++ b/drivers/spi/spi-fsl-cpm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spi-fsl-cpm.h"
 #include "spi-fsl-lib.h"
@@ -124,6 +125,21 @@ int fsl_spi_cpm_bufs(struct mpc8xxx_spi
mspi->rx_dma = mspi->dma_dummy_rx;
mspi->map_rx_dma = 0;
}
+   if (t->bits_per_word == 16 && t->tx_buf) {
+   const u16 *src = t->tx_buf;
+   u16 *dst;
+   int i;
+
+   dst = kmalloc(t->len, GFP_KERNEL);
+   if (!dst)
+   return -ENOMEM;
+
+   for (i = 0; i < t->len >> 1; i++)
+   dst[i] = cpu_to_le16p(src + i);
+
+   mspi->tx = dst;
+   mspi->map_tx_dma = 1;
+   }
 
if (mspi->map_tx_dma) {
void *nonconst_tx = (void *)mspi->tx; /* shut up gcc */
@@ -177,6 +193,13 @@ void fsl_spi_cpm_bufs_complete(struct mp
if (mspi->map_rx_dma)
dma_unmap_single(dev, mspi->rx_dma, t->len, DMA_FROM_DEVICE);
mspi->xfer_in_progress = NULL;
+
+   if (t->bits_per_word == 16 && t->rx_buf) {
+   int i;
+
+   for (i = 0; i < t->len; i += 2)
+   le16_to_cpus(t->rx_buf + i);
+   }
 }
 EXPORT_SYMBOL_GPL(fsl_spi_cpm_bufs_complete);
 
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -366,6 +366,9 @@ static int fsl_spi_do_one_msg(struct spi
return -EINVAL;
if (t->bits_per_word == 16 || t->bits_per_word == 32)
t->bits_per_word = 8; /* pretend its 8 bits */
+   if (t->bits_per_word == 8 && t->len >= 256 &&
+   (mpc8xxx_spi->flags & SPI_CPM1))
+   t->bits_per_word = 16;
}
}
 


Patches currently in stable-queue which might be from 
christophe.le...@csgroup.eu are

queue-4.14/spi-fsl-cpm-use-16-bit-mode-for-large-transfers-with-even-size.patch
queue-4.14/spi-fsl-spi-re-organise-transfer-bits_per_word-adaptation.patch
queue-4.14/spi-spi-fsl-spi-automatically-adapt-bits-per-word-in-cpu-mode.patch


Re: [PATCH v5 13/18] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly

2023-05-26 Thread Petr Mladek
On Wed 2023-05-24 12:38:49, Doug Anderson wrote:
> Hi,
> 
> On Wed, May 24, 2023 at 6:59 AM Petr Mladek  wrote:
> >
> > On Fri 2023-05-19 10:18:37, Douglas Anderson wrote:
> > > The fact that there watchdog_hardlockup_enable(),
> > > watchdog_hardlockup_disable(), and watchdog_hardlockup_probe() are
> > > declared __weak means that the configured hardlockup detector can
> > > define non-weak versions of those functions if it needs to. Instead of
> > > doing this, the perf hardlockup detector hooked itself into the
> > > default __weak implementation, which was a bit awkward. Clean this up.
> > >
> > > >From comments, it looks as if the original design was done because the
> > > __weak function were expected to implemented by the architecture and
> > > not by the configured hardlockup detector. This got awkward when we
> > > tried to add the buddy lockup detector which was not arch-specific but
> > > wanted to hook into those same functions.
> > >
> > > This is not expected to have any functional impact.
> > >
> > > @@ -187,27 +187,33 @@ static inline void watchdog_hardlockup_kick(void) { 
> > > }
> > >  #endif /* !CONFIG_HARDLOCKUP_DETECTOR_PERF */
> > >
> > >  /*
> > > - * These functions can be overridden if an architecture implements its
> > > - * own hardlockup detector.
> > > + * These functions can be overridden based on the configured hardlockdup 
> > > detector.
> > >   *
> > >   * watchdog_hardlockup_enable/disable can be implemented to start and 
> > > stop when
> > > - * softlockup watchdog start and stop. The arch must select the
> > > + * softlockup watchdog start and stop. The detector must select the
> > >   * SOFTLOCKUP_DETECTOR Kconfig.
> > >   */
> > > -void __weak watchdog_hardlockup_enable(unsigned int cpu)
> > > -{
> > > - hardlockup_detector_perf_enable();
> > > -}
> > > +void __weak watchdog_hardlockup_enable(unsigned int cpu) { }
> > >
> > > -void __weak watchdog_hardlockup_disable(unsigned int cpu)
> > > -{
> > > - hardlockup_detector_perf_disable();
> > > -}
> > > +void __weak watchdog_hardlockup_disable(unsigned int cpu) { }
> > >
> > >  /* Return 0, if a hardlockup watchdog is available. Error code otherwise 
> > > */
> > >  int __weak __init watchdog_hardlockup_probe(void)
> > >  {
> > > - return hardlockup_detector_perf_init();
> > > + /*
> > > +  * If CONFIG_HAVE_NMI_WATCHDOG is defined then an architecture
> > > +  * is assumed to have the hard watchdog available and we return 0.
> > > +  */
> > > + if (IS_ENABLED(CONFIG_HAVE_NMI_WATCHDOG))
> > > + return 0;
> > > +
> > > + /*
> > > +  * Hardlockup detectors other than those using 
> > > CONFIG_HAVE_NMI_WATCHDOG
> > > +  * are required to implement a non-weak version of this probe 
> > > function
> > > +  * to tell whether they are available. If they don't override then
> > > +  * we'll return -ENODEV.
> > > +  */
> > > + return -ENODEV;
> > >  }
> >
> > When thinking more about it. It is weird that we need to handle
> > CONFIG_HAVE_NMI_WATCHDOG in this default week function.
> >
> > It should be handled in watchdog_hardlockup_probe() implemented
> > in kernel/watchdog_perf.c.
> >
> > IMHO, the default __weak function could always return -ENODEV;
> >
> > Would it make sense, please?
> 
> I don't quite understand. I'd agree that the special case for
> CONFIG_HAVE_NMI_WATCHDOG is ugly, but it was also ugly before. IMO
> it's actually a little less ugly / easier to understand after my
> patch. ...but let me walk through how I think this special case works
> and maybe you can tell me where I'm confused.
> 
> The first thing to understand is that CONFIG_HARDLOCKUP_DETECTOR_PERF
> and CONFIG_HAVE_NMI_WATCHDOG are mutually exclusive from each other.
> This was true before any of my patches and is still true after them.
> Specifically, if CONFIG_HAVE_NMI_WATCHDOG is defined then an
> architecture implements arch_touch_nmi_watchdog() (as documented in
> the Kconfig docs for HAVE_NMI_WATCHDOG). Looking at the tree before my
> series you can see that the perf hardlockup detector also implemented
> arch_touch_nmi_watchdog(). This would have caused a conflict. The
> mutual exclusion was presumably enforced by an architecture not
> defining both HAVE_NMI_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_PERF.
> 
> The second thing to understand is that an architecture that defines
> CONFIG_HAVE_NMI_WATCHDOG is _not_ required to implement
> watchdog_hardlockup_probe() (used to be called watchdog_nmi_probe()).
> Maybe this should change, but at the very least it appears that
> SPARC64 defines HAVE_NMI_WATCHDOG but doesn't define
> watchdog_hardlockup_probe() AKA watchdog_nmi_probe(). Anyone who
> defines CONFIG_HAVE_NMI_WATCHDOG and doesn't implement
> watchdog_hardlockup_probe() is claiming that their watchdog needs no
> probing and is always available.
> 
> So with that context:
> 
> 1. We can't handle any special cases for CONFIG_HAVE_NMI_WATCHDOG in
> "kernel/watchdog_perf.c". 

Re: [PATCH v5 15/18] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs

2023-05-26 Thread Petr Mladek
On Fri 2023-05-19 10:18:39, Douglas Anderson wrote:
> On arm64, NMI support needs to be detected at runtime. Add a weak
> function to the perf hardlockup detector so that an architecture can
> implement it to detect whether NMIs are available.
> 
> Signed-off-by: Douglas Anderson 
> ---
> While I won't object to this patch landing, I consider it part of the
> arm64 perf hardlockup effort. I would be OK with the earlier patches
> in the series landing and then not landing ${SUBJECT} patch nor
> anything else later.
> 
> I'll also note that, as an alternative to this, it would be nice if we
> could figure out how to make perf_event_create_kernel_counter() fail
> on arm64 if NMIs aren't available. Maybe we could add a "must_use_nmi"
> element to "struct perf_event_attr"?
> 
> --- a/kernel/watchdog_perf.c
> +++ b/kernel/watchdog_perf.c
> @@ -234,12 +234,22 @@ void __init hardlockup_detector_perf_restart(void)
>   }
>  }
>  
> +bool __weak __init arch_perf_nmi_is_available(void)
> +{
> + return true;
> +}
> +
>  /**
>   * watchdog_hardlockup_probe - Probe whether NMI event is available at all
>   */
>  int __init watchdog_hardlockup_probe(void)
>  {
> - int ret = hardlockup_detector_event_create();
> + int ret;
> +
> + if (!arch_perf_nmi_is_available())
> + return -ENODEV;

My understanding is that this would block the perf hardlockup detector
at runtime. Does it work with the "nmi_watchdog" sysctl. I see
that it is made read-only when it is not enabled at build time,
see NMI_WATCHDOG_SYSCTL_PERM.

> +
> + ret = hardlockup_detector_event_create();
>  
>   if (ret) {
>   pr_info("Perf NMI watchdog permanently disabled\n");

Best Regards,
Petr


Re: [PATCH v5 14/18] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-26 Thread Petr Mladek
On Thu 2023-05-25 13:08:04, Doug Anderson wrote:
> Hi,
> 
> On Thu, May 25, 2023 at 9:27 AM Petr Mladek  wrote:
> >
> > On Fri 2023-05-19 10:18:38, Douglas Anderson wrote:
> > > Implement a hardlockup detector that doesn't doesn't need any extra
> > > arch-specific support code to detect lockups. Instead of using
> > > something arch-specific we will use the buddy system, where each CPU
> > > watches out for another one. Specifically, each CPU will use its
> > > softlockup hrtimer to check that the next CPU is processing hrtimer
> > > interrupts by verifying that a counter is increasing.
> > >
> > > --- a/kernel/watchdog.c
> > > +++ b/kernel/watchdog.c
> > > @@ -85,7 +85,7 @@ __setup("nmi_watchdog=", hardlockup_panic_setup);
> > >
> > >  #endif /* CONFIG_HARDLOCKUP_DETECTOR */
> > >
> > > -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> > > +#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
> > >
> > >  static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
> > >  static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
> > > @@ -106,6 +106,14 @@ notrace void arch_touch_nmi_watchdog(void)
> > >  }
> > >  EXPORT_SYMBOL(arch_touch_nmi_watchdog);
> > >
> > > +void watchdog_hardlockup_touch_cpu(unsigned int cpu)
> > > +{
> > > + per_cpu(watchdog_hardlockup_touched, cpu) = true;
> > > +
> > > + /* Match with smp_rmb() in watchdog_hardlockup_check() */
> > > + smp_wmb();
> >
> > It is great that you described where the related barrier is.
> >
> > Another important information is what exactly is synchronized.
> > And I am actually not sure what we are synchronizing here.
> >
> > My understanding is that a write barrier should synchronize
> > related writes, for example:
> >
> > X = ...;
> > /* Make sure that X is modified before Y */
> > smp_wmb();
> > Y = ...;
> >
> > And the related read barrier should synchronize the related reads,
> > for example:
> >
> > if (test(Y)) {
> > /*
> >  * Make sure that we use the updated X when
> >  * we saw the updated Y.
> >  */
> >  smp_rmb();
> >  do_something(X);
> >  }
> >
> > IMHO, we do not need any barrier here because we have only
> > one variable "watchdog_hardlockup_touched" here.
> > watchdog_hardlockup_check() will either see the updated value
> > or not. But it does not synchronize it against any other
> > variables or values.
> 
> Fair. These barriers were present in the original buddy lockup
> detector that we've been carrying in ChromeOS but that doesn't
> necessarily mean that they were there for a good reason.
> 
> Reasoning about weakly ordered memory always makes my brain hurt and I
> never feel confident at the end that I got the right answer and, of
> course, this is coupled by the fact that if I have a logic error in my
> reasoning that it might cause a rare / subtle bug. :(

Sure. Lockless code is complicated.

> When possible I
> try to write code that uses full blown locks to make it easier to
> reason about (even if less efficient),

Makes sense. There should be a good reason to use lockless code
because it is complicated to do it right and maintain.

> but that's not necessarily
> possible here. While we obviously don't just want to sprinkle barriers
> all over the code, IMO it's not a terrible sin to put a barrier in a
> case where it makes it easier to reason about the order of things.

I understand this. Well, it is always important to describe the
the reason why the barrier was added there. Even when it is wrong,
it gives a clue what was the motivation. Otherwise, it is hard
to do any changes on the code later.

I guess that it might be more problematic for you because
you probably are not the original author.

> In any case, I guess in this case I would worry about some sort of
> ordering race when enabling / disabling the buddy lockup detector. At
> the end of the buddy's watchdog_hardlockup_enable() /
> watchdog_hardlockup_disable() we adjust the "watchdog_cpus" which
> changes buddy assignments. Without a barrier, I _think_ it would be
> possible for a new CPU to notice the change in buddies without
> noticing the touch. Does that match your understanding? Now when
> reasoning about CPUs going online/offline we need to consider this
> extra case and we have to decide if there's any chance it could lead
> to a false lockup detection. With the memory barriers here, it's a
> little easier to think about.

This makes sense. I did not think about the hotplug scenario.

Well, I suggest to move the barriers into the buddy code and describe
it there. It does not make sense to use the barriers for the perf
hardlockup.

I mean something like:

diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c
index fee45af2e5bd..ebe71dcb55e6 100644
--- a/kernel/watchdog_buddy.c
+++ b/kernel/watchdog_buddy.c
@@ -52,6 +52,13 @@ void watchdog_hardlockup_enable(unsigned int cpu)
if (next_cpu < 

Bug: Write fault blocked by KUAP! in do_notify_resume()

2023-05-26 Thread Christophe Leroy
Can't find how that can happen.

I have:

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y

We are inside an access_begin / access_end block.


[  380.407589] [ cut here ]
[  380.408019] Bug: Write fault blocked by KUAP!
[  380.408497] WARNING: CPU: 0 PID: 422 at arch/powerpc/mm/fault.c:228 
do_page_fault+0x510/0x78c
[  380.409353] CPU: 0 PID: 422 Comm: CORSurv Tainted: GW 
  6.4.0-rc2-s3k-dev-02274-gca69d28ba73c #328
[  380.409879] Hardware name: MCR3000_2G 8xx 0x50 CMPC885
[  380.410346] NIP:  c00135cc LR: c00135cc CTR: c0065b08
[  380.410834] REGS: cae53ce0 TRAP: 0700   Tainted: GW 
(6.4.0-rc2-s3k-dev-02274-gca69d28ba73c)
[  380.411335] MSR:  00021032   CR: 22e822d2  XER: 2000f701
[  380.414300]
[  380.414300] GPR00: c00135cc cae53da0 c26439c0 0021 c0a75d78 
0001 c0a75e88 1032
[  380.414300] GPR08: 0027  0001 921a5f00 22e822d2 
1002c9f4 0001 0005
[  380.414300] GPR16: 10025760   100254bc 10024eec 
  0004
[  380.414300] GPR24: 0006 10025770 100100cc c26c8d80 7fab7b50 
0200 0300 cae53de0
[  380.428316] NIP [c00135cc] do_page_fault+0x510/0x78c
[  380.429005] LR [c00135cc] do_page_fault+0x510/0x78c
[  380.429673] Call Trace:
[  380.430115] [cae53da0] [c00135cc] do_page_fault+0x510/0x78c (unreliable)
[  380.431513] [cae53dd0] [c0003ac4] DataTLBError_virt+0x114/0x118
[  380.432546] --- interrupt: 300 at 
__unsafe_save_user_regs.constprop.0+0x24/0x88
[  380.433231] NIP:  c000656c LR: c0006980 CTR: 0006
[  380.433714] REGS: cae53de0 TRAP: 0300   Tainted: GW 
(6.4.0-rc2-s3k-dev-02274-gca69d28ba73c)
[  380.434216] MSR:  9032   CR: 22084442  XER: 2000f701
[  380.437561] DAR: 7fab7b50 DSISR: 8a00
[  380.437561] GPR00: c0006980 cae53ea0 c26439c0 cae53f40 7fab7b50 
7fab7b30 001d 7fab8010
[  380.437561] GPR08: cae53f38 7fab7b50 cae53f40 918ffc00 22084442 
1002c9f4 0001 0005
[  380.437561] GPR16: 10025760   100254bc 10024eec 
  0004
[  380.437561] GPR24: 0006 10025770 100100cc cae53f40 cae53f40 
cae53ec8 c26439c0 7fab7b50
[  380.451991] NIP [c000656c] __unsafe_save_user_regs.constprop.0+0x24/0x88
[  380.452678] LR [c0006980] handle_signal32+0x9c/0x1a0
[  380.453329] --- interrupt: 300
[  380.453803] [cae53ea0] [10025770] 0x10025770 (unreliable)
[  380.455086] [cae53ec0] [c0008844] do_notify_resume+0x128/0x2cc
[  380.456149] [cae53f20] [c000df70] 
interrupt_exit_user_prepare_main+0x7c/0xd8
[  380.457205] [cae53f30] [c00120b4] ret_from_syscall+0xc/0x28
[  380.458249] --- interrupt: c00 at 0xff42e98
[  380.458809] NIP:  0ff42e98 LR: 0ff42e8c CTR: 0fdff244
[  380.459298] REGS: cae53f40 TRAP: 0c00   Tainted: GW 
(6.4.0-rc2-s3k-dev-02274-gca69d28ba73c)
[  380.459796] MSR:  d032   CR: 38084448  XER: 
2000f701
[  380.463473]
[  380.463473] GPR00: 001d 7fab8010 77fc34d0 0004  
  
[  380.463473] GPR08:  2009  7fab7ef0 28084842 
1002c9f4 0001 0005
[  380.463473] GPR16: 10025760   100254bc 10024eec 
  0004
[  380.463473] GPR24: 0006 10025770 100100cc 0fc425ec 1001048c 
7fab8048 10025558 10024ee8
[  380.477466] NIP [0ff42e98] 0xff42e98
[  380.478015] LR [0ff42e8c] 0xff42e8c
[  380.478546] --- interrupt: c00
[  380.479026] Code: 408201f8 807f0080 48033361 2c03 41a2ffc8 
2c1d 41820138 3c80c096 3884d5d0 3c60c096 3863d6bc 4800fa79 
<0fe0> 93410018 814203c8 714a0100
[  380.486263] ---[ end trace  ]---
[  380.492813]
[  380.547950] CORSurv[422]: bad frame in handle_signal32: 3e6b61e5 nip 
0ff42e98 lr 0ff42e8c


Any idea ?

Christophe


Re: linux-next: boot failure after merge of the tip tree

2023-05-26 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Wed, May 24, 2023 at 03:44:59PM +1000, Stephen Rothwell wrote:
> > Hi all,
> > 
> > After merging the tip tree, today's linux-next build (powerpc
> > pseries_le_defconfig) failed to boot like this:
> > 
> >  Built 1 zonelists, mobility grouping on.  Total pages: 32736
> >  Policy zone: Normal
> >  mem auto-init: stack:all(zero), heap alloc:off, heap free:off
> >  Memory: 2027392K/2097152K available (17984K kernel code, 3328K rwdata, 
> > 14784K rodata, 6080K init, 1671K bss, 69760K reserved, 0K cma-reserved)
> > 
> > *crickets*
> > 
> > Bisected to commit
> > 
> >   f4ab23558310 ("slub: Replace cmpxchg_double()")
> > 
> > I have reverted that commit (and the following one) for today.
> 
> Sorry about that; turns out I forgot to test the case where cmpxchg128
> wasn't available.
> 
> Please see:
> 
>   
> https://lkml.kernel.org/r/20230524093246.gp83...@hirez.programming.kicks-ass.net
> 
> As stated there, I'm going to zap tip/locking/core for a few days and
> let this fixed version run through the robots again before re-instating
> it.

Note to -next, this should now be resolved in the tip:auto-latest branch.

Thanks,

Ingo