date:20230607

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-07 Thread Hugh Dickins

On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Thu, 1 Jun 2023 15:57:51 +0200
> > Gerald Schaefer  wrote:
> > > 
> > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > rcu_head reuse. It might be possible to use the cleverness from our
> > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > the case where both 2K pagetable fragments become unused, similar to
> > > how we decide when to actually call __free_page().
> > > 
> > > However, it might be much worse, and page->rcu_head from a pagetable
> > > page cannot be used at all for s390, because we also use page->lru
> > > to keep our list of free 2K pagetable fragments. I always get confused
> > > by struct page unions, so not completely sure, but it seems to me that
> > > page->rcu_head would overlay with page->lru, right?  
> > 
> > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > I'm wrong) I think that s390 could use exactly the same technique for
> > its list of free 2K pagetable fragments as it uses for its list of THP
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> Nice idea, I think that could actually work, since we only need the empty
> 2K halves on the list. So it should be possible to store the list_head
> inside those.

Jason quickly pointed out the flaw in my thinking there.

> 
> > 
> > And while it could use third and fourth longs instead, I don't see any
> > need for that: a deposited pagetable has been allocated, so would not
> > be on the list of free fragments.
> 
> Correct, that should not interfere.
> 
> > 
> > Below is one of the grossest patches I've ever posted: gross because
> > it's a rushed attempt to see whether that is viable, while it would take
> > me longer to understand all the s390 cleverness there (even though the
> > PP AA commentary above page_table_alloc() is excellent).
> 
> Sounds fair, this is also one of the grossest code we have, which is also
> why Alexander added the comment. I guess we could need even more comments
> inside the code, as it still confuses me more than it should.
> 
> Considering that, you did remarkably well. Your patch seems to work fine,
> at least it survived some LTP mm tests. I will also add it to our CI runs,
> to give it some more testing. Will report tomorrow when it broke something.
> See also below for some patch comments.

Many thanks for your effort on this patch.  I don't expect the testing
of it to catch Jason's point, that I'm corrupting the page table while
it's on its way through RCU to being freed, but he's right nonetheless.

I'll integrate your fixes below into what I have here, but probably
just archive it as something to refer to later in case it might play
a part; but probably it will not - sorry for wasting your time.

> 
> > 
> > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.
> 
> cmma_init_nodat() should be disjoint, not only because it is __init,
> but also because it explicitly skips pagetable pages, so it should
> never touch page->lru of those.
> 
> Not very familiar with the gmap code, it does look disjoint, and we should
> also use complete 4K pages for pagetables instead of 2K fragments there,
> but Christian or Claudio should also have a look.
> 
> > 
> > Gerald, s390 folk: would it be possible for you to give this
> > a try, suggest corrections and improvements, and then I can make it
> > a separate patch of the series; and work on avoiding concurrent use
> > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > the scheme already there, maybe DD bits to go along with the PP AA).
> 
> It feels like it could be possible to not only avoid the double
> rcu_head, but also avoid passing over the mm via page->pt_mm.
> I.e. have pte_free_defer(), which has the mm, do all the checks and
> list updates that page_table_free() does, for which we need the mm.
> Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> just do _dtor/__free_page similar to the generic version.

I'm not sure: I missed your suggestion there when I first skimmed
through, and today have spent more time getting deeper into how it's
done at present.  I am now feeling more confident of a way forward,
a nicely integrated way forward, than I was yesterday.
Though getting it right may not be so easy.

When Jason pointed out the existing RCU, I initially hoped that it might
already provide the necessary framework: but sadly not, because the
unbatched case (used when additional memory is not available) does not
use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
If I used that, it would cripple the s390 implementation unacceptably.

> 
> I must admit

[PATCH v3 6/6] KVM: PPC: selftests: Add interrupt performance tester

2023-06-07 Thread Nicholas Piggin

Add a little perf tester for interrupts that go to guest, host, and
userspace.

Signed-off-by: Nicholas Piggin 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../selftests/kvm/powerpc/interrupt_perf.c| 199 ++
 2 files changed, 200 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/powerpc/interrupt_perf.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index aa3a8ca676c2..834f98971b0c 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -184,6 +184,7 @@ TEST_GEN_PROGS_riscv += kvm_page_table_test
 TEST_GEN_PROGS_riscv += set_memory_region_test
 TEST_GEN_PROGS_riscv += kvm_binary_stats_test
 
+TEST_GEN_PROGS_powerpc += powerpc/interrupt_perf
 TEST_GEN_PROGS_powerpc += powerpc/null_test
 TEST_GEN_PROGS_powerpc += powerpc/rtas_hcall
 TEST_GEN_PROGS_powerpc += powerpc/tlbiel_test
diff --git a/tools/testing/selftests/kvm/powerpc/interrupt_perf.c 
b/tools/testing/selftests/kvm/powerpc/interrupt_perf.c
new file mode 100644
index ..50d078899e22
--- /dev/null
+++ b/tools/testing/selftests/kvm/powerpc/interrupt_perf.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test basic guest interrupt/exit performance.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "kselftest.h"
+#include "processor.h"
+#include "helpers.h"
+#include "hcall.h"
+
+static bool timeout;
+static unsigned long count;
+static struct kvm_vm *kvm_vm;
+
+static void set_timer(int sec)
+{
+   struct itimerval timer;
+
+   timeout = false;
+
+   timer.it_value.tv_sec  = sec;
+   timer.it_value.tv_usec = 0;
+   timer.it_interval = timer.it_value;
+   TEST_ASSERT(setitimer(ITIMER_REAL, , NULL) == 0,
+   "setitimer failed %s", strerror(errno));
+}
+
+static void sigalrm_handler(int sig)
+{
+   timeout = true;
+   sync_global_to_guest(kvm_vm, timeout);
+}
+
+static void init_timers(void)
+{
+   TEST_ASSERT(signal(SIGALRM, sigalrm_handler) != SIG_ERR,
+   "Failed to register SIGALRM handler, errno = %d (%s)",
+   errno, strerror(errno));
+}
+
+static void program_interrupt_handler(struct ex_regs *regs)
+{
+   regs->nia += 4;
+}
+
+static void program_interrupt_guest_code(void)
+{
+   unsigned long nr = 0;
+
+   while (!timeout) {
+   asm volatile("trap");
+   nr++;
+   barrier();
+   }
+   count = nr;
+
+   GUEST_DONE();
+}
+
+static void program_interrupt_test(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   /* Create VM */
+   vm = vm_create_with_one_vcpu(, program_interrupt_guest_code);
+   kvm_vm = vm;
+   vm_install_exception_handler(vm, 0x700, program_interrupt_handler);
+
+   set_timer(1);
+
+   while (!timeout) {
+   vcpu_run(vcpu);
+   barrier();
+   }
+
+   sync_global_from_guest(vm, count);
+
+   kvm_vm = NULL;
+   vm_install_exception_handler(vm, 0x700, NULL);
+
+   kvm_vm_free(vm);
+
+   printf("%lu guest interrupts per second\n", count);
+   count = 0;
+}
+
+static void heai_guest_code(void)
+{
+   unsigned long nr = 0;
+
+   while (!timeout) {
+   asm volatile(".long 0");
+   nr++;
+   barrier();
+   }
+   count = nr;
+
+   GUEST_DONE();
+}
+
+static void heai_test(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   /* Create VM */
+   vm = vm_create_with_one_vcpu(, heai_guest_code);
+   kvm_vm = vm;
+   vm_install_exception_handler(vm, 0x700, program_interrupt_handler);
+
+   set_timer(1);
+
+   while (!timeout) {
+   vcpu_run(vcpu);
+   barrier();
+   }
+
+   sync_global_from_guest(vm, count);
+
+   kvm_vm = NULL;
+   vm_install_exception_handler(vm, 0x700, NULL);
+
+   kvm_vm_free(vm);
+
+   printf("%lu guest exits per second\n", count);
+   count = 0;
+}
+
+static void hcall_guest_code(void)
+{
+   for (;;)
+   hcall0(H_RTAS);
+}
+
+static void hcall_test(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   /* Create VM */
+   vm = vm_create_with_one_vcpu(, hcall_guest_code);
+   kvm_vm = vm;
+
+   set_timer(1);
+
+   while (!timeout) {
+   vcpu_run(vcpu);
+   count++;
+   barrier();
+   }
+
+   kvm_vm = NULL;
+
+   kvm_vm_free(vm);
+
+   printf("%lu KVM exits per second\n", count);
+   count = 0;
+}
+
+struct testdef {
+   const char *name;
+   void (*test)(void);
+} testlist[] = {
+   { "guest interrupt test", program_interrupt_test},
+   { "guest exit test", heai_test},
+   {

[PATCH v3 5/6] KVM: PPC: selftests: Add a TLBIEL virtualisation tester

2023-06-07 Thread Nicholas Piggin

TLBIEL virtualisation has been a source of difficulty. The TLBIEL
instruction operates on the TLB of the hardware thread which
executes it, but the behaviour expected by the guest environment

Signed-off-by: Nicholas Piggin 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../selftests/kvm/include/powerpc/processor.h |   7 +
 .../selftests/kvm/lib/powerpc/processor.c | 108 +++-
 .../selftests/kvm/powerpc/tlbiel_test.c   | 508 ++
 4 files changed, 621 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/powerpc/tlbiel_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index efb8700b9752..aa3a8ca676c2 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -186,6 +186,7 @@ TEST_GEN_PROGS_riscv += kvm_binary_stats_test
 
 TEST_GEN_PROGS_powerpc += powerpc/null_test
 TEST_GEN_PROGS_powerpc += powerpc/rtas_hcall
+TEST_GEN_PROGS_powerpc += powerpc/tlbiel_test
 TEST_GEN_PROGS_powerpc += access_tracking_perf_test
 TEST_GEN_PROGS_powerpc += demand_paging_test
 TEST_GEN_PROGS_powerpc += dirty_log_test
diff --git a/tools/testing/selftests/kvm/include/powerpc/processor.h 
b/tools/testing/selftests/kvm/include/powerpc/processor.h
index ce5a23525dbd..92ef6476a9ef 100644
--- a/tools/testing/selftests/kvm/include/powerpc/processor.h
+++ b/tools/testing/selftests/kvm/include/powerpc/processor.h
@@ -7,6 +7,7 @@
 
 #include 
 #include "ppc_asm.h"
+#include "kvm_util_base.h"
 
 extern unsigned char __interrupts_start[];
 extern unsigned char __interrupts_end[];
@@ -31,6 +32,12 @@ struct ex_regs {
 void vm_install_exception_handler(struct kvm_vm *vm, int vector,
void (*handler)(struct ex_regs *));
 
+vm_paddr_t virt_pt_duplicate(struct kvm_vm *vm);
+void set_radix_proc_table(struct kvm_vm *vm, int pid, vm_paddr_t pgd);
+bool virt_wrprotect_pte(struct kvm_vm *vm, uint64_t gva);
+bool virt_wrenable_pte(struct kvm_vm *vm, uint64_t gva);
+bool virt_remap_pte(struct kvm_vm *vm, uint64_t gva, vm_paddr_t gpa);
+
 static inline void cpu_relax(void)
 {
asm volatile("" ::: "memory");
diff --git a/tools/testing/selftests/kvm/lib/powerpc/processor.c 
b/tools/testing/selftests/kvm/lib/powerpc/processor.c
index 02db2ff86da8..17ea440f9026 100644
--- a/tools/testing/selftests/kvm/lib/powerpc/processor.c
+++ b/tools/testing/selftests/kvm/lib/powerpc/processor.c
@@ -23,7 +23,7 @@ static void set_proc_table(struct kvm_vm *vm, int pid, 
uint64_t dw0, uint64_t dw
proc_table[pid * 2 + 1] = cpu_to_be64(dw1);
 }
 
-static void set_radix_proc_table(struct kvm_vm *vm, int pid, vm_paddr_t pgd)
+void set_radix_proc_table(struct kvm_vm *vm, int pid, vm_paddr_t pgd)
 {
set_proc_table(vm, pid, pgd | RADIX_TREE_SIZE | RADIX_PGD_INDEX_SIZE, 
0);
 }
@@ -146,9 +146,69 @@ static uint64_t *virt_get_pte(struct kvm_vm *vm, 
vm_paddr_t pt,
 #define PDE_NLS0x0011ull
 #define PDE_PT_MASK0x0f00ull
 
-void virt_arch_pg_map(struct kvm_vm *vm, uint64_t gva, uint64_t gpa)
+static uint64_t *virt_lookup_pte(struct kvm_vm *vm, uint64_t gva)
 {
vm_paddr_t pt = vm->pgd;
+   uint64_t *ptep;
+   int level;
+
+   for (level = 1; level <= 3; level++) {
+   uint64_t nls;
+   uint64_t *pdep = virt_get_pte(vm, pt, gva, level, );
+   uint64_t pde = be64_to_cpu(*pdep);
+
+   if (pde) {
+   TEST_ASSERT((pde & PDE_VALID) && !(pde & PTE_LEAF),
+   "Invalid PDE at level: %u gva: 0x%lx 
pde:0x%lx\n",
+   level, gva, pde);
+   pt = pde & PDE_PT_MASK;
+   continue;
+   }
+
+   return NULL;
+   }
+
+   ptep = virt_get_pte(vm, pt, gva, level, NULL);
+
+   return ptep;
+}
+
+static bool virt_modify_pte(struct kvm_vm *vm, uint64_t gva, uint64_t clr, 
uint64_t set)
+{
+   uint64_t *ptep, pte;
+
+   ptep = virt_lookup_pte(vm, gva);
+   if (!ptep)
+   return false;
+
+   pte = be64_to_cpu(*ptep);
+   if (!(pte & PTE_VALID))
+   return false;
+
+   pte = (pte & ~clr) | set;
+   *ptep = cpu_to_be64(pte);
+
+   return true;
+}
+
+bool virt_remap_pte(struct kvm_vm *vm, uint64_t gva, vm_paddr_t gpa)
+{
+   return virt_modify_pte(vm, gva, PTE_PAGE_MASK, (gpa & PTE_PAGE_MASK));
+}
+
+bool virt_wrprotect_pte(struct kvm_vm *vm, uint64_t gva)
+{
+   return virt_modify_pte(vm, gva, PTE_RW, 0);
+}
+
+bool virt_wrenable_pte(struct kvm_vm *vm, uint64_t gva)
+{
+   return virt_modify_pte(vm, gva, 0, PTE_RW);
+}
+
+static void __virt_arch_pg_map(struct kvm_vm *vm, vm_paddr_t pgd, uint64_t 
gva, uint64_t gpa)
+{
+   vm_paddr_t pt = pgd;
uint64_t *ptep, pte;
int level;
 
@@ -187,6 +247,49 @@ void virt_arch_pg_map(struct kvm_vm *vm, uint64_t gva, 
uint64_t gpa)
*ptep =

[PATCH v3 4/6] KVM: PPC: selftests: add selftests sanity tests

2023-06-07 Thread Nicholas Piggin

Add tests that exercise very basic functions of the kvm selftests
framework, guest creation, ucalls, hcalls, copying data between guest
and host, interrupts and page faults.

These don't stress KVM so much as being useful when developing support
for powerpc.

Acked-by: Michael Ellerman  (powerpc)
Signed-off-by: Nicholas Piggin 
---
 tools/testing/selftests/kvm/Makefile  |   2 +
 .../selftests/kvm/include/powerpc/hcall.h |   2 +
 .../testing/selftests/kvm/powerpc/null_test.c | 166 ++
 .../selftests/kvm/powerpc/rtas_hcall.c| 136 ++
 4 files changed, 306 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/powerpc/null_test.c
 create mode 100644 tools/testing/selftests/kvm/powerpc/rtas_hcall.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 53cd3ce63dec..efb8700b9752 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -184,6 +184,8 @@ TEST_GEN_PROGS_riscv += kvm_page_table_test
 TEST_GEN_PROGS_riscv += set_memory_region_test
 TEST_GEN_PROGS_riscv += kvm_binary_stats_test
 
+TEST_GEN_PROGS_powerpc += powerpc/null_test
+TEST_GEN_PROGS_powerpc += powerpc/rtas_hcall
 TEST_GEN_PROGS_powerpc += access_tracking_perf_test
 TEST_GEN_PROGS_powerpc += demand_paging_test
 TEST_GEN_PROGS_powerpc += dirty_log_test
diff --git a/tools/testing/selftests/kvm/include/powerpc/hcall.h 
b/tools/testing/selftests/kvm/include/powerpc/hcall.h
index ba119f5a3fef..04c7d2d13020 100644
--- a/tools/testing/selftests/kvm/include/powerpc/hcall.h
+++ b/tools/testing/selftests/kvm/include/powerpc/hcall.h
@@ -12,6 +12,8 @@
 #define UCALL_R4_UCALL 0x5715 // regular ucall, r5 contains ucall pointer
 #define UCALL_R4_SIMPLE0x // simple exit usable by asm with no 
ucall data
 
+#define H_RTAS 0xf000
+
 int64_t hcall0(uint64_t token);
 int64_t hcall1(uint64_t token, uint64_t arg1);
 int64_t hcall2(uint64_t token, uint64_t arg1, uint64_t arg2);
diff --git a/tools/testing/selftests/kvm/powerpc/null_test.c 
b/tools/testing/selftests/kvm/powerpc/null_test.c
new file mode 100644
index ..31db0b6becd6
--- /dev/null
+++ b/tools/testing/selftests/kvm/powerpc/null_test.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Tests for guest creation, run, ucall, interrupt, and vm dumping.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "kselftest.h"
+#include "processor.h"
+#include "helpers.h"
+
+extern void guest_code_asm(void);
+asm(".global guest_code_asm");
+asm(".balign 4");
+asm("guest_code_asm:");
+asm("li 3,0"); // H_UCALL
+asm("li 4,0"); // UCALL_R4_SIMPLE
+asm("sc 1");
+
+static void test_asm(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   vm = vm_create_with_one_vcpu(, guest_code_asm);
+
+   vcpu_run(vcpu);
+   handle_ucall(vcpu, UCALL_NONE);
+
+   kvm_vm_free(vm);
+}
+
+static void guest_code_ucall(void)
+{
+   GUEST_DONE();
+}
+
+static void test_ucall(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   vm = vm_create_with_one_vcpu(, guest_code_ucall);
+
+   vcpu_run(vcpu);
+   handle_ucall(vcpu, UCALL_DONE);
+
+   kvm_vm_free(vm);
+}
+
+static void trap_handler(struct ex_regs *regs)
+{
+   GUEST_SYNC(1);
+   regs->nia += 4;
+}
+
+static void guest_code_trap(void)
+{
+   GUEST_SYNC(0);
+   asm volatile("trap");
+   GUEST_DONE();
+}
+
+static void test_trap(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   vm = vm_create_with_one_vcpu(, guest_code_trap);
+   vm_install_exception_handler(vm, 0x700, trap_handler);
+
+   vcpu_run(vcpu);
+   host_sync(vcpu, 0);
+   vcpu_run(vcpu);
+   host_sync(vcpu, 1);
+   vcpu_run(vcpu);
+   handle_ucall(vcpu, UCALL_DONE);
+
+   vm_install_exception_handler(vm, 0x700, NULL);
+
+   kvm_vm_free(vm);
+}
+
+static void dsi_handler(struct ex_regs *regs)
+{
+   GUEST_SYNC(1);
+   regs->nia += 4;
+}
+
+static void guest_code_dsi(void)
+{
+   GUEST_SYNC(0);
+   asm volatile("stb %r0,0(0)");
+   GUEST_DONE();
+}
+
+static void test_dsi(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   vm = vm_create_with_one_vcpu(, guest_code_dsi);
+   vm_install_exception_handler(vm, 0x300, dsi_handler);
+
+   vcpu_run(vcpu);
+   host_sync(vcpu, 0);
+   vcpu_run(vcpu);
+   host_sync(vcpu, 1);
+   vcpu_run(vcpu);
+   handle_ucall(vcpu, UCALL_DONE);
+
+   vm_install_exception_handler(vm, 0x300, NULL);
+
+   kvm_vm_free(vm);
+}
+
+static void test_dump(void)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm_vm *vm;
+
+   vm = vm_create_with_one_vcpu(, guest_code_ucall);
+
+   vcpu_run(vcpu);
+   handle_ucall(vcpu, UCALL_DONE);
+
+   printf("Testing

[PATCH v3 3/6] KVM: PPC: selftests: add support for powerpc

2023-06-07 Thread Nicholas Piggin

Implement KVM selftests support for powerpc (Book3S-64).

ucalls are implemented with an unsuppored PAPR hcall number which will
always cause KVM to exit to userspace.

Virtual memory is implemented for the radix MMU, and only a base page
size is supported (both 4K and 64K).

Guest interrupts are taken in real-mode, so require a page allocated at
gRA 0x0. Interrupt entry is complicated because gVA:gRA is not 1:1 mapped
(like the kernel is), so the MMU can not just just be switched on and
off.

Acked-by: Michael Ellerman  (powerpc)
Signed-off-by: Nicholas Piggin 
---
 MAINTAINERS   |   2 +
 tools/testing/selftests/kvm/Makefile  |  19 +
 .../selftests/kvm/include/kvm_util_base.h |  22 +
 .../selftests/kvm/include/powerpc/hcall.h |  19 +
 .../selftests/kvm/include/powerpc/ppc_asm.h   |  32 ++
 .../selftests/kvm/include/powerpc/processor.h |  39 ++
 tools/testing/selftests/kvm/lib/guest_modes.c |  27 +-
 tools/testing/selftests/kvm/lib/kvm_util.c|  12 +
 .../selftests/kvm/lib/powerpc/handlers.S  |  93 
 .../testing/selftests/kvm/lib/powerpc/hcall.c |  45 ++
 .../selftests/kvm/lib/powerpc/processor.c | 439 ++
 .../testing/selftests/kvm/lib/powerpc/ucall.c |  30 ++
 tools/testing/selftests/kvm/powerpc/helpers.h |  46 ++
 13 files changed, 821 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/powerpc/hcall.h
 create mode 100644 tools/testing/selftests/kvm/include/powerpc/ppc_asm.h
 create mode 100644 tools/testing/selftests/kvm/include/powerpc/processor.h
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/handlers.S
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/hcall.c
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/processor.c
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/ucall.c
 create mode 100644 tools/testing/selftests/kvm/powerpc/helpers.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 44417acd2936..39afb356369e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11391,6 +11391,8 @@ F:  arch/powerpc/include/asm/kvm*
 F: arch/powerpc/include/uapi/asm/kvm*
 F: arch/powerpc/kernel/kvm*
 F: arch/powerpc/kvm/
+F: tools/testing/selftests/kvm/*/powerpc/
+F: tools/testing/selftests/kvm/powerpc/
 
 KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv)
 M: Anup Patel 
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 4761b768b773..53cd3ce63dec 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -55,6 +55,11 @@ LIBKVM_s390x += lib/s390x/ucall.c
 LIBKVM_riscv += lib/riscv/processor.c
 LIBKVM_riscv += lib/riscv/ucall.c
 
+LIBKVM_powerpc += lib/powerpc/handlers.S
+LIBKVM_powerpc += lib/powerpc/processor.c
+LIBKVM_powerpc += lib/powerpc/ucall.c
+LIBKVM_powerpc += lib/powerpc/hcall.c
+
 # Non-compiled test targets
 TEST_PROGS_x86_64 += x86_64/nx_huge_pages_test.sh
 
@@ -179,6 +184,20 @@ TEST_GEN_PROGS_riscv += kvm_page_table_test
 TEST_GEN_PROGS_riscv += set_memory_region_test
 TEST_GEN_PROGS_riscv += kvm_binary_stats_test
 
+TEST_GEN_PROGS_powerpc += access_tracking_perf_test
+TEST_GEN_PROGS_powerpc += demand_paging_test
+TEST_GEN_PROGS_powerpc += dirty_log_test
+TEST_GEN_PROGS_powerpc += dirty_log_perf_test
+TEST_GEN_PROGS_powerpc += hardware_disable_test
+TEST_GEN_PROGS_powerpc += kvm_create_max_vcpus
+TEST_GEN_PROGS_powerpc += kvm_page_table_test
+TEST_GEN_PROGS_powerpc += max_guest_memory_test
+TEST_GEN_PROGS_powerpc += memslot_modification_stress_test
+TEST_GEN_PROGS_powerpc += memslot_perf_test
+TEST_GEN_PROGS_powerpc += rseq_test
+TEST_GEN_PROGS_powerpc += set_memory_region_test
+TEST_GEN_PROGS_powerpc += kvm_binary_stats_test
+
 TEST_PROGS += $(TEST_PROGS_$(ARCH_DIR))
 TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(ARCH_DIR))
 TEST_GEN_PROGS_EXTENDED += $(TEST_GEN_PROGS_EXTENDED_$(ARCH_DIR))
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 42d03ae08ecb..17b80709b894 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -105,6 +105,7 @@ struct kvm_vm {
bool pgd_created;
vm_paddr_t ucall_mmio_addr;
vm_paddr_t pgd;
+   vm_paddr_t prtb; // powerpc process table
vm_vaddr_t gdt;
vm_vaddr_t tss;
vm_vaddr_t idt;
@@ -160,6 +161,8 @@ enum vm_guest_mode {
VM_MODE_PXXV48_4K,  /* For 48bits VA but ANY bits PA */
VM_MODE_P47V64_4K,
VM_MODE_P44V64_4K,
+   VM_MODE_P52V52_4K,
+   VM_MODE_P52V52_64K,
VM_MODE_P36V48_4K,
VM_MODE_P36V48_16K,
VM_MODE_P36V48_64K,
@@ -197,6 +200,25 @@ extern enum vm_guest_mode vm_mode_default;
 #define MIN_PAGE_SHIFT 12U
 #define ptes_per_page(page_size)   ((page_size) / 8)
 
+#elif defined(__powerpc64__)
+
+extern enum vm_guest_mode vm_mode_default;
+
+#define

[PATCH v3 2/6] KVM: selftests: Add aligned guest physical page allocator

2023-06-07 Thread Nicholas Piggin

powerpc will require this to allocate MMU tables in guest memory that
are larger than guest base page size.

Signed-off-by: Nicholas Piggin 
---
 .../selftests/kvm/include/kvm_util_base.h |  2 +
 tools/testing/selftests/kvm/lib/kvm_util.c| 44 ---
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index d630a0a1877c..42d03ae08ecb 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -680,6 +680,8 @@ const char *exit_reason_str(unsigned int exit_reason);
 
 vm_paddr_t vm_phy_page_alloc(struct kvm_vm *vm, vm_paddr_t paddr_min,
 uint32_t memslot);
+vm_paddr_t vm_phy_pages_alloc_align(struct kvm_vm *vm, size_t num, size_t 
align,
+ vm_paddr_t paddr_min, uint32_t memslot);
 vm_paddr_t vm_phy_pages_alloc(struct kvm_vm *vm, size_t num,
  vm_paddr_t paddr_min, uint32_t memslot);
 vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 298c4372fb1a..68558d60f949 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1903,6 +1903,7 @@ const char *exit_reason_str(unsigned int exit_reason)
  * Input Args:
  *   vm - Virtual Machine
  *   num - number of pages
+ *   align - pages alignment
  *   paddr_min - Physical address minimum
  *   memslot - Memory region to allocate page from
  *
@@ -1916,7 +1917,7 @@ const char *exit_reason_str(unsigned int exit_reason)
  * and their base address is returned. A TEST_ASSERT failure occurs if
  * not enough pages are available at or above paddr_min.
  */
-vm_paddr_t vm_phy_pages_alloc(struct kvm_vm *vm, size_t num,
+vm_paddr_t vm_phy_pages_alloc_align(struct kvm_vm *vm, size_t num, size_t 
align,
  vm_paddr_t paddr_min, uint32_t memslot)
 {
struct userspace_mem_region *region;
@@ -1930,24 +1931,27 @@ vm_paddr_t vm_phy_pages_alloc(struct kvm_vm *vm, size_t 
num,
paddr_min, vm->page_size);
 
region = memslot2region(vm, memslot);
-   base = pg = paddr_min >> vm->page_shift;
-
-   do {
-   for (; pg < base + num; ++pg) {
-   if (!sparsebit_is_set(region->unused_phy_pages, pg)) {
-   base = pg = 
sparsebit_next_set(region->unused_phy_pages, pg);
-   break;
+   base = paddr_min >> vm->page_shift;
+
+again:
+   base = (base + align - 1) & ~(align - 1);
+   for (pg = base; pg < base + num; ++pg) {
+   if (!sparsebit_is_set(region->unused_phy_pages, pg)) {
+   base = sparsebit_next_set(region->unused_phy_pages, pg);
+   if (!base) {
+   fprintf(stderr, "No guest physical pages "
+   "available, paddr_min: 0x%lx "
+   "page_size: 0x%x memslot: %u "
+   "num_pages: %lu align: %lu\n",
+   paddr_min, vm->page_size, memslot,
+   num, align);
+   fputs(" vm dump \n", stderr);
+   vm_dump(stderr, vm, 2);
+   TEST_ASSERT(false, "false");
+   abort();
}
+   goto again;
}
-   } while (pg && pg != base + num);
-
-   if (pg == 0) {
-   fprintf(stderr, "No guest physical page available, "
-   "paddr_min: 0x%lx page_size: 0x%x memslot: %u\n",
-   paddr_min, vm->page_size, memslot);
-   fputs(" vm dump \n", stderr);
-   vm_dump(stderr, vm, 2);
-   abort();
}
 
for (pg = base; pg < base + num; ++pg)
@@ -1956,6 +1960,12 @@ vm_paddr_t vm_phy_pages_alloc(struct kvm_vm *vm, size_t 
num,
return base * vm->page_size;
 }
 
+vm_paddr_t vm_phy_pages_alloc(struct kvm_vm *vm, size_t num,
+ vm_paddr_t paddr_min, uint32_t memslot)
+{
+   return vm_phy_pages_alloc_align(vm, num, 1, paddr_min, memslot);
+}
+
 vm_paddr_t vm_phy_page_alloc(struct kvm_vm *vm, vm_paddr_t paddr_min,
 uint32_t memslot)
 {
-- 
2.40.1

[PATCH v3 0/6] KVM: selftests: add powerpc support

2023-06-07 Thread Nicholas Piggin

This series adds initial KVM selftests support for powerpc
(64-bit, BookS, radix MMU).

Since v2:
- Added a couple of new tests (patch 5,6)
- Make default page size match host page size.
- Check for radix MMU capability.
- Build a few more of the generic tests.

Since v1:
- Update MAINTAINERS KVM PPC entry to include kvm selftests.
- Fixes and cleanups from Sean's review including new patch 1.
- Add 4K guest page support requiring new patch 2.

Thanks,
Nick

Nicholas Piggin (6):
  KVM: selftests: Move pgd_created check into virt_pgd_alloc
  KVM: selftests: Add aligned guest physical page allocator
  KVM: PPC: selftests: add support for powerpc
  KVM: PPC: selftests: add selftests sanity tests
  KVM: PPC: selftests: Add a TLBIEL virtualisation tester
  KVM: PPC: selftests: Add interrupt performance tester

 MAINTAINERS   |   2 +
 tools/testing/selftests/kvm/Makefile  |  23 +
 .../selftests/kvm/include/kvm_util_base.h |  29 +
 .../selftests/kvm/include/powerpc/hcall.h |  21 +
 .../selftests/kvm/include/powerpc/ppc_asm.h   |  32 ++
 .../selftests/kvm/include/powerpc/processor.h |  46 ++
 .../selftests/kvm/lib/aarch64/processor.c |   4 -
 tools/testing/selftests/kvm/lib/guest_modes.c |  27 +-
 tools/testing/selftests/kvm/lib/kvm_util.c|  56 +-
 .../selftests/kvm/lib/powerpc/handlers.S  |  93 +++
 .../testing/selftests/kvm/lib/powerpc/hcall.c |  45 ++
 .../selftests/kvm/lib/powerpc/processor.c | 541 ++
 .../testing/selftests/kvm/lib/powerpc/ucall.c |  30 +
 .../selftests/kvm/lib/riscv/processor.c   |   4 -
 .../selftests/kvm/lib/s390x/processor.c   |   4 -
 .../selftests/kvm/lib/x86_64/processor.c  |   7 +-
 tools/testing/selftests/kvm/powerpc/helpers.h |  46 ++
 .../selftests/kvm/powerpc/interrupt_perf.c| 199 +++
 .../testing/selftests/kvm/powerpc/null_test.c | 166 ++
 .../selftests/kvm/powerpc/rtas_hcall.c| 136 +
 .../selftests/kvm/powerpc/tlbiel_test.c   | 508 
 21 files changed, 1981 insertions(+), 38 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/powerpc/hcall.h
 create mode 100644 tools/testing/selftests/kvm/include/powerpc/ppc_asm.h
 create mode 100644 tools/testing/selftests/kvm/include/powerpc/processor.h
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/handlers.S
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/hcall.c
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/processor.c
 create mode 100644 tools/testing/selftests/kvm/lib/powerpc/ucall.c
 create mode 100644 tools/testing/selftests/kvm/powerpc/helpers.h
 create mode 100644 tools/testing/selftests/kvm/powerpc/interrupt_perf.c
 create mode 100644 tools/testing/selftests/kvm/powerpc/null_test.c
 create mode 100644 tools/testing/selftests/kvm/powerpc/rtas_hcall.c
 create mode 100644 tools/testing/selftests/kvm/powerpc/tlbiel_test.c

-- 
2.40.1

[PATCH v3 1/6] KVM: selftests: Move pgd_created check into virt_pgd_alloc

2023-06-07 Thread Nicholas Piggin

virt_arch_pgd_alloc all do the same test and set of pgd_created. Move
this into common code.

Signed-off-by: Nicholas Piggin 
---
 tools/testing/selftests/kvm/include/kvm_util_base.h | 5 +
 tools/testing/selftests/kvm/lib/aarch64/processor.c | 4 
 tools/testing/selftests/kvm/lib/riscv/processor.c   | 4 
 tools/testing/selftests/kvm/lib/s390x/processor.c   | 4 
 tools/testing/selftests/kvm/lib/x86_64/processor.c  | 7 ++-
 5 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a089c356f354..d630a0a1877c 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -822,7 +822,12 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm);
 
 static inline void virt_pgd_alloc(struct kvm_vm *vm)
 {
+   if (vm->pgd_created)
+   return;
+
virt_arch_pgd_alloc(vm);
+
+   vm->pgd_created = true;
 }
 
 /*
diff --git a/tools/testing/selftests/kvm/lib/aarch64/processor.c 
b/tools/testing/selftests/kvm/lib/aarch64/processor.c
index 3a0259e25335..3da3ec7c5b23 100644
--- a/tools/testing/selftests/kvm/lib/aarch64/processor.c
+++ b/tools/testing/selftests/kvm/lib/aarch64/processor.c
@@ -96,13 +96,9 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm)
 {
size_t nr_pages = page_align(vm, ptrs_per_pgd(vm) * 8) / vm->page_size;
 
-   if (vm->pgd_created)
-   return;
-
vm->pgd = vm_phy_pages_alloc(vm, nr_pages,
 KVM_GUEST_PAGE_TABLE_MIN_PADDR,
 vm->memslots[MEM_REGION_PT]);
-   vm->pgd_created = true;
 }
 
 static void _virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
diff --git a/tools/testing/selftests/kvm/lib/riscv/processor.c 
b/tools/testing/selftests/kvm/lib/riscv/processor.c
index d146ca71e0c0..7695ba2cd369 100644
--- a/tools/testing/selftests/kvm/lib/riscv/processor.c
+++ b/tools/testing/selftests/kvm/lib/riscv/processor.c
@@ -57,13 +57,9 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm)
 {
size_t nr_pages = page_align(vm, ptrs_per_pte(vm) * 8) / vm->page_size;
 
-   if (vm->pgd_created)
-   return;
-
vm->pgd = vm_phy_pages_alloc(vm, nr_pages,
 KVM_GUEST_PAGE_TABLE_MIN_PADDR,
 vm->memslots[MEM_REGION_PT]);
-   vm->pgd_created = true;
 }
 
 void virt_arch_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
diff --git a/tools/testing/selftests/kvm/lib/s390x/processor.c 
b/tools/testing/selftests/kvm/lib/s390x/processor.c
index 15945121daf1..358e03f09c7a 100644
--- a/tools/testing/selftests/kvm/lib/s390x/processor.c
+++ b/tools/testing/selftests/kvm/lib/s390x/processor.c
@@ -17,16 +17,12 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm)
TEST_ASSERT(vm->page_size == 4096, "Unsupported page size: 0x%x",
vm->page_size);
 
-   if (vm->pgd_created)
-   return;
-
paddr = vm_phy_pages_alloc(vm, PAGES_PER_REGION,
   KVM_GUEST_PAGE_TABLE_MIN_PADDR,
   vm->memslots[MEM_REGION_PT]);
memset(addr_gpa2hva(vm, paddr), 0xff, PAGES_PER_REGION * vm->page_size);
 
vm->pgd = paddr;
-   vm->pgd_created = true;
 }
 
 /*
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c 
b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index d4a0b504b1e0..d4deb2718e86 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -127,11 +127,8 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm)
TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use "
"unknown or unsupported guest mode, mode: 0x%x", vm->mode);
 
-   /* If needed, create page map l4 table. */
-   if (!vm->pgd_created) {
-   vm->pgd = vm_alloc_page_table(vm);
-   vm->pgd_created = true;
-   }
+   /* Create page map l4 table. */
+   vm->pgd = vm_alloc_page_table(vm);
 }
 
 static void *virt_get_pte(struct kvm_vm *vm, uint64_t *parent_pte,
-- 
2.40.1

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-07 Thread Hugh Dickins

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:
> 
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> It is not RCU anymore if it writes to the page table itself before the
> grace period, so this change seems to break the RCU behavior of
> page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
> called after the stores.

Yes indeed, thanks for pointing that out.

> 
> Maybe something like an xarray on the mm to hold the frags?

I think we can manage without that:
I'll say slightly more in reply to Gerald.

Hugh

[PATCH] KVM: PPC: Update MAINTAINERS

2023-06-07 Thread Nicholas Piggin

Michael is merging KVM PPC patches via the powerpc tree and KVM topic
branches. He doesn't necessarily have time to be across all of KVM so
is reluctant to call himself maintainer, but for the mechanics of how
patches flow upstream, it is maintained and does make sense to have
some contact people in MAINTAINERS.

So add Michael Ellerman as KVM PPC maintainer and myself as reviewer.
Split out the subarchs that don't get so much attention.

Signed-off-by: Nicholas Piggin 
---
 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0dab9737ec16..44417acd2936 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11379,7 +11379,13 @@ F: arch/mips/include/uapi/asm/kvm*
 F: arch/mips/kvm/
 
 KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc)
+M: Michael Ellerman 
+R: Nicholas Piggin 
 L: linuxppc-dev@lists.ozlabs.org
+L: k...@vger.kernel.org
+S: Maintained (Book3S 64-bit HV)
+S: Odd fixes (Book3S 64-bit PR)
+S: Orphan (Book3E and 32-bit)
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
topic/ppc-kvm
 F: arch/powerpc/include/asm/kvm*
 F: arch/powerpc/include/uapi/asm/kvm*
-- 
2.40.1

Re: [PATCH 4/7] watchdog/hardlockup: Enable HAVE_NMI_WATCHDOG only on sparc64

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 13c6e596cf9e..57f15babe188 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -404,10 +404,9 @@ config HAVE_NMI_WATCHDOG
> depends on HAVE_NMI
> bool
> help
> - The arch provides its own hardlockup detector implementation instead
> + Sparc64 provides its own hardlockup detector implementation instead
>   of the generic perf one.

It's a little weird to document generic things with the specifics of
the user. The exception, IMO, is when something is deprecated.
Personally, it would sound less weird to me to say something like:

The arch provides its own hardlockup detector implementation instead
of the generic perf one. This is a deprecated thing to do and kept
around until sparc64 provides a full hardlockup implementation or
moves to generic code.


> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d201f5d3876b..4b4aa0f941f9 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1050,15 +1050,13 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
>  #  sparc64: has a custom implementation which is not using the common
>  #  hardlockup command line options and sysctl interface.
>  #
> -# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
> -# implementaion. It is automatically enabled also for other arch-specific
> -# variants which set HAVE_HARDLOCKUP_DETECTOR_ARCH. It makes the check
> -# of avaialable and supported variants quite tricky.
> +# Note that HAVE_NMI_WATCHDOG is set when the sparc64 specific implementation
> +# is used.
>  #
>  config HARDLOCKUP_DETECTOR
> bool "Detect Hard Lockups"
> -   depends on DEBUG_KERNEL && !S390
> -   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH
> +   depends on DEBUG_KERNEL && !S390 && !HAVE_NMI_WATCHDOG
> +   depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH

If you add the "!HAVE_NMI_WATCHDOG" as a dependency to
HAVE_HARDLOCKUP_DETECTOR_BUDDY, as discussed in a previous patch, you
can skip adding it here.


> imply HARDLOCKUP_DETECTOR_PERF
> imply HARDLOCKUP_DETECTOR_BUDDY
> select LOCKUP_DETECTOR
> @@ -1079,7 +1077,7 @@ config HARDLOCKUP_DETECTOR_PREFER_BUDDY
> bool "Prefer the buddy CPU hardlockup detector"
> depends on HARDLOCKUP_DETECTOR
> depends on HAVE_HARDLOCKUP_DETECTOR_PERF && 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY
> -   depends on !HAVE_NMI_WATCHDOG
> +   depends on !HAVE_HARLOCKUP_DETECTOR_ARCH

Don't need this. Architectures never are allowed to define
HAVE_HARDLOCKUP_DETECTOR_PERF and HAVE_HARLOCKUP_DETECTOR_ARCH


> default n
> help
>   Say Y here to prefer the buddy hardlockup detector over the perf 
> one.
> @@ -1096,7 +1094,7 @@ config HARDLOCKUP_DETECTOR_PERF
> bool
> depends on HARDLOCKUP_DETECTOR
> depends on HAVE_HARDLOCKUP_DETECTOR_PERF && 
> !HARDLOCKUP_DETECTOR_PREFER_BUDDY
> -   depends on !HAVE_NMI_WATCHDOG
> +   depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH

Similarly, don't need this.


> select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
>
>  config HARDLOCKUP_DETECTOR_BUDDY
> @@ -1104,7 +1102,7 @@ config HARDLOCKUP_DETECTOR_BUDDY
> depends on HARDLOCKUP_DETECTOR
> depends on HAVE_HARDLOCKUP_DETECTOR_BUDDY
> depends on !HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HARDLOCKUP_DETECTOR_PREFER_BUDDY
> -   depends on !HAVE_NMI_WATCHDOG
> +   depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH

Similarly, don't need this.


In general I don't object to splitting out HAVE_NMI_WATCHDOG from
HAVE_HARDLOCKUP_DETECTOR_ARCH.

Re: [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix

2023-06-07 Thread kernel test robot

Hi Aneesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on powerpc/next]
[also build test WARNING on powerpc/fixes akpm-mm/mm-everything linus/master 
v6.4-rc5]
[cannot apply to nvdimm/libnvdimm-for-next tip/x86/core nvdimm/dax-misc 
next-20230607]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/powerpc-mm-book3s64-Use-pmdp_ptep-helper-instead-of-typecasting/20230606-125913
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:
https://lore.kernel.org/r/20230606045608.55127-16-aneesh.kumar%40linux.ibm.com
patch subject: [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap 
optimization for radix
reproduce:
git remote add powerpc 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
git fetch powerpc next
git checkout powerpc/next
b4 shazam 
https://lore.kernel.org/r/20230606045608.55127-16-aneesh.ku...@linux.ibm.com
make menuconfig
# enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, 
CONFIG_WARN_ABI_ERRORS
make htmldocs

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202306080711.eecyypfk-...@intel.com/

All warnings (new ones prefixed by >>):

>> Documentation/powerpc/vmemmap_dedup.rst: WARNING: document isn't included in 
>> any toctree

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 6/7] watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:26 AM Petr Mladek  wrote:
>
> The HAVE_ prefix means that the code could be enabled. Add another
> variable for HAVE_HARDLOCKUP_DETECTOR_SPARC64 without this prefix.
> It will be set when it should be built. It will make it compatible
> with the other hardlockup detectors.
>
> Before, it is far from obvious that the SPARC64 variant is actually used:
>
> $> make ARCH=sparc64 defconfig
> $> grep HARDLOCKUP_DETECTOR .config
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
>
> After, it is more clear:
>
> $> make ARCH=sparc64 defconfig
> $> grep HARDLOCKUP_DETECTOR .config
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
> CONFIG_HARDLOCKUP_DETECTOR_SPARC64=y
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/sparc/Kconfig.debug | 10 +-
>  include/linux/nmi.h  |  4 ++--
>  kernel/watchdog.c|  2 +-
>  lib/Kconfig.debug|  2 +-
>  4 files changed, 13 insertions(+), 5 deletions(-)

Reviewed-by: Douglas Anderson

Re: [PATCH 3/7] watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> arch_touch_nmi_watchdog() needs a different implementation for various
> hardlockup detector implementations. And it does nothing when
> any hardlockup detector is not build at all.

s/build/built/


> arch_touch_nmi_watchdog() has to be declared in linux/nmi.h. It is done
> directly in this header file for the perf and buddy detectors. And it
> is done in the included asm/linux.h for arch specific detectors.
>
> The reason probably is that the arch specific variants build the code
> using another conditions. For example, powerpc64/sparc64 builds the code
> when CONFIG_PPC_WATCHDOG is enabled.
>
> Another reason might be that these architectures define more functions
> in asm/nmi.h anyway.
>
> However the generic code actually knows the information. The config
> variables HAVE_NMI_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_ARCH are used
> to decide whether to build the buddy detector.
>
> In particular, CONFIG_HARDLOCKUP_DETECTOR is set only when a generic
> or arch-specific hardlockup detector is built. The only exception
> is sparc64 which ignores the global HARDLOCKUP_DETECTOR switch.
>
> The information about sparc64 is a bit complicated. The hardlockup
> detector is built there when CONFIG_HAVE_NMI_WATCHDOG is set and
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> People might wonder whether this change really makes things easier.
> The motivation is:
>
>   + The current logic in linux/nmi.h is far from obvious.
> For example, arch_touch_nmi_watchdog() is defined as {} when
> neither CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER nor
> CONFIG_HAVE_NMI_WATCHDOG is defined.
>
>   + The change synchronizes the checks in lib/Kconfig.debug and
> in the generic code.
>
>   + It is a step that will help cleaning HAVE_NMI_WATCHDOG related
> checks.
>
> The change should not change the existing behavior.
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/powerpc/include/asm/nmi.h |  2 --
>  arch/sparc/include/asm/nmi.h   |  1 -
>  include/linux/nmi.h| 13 ++---
>  3 files changed, 10 insertions(+), 6 deletions(-)

This looks right and is a nice cleanup.

Reviewed-by: Douglas Anderson

Re: [PATCH 0/7] watchdog/hardlockup: Cleanup configuration of hardlockup detectors

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> Hi,
>
> this patchset is supposed to replace the last patch in the patchset cleaning
> up after introducing the buddy detector, see
> https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid

I will let Andrew chime in with his preference, but so far I haven't
seen him dropping and/or modifying any patches that he's picked up in
this series. I see that he's already picked up the patch that you're
"replacing". I wonder if it would be easier for him if you just built
atop that?

-Doug

Re: [PATCH 7/7] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:26 AM Petr Mladek  wrote:
>
> @@ -1102,6 +1103,14 @@ config HARDLOCKUP_DETECTOR_BUDDY
> depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
> select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
>
> +config HARDLOCKUP_DETECTOR_ARCH
> +   bool
> +   depends on HARDLOCKUP_DETECTOR
> +   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
> +   help
> + The arch-specific implementation of the hardlockup detector is
> + available.

nit: "is available" makes it sound a bit too much like a "have"
version. Maybe "The arch-specific implementation of the hardlockup
detector will be used" or something like that?

Otherise:

Reviewed-by: Douglas Anderson

Re: [PATCH 5/7] watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> The configuration variable HAVE_NMI_WATCHDOG has a generic name but
> it is selected only for SPARC64.
>
> It should _not_ be used in general because it is not integrated with
> the other hardlockup detectors. Namely, it does not support the hardlockup
> specific command line parameters and systcl interface. Instead, it is
> enabled/disabled together with the softlockup detector by the global
> "watchdog" sysctl.
>
> Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special
> behavior more clear.
>
> Also the variable is set only on sparc64. Move the definition
> from arch/Kconfig to arch/sparc/Kconfig.debug.
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/Kconfig | 12 
>  arch/sparc/Kconfig   |  2 +-
>  arch/sparc/Kconfig.debug | 12 
>  include/linux/nmi.h  |  4 ++--
>  kernel/watchdog.c|  2 +-
>  lib/Kconfig.debug|  5 +
>  6 files changed, 17 insertions(+), 20 deletions(-)

Reviewed-by: Douglas Anderson

Re: [PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 422f0ffa269e..13c6e596cf9e 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -404,17 +404,27 @@ config HAVE_NMI_WATCHDOG
> depends on HAVE_NMI
> bool
> help
> - The arch provides a low level NMI watchdog. It provides
> - asm/nmi.h, and defines its own watchdog_hardlockup_probe() and
> - arch_touch_nmi_watchdog().
> + The arch provides its own hardlockup detector implementation instead
> + of the generic perf one.

nit: did you mean to have different wording here compared to
HAVE_HARDLOCKUP_DETECTOR_ARCH? Here you say "the generic perf one" and
there you say "the generic ones", though it seems like you mean them
to be the same.

> +
> + Sparc64 defines this variable without HAVE_HARDLOCKUP_DETECTOR_ARCH.
> + It does _not_ use the command line parameters and sysctl interface
> + used by generic hardlockup detectors. Instead it is enabled/disabled
> + by the top-level watchdog interface that is common for both 
> softlockup
> + and hardlockup detectors.
>
>  config HAVE_HARDLOCKUP_DETECTOR_ARCH
> bool
> select HAVE_NMI_WATCHDOG
> help
> - The arch chooses to provide its own hardlockup detector, which is
> - a superset of the HAVE_NMI_WATCHDOG. It also conforms to config
> - interfaces and parameters provided by hardlockup detector subsystem.
> + The arch provides its own hardlockup detector implementation instead
> + of the generic ones.
> +
> + It uses the same command line parameters, and sysctl interface,
> + as the generic hardlockup detectors.
> +
> + HAVE_NMI_WATCHDOG is selected to build the code shared with
> + the sparc64 specific implementation.
>
>  config HAVE_PERF_REGS
> bool
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 3e91fa33c7a0..d201f5d3876b 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1035,16 +1035,33 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
>
>   Say N if unsure.
>
> +config HAVE_HARDLOCKUP_DETECTOR_BUDDY
> +   bool
> +   depends on SMP
> +   default y

I think you simplify your life if you also add:

  depends on !HAVE_NMI_WATCHDOG

The existing config system has always assumed that no architecture
defines both HAVE_HARDLOCKUP_DETECTOR_PERF and HAVE_NMI_WATCHDOG
(symbols would have clashed and you would get a link error as two
watchdogs try to implement the same weak symbol). If you add the extra
dependency to "buddy" as per above, then a few things below fall out.
I'll try to point them out below.


> +
>  #
> -# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
> -# lockup detector rather than the perf based detector.
> +# Global switch whether to build a hardlockup detector at all. It is 
> available
> +# only when the architecture supports at least one implementation. There are
> +# two exceptions. The hardlockup detector is newer enabled on:

s/newer/never/


> +#
> +#  s390: it reported many false positives there
> +#
> +#  sparc64: has a custom implementation which is not using the common
> +#  hardlockup command line options and sysctl interface.
> +#
> +# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
> +# implementaion. It is automatically enabled also for other arch-specific
> +# variants which set HAVE_HARDLOCKUP_DETECTOR_ARCH. It makes the check
> +# of avaialable and supported variants quite tricky.
>  #
>  config HARDLOCKUP_DETECTOR
> bool "Detect Hard Lockups"
> depends on DEBUG_KERNEL && !S390
> -   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH
> +   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH

Adding the dependency to buddy (see ablove) would simplify the above
to just this:

depends on HAVE_HARDLOCKUP_DETECTOR_PERF ||
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH

As per above, it's simply a responsibility of architectures not to
define that they have both "perf" if they have the NMI watchdog, so
it's just buddy to worry about.


> +   imply HARDLOCKUP_DETECTOR_PERF
> +   imply HARDLOCKUP_DETECTOR_BUDDY
> select LOCKUP_DETECTOR
> -   select HARDLOCKUP_DETECTOR_NON_ARCH if 
> HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
>
> help
>   Say Y here to enable the kernel to act as a watchdog to detect
> @@ -1055,9 +1072,15 @@ config HARDLOCKUP_DETECTOR
>   chance to run.  The current stack trace is displayed upon detection
>   and the system will stay locked up.
>
> +#
> +# Note that arch-specific variants are always preferred.
> +#
>  config HARDLOCKUP_DETECTOR_PREFER_BUDDY
> bool "Prefer the buddy CPU hardlockup detector"
>

Re: [PATCH 1/7] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-07 Thread Doug Anderson

Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> There are four possible variants of hardlockup detectors:
>
>   + buddy: available when SMP is set.
>
>   + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.
>
>   + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
>
>   + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
> and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> Only one hardlockup detector can be compiled in. The selection is done
> using quite complex dependencies between several CONFIG variables.
> The following patches will try to make it more straightforward.
>
> As a first step, reorder the definitions of the various CONFIG variables.
> The logical order is:
>
>1. HAVE_* variables define available variants. They are typically
>   defined in the arch/ config files.
>
>2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
>   detector is enabled at all.
>
>3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
>   the buddy detector should be preferred over the perf one.
>   Note that the arch specific variants are always preferred when
>   available.
>
>4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
>   detector is enabled in the end.
>
>5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
>   are temporary variables that are going to be removed in
>   a followup patch.
>
> The patch just shuffles the definitions. It should not change the existing
> behavior.
>
> Signed-off-by: Petr Mladek 
> ---
>  lib/Kconfig.debug | 112 +++---
>  1 file changed, 56 insertions(+), 56 deletions(-)

I don't really have any strong opinions, so I'm fine with this. In
general I think the ordering I picked tried to match the existing
"style" which generally tried to list configs and then select them
below. To me the existing style makes more sense when thinking about
writing C code without having to put a pile of prototypes at the top
of your file: you define things higher in the file and reference them
below. For instance, the old style (before any of my patches) had:

config SOFTLOCKUP_DETECTOR:
  ... blah blah blah ...

config HARDLOCKUP_DETECTOR_PERF:
  select SOFTLOCKUP_DETECTOR

config HARDLOCKUP_DETECTOR:
  ... blah blah blah ...
  select LOCKUP_DETECTOR
  select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF

Your new style seems to be the opposite of that.

-Doug

[PATCH v5 25/26] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-06-07 Thread Terry Bowman

From: Robert Richter 

In Restricted CXL Device (RCD) mode a CXL device is exposed as an
RCiEP, but CXL downstream and upstream ports are not enumerated and
not visible in the PCIe hierarchy. [1] Protocol and link errors from
these non-enumerated ports are signaled as internal AER errors, either
Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE)
via an RCEC.

Restricted CXL host (RCH) downstream port-detected errors have the
Requster ID of the RCEC set in the RCEC's AER Error Source ID
register. A CXL handler must then inspect the error status in various
CXL registers residing in the dport's component register space (CXL
RAS capability) or the dport's RCRB (PCIe AER extended
capability). [2]

Errors showing up in the RCEC's error handler must be handled and
connected to the CXL subsystem. Implement this by forwarding the error
to all CXL devices below the RCEC. Since the entire CXL device is
controlled only using PCIe Configuration Space of device 0, function
0, only pass it there [3]. The error handling is limited to currently
supported devices with the Memory Device class code set (CXL Type 3
Device, PCI_CLASS_MEMORY_CXL, 502h), handle downstream port errors in
the device's cxl_pci driver. Support for other CXL Device Types
(e.g. a CXL.cache Device) can be added later.

To handle downstream port errors in addition to errors directed to the
CXL endpoint device, a handler must also inspect the CXL RAS and PCIe
AER capabilities of the CXL downstream port the device is connected
to.

Since CXL downstream port errors are signaled using internal errors,
the handler requires those errors to be unmasked. This is subject of a
follow-on patch.

The reason for choosing this implementation is that the AER service
driver claims the RCEC device, but does not allow it to register a
custom specific handler to support CXL. Connecting the RCEC hard-wired
with a CXL handler does not work, as the CXL subsystem might not be
present all the time. The alternative to add an implementation to the
portdrv to allow the registration of a custom RCEC error handler isn't
worth doing it as CXL would be its only user. Instead, just check for
an CXL RCEC and pass it down to the connected CXL device's error
handler. With this approach the code can entirely be implemented in
the PCIe AER driver and is independent of the CXL subsystem. The CXL
driver only provides the handler.

[1] CXL 3.0 spec: 9.11.8 CXL Devices Attached to an RCH
[2] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
[3] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices

Co-developed-by: Terry Bowman 
Signed-off-by: Terry Bowman 
Signed-off-by: Robert Richter 
Cc: "Oliver O'Halloran" 
Cc: Bjorn Helgaas 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Acked-by: Bjorn Helgaas 
Reviewed-by: Jonathan Cameron 
---
 drivers/pci/pcie/Kconfig | 12 +
 drivers/pci/pcie/aer.c   | 96 +++-
 2 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 228652a59f27..4f0e70fafe2d 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -49,6 +49,18 @@ config PCIEAER_INJECT
  gotten from:
 
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
 
+config PCIEAER_CXL
+   bool "PCI Express CXL RAS support for Restricted Hosts (RCH)"
+   default y
+   depends on PCIEAER && CXL_PCI
+   help
+ Enables error handling of downstream ports of a CXL host
+ that is operating in RCD mode (Restricted CXL Host, RCH).
+ The downstream port reports AER errors to a given RCEC.
+ Errors are handled by the CXL memory device driver.
+
+ If unsure, say Y.
+
 #
 # PCI Express ECRC
 #
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3344fcf1f79..c354ca5e8f2b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -946,14 +946,100 @@ static bool find_source_device(struct pci_dev *parent,
return true;
 }
 
+#ifdef CONFIG_PCIEAER_CXL
+
+static bool is_cxl_mem_dev(struct pci_dev *dev)
+{
+   /*
+* The capability, status, and control fields in Device 0,
+* Function 0 DVSEC control the CXL functionality of the
+* entire device (CXL 3.0, 8.1.3).
+*/
+   if (dev->devfn != PCI_DEVFN(0, 0))
+   return false;
+
+   /*
+* CXL Memory Devices must have the 502h class code set (CXL
+* 3.0, 8.1.12.1).
+*/
+   if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
+   return false;
+
+   return true;
+}
+
+static bool cxl_error_is_native(struct pci_dev *dev)
+{
+   struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+
+   if (pcie_ports_native)
+   return true;
+
+   return host->native_aer && host->native_cxl_error;
+}
+
+static bool is_internal_error(struct aer_err_info *info)
+{
+   if (info->severity ==

Re: [RFC PATCH] powerpc/ftrace: Refactoring and support for -fpatchable-function-entry

2023-06-07 Thread Naveen N Rao


Christophe Leroy wrote:



Le 23/05/2023 à 11:31, Naveen N Rao a écrit :

Christophe Leroy wrote:


Ok, I simplified this further, and this is as close to the previous 
fast path as we can get (applies atop the original RFC). The only 
difference left is the ftrace_rec iterator.


That's not better, that's even slightly worse (less than 1%).

I will try to investigate why.


I have posted a follow-on RFC with the changes split up:
http://lore.kernel.org/cover.1686151854.git.nav...@kernel.org

Hopefully, that will help narrow down the source of the regression.

- Naveen

[RFC PATCH 01/15] powerpc/module: Remove unused .ftrace.tramp section

2023-06-07 Thread Naveen N Rao

.ftrace.tramp section is not used for any purpose. This code was added
all the way back in the original commit introducing support for dynamic
ftrace on ppc64 modules. Remove it.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/include/asm/module.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/include/asm/module.h 
b/arch/powerpc/include/asm/module.h
index ac53606c259430..a8e2e8339fb7f4 100644
--- a/arch/powerpc/include/asm/module.h
+++ b/arch/powerpc/include/asm/module.h
@@ -75,10 +75,6 @@ struct mod_arch_specific {
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-#ifdef MODULE
-   asm(".section .ftrace.tramp,\"ax\",@nobits; .align 3; .previous");
-#endif /* MODULE */
-
 int module_trampoline_target(struct module *mod, unsigned long trampoline,
 unsigned long *target);
 int module_finalize_ftrace(struct module *mod, const Elf_Shdr *sechdrs);
-- 
2.40.1

[RFC PATCH 10/15] powerpc/ftrace: Simplify ftrace_make_nop()

2023-06-07 Thread Naveen N Rao

Now that we validate the ftrace location during initialization in
ftrace_init_nop(), we can simplify ftrace_make_nop() to patch-in the nop
without worrying about the instructions surrounding the ftrace location.
Note that we continue to ensure that we have a bl to
ftrace_[regs_]caller at the ftrace location before nop-ing it out.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 220 +
 1 file changed, 32 insertions(+), 188 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index c0d185742c23ca..67773cd14da71a 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -116,112 +116,6 @@ static unsigned long find_bl_target(unsigned long ip, 
ppc_inst_t op)
return ip + (long)offset;
 }
 
-#ifdef CONFIG_MODULES
-static int
-__ftrace_make_nop(struct module *mod,
- struct dyn_ftrace *rec, unsigned long addr)
-{
-   unsigned long entry, ptr, tramp;
-   unsigned long ip = rec->ip;
-   ppc_inst_t op, pop;
-
-   /* read where this goes */
-   if (copy_inst_from_kernel_nofault(, (void *)ip)) {
-   pr_err("Fetching opcode failed.\n");
-   return -EFAULT;
-   }
-
-   /* Make sure that this is still a 24bit jump */
-   if (!is_bl_op(op)) {
-   pr_err("Not expected bl: opcode is %08lx\n", 
ppc_inst_as_ulong(op));
-   return -EINVAL;
-   }
-
-   /* lets find where the pointer goes */
-   tramp = find_bl_target(ip, op);
-
-   pr_devel("ip:%lx jumps to %lx", ip, tramp);
-
-   if (module_trampoline_target(mod, tramp, )) {
-   pr_err("Failed to get trampoline target\n");
-   return -EFAULT;
-   }
-
-   pr_devel("trampoline target %lx", ptr);
-
-   entry = ppc_global_function_entry((void *)addr);
-   /* This should match what was called */
-   if (ptr != entry) {
-   pr_err("addr %lx does not match expected %lx\n", ptr, entry);
-   return -EINVAL;
-   }
-
-   if (IS_ENABLED(CONFIG_MPROFILE_KERNEL)) {
-   if (copy_inst_from_kernel_nofault(, (void *)(ip - 4))) {
-   pr_err("Fetching instruction at %lx failed.\n", ip - 4);
-   return -EFAULT;
-   }
-
-   /* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
-   if (!ppc_inst_equal(op, ppc_inst(PPC_RAW_MFLR(_R0))) &&
-   !ppc_inst_equal(op, ppc_inst(PPC_INST_STD_LR))) {
-   pr_err("Unexpected instruction %08lx around bl 
_mcount\n",
-  ppc_inst_as_ulong(op));
-   return -EINVAL;
-   }
-   } else if (IS_ENABLED(CONFIG_PPC64)) {
-   /*
-* Check what is in the next instruction. We can see ld 
r2,40(r1), but
-* on first pass after boot we will see mflr r0.
-*/
-   if (copy_inst_from_kernel_nofault(, (void *)(ip + 4))) {
-   pr_err("Fetching op failed.\n");
-   return -EFAULT;
-   }
-
-   if (!ppc_inst_equal(op,  ppc_inst(PPC_INST_LD_TOC))) {
-   pr_err("Expected %08lx found %08lx\n", PPC_INST_LD_TOC,
-  ppc_inst_as_ulong(op));
-   return -EINVAL;
-   }
-   }
-
-   /*
-* When using -mprofile-kernel or PPC32 there is no load to jump over.
-*
-* Otherwise our original call site looks like:
-*
-* bl 
-* ld r2,XX(r1)
-*
-* Milton Miller pointed out that we can not simply nop the branch.
-* If a task was preempted when calling a trace function, the nops
-* will remove the way to restore the TOC in r2 and the r2 TOC will
-* get corrupted.
-*
-* Use a b +8 to jump over the load.
-* XXX: could make PCREL depend on MPROFILE_KERNEL
-* XXX: check PCREL && MPROFILE_KERNEL calling sequence
-*/
-   if (IS_ENABLED(CONFIG_MPROFILE_KERNEL) || IS_ENABLED(CONFIG_PPC32))
-   pop = ppc_inst(PPC_RAW_NOP());
-   else
-   pop = ppc_inst(PPC_RAW_BRANCH(8));  /* b +8 */
-
-   if (patch_instruction((u32 *)ip, pop)) {
-   pr_err("Patching NOP failed.\n");
-   return -EPERM;
-   }
-
-   return 0;
-}
-#else
-static int __ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, 
unsigned long addr)
-{
-   return 0;
-}
-#endif /* CONFIG_MODULES */
-
 static unsigned long find_ftrace_tramp(unsigned long ip)
 {
int i;
@@ -235,88 +129,6 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
return 0;
 }
 
-static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, unsigned long addr)
-{
-   unsigned long tramp, ip = rec->ip;
-   ppc_inst_t op;
-
-

[RFC PATCH 09/15] powerpc/ftrace: Add separate ftrace_init_nop() with additional validation

2023-06-07 Thread Naveen N Rao

Currently, we validate instructions around the ftrace location every
time we have to enable/disable ftrace. Introduce ftrace_init_nop() to
instead perform all the validation during ftrace initialization. This
allows us to simply patch the necessary instructions during
enabling/disabling ftrace.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/include/asm/ftrace.h  |  6 +++
 arch/powerpc/kernel/trace/ftrace.c | 71 ++
 2 files changed, 77 insertions(+)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 1a5d365523e160..89fbae3caa1fc2 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -29,11 +29,17 @@ static inline unsigned long ftrace_call_adjust(unsigned 
long addr)
 unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
unsigned long sp);
 
+struct module;
+struct dyn_ftrace;
 struct dyn_arch_ftrace {
struct module *mod;
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
+#define ftrace_need_init_nop() (true)
+int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
+#define ftrace_init_nop ftrace_init_nop
+
 struct ftrace_regs {
struct pt_regs regs;
 };
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 278bf8e52b6e89..c0d185742c23ca 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -31,6 +31,16 @@
 #defineNUM_FTRACE_TRAMPS   2
 static unsigned long ftrace_tramps[NUM_FTRACE_TRAMPS];
 
+static ppc_inst_t ftrace_create_branch_inst(unsigned long ip, unsigned long 
addr, int link)
+{
+   ppc_inst_t op;
+
+   WARN_ON(!is_offset_in_branch_range(addr - ip));
+   create_branch(, (u32 *)ip, addr, link ? BRANCH_SET_LINK : 0);
+
+   return op;
+}
+
 static ppc_inst_t
 ftrace_call_replace(unsigned long ip, unsigned long addr, int link)
 {
@@ -597,6 +607,67 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned 
long old_addr,
 }
 #endif
 
+int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
+{
+   unsigned long addr, ip = rec->ip;
+   ppc_inst_t old, new;
+   int ret = 0;
+
+   /* Verify instructions surrounding the ftrace location */
+   if (IS_ENABLED(CONFIG_PPC32)) {
+   /* Expected sequence: 'mflr r0', 'stw r0,4(r1)', 'bl _mcount' */
+   ret = ftrace_validate_inst(ip - 8, ppc_inst(PPC_RAW_MFLR(_R0)));
+   if (!ret)
+   ret = ftrace_validate_inst(ip - 4, 
ppc_inst(PPC_RAW_STW(_R0, _R1, 4)));
+   } else if (IS_ENABLED(CONFIG_MPROFILE_KERNEL)) {
+   /* Expected sequence: 'mflr r0', ['std r0,16(r1)'], 'bl 
_mcount' */
+   ret = ftrace_validate_inst(ip - 4, ppc_inst(PPC_RAW_MFLR(_R0)));
+   if (ret) {
+   ret = ftrace_validate_inst(ip - 4, 
ppc_inst(PPC_RAW_STD(_R0, _R1, 16)));
+   ret |= ftrace_validate_inst(ip - 8, 
ppc_inst(PPC_RAW_MFLR(_R0)));
+   }
+   } else {
+   return -EINVAL;
+   }
+
+   if (ret)
+   return ret;
+
+   if (!core_kernel_text(ip)) {
+   if (!mod) {
+   pr_err("0x%lx: No module provided for non-kernel 
address\n", ip);
+   return -EFAULT;
+   }
+   rec->arch.mod = mod;
+   }
+
+   /* Nop-out the ftrace location */
+   new = ppc_inst(PPC_RAW_NOP());
+   addr = MCOUNT_ADDR;
+   if (is_offset_in_branch_range(addr - ip)) {
+   /* Within range */
+   old = ftrace_create_branch_inst(ip, addr, 1);
+   ret = ftrace_modify_code(ip, old, new);
+   } else if (core_kernel_text(ip) || (IS_ENABLED(CONFIG_MODULES) && mod)) 
{
+   /*
+* We would be branching to a linker-generated stub, or to the 
module _mcount
+* stub. Let's just confirm we have a 'bl' here.
+*/
+   ret = ftrace_read_inst(ip, );
+   if (ret)
+   return ret;
+   if (!is_bl_op(old)) {
+   pr_err("0x%lx: expected (bl) != found (%08lx)\n", ip, 
ppc_inst_as_ulong(old));
+   return -EINVAL;
+   }
+   ret = patch_instruction((u32 *)ip, new);
+   } else {
+   return -EINVAL;
+   }
+
+   return ret;
+}
+
 int ftrace_update_ftrace_func(ftrace_func_t func)
 {
unsigned long ip = (unsigned long)(_call);
-- 
2.40.1

[RFC PATCH 08/15] powerpc/ftrace: Stop re-purposing linker generated long branches for ftrace

2023-06-07 Thread Naveen N Rao

Commit 67361cf8071286 ("powerpc/ftrace: Handle large kernel configs")
added ftrace support for ppc64 kernel images with a text section larger
than 32MB. The patch did two things:
1. Add stubs at the end of .text to branch into ftrace_[regs_]caller for
   functions that were out of branch range.
2. Re-purpose linker-generated long branches to _mcount to instead branch
   to ftrace_[regs_]caller.

Before that, we only supported kernel .text up to ~32MB. With the above,
we now support up to ~96MB:
- The first 32MB of kernel text can branch directly into
  ftrace_[regs_]caller since that symbol is usually at the beginning.
- The modified long_branch from (2) above is used by the next 32MB of
  kernel text.
- The next 32MB of kernel text can use the stub at the end of text to
  branch back to ftrace_[regs_]caller.

While re-purposing the long branch works in practice, it still restricts
ftrace to kernel text up to ~96MB. The stub at the end of kernel text
from (1) already enables us to extend ftrace support for kernel text
up to 64MB, which fulfils the original requirement. Further, once we
switch to -fpatchable-function-entry, there will not be a long branch
that we can use.

Stop re-purposing the linker-generated long branches for ftrace to
simplify the code. If there are good reasons to support ftrace on
kernels beyond 64MB, we can consider adding support by using
-fpatchable-function-entry.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 110 +
 1 file changed, 17 insertions(+), 93 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index ef4e49c2c37781..278bf8e52b6e89 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -28,13 +28,7 @@
 #include 
 #include 
 
-/*
- * We generally only have a single long_branch tramp and at most 2 or 3 plt
- * tramps generated. But, we don't use the plt tramps currently. We also allot
- * 2 tramps after .text and .init.text. So, we only end up with around 3 usable
- * tramps in total. Set aside 8 just to be sure.
- */
-#defineNUM_FTRACE_TRAMPS   8
+#defineNUM_FTRACE_TRAMPS   2
 static unsigned long ftrace_tramps[NUM_FTRACE_TRAMPS];
 
 static ppc_inst_t
@@ -100,11 +94,6 @@ static int is_bl_op(ppc_inst_t op)
return (ppc_inst_val(op) & ~PPC_LI_MASK) == PPC_RAW_BL(0);
 }
 
-static int is_b_op(ppc_inst_t op)
-{
-   return (ppc_inst_val(op) & ~PPC_LI_MASK) == PPC_RAW_BRANCH(0);
-}
-
 static unsigned long find_bl_target(unsigned long ip, ppc_inst_t op)
 {
int offset;
@@ -227,11 +216,7 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
 {
int i;
 
-   /*
-* We have the compiler generated long_branch tramps at the end
-* and we prefer those
-*/
-   for (i = NUM_FTRACE_TRAMPS - 1; i >= 0; i--)
+   for (i = 0; i < NUM_FTRACE_TRAMPS; i++)
if (!ftrace_tramps[i])
continue;
else if (is_offset_in_branch_range(ftrace_tramps[i] - ip))
@@ -240,75 +225,6 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
return 0;
 }
 
-static int add_ftrace_tramp(unsigned long tramp)
-{
-   int i;
-
-   for (i = 0; i < NUM_FTRACE_TRAMPS; i++)
-   if (!ftrace_tramps[i]) {
-   ftrace_tramps[i] = tramp;
-   return 0;
-   }
-
-   return -1;
-}
-
-/*
- * If this is a compiler generated long_branch trampoline (essentially, a
- * trampoline that has a branch to _mcount()), we re-write the branch to
- * instead go to ftrace_[regs_]caller() and note down the location of this
- * trampoline.
- */
-static int setup_mcount_compiler_tramp(unsigned long tramp)
-{
-   int i;
-   ppc_inst_t op;
-   unsigned long ptr;
-
-   /* Is this a known long jump tramp? */
-   for (i = 0; i < NUM_FTRACE_TRAMPS; i++)
-   if (ftrace_tramps[i] == tramp)
-   return 0;
-
-   /* New trampoline -- read where this goes */
-   if (copy_inst_from_kernel_nofault(, (void *)tramp)) {
-   pr_debug("Fetching opcode failed.\n");
-   return -1;
-   }
-
-   /* Is this a 24 bit branch? */
-   if (!is_b_op(op)) {
-   pr_debug("Trampoline is not a long branch tramp.\n");
-   return -1;
-   }
-
-   /* lets find where the pointer goes */
-   ptr = find_bl_target(tramp, op);
-
-   if (ptr != ppc_global_function_entry((void *)_mcount)) {
-   pr_debug("Trampoline target %p is not _mcount\n", (void *)ptr);
-   return -1;
-   }
-
-   /* Let's re-write the tramp to go to ftrace_[regs_]caller */
-   if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
-   ptr = ppc_global_function_entry((void *)ftrace_regs_caller);
-   else
-   ptr = ppc_global_function_entry((void *)ftrace_caller);
-
-   if

[RFC PATCH 07/15] powerpc/ftrace: Refactor ftrace_modify_code()

2023-06-07 Thread Naveen N Rao

Split up ftrace_modify_code() into a few helpers for future use. Also
update error messages accordingly.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 51 +-
 1 file changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 913c7aa63d3fa3..ef4e49c2c37781 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -50,32 +50,39 @@ ftrace_call_replace(unsigned long ip, unsigned long addr, 
int link)
return op;
 }
 
-static inline int
-ftrace_modify_code(unsigned long ip, ppc_inst_t old, ppc_inst_t new)
+static inline int ftrace_read_inst(unsigned long ip, ppc_inst_t *op)
 {
-   ppc_inst_t replaced;
-
-   /*
-* Note:
-* We are paranoid about modifying text, as if a bug was to happen, it
-* could cause us to read or write to someplace that could cause harm.
-* Carefully read and modify the code with probe_kernel_*(), and make
-* sure what we read is what we expected it to be before modifying it.
-*/
-
-   /* read the text we want to modify */
-   if (copy_inst_from_kernel_nofault(, (void *)ip))
+   if (copy_inst_from_kernel_nofault(op, (void *)ip)) {
+   pr_err("0x%lx: fetching instruction failed\n", ip);
return -EFAULT;
-
-   /* Make sure it is what we expect it to be */
-   if (!ppc_inst_equal(replaced, old)) {
-   pr_err("%p: replaced (%08lx) != old (%08lx)", (void *)ip,
-  ppc_inst_as_ulong(replaced), ppc_inst_as_ulong(old));
-   return -EINVAL;
}
 
-   /* replace the text with the new text */
-   return patch_instruction((u32 *)ip, new);
+   return 0;
+}
+
+static inline int ftrace_validate_inst(unsigned long ip, ppc_inst_t inst)
+{
+   ppc_inst_t op;
+   int ret;
+
+   ret = ftrace_read_inst(ip, );
+   if (!ret && !ppc_inst_equal(op, inst)) {
+   pr_err("0x%lx: expected (%08lx) != found (%08lx)\n",
+  ip, ppc_inst_as_ulong(inst), ppc_inst_as_ulong(op));
+   ret = -EINVAL;
+   }
+
+   return ret;
+}
+
+static inline int ftrace_modify_code(unsigned long ip, ppc_inst_t old, 
ppc_inst_t new)
+{
+   int ret = ftrace_validate_inst(ip, old);
+
+   if (!ret)
+   ret = patch_instruction((u32 *)ip, new);
+
+   return ret;
 }
 
 /*
-- 
2.40.1

[RFC PATCH 06/15] powerpc/ftrace: Consolidate ftrace support into fewer files

2023-06-07 Thread Naveen N Rao

ftrace_low.S has just the _mcount stub and return_to_handler(). Merge
this back into ftrace_mprofile.S and ftrace_64_pg.S to keep all ftrace
code together, and to allow those to evolve independently.

ftrace_mprofile.S is also not an entirely accurate name since this also
holds ppc32 code. This will be all the more incorrect once support for
-fpatchable-function-entry is added. Rename files here to more
accurately describe the code:
- ftrace_mprofile.S is renamed to ftrace_entry.S
- ftrace_pg.c is renamed to ftrace_64_pg.c
- ftrace_64_pg.S is rename to ftrace_64_pg_entry.S

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/Makefile| 17 +++--
 arch/powerpc/kernel/trace/ftrace_64_pg.S  | 67 ---
 .../trace/{ftrace_pg.c => ftrace_64_pg.c} |  0
 .../{ftrace_low.S => ftrace_64_pg_entry.S}| 58 +++-
 .../{ftrace_mprofile.S => ftrace_entry.S} | 65 ++
 5 files changed, 130 insertions(+), 77 deletions(-)
 delete mode 100644 arch/powerpc/kernel/trace/ftrace_64_pg.S
 rename arch/powerpc/kernel/trace/{ftrace_pg.c => ftrace_64_pg.c} (100%)
 rename arch/powerpc/kernel/trace/{ftrace_low.S => ftrace_64_pg_entry.S} (55%)
 rename arch/powerpc/kernel/trace/{ftrace_mprofile.S => ftrace_entry.S} (83%)

diff --git a/arch/powerpc/kernel/trace/Makefile 
b/arch/powerpc/kernel/trace/Makefile
index 342a2d1ae86cd0..125f4ca588b98a 100644
--- a/arch/powerpc/kernel/trace/Makefile
+++ b/arch/powerpc/kernel/trace/Makefile
@@ -6,16 +6,15 @@
 ifdef CONFIG_FUNCTION_TRACER
 # do not trace tracer code
 CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE)
-CFLAGS_REMOVE_ftrace_pg.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_ftrace_64_pg.o = $(CC_FLAGS_FTRACE)
 endif
 
-obj32-$(CONFIG_FUNCTION_TRACER)+= ftrace_mprofile.o ftrace.o
+obj32-$(CONFIG_FUNCTION_TRACER)+= ftrace.o ftrace_entry.o
 ifdef CONFIG_MPROFILE_KERNEL
-obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_mprofile.o ftrace.o
+obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace.o ftrace_entry.o
 else
-obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64_pg.o ftrace_pg.o
+obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64_pg.o 
ftrace_64_pg_entry.o
 endif
-obj-$(CONFIG_FUNCTION_TRACER)  += ftrace_low.o
 obj-$(CONFIG_TRACING)  += trace_clock.o
 
 obj-$(CONFIG_PPC64)+= $(obj64-y)
@@ -26,7 +25,7 @@ GCOV_PROFILE_ftrace.o := n
 KCOV_INSTRUMENT_ftrace.o := n
 KCSAN_SANITIZE_ftrace.o := n
 UBSAN_SANITIZE_ftrace.o := n
-GCOV_PROFILE_ftrace_pg.o := n
-KCOV_INSTRUMENT_ftrace_pg.o := n
-KCSAN_SANITIZE_ftrace_pg.o := n
-UBSAN_SANITIZE_ftrace_pg.o := n
+GCOV_PROFILE_ftrace_64_pg.o := n
+KCOV_INSTRUMENT_ftrace_64_pg.o := n
+KCSAN_SANITIZE_ftrace_64_pg.o := n
+UBSAN_SANITIZE_ftrace_64_pg.o := n
diff --git a/arch/powerpc/kernel/trace/ftrace_64_pg.S 
b/arch/powerpc/kernel/trace/ftrace_64_pg.S
deleted file mode 100644
index 6708e24db0aba8..00
--- a/arch/powerpc/kernel/trace/ftrace_64_pg.S
+++ /dev/null
@@ -1,67 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/*
- * Split from ftrace_64.S
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-_GLOBAL_TOC(ftrace_caller)
-   lbz r3, PACA_FTRACE_ENABLED(r13)
-   cmpdi   r3, 0
-   beqlr
-
-   /* Taken from output of objdump from lib64/glibc */
-   mflrr3
-   ld  r11, 0(r1)
-   stdur1, -112(r1)
-   std r3, 128(r1)
-   ld  r4, 16(r11)
-   subir3, r3, MCOUNT_INSN_SIZE
-.globl ftrace_call
-ftrace_call:
-   bl  ftrace_stub
-   nop
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-.globl ftrace_graph_call
-ftrace_graph_call:
-   b   ftrace_graph_stub
-_GLOBAL(ftrace_graph_stub)
-#endif
-   ld  r0, 128(r1)
-   mtlrr0
-   addir1, r1, 112
-
-_GLOBAL(ftrace_stub)
-   blr
-
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-_GLOBAL(ftrace_graph_caller)
-   addir5, r1, 112
-   /* load r4 with local address */
-   ld  r4, 128(r1)
-   subir4, r4, MCOUNT_INSN_SIZE
-
-   /* Grab the LR out of the caller stack frame */
-   ld  r11, 112(r1)
-   ld  r3, 16(r11)
-
-   bl  prepare_ftrace_return
-   nop
-
-   /*
-* prepare_ftrace_return gives us the address we divert to.
-* Change the LR in the callers stack frame to this.
-*/
-   ld  r11, 112(r1)
-   std r3, 16(r11)
-
-   ld  r0, 128(r1)
-   mtlrr0
-   addir1, r1, 112
-   blr
-#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
diff --git a/arch/powerpc/kernel/trace/ftrace_pg.c 
b/arch/powerpc/kernel/trace/ftrace_64_pg.c
similarity index 100%
rename from arch/powerpc/kernel/trace/ftrace_pg.c
rename to arch/powerpc/kernel/trace/ftrace_64_pg.c
diff --git a/arch/powerpc/kernel/trace/ftrace_low.S 
b/arch/powerpc/kernel/trace/ftrace_64_pg_entry.S
similarity index 55%
rename from arch/powerpc/kernel/trace/ftrace_low.S

[RFC PATCH 05/15] powerpc/ftrace: Extend ftrace support for large kernels to ppc32

2023-06-07 Thread Naveen N Rao

Commit 67361cf8071286 ("powerpc/ftrace: Handle large kernel configs")
added ftrace support for ppc64 kernel images with a text section larger
than 32MB. The approach itself isn't specific to ppc64, so extend the
same to also work on ppc32.

While at it, reduce the space reserved for the stub from 64 bytes to 32
bytes since the different stub variants are all less than 8
instructions.

To reduce use of #ifdef, a stub implementation is provided for
kernel_toc_address() and -SZ_2G is cast to 'long long' to prevent
errors on ppc32.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/include/asm/ftrace.h  | 10 +--
 arch/powerpc/include/asm/sections.h|  2 ++
 arch/powerpc/kernel/trace/ftrace.c | 39 ++
 arch/powerpc/kernel/trace/ftrace_low.S |  6 ++--
 arch/powerpc/kernel/vmlinux.lds.S  |  4 ---
 5 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 91c049d51d0e10..1a5d365523e160 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -124,15 +124,19 @@ static inline u8 this_cpu_get_ftrace_enabled(void)
 {
return get_paca()->ftrace_enabled;
 }
-
-void ftrace_free_init_tramp(void);
 #else /* CONFIG_PPC64 */
 static inline void this_cpu_disable_ftrace(void) { }
 static inline void this_cpu_enable_ftrace(void) { }
 static inline void this_cpu_set_ftrace_enabled(u8 ftrace_enabled) { }
 static inline u8 this_cpu_get_ftrace_enabled(void) { return 1; }
-static inline void ftrace_free_init_tramp(void) { }
 #endif /* CONFIG_PPC64 */
+
+#ifdef CONFIG_FUNCTION_TRACER
+extern unsigned int ftrace_tramp_text[], ftrace_tramp_init[];
+void ftrace_free_init_tramp(void);
+#else
+static inline void ftrace_free_init_tramp(void) { }
+#endif
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_FTRACE */
diff --git a/arch/powerpc/include/asm/sections.h 
b/arch/powerpc/include/asm/sections.h
index 4e1f548c8d373d..ea26665f82cfc8 100644
--- a/arch/powerpc/include/asm/sections.h
+++ b/arch/powerpc/include/asm/sections.h
@@ -74,6 +74,8 @@ static inline int overlaps_kernel_text(unsigned long start, 
unsigned long end)
(unsigned long)_stext < end;
 }
 
+#else
+static inline unsigned long kernel_toc_addr(void) { BUILD_BUG(); return -1UL; }
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 5aa36272617a03..913c7aa63d3fa3 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -707,11 +707,6 @@ void arch_ftrace_update_code(int command)
ftrace_modify_all_code(command);
 }
 
-#ifdef CONFIG_PPC64
-#define PACATOC offsetof(struct paca_struct, kernel_toc)
-
-extern unsigned int ftrace_tramp_text[], ftrace_tramp_init[];
-
 void ftrace_free_init_tramp(void)
 {
int i;
@@ -725,28 +720,30 @@ void ftrace_free_init_tramp(void)
 
 int __init ftrace_dyn_arch_init(void)
 {
-   int i;
unsigned int *tramp[] = { ftrace_tramp_text, ftrace_tramp_init };
-#ifdef CONFIG_PPC_KERNEL_PCREL
+   unsigned long addr = FTRACE_REGS_ADDR;
+   long reladdr;
+   int i;
u32 stub_insns[] = {
+#ifdef CONFIG_PPC_KERNEL_PCREL
/* pla r12,addr */
PPC_PREFIX_MLS | __PPC_PRFX_R(1),
PPC_INST_PADDI | ___PPC_RT(_R12),
PPC_RAW_MTCTR(_R12),
PPC_RAW_BCTR()
-   };
-#else
-   u32 stub_insns[] = {
-   PPC_RAW_LD(_R12, _R13, PACATOC),
+#elif defined(CONFIG_PPC64)
+   PPC_RAW_LD(_R12, _R13, offsetof(struct paca_struct, 
kernel_toc)),
PPC_RAW_ADDIS(_R12, _R12, 0),
PPC_RAW_ADDI(_R12, _R12, 0),
PPC_RAW_MTCTR(_R12),
PPC_RAW_BCTR()
-   };
+#else
+   PPC_RAW_LIS(_R12, 0),
+   PPC_RAW_ADDI(_R12, _R12, 0),
+   PPC_RAW_MTCTR(_R12),
+   PPC_RAW_BCTR()
 #endif
-
-   unsigned long addr = FTRACE_REGS_ADDR;
-   long reladdr;
+   };
 
if (IS_ENABLED(CONFIG_PPC_KERNEL_PCREL)) {
for (i = 0; i < 2; i++) {
@@ -763,10 +760,10 @@ int __init ftrace_dyn_arch_init(void)
tramp[i][1] |= IMM_L(reladdr);
add_ftrace_tramp((unsigned long)tramp[i]);
}
-   } else {
+   } else if (IS_ENABLED(CONFIG_PPC64)) {
reladdr = addr - kernel_toc_addr();
 
-   if (reladdr >= (long)SZ_2G || reladdr < -(long)SZ_2G) {
+   if (reladdr >= (long)SZ_2G || reladdr < -(long long)SZ_2G) {
pr_err("Address of %ps out of range of kernel_toc.\n",
(void *)addr);
return -1;
@@ -778,11 +775,17 @@ int __init ftrace_dyn_arch_init(void)
tramp[i][2] |= PPC_LO(reladdr);
add_ftrace_tramp((unsigned long)tramp[i]);

[RFC PATCH 04/15] powerpc/ftrace: Use FTRACE_REGS_ADDR to identify the correct ftrace trampoline

2023-06-07 Thread Naveen N Rao

Instead of keying off DYNAMIC_FTRACE_WITH_REGS, use FTRACE_REGS_ADDR to
identify the proper ftrace trampoline address to use.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index f117124c30325f..5aa36272617a03 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -745,14 +745,9 @@ int __init ftrace_dyn_arch_init(void)
};
 #endif
 
-   unsigned long addr;
+   unsigned long addr = FTRACE_REGS_ADDR;
long reladdr;
 
-   if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
-   addr = ppc_global_function_entry((void *)ftrace_regs_caller);
-   else
-   addr = ppc_global_function_entry((void *)ftrace_caller);
-
if (IS_ENABLED(CONFIG_PPC_KERNEL_PCREL)) {
for (i = 0; i < 2; i++) {
reladdr = addr - (unsigned long)tramp[i];
-- 
2.40.1

[RFC PATCH 03/15] powerpc/ftrace: Simplify function_graph support in ftrace.c

2023-06-07 Thread Naveen N Rao

Since we now support DYNAMIC_FTRACE_WITH_ARGS across ppc32 and ppc64
ELFv2, we can simplify function_graph tracer support code in ftrace.c

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 64 --
 1 file changed, 7 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 81a121b56c4d7f..f117124c30325f 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -790,44 +790,10 @@ int __init ftrace_dyn_arch_init(void)
 #endif
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-
-extern void ftrace_graph_call(void);
-extern void ftrace_graph_stub(void);
-
-static int ftrace_modify_ftrace_graph_caller(bool enable)
+void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
-   unsigned long ip = (unsigned long)(_graph_call);
-   unsigned long addr = (unsigned long)(_graph_caller);
-   unsigned long stub = (unsigned long)(_graph_stub);
-   ppc_inst_t old, new;
-
-   if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
-   return 0;
-
-   old = ftrace_call_replace(ip, enable ? stub : addr, 0);
-   new = ftrace_call_replace(ip, enable ? addr : stub, 0);
-
-   return ftrace_modify_code(ip, old, new);
-}
-
-int ftrace_enable_ftrace_graph_caller(void)
-{
-   return ftrace_modify_ftrace_graph_caller(true);
-}
-
-int ftrace_disable_ftrace_graph_caller(void)
-{
-   return ftrace_modify_ftrace_graph_caller(false);
-}
-
-/*
- * Hook the return address and push it in the stack of return addrs
- * in current thread info. Return the address we want to divert to.
- */
-static unsigned long
-__prepare_ftrace_return(unsigned long parent, unsigned long ip, unsigned long 
sp)
-{
-   unsigned long return_hooker;
+   unsigned long sp = fregs->regs.gpr[1];
int bit;
 
if (unlikely(ftrace_graph_is_dead()))
@@ -836,31 +802,15 @@ __prepare_ftrace_return(unsigned long parent, unsigned 
long ip, unsigned long sp
if (unlikely(atomic_read(>tracing_graph_pause)))
goto out;
 
-   bit = ftrace_test_recursion_trylock(ip, parent);
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
goto out;
 
-   return_hooker = ppc_function_entry(return_to_handler);
-
-   if (!function_graph_enter(parent, ip, 0, (unsigned long *)sp))
-   parent = return_hooker;
+   if (!function_graph_enter(parent_ip, ip, 0, (unsigned long *)sp))
+   parent_ip = ppc_function_entry(return_to_handler);
 
ftrace_test_recursion_unlock(bit);
 out:
-   return parent;
+   fregs->regs.link = parent_ip;
 }
-
-#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
-void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
-  struct ftrace_ops *op, struct ftrace_regs *fregs)
-{
-   fregs->regs.link = __prepare_ftrace_return(parent_ip, ip, 
fregs->regs.gpr[1]);
-}
-#else
-unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
-   unsigned long sp)
-{
-   return __prepare_ftrace_return(parent, ip, sp);
-}
-#endif
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
-- 
2.40.1

[RFC PATCH 02/15] powerpc64/ftrace: Move ELFv1 and -pg support code into a separate file

2023-06-07 Thread Naveen N Rao

ELFv1 support is deprecated and on the way out. Pre -mprofile-kernel
ftrace support (-pg only) is very limited and is retained primarily for
clang builds. It won't be necessary once clang lands support for
-fpatchable-function-entry.

Copy the existing ftrace code supporting these into ftrace_pg.c.
ftrace.c can then be refactored and enhanced with a focus on ppc32 and
ppc64 ELFv2.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/Makefile|  13 +-
 arch/powerpc/kernel/trace/ftrace.c|  10 -
 arch/powerpc/kernel/trace/ftrace_pg.c | 846 ++
 3 files changed, 855 insertions(+), 14 deletions(-)
 create mode 100644 arch/powerpc/kernel/trace/ftrace_pg.c

diff --git a/arch/powerpc/kernel/trace/Makefile 
b/arch/powerpc/kernel/trace/Makefile
index b16a9f9c0b35f2..342a2d1ae86cd0 100644
--- a/arch/powerpc/kernel/trace/Makefile
+++ b/arch/powerpc/kernel/trace/Makefile
@@ -6,15 +6,16 @@
 ifdef CONFIG_FUNCTION_TRACER
 # do not trace tracer code
 CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_ftrace_pg.o = $(CC_FLAGS_FTRACE)
 endif
 
-obj32-$(CONFIG_FUNCTION_TRACER)+= ftrace_mprofile.o
+obj32-$(CONFIG_FUNCTION_TRACER)+= ftrace_mprofile.o ftrace.o
 ifdef CONFIG_MPROFILE_KERNEL
-obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_mprofile.o
+obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_mprofile.o ftrace.o
 else
-obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64_pg.o
+obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64_pg.o ftrace_pg.o
 endif
-obj-$(CONFIG_FUNCTION_TRACER)  += ftrace_low.o ftrace.o
+obj-$(CONFIG_FUNCTION_TRACER)  += ftrace_low.o
 obj-$(CONFIG_TRACING)  += trace_clock.o
 
 obj-$(CONFIG_PPC64)+= $(obj64-y)
@@ -25,3 +26,7 @@ GCOV_PROFILE_ftrace.o := n
 KCOV_INSTRUMENT_ftrace.o := n
 KCSAN_SANITIZE_ftrace.o := n
 UBSAN_SANITIZE_ftrace.o := n
+GCOV_PROFILE_ftrace_pg.o := n
+KCOV_INSTRUMENT_ftrace_pg.o := n
+KCSAN_SANITIZE_ftrace_pg.o := n
+UBSAN_SANITIZE_ftrace_pg.o := n
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index a47f303734233b..81a121b56c4d7f 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -864,13 +864,3 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip,
 }
 #endif
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
-
-#ifdef CONFIG_PPC64_ELF_ABI_V1
-char *arch_ftrace_match_adjust(char *str, const char *search)
-{
-   if (str[0] == '.' && search[0] != '.')
-   return str + 1;
-   else
-   return str;
-}
-#endif /* CONFIG_PPC64_ELF_ABI_V1 */
diff --git a/arch/powerpc/kernel/trace/ftrace_pg.c 
b/arch/powerpc/kernel/trace/ftrace_pg.c
new file mode 100644
index 00..7b85c3b460a3c0
--- /dev/null
+++ b/arch/powerpc/kernel/trace/ftrace_pg.c
@@ -0,0 +1,846 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Code for replacing ftrace calls with jumps.
+ *
+ * Copyright (C) 2007-2008 Steven Rostedt 
+ *
+ * Thanks goes out to P.A. Semi, Inc for supplying me with a PPC64 box.
+ *
+ * Added function graph tracer code, taken from x86 that was written
+ * by Frederic Weisbecker, and ported to PPC by Steven Rostedt.
+ *
+ */
+
+#define pr_fmt(fmt) "ftrace-powerpc: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * We generally only have a single long_branch tramp and at most 2 or 3 plt
+ * tramps generated. But, we don't use the plt tramps currently. We also allot
+ * 2 tramps after .text and .init.text. So, we only end up with around 3 usable
+ * tramps in total. Set aside 8 just to be sure.
+ */
+#defineNUM_FTRACE_TRAMPS   8
+static unsigned long ftrace_tramps[NUM_FTRACE_TRAMPS];
+
+static ppc_inst_t
+ftrace_call_replace(unsigned long ip, unsigned long addr, int link)
+{
+   ppc_inst_t op;
+
+   addr = ppc_function_entry((void *)addr);
+
+   /* if (link) set op to 'bl' else 'b' */
+   create_branch(, (u32 *)ip, addr, link ? BRANCH_SET_LINK : 0);
+
+   return op;
+}
+
+static inline int
+ftrace_modify_code(unsigned long ip, ppc_inst_t old, ppc_inst_t new)
+{
+   ppc_inst_t replaced;
+
+   /*
+* Note:
+* We are paranoid about modifying text, as if a bug was to happen, it
+* could cause us to read or write to someplace that could cause harm.
+* Carefully read and modify the code with probe_kernel_*(), and make
+* sure what we read is what we expected it to be before modifying it.
+*/
+
+   /* read the text we want to modify */
+   if (copy_inst_from_kernel_nofault(, (void *)ip))
+   return -EFAULT;
+
+   /* Make sure it is what we expect it to be */
+   if (!ppc_inst_equal(replaced, old)) {
+   pr_err("%p: replaced (%08lx) != old (%08lx)", (void *)ip,
+

[RFC PATCH 15/15] powerpc/ftrace: Add support for -fpatchable-function-entry

2023-06-07 Thread Naveen N Rao

GCC v13.1 updated support for -fpatchable-function-entry on ppc64le to
emit nops after the local entry point, rather than before it. This
allows us to use this in the kernel for ftrace purposes. A new script is
added under arch/powerpc/tools/ to help detect if nops are emitted after
the function local entry point, or before the global entry point.

With -fpatchable-function-entry, we no longer have the profiling
instructions generated at function entry, so we only need to validate
the presence of two nops at the ftrace location in ftrace_init_nop(). We
patch the preceding instruction with 'mflr r0' to match the
-mprofile-kernel ABI for subsequent ftrace use.

This changes the profiling instructions used on ppc32. The default -pg
option emits an additional 'stw' instruction after 'mflr r0' and before
the branch to _mcount 'bl _mcount'. This is very similar to the original
-mprofile-kernel implementation on ppc64le, where an additional 'std'
instruction was used to save LR to its save location in the caller's
stackframe. Subsequently, this additional store was removed in later
compiler versions for performance reasons. The same reasons apply for
ppc32 so we only patch in a 'mflr r0'.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/Kconfig  | 14 +++---
 arch/powerpc/Makefile |  5 
 arch/powerpc/include/asm/ftrace.h |  6 +++--
 arch/powerpc/include/asm/vermagic.h   |  4 ++-
 arch/powerpc/kernel/module_64.c   |  2 +-
 arch/powerpc/kernel/trace/ftrace.c| 14 --
 arch/powerpc/kernel/trace/ftrace_entry.S  |  2 ++
 .../gcc-check-fpatchable-function-entry.sh| 26 +++
 8 files changed, 64 insertions(+), 9 deletions(-)
 create mode 100755 arch/powerpc/tools/gcc-check-fpatchable-function-entry.sh

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index bff5820b7cda14..9352d8e68152e1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -187,6 +187,7 @@ config PPC
select DYNAMIC_FTRACE   if FUNCTION_TRACER
select EDAC_ATOMIC_SCRUB
select EDAC_SUPPORT
+   select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY if 
ARCH_USING_PATCHABLE_FUNCTION_ENTRY
select GENERIC_ATOMIC64 if PPC32
select GENERIC_CLOCKEVENTS_BROADCASTif SMP
select GENERIC_CMOS_UPDATE
@@ -227,8 +228,8 @@ config PPC
select HAVE_DEBUG_KMEMLEAK
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_DYNAMIC_FTRACE
-   select HAVE_DYNAMIC_FTRACE_WITH_ARGSif MPROFILE_KERNEL || PPC32
-   select HAVE_DYNAMIC_FTRACE_WITH_REGSif MPROFILE_KERNEL || PPC32
+   select HAVE_DYNAMIC_FTRACE_WITH_ARGSif 
ARCH_USING_PATCHABLE_FUNCTION_ENTRY || MPROFILE_KERNEL || PPC32
+   select HAVE_DYNAMIC_FTRACE_WITH_REGSif 
ARCH_USING_PATCHABLE_FUNCTION_ENTRY || MPROFILE_KERNEL || PPC32
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_FAST_GUP
@@ -256,7 +257,7 @@ config PPC
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI if PERF_EVENTS || (PPC64 && 
PPC_BOOK3S)
select HAVE_OPTPROBES
-   select HAVE_OBJTOOL if PPC32 || MPROFILE_KERNEL
+   select HAVE_OBJTOOL if 
ARCH_USING_PATCHABLE_FUNCTION_ENTRY || MPROFILE_KERNEL || PPC32
select HAVE_OBJTOOL_MCOUNT  if HAVE_OBJTOOL
select HAVE_PERF_EVENTS
select HAVE_PERF_EVENTS_NMI if PPC64
@@ -550,6 +551,13 @@ config MPROFILE_KERNEL
depends on PPC64 && CPU_LITTLE_ENDIAN && FUNCTION_TRACER
def_bool 
$(success,$(srctree)/arch/powerpc/tools/gcc-check-mprofile-kernel.sh $(CC) 
-I$(srctree)/include -D__KERNEL__)
 
+config ARCH_USING_PATCHABLE_FUNCTION_ENTRY
+   depends on FUNCTION_TRACER && (PPC32 || PPC64_ELF_ABI_V2)
+   depends on $(cc-option,-fpatchable-function-entry=2)
+   def_bool y if PPC32
+   def_bool 
$(success,$(srctree)/arch/powerpc/tools/gcc-check-fpatchable-function-entry.sh 
$(CC) -mlittle-endian) if PPC64 && CPU_LITTLE_ENDIAN
+   def_bool 
$(success,$(srctree)/arch/powerpc/tools/gcc-check-fpatchable-function-entry.sh 
$(CC) -mbig-endian) if PPC64 && CPU_BIG_ENDIAN
+
 config HOTPLUG_CPU
bool "Support for enabling/disabling CPUs"
depends on SMP && (PPC_PSERIES || \
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index dca73f673d7046..de39478b1c9e9f 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -148,11 +148,16 @@ CFLAGS-$(CONFIG_PPC32)+= $(call cc-option, 
$(MULTIPLEWORD))
 CFLAGS-$(CONFIG_PPC32) += $(call cc-option,-mno-readonly-in-sdata)
 
 ifdef CONFIG_FUNCTION_TRACER
+ifdef CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY
+KBUILD_CPPFLAGS+= -DCC_USING_PATCHABLE_FUNCTION_ENTRY
+CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+else
 CC_FLAGS_FTRACE := -pg
 ifdef CONFIG_MPROFILE_KERNEL

[RFC PATCH 14/15] powerpc/ftrace: Implement ftrace_replace_code()

2023-06-07 Thread Naveen N Rao

Implement ftrace_replace_code() to consolidate logic from the different
ftrace patching routines: ftrace_make_nop(), ftrace_make_call() and
ftrace_modify_call(). Note that ftrace_make_call() is still required
primarily to handle patching modules during their load time. The other
two routines should no longer be called.

This lays the groundwork to enable better control in patching ftrace
locations, including the ability to nop-out preceding profiling
instructions when ftrace is disabled.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 173 -
 1 file changed, 96 insertions(+), 77 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 4375ef71609c0a..fd1ef350c62c42 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -94,104 +94,123 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
return 0;
 }
 
+static int ftrace_get_call_inst(struct dyn_ftrace *rec, unsigned long addr, 
ppc_inst_t *call_inst)
+{
+   unsigned long ip = rec->ip;
+   unsigned long stub;
+
+   if (is_offset_in_branch_range(addr - ip)) {
+   /* Within range */
+   stub = addr;
+#ifdef CONFIG_MODULES
+   } else if (rec->arch.mod) {
+   /* Module code would be going to one of the module stubs */
+   stub = (addr == (unsigned long)ftrace_caller ? 
rec->arch.mod->arch.tramp :
+  
rec->arch.mod->arch.tramp_regs);
+#endif
+   } else if (core_kernel_text(ip)) {
+   /* We would be branching to one of our ftrace stubs */
+   stub = find_ftrace_tramp(ip);
+   if (!stub) {
+   pr_err("0x%lx: No ftrace stubs reachable\n", ip);
+   return -EINVAL;
+   }
+   } else {
+   return -EINVAL;
+   }
+
+   *call_inst = ftrace_create_branch_inst(ip, stub, 1);
+   return 0;
+}
+
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
 int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, 
unsigned long addr)
 {
-   unsigned long tramp, tramp_old, ip = rec->ip;
-   ppc_inst_t old, new;
-   struct module *mod;
-
-   if (is_offset_in_branch_range(old_addr - ip) && 
is_offset_in_branch_range(addr - ip)) {
-   /* Within range */
-   old = ftrace_create_branch_inst(ip, old_addr, 1);
-   new = ftrace_create_branch_inst(ip, addr, 1);
-   return ftrace_modify_code(ip, old, new);
-   } else if (core_kernel_text(ip)) {
-   /*
-* We always patch out of range locations to go to the regs
-* variant, so there is nothing to do here
-*/
-   return 0;
-   } else if (IS_ENABLED(CONFIG_MODULES)) {
-   /* Module code would be going to one of the module stubs */
-   mod = rec->arch.mod;
-   if (addr == (unsigned long)ftrace_caller) {
-   tramp_old = mod->arch.tramp_regs;
-   tramp = mod->arch.tramp;
-   } else {
-   tramp_old = mod->arch.tramp;
-   tramp = mod->arch.tramp_regs;
-   }
-   old = ftrace_create_branch_inst(ip, tramp_old, 1);
-   new = ftrace_create_branch_inst(ip, tramp, 1);
-   return ftrace_modify_code(ip, old, new);
-   }
-
+   /* This should never be called since we override ftrace_replace_code() 
*/
+   WARN_ON(1);
return -EINVAL;
 }
 #endif
 
 int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
 {
-   unsigned long tramp, ip = rec->ip;
ppc_inst_t old, new;
-   struct module *mod;
+   int ret;
+
+   /* This can only ever be called during module load */
+   if (WARN_ON(!IS_ENABLED(CONFIG_MODULES) || core_kernel_text(rec->ip)))
+   return -EINVAL;
 
old = ppc_inst(PPC_RAW_NOP());
-   if (is_offset_in_branch_range(addr - ip)) {
-   /* Within range */
-   new = ftrace_create_branch_inst(ip, addr, 1);
-   return ftrace_modify_code(ip, old, new);
-   } else if (core_kernel_text(ip)) {
-   /* We would be branching to one of our ftrace tramps */
-   tramp = find_ftrace_tramp(ip);
-   if (!tramp) {
-   pr_err("0x%lx: No ftrace trampolines reachable\n", ip);
-   return -EINVAL;
-   }
-   new = ftrace_create_branch_inst(ip, tramp, 1);
-   return ftrace_modify_code(ip, old, new);
-   } else if (IS_ENABLED(CONFIG_MODULES)) {
-   /* Module code would be going to one of the module stubs */
-   mod = rec->arch.mod;
-   tramp = (addr == (unsigned long)ftrace_caller ? mod->arch.tramp 
:

[RFC PATCH 13/15] powerpc/ftrace: Replace use of ftrace_call_replace() with ftrace_create_branch_inst()

2023-06-07 Thread Naveen N Rao

ftrace_create_branch_inst() is clearer about its intent than
ftrace_call_replace().

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 17 ++---
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index c810343d1a6cd6..4375ef71609c0a 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -41,19 +41,6 @@ static ppc_inst_t ftrace_create_branch_inst(unsigned long 
ip, unsigned long addr
return op;
 }
 
-static ppc_inst_t
-ftrace_call_replace(unsigned long ip, unsigned long addr, int link)
-{
-   ppc_inst_t op;
-
-   addr = ppc_function_entry((void *)addr);
-
-   /* if (link) set op to 'bl' else 'b' */
-   create_branch(, (u32 *)ip, addr, link ? BRANCH_SET_LINK : 0);
-
-   return op;
-}
-
 static inline int ftrace_read_inst(unsigned long ip, ppc_inst_t *op)
 {
if (copy_inst_from_kernel_nofault(op, (void *)ip)) {
@@ -275,14 +262,14 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
int ret;
 
old = ppc_inst_read((u32 *)_call);
-   new = ftrace_call_replace(ip, (unsigned long)func, 1);
+   new = ftrace_create_branch_inst(ip, ppc_function_entry(func), 1);
ret = ftrace_modify_code(ip, old, new);
 
/* Also update the regs callback function */
if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS) && !ret) {
ip = (unsigned long)(_regs_call);
old = ppc_inst_read((u32 *)_regs_call);
-   new = ftrace_call_replace(ip, (unsigned long)func, 1);
+   new = ftrace_create_branch_inst(ip, ppc_function_entry(func), 
1);
ret = ftrace_modify_code(ip, old, new);
}
 
-- 
2.40.1

[RFC PATCH 12/15] powerpc/ftrace: Simplify ftrace_modify_call()

2023-06-07 Thread Naveen N Rao

Now that we validate the ftrace location during initialization in
ftrace_init_nop(), we can simplify ftrace_modify_call() to patch-in the
updated branch instruction without worrying about the instructions
surrounding the ftrace location. Note that we continue to ensure we
have the expected branch instruction at the ftrace location before
patching it with the updated branch destination.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 161 -
 1 file changed, 21 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 8d5d91b8ae85a0..c810343d1a6cd6 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -89,33 +89,11 @@ static inline int ftrace_modify_code(unsigned long ip, 
ppc_inst_t old, ppc_inst_
return ret;
 }
 
-/*
- * Helper functions that are the same for both PPC64 and PPC32.
- */
-static int test_24bit_addr(unsigned long ip, unsigned long addr)
-{
-   addr = ppc_function_entry((void *)addr);
-
-   return is_offset_in_branch_range(addr - ip);
-}
-
 static int is_bl_op(ppc_inst_t op)
 {
return (ppc_inst_val(op) & ~PPC_LI_MASK) == PPC_RAW_BL(0);
 }
 
-static unsigned long find_bl_target(unsigned long ip, ppc_inst_t op)
-{
-   int offset;
-
-   offset = PPC_LI(ppc_inst_val(op));
-   /* make it signed */
-   if (offset & 0x0200)
-   offset |= 0xfe00;
-
-   return ip + (long)offset;
-}
-
 static unsigned long find_ftrace_tramp(unsigned long ip)
 {
int i;
@@ -130,115 +108,16 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
 }
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
-#ifdef CONFIG_MODULES
-static int
-__ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
-   unsigned long addr)
+int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, 
unsigned long addr)
 {
-   ppc_inst_t op;
-   unsigned long ip = rec->ip;
-   unsigned long entry, ptr, tramp;
-   struct module *mod = rec->arch.mod;
-
-   /* If we never set up ftrace trampolines, then bail */
-   if (!mod->arch.tramp || !mod->arch.tramp_regs) {
-   pr_err("No ftrace trampoline\n");
-   return -EINVAL;
-   }
-
-   /* read where this goes */
-   if (copy_inst_from_kernel_nofault(, (void *)ip)) {
-   pr_err("Fetching opcode failed.\n");
-   return -EFAULT;
-   }
-
-   /* Make sure that this is still a 24bit jump */
-   if (!is_bl_op(op)) {
-   pr_err("Not expected bl: opcode is %08lx\n", 
ppc_inst_as_ulong(op));
-   return -EINVAL;
-   }
-
-   /* lets find where the pointer goes */
-   tramp = find_bl_target(ip, op);
-   entry = ppc_global_function_entry((void *)old_addr);
-
-   pr_devel("ip:%lx jumps to %lx", ip, tramp);
-
-   if (tramp != entry) {
-   /* old_addr is not within range, so we must have used a 
trampoline */
-   if (module_trampoline_target(mod, tramp, )) {
-   pr_err("Failed to get trampoline target\n");
-   return -EFAULT;
-   }
-
-   pr_devel("trampoline target %lx", ptr);
-
-   /* This should match what was called */
-   if (ptr != entry) {
-   pr_err("addr %lx does not match expected %lx\n", ptr, 
entry);
-   return -EINVAL;
-   }
-   }
-
-   /* The new target may be within range */
-   if (test_24bit_addr(ip, addr)) {
-   /* within range */
-   if (patch_branch((u32 *)ip, addr, BRANCH_SET_LINK)) {
-   pr_err("REL24 out of range!\n");
-   return -EINVAL;
-   }
-
-   return 0;
-   }
-
-   if (rec->flags & FTRACE_FL_REGS)
-   tramp = mod->arch.tramp_regs;
-   else
-   tramp = mod->arch.tramp;
-
-   if (module_trampoline_target(mod, tramp, )) {
-   pr_err("Failed to get trampoline target\n");
-   return -EFAULT;
-   }
-
-   pr_devel("trampoline target %lx", ptr);
-
-   entry = ppc_global_function_entry((void *)addr);
-   /* This should match what was called */
-   if (ptr != entry) {
-   pr_err("addr %lx does not match expected %lx\n", ptr, entry);
-   return -EINVAL;
-   }
-
-   if (patch_branch((u32 *)ip, tramp, BRANCH_SET_LINK)) {
-   pr_err("REL24 out of range!\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-#else
-static int __ftrace_modify_call(struct dyn_ftrace *rec, unsigned long 
old_addr, unsigned long addr)
-{
-   return 0;
-}
-#endif
-
-int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
-   unsigned long addr)
-{
-   unsigned long ip =

[RFC PATCH 11/15] powerpc/ftrace: Simplify ftrace_make_call()

2023-06-07 Thread Naveen N Rao

Now that we validate the ftrace location during initialization in
ftrace_init_nop(), we can simplify ftrace_make_call() to replace the nop
without worrying about the instructions surrounding the ftrace location.
Note that we continue to ensure that we have a nop at the ftrace
location before patching it.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c | 187 +
 1 file changed, 31 insertions(+), 156 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 67773cd14da71a..8d5d91b8ae85a0 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -129,162 +129,6 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
return 0;
 }
 
-#ifdef CONFIG_MODULES
-/*
- * Examine the existing instructions for __ftrace_make_call.
- * They should effectively be a NOP, and follow formal constraints,
- * depending on the ABI. Return false if they don't.
- */
-static bool expected_nop_sequence(void *ip, ppc_inst_t op0, ppc_inst_t op1)
-{
-   if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
-   return ppc_inst_equal(op0, ppc_inst(PPC_RAW_NOP()));
-   else
-   return ppc_inst_equal(op0, ppc_inst(PPC_RAW_BRANCH(8))) &&
-  ppc_inst_equal(op1, ppc_inst(PPC_INST_LD_TOC));
-}
-
-static int
-__ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
-{
-   ppc_inst_t op[2];
-   void *ip = (void *)rec->ip;
-   unsigned long entry, ptr, tramp;
-   struct module *mod = rec->arch.mod;
-
-   /* read where this goes */
-   if (copy_inst_from_kernel_nofault(op, ip))
-   return -EFAULT;
-
-   if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS) &&
-   copy_inst_from_kernel_nofault(op + 1, ip + 4))
-   return -EFAULT;
-
-   if (!expected_nop_sequence(ip, op[0], op[1])) {
-   pr_err("Unexpected call sequence at %p: %08lx %08lx\n", ip,
-  ppc_inst_as_ulong(op[0]), ppc_inst_as_ulong(op[1]));
-   return -EINVAL;
-   }
-
-   /* If we never set up ftrace trampoline(s), then bail */
-   if (!mod->arch.tramp ||
-   (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS) && 
!mod->arch.tramp_regs)) {
-   pr_err("No ftrace trampoline\n");
-   return -EINVAL;
-   }
-
-   if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS) && rec->flags & 
FTRACE_FL_REGS)
-   tramp = mod->arch.tramp_regs;
-   else
-   tramp = mod->arch.tramp;
-
-   if (module_trampoline_target(mod, tramp, )) {
-   pr_err("Failed to get trampoline target\n");
-   return -EFAULT;
-   }
-
-   pr_devel("trampoline target %lx", ptr);
-
-   entry = ppc_global_function_entry((void *)addr);
-   /* This should match what was called */
-   if (ptr != entry) {
-   pr_err("addr %lx does not match expected %lx\n", ptr, entry);
-   return -EINVAL;
-   }
-
-   if (patch_branch(ip, tramp, BRANCH_SET_LINK)) {
-   pr_err("REL24 out of range!\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-#else
-static int __ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
-{
-   return 0;
-}
-#endif /* CONFIG_MODULES */
-
-static int __ftrace_make_call_kernel(struct dyn_ftrace *rec, unsigned long 
addr)
-{
-   ppc_inst_t op;
-   void *ip = (void *)rec->ip;
-   unsigned long tramp, entry, ptr;
-
-   /* Make sure we're being asked to patch branch to a known ftrace addr */
-   entry = ppc_global_function_entry((void *)ftrace_caller);
-   ptr = ppc_global_function_entry((void *)addr);
-
-   if (ptr != entry && IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
-   entry = ppc_global_function_entry((void *)ftrace_regs_caller);
-
-   if (ptr != entry) {
-   pr_err("Unknown ftrace addr to patch: %ps\n", (void *)ptr);
-   return -EINVAL;
-   }
-
-   /* Make sure we have a nop */
-   if (copy_inst_from_kernel_nofault(, ip)) {
-   pr_err("Unable to read ftrace location %p\n", ip);
-   return -EFAULT;
-   }
-
-   if (!ppc_inst_equal(op, ppc_inst(PPC_RAW_NOP( {
-   pr_err("Unexpected call sequence at %p: %08lx\n",
-  ip, ppc_inst_as_ulong(op));
-   return -EINVAL;
-   }
-
-   tramp = find_ftrace_tramp((unsigned long)ip);
-   if (!tramp) {
-   pr_err("No ftrace trampolines reachable from %ps\n", ip);
-   return -EINVAL;
-   }
-
-   if (patch_branch(ip, tramp, BRANCH_SET_LINK)) {
-   pr_err("Error patching branch to ftrace tramp!\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
-int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
-{
-   unsigned long ip = rec->ip;
-   ppc_inst_t old, new;
-
-

[RFC PATCH 00/15] powerpc/ftrace: refactoring and support for -fpatchable-function-entry

2023-06-07 Thread Naveen N Rao

This is a follow-on RFC to the patch I previously posted here:
http://lore.kernel.org/20230519192600.2593506-1-nav...@kernel.org

Since then, I have split up the patches, picked up a few more changes 
and given this more testing. More details in the individual patches.


- Naveen



Naveen N Rao (15):
  powerpc/module: Remove unused .ftrace.tramp section
  powerpc64/ftrace: Move ELFv1 and -pg support code into a separate file
  powerpc/ftrace: Simplify function_graph support in ftrace.c
  powerpc/ftrace: Use FTRACE_REGS_ADDR to identify the correct ftrace
trampoline
  powerpc/ftrace: Extend ftrace support for large kernels to ppc32
  powerpc/ftrace: Consolidate ftrace support into fewer files
  powerpc/ftrace: Refactor ftrace_modify_code()
  powerpc/ftrace: Stop re-purposing linker generated long branches for
ftrace
  powerpc/ftrace: Add separate ftrace_init_nop() with additional
validation
  powerpc/ftrace: Simplify ftrace_make_nop()
  powerpc/ftrace: Simplify ftrace_make_call()
  powerpc/ftrace: Simplify ftrace_modify_call()
  powerpc/ftrace: Replace use of ftrace_call_replace() with
ftrace_create_branch_inst()
  powerpc/ftrace: Implement ftrace_replace_code()
  powerpc/ftrace: Add support for -fpatchable-function-entry

 arch/powerpc/Kconfig  |  14 +-
 arch/powerpc/Makefile |   5 +
 arch/powerpc/include/asm/ftrace.h |  22 +-
 arch/powerpc/include/asm/module.h |   4 -
 arch/powerpc/include/asm/sections.h   |   2 +
 arch/powerpc/include/asm/vermagic.h   |   4 +-
 arch/powerpc/kernel/module_64.c   |   2 +-
 arch/powerpc/kernel/trace/Makefile|  12 +-
 arch/powerpc/kernel/trace/ftrace.c| 908 +-
 arch/powerpc/kernel/trace/ftrace_64_pg.S  |  67 --
 arch/powerpc/kernel/trace/ftrace_64_pg.c  | 846 
 .../{ftrace_low.S => ftrace_64_pg_entry.S}|  64 +-
 .../{ftrace_mprofile.S => ftrace_entry.S} |  67 ++
 arch/powerpc/kernel/vmlinux.lds.S |   4 -
 .../gcc-check-fpatchable-function-entry.sh|  26 +
 15 files changed, 1277 insertions(+), 770 deletions(-)
 delete mode 100644 arch/powerpc/kernel/trace/ftrace_64_pg.S
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64_pg.c
 rename arch/powerpc/kernel/trace/{ftrace_low.S => ftrace_64_pg_entry.S} (54%)
 rename arch/powerpc/kernel/trace/{ftrace_mprofile.S => ftrace_entry.S} (83%)
 create mode 100755 arch/powerpc/tools/gcc-check-fpatchable-function-entry.sh


base-commit: bd517a8442b6c6646a136421cd4c1b95bf4ce32b
-- 
2.40.1

Re: [PATCH] security/integrity: fix pointer to ESL data and its size on pseries

2023-06-07 Thread Jarkko Sakkinen

On Wed Jun 7, 2023 at 3:28 PM EEST, Nayna wrote:
>
> On 6/6/23 16:51, Jarkko Sakkinen wrote:
> > On Tue Jun 6, 2023 at 8:26 PM EEST, Nayna Jain wrote:
> >> On PowerVM guest, variable data is prefixed with 8 bytes of timestamp.
> >> Extract ESL by stripping off the timestamp before passing to ESL parser.
> >>
> > Cc: sta...@vger.kenrnel.org # v6.3
> >
> > ?
>
> Aah yes. Missed that.. Thanks..
>
>
> >
> >> Fixes: 4b3e71e9a34c ("integrity/powerpc: Support loading keys from PLPKS")
> >> Signed-off-by: Nayna Jain 
> >> ---
> >>   .../integrity/platform_certs/load_powerpc.c   | 39 ---
> >>   1 file changed, 26 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/security/integrity/platform_certs/load_powerpc.c 
> >> b/security/integrity/platform_certs/load_powerpc.c
> >> index b9de70b90826..57768cbf1fd3 100644
> >> --- a/security/integrity/platform_certs/load_powerpc.c
> >> +++ b/security/integrity/platform_certs/load_powerpc.c
> >> @@ -15,6 +15,9 @@
> >>   #include "keyring_handler.h"
> >>   #include "../integrity.h"
> >>   
> >> +#define extract_data(db, data, size, offset)  \
> >> +  do { db = data + offset; size = size - offset; } while (0)
> >> +
> >>   /*
> >>* Get a certificate list blob from the named secure variable.
> >>*
> >> @@ -55,8 +58,10 @@ static __init void *get_cert_list(u8 *key, unsigned 
> >> long keylen, u64 *size)
> >>*/
> >>   static int __init load_powerpc_certs(void)
> >>   {
> >> +  void *data = NULL;
> >> +  u64 dsize = 0;
> >> +  u64 offset = 0;
> >>void *db = NULL, *dbx = NULL;
> > So... what do you need db still for?
> >
> > If you meant to rename 'db' to 'data', then you should not do it, since 
> > this is
> > a bug fix. It is zero gain, and a factor harder backport.
>
> In case of PowerVM guest, data points to timestamp + ESL.  And then with 
> offset of 8 bytes, db points to ESL.
>
> While db is used for parsing ESL, data is then later used to free 
> (timestamp + ESL) memory.
>
> Hope it answers the question.

OK, cool. Only thing I have to add that it would be more consistent if
data was declared in the same line as db and dbx, given that they are
declared in the same line.

BR, Jarkko

[PATCH 7/7] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-06-07 Thread Petr Mladek

The HAVE_ prefix means that the code could be enabled. Add another
variable for HAVE_HARDLOCKUP_DETECTOR_ARCH without this prefix.
It will be set when it should be built. It will make it compatible
with the other hardlockup detectors.

The change allows to clean up dependencies of PPC_WATCHDOG
and HAVE_HARDLOCKUP_DETECTOR_PERF definitions for powerpc.

As a result HAVE_HARDLOCKUP_DETECTOR_PERF has the same dependencies
on arm, x86, powerpc architectures.

Signed-off-by: Petr Mladek 
---
 arch/powerpc/Kconfig | 5 ++---
 include/linux/nmi.h  | 2 +-
 lib/Kconfig.debug| 9 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 539d1f03ff42..987e730740d7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -90,8 +90,7 @@ config NMI_IPI
 
 config PPC_WATCHDOG
bool
-   depends on HARDLOCKUP_DETECTOR
-   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
+   depends on HARDLOCKUP_DETECTOR_ARCH
default y
help
  This is a placeholder when the powerpc hardlockup detector
@@ -240,7 +239,7 @@ config PPC
select HAVE_GCC_PLUGINS if GCC_VERSION >= 50200   # 
plugin support on gcc <= 5.1 is buggy on PPC
select HAVE_GENERIC_VDSO
select HAVE_HARDLOCKUP_DETECTOR_ARCHif PPC_BOOK3S_64 && SMP
-   select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
+   select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI
select HAVE_HW_BREAKPOINT   if PERF_EVENTS && (PPC_BOOK3S 
|| PPC_8xx)
select HAVE_IOREMAP_PROT
select HAVE_IRQ_TIME_ACCOUNTING
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 515d6724f469..ec808ebd36ba 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -9,7 +9,7 @@
 #include 
 
 /* Arch specific watchdogs might need to share extra watchdog-related APIs. */
-#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 #include 
 #endif
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 116904e65d9f..97853ca54dc7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1056,6 +1056,7 @@ config HARDLOCKUP_DETECTOR
depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
imply HARDLOCKUP_DETECTOR_PERF
imply HARDLOCKUP_DETECTOR_BUDDY
+   imply HARDLOCKUP_DETECTOR_ARCH
select LOCKUP_DETECTOR
 
help
@@ -1102,6 +1103,14 @@ config HARDLOCKUP_DETECTOR_BUDDY
depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
 
+config HARDLOCKUP_DETECTOR_ARCH
+   bool
+   depends on HARDLOCKUP_DETECTOR
+   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
+   help
+ The arch-specific implementation of the hardlockup detector is
+ available.
+
 #
 # Both the "perf" and "buddy" hardlockup detectors count hrtimer
 # interrupts. This config enables functions managing this common code.
-- 
2.35.3

[PATCH 6/7] watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64

2023-06-07 Thread Petr Mladek

The HAVE_ prefix means that the code could be enabled. Add another
variable for HAVE_HARDLOCKUP_DETECTOR_SPARC64 without this prefix.
It will be set when it should be built. It will make it compatible
with the other hardlockup detectors.

Before, it is far from obvious that the SPARC64 variant is actually used:

$> make ARCH=sparc64 defconfig
$> grep HARDLOCKUP_DETECTOR .config
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y

After, it is more clear:

$> make ARCH=sparc64 defconfig
$> grep HARDLOCKUP_DETECTOR .config
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
CONFIG_HARDLOCKUP_DETECTOR_SPARC64=y

Signed-off-by: Petr Mladek 
---
 arch/sparc/Kconfig.debug | 10 +-
 include/linux/nmi.h  |  4 ++--
 kernel/watchdog.c|  2 +-
 lib/Kconfig.debug|  2 +-
 4 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/Kconfig.debug b/arch/sparc/Kconfig.debug
index b6695303b8d4..0bb95b0aacf4 100644
--- a/arch/sparc/Kconfig.debug
+++ b/arch/sparc/Kconfig.debug
@@ -16,8 +16,9 @@ config FRAME_POINTER
default y
 
 config HAVE_HARDLOCKUP_DETECTOR_SPARC64
-   depends on HAVE_NMI
bool
+   depends on HAVE_NMI
+   select HARDLOCKUP_DETECTOR_SPARC64
help
  Sparc64 provides its own hardlockup detector implementation instead
  of the generic perf one.
@@ -26,3 +27,10 @@ config HAVE_HARDLOCKUP_DETECTOR_SPARC64
  used by generic hardlockup detectors. Instead it is enabled/disabled
  by the top-level watchdog interface that is common for both softlockup
  and hardlockup detectors.
+
+config HARDLOCKUP_DETECTOR_SPARC64
+   bool
+   depends on HAVE_HARDLOCKUP_DETECTOR_SPARC64
+
+   help
+ The custom hardlockup detector is always built when possible.
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 7ee6c35d1f05..515d6724f469 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -9,7 +9,7 @@
 #include 
 
 /* Arch specific watchdogs might need to share extra watchdog-related APIs. */
-#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 #include 
 #endif
 
@@ -92,7 +92,7 @@ static inline void hardlockup_detector_disable(void) {}
 #endif
 
 /* Sparc64 has special implemetantion that is always enabled. */
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 void arch_touch_nmi_watchdog(void);
 #else
 static inline void arch_touch_nmi_watchdog(void) { }
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index babd2f3c8b72..a2154e753cb4 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -29,7 +29,7 @@
 
 static DEFINE_MUTEX(watchdog_mutex);
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 # define WATCHDOG_HARDLOCKUP_DEFAULT   1
 #else
 # define WATCHDOG_HARDLOCKUP_DEFAULT   0
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2d8d8ce7c2d7..116904e65d9f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1052,7 +1052,7 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
 #
 config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
-   depends on DEBUG_KERNEL && !S390 && !HAVE_HARDLOCKUP_DETECTOR_SPARC64
+   depends on DEBUG_KERNEL && !S390 && !HARDLOCKUP_DETECTOR_SPARC64
depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
imply HARDLOCKUP_DETECTOR_PERF
imply HARDLOCKUP_DETECTOR_BUDDY
-- 
2.35.3

[PATCH 5/7] watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64

2023-06-07 Thread Petr Mladek

The configuration variable HAVE_NMI_WATCHDOG has a generic name but
it is selected only for SPARC64.

It should _not_ be used in general because it is not integrated with
the other hardlockup detectors. Namely, it does not support the hardlockup
specific command line parameters and systcl interface. Instead, it is
enabled/disabled together with the softlockup detector by the global
"watchdog" sysctl.

Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special
behavior more clear.

Also the variable is set only on sparc64. Move the definition
from arch/Kconfig to arch/sparc/Kconfig.debug.

Signed-off-by: Petr Mladek 
---
 arch/Kconfig | 12 
 arch/sparc/Kconfig   |  2 +-
 arch/sparc/Kconfig.debug | 12 
 include/linux/nmi.h  |  4 ++--
 kernel/watchdog.c|  2 +-
 lib/Kconfig.debug|  5 +
 6 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 57f15babe188..6517e5477459 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -400,18 +400,6 @@ config HAVE_HARDLOCKUP_DETECTOR_PERF
  The arch chooses to use the generic perf-NMI-based hardlockup
  detector. Must define HAVE_PERF_EVENTS_NMI.
 
-config HAVE_NMI_WATCHDOG
-   depends on HAVE_NMI
-   bool
-   help
- Sparc64 provides its own hardlockup detector implementation instead
- of the generic perf one.
-
- It does _not_ use the command line parameters and sysctl interface
- used by generic hardlockup detectors. Instead it is enabled/disabled
- by the top-level watchdog interface that is common for both softlockup
- and hardlockup detectors.
-
 config HAVE_HARDLOCKUP_DETECTOR_ARCH
bool
help
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8535e19062f6..7297f69635cb 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -33,7 +33,7 @@ config SPARC
select ARCH_WANT_IPC_PARSE_VERSION
select GENERIC_PCI_IOMAP
select HAS_IOPORT
-   select HAVE_NMI_WATCHDOG if SPARC64
+   select HAVE_HARDLOCKUP_DETECTOR_SPARC64 if SPARC64
select HAVE_CBPF_JIT if SPARC32
select HAVE_EBPF_JIT if SPARC64
select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/sparc/Kconfig.debug b/arch/sparc/Kconfig.debug
index 6b2bec1888b3..b6695303b8d4 100644
--- a/arch/sparc/Kconfig.debug
+++ b/arch/sparc/Kconfig.debug
@@ -14,3 +14,15 @@ config FRAME_POINTER
bool
depends on MCOUNT
default y
+
+config HAVE_HARDLOCKUP_DETECTOR_SPARC64
+   depends on HAVE_NMI
+   bool
+   help
+ Sparc64 provides its own hardlockup detector implementation instead
+ of the generic perf one.
+
+ It does _not_ use the command line parameters and sysctl interface
+ used by generic hardlockup detectors. Instead it is enabled/disabled
+ by the top-level watchdog interface that is common for both softlockup
+ and hardlockup detectors.
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 800196c78f65..7ee6c35d1f05 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -9,7 +9,7 @@
 #include 
 
 /* Arch specific watchdogs might need to share extra watchdog-related APIs. */
-#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HAVE_NMI_WATCHDOG)
+#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
 #include 
 #endif
 
@@ -92,7 +92,7 @@ static inline void hardlockup_detector_disable(void) {}
 #endif
 
 /* Sparc64 has special implemetantion that is always enabled. */
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_NMI_WATCHDOG)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
 void arch_touch_nmi_watchdog(void);
 #else
 static inline void arch_touch_nmi_watchdog(void) { }
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 437c198933cf..babd2f3c8b72 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -29,7 +29,7 @@
 
 static DEFINE_MUTEX(watchdog_mutex);
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_NMI_WATCHDOG)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
 # define WATCHDOG_HARDLOCKUP_DEFAULT   1
 #else
 # define WATCHDOG_HARDLOCKUP_DEFAULT   0
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4b4aa0f941f9..2d8d8ce7c2d7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1050,12 +1050,9 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
 #  sparc64: has a custom implementation which is not using the common
 #  hardlockup command line options and sysctl interface.
 #
-# Note that HAVE_NMI_WATCHDOG is set when the sparc64 specific implementation
-# is used.
-#
 config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
-   depends on DEBUG_KERNEL && !S390 && !HAVE_NMI_WATCHDOG
+   depends on DEBUG_KERNEL && !S390 && !HAVE_HARDLOCKUP_DETECTOR_SPARC64

[PATCH 4/7] watchdog/hardlockup: Enable HAVE_NMI_WATCHDOG only on sparc64

2023-06-07 Thread Petr Mladek

HAVE_NMI_WATCHDOG is always enabled when SPARC64 is enabled. The sparc64
implementation is special. It does not support the generic hardlockup
related Kconfig values, command line parameters, and sysctl interface.
Instead it is enabled/disabled by the top-level watchdog interface
that is common for both softlockup and hardlockup detectors.

As a result, sparc64 needs special treating in Kconfig and source
files. The checks are a bit complicated because HAVE_NMI_WATCHDOG is
automatically set when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
But HAVE_HARDLOCKUP_DETECTOR_ARCH is set when the arch specific
implementation uses the generic hardlockup detector related
Kconfig variables, command line parameters, and sysctl interface.

The motivation probably was to avoid changes in the code when
the powerpc64-specific watchdog introduced HAVE_HARDLOCKUP_DETECTOR_ARCH.
It probably allowed to re-use some existing checks for HAVE_NMI_WATCHDOG.

But it actually made things pretty complicated. For example,
the following check was needed in HARDLOCKUP_DETECTOR config variable:

   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
HAVE_HARDLOCKUP_DETECTOR_ARCH

The inverted logic makes things easier:

  + HAVE_NMI_WATCHDOG is used only on sparc64. It is clear when
the sparc64 specific watchdog is used.

  + HAVE_HARDLOCKUP_DETECTOR_ARCH is basically compatible with
the generic watchdogs. As a result, the common code
is marked by ifdef CONFIG_HARDLOCKUP_DETECTOR.

As a result:

  + Some conditions are easier.

  + Some conditions use HAVE_HARDLOCKUP_DETECTOR_ARCH instead of
HAVE_NMI_WATCHDOG.

Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF,
and HARDLOCKUP_DETECTOR_BUDDY might depend only on
HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR
and it is not longer enabled when HAVE_NMI_WATCHDOG is set.

Note that asm/nmi.h still has to be included for all arch-specific
watchdogs. It declares more functions that are used in another
arch specific code which includes only linux/nmi.h.

Signed-off-by: Petr Mladek 
---
 arch/Kconfig|  7 +--
 include/linux/nmi.h |  5 ++---
 lib/Kconfig.debug   | 16 +++-
 3 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 13c6e596cf9e..57f15babe188 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -404,10 +404,9 @@ config HAVE_NMI_WATCHDOG
depends on HAVE_NMI
bool
help
- The arch provides its own hardlockup detector implementation instead
+ Sparc64 provides its own hardlockup detector implementation instead
  of the generic perf one.
 
- Sparc64 defines this variable without HAVE_HARDLOCKUP_DETECTOR_ARCH.
  It does _not_ use the command line parameters and sysctl interface
  used by generic hardlockup detectors. Instead it is enabled/disabled
  by the top-level watchdog interface that is common for both softlockup
@@ -415,7 +414,6 @@ config HAVE_NMI_WATCHDOG
 
 config HAVE_HARDLOCKUP_DETECTOR_ARCH
bool
-   select HAVE_NMI_WATCHDOG
help
  The arch provides its own hardlockup detector implementation instead
  of the generic ones.
@@ -423,9 +421,6 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
  It uses the same command line parameters, and sysctl interface,
  as the generic hardlockup detectors.
 
- HAVE_NMI_WATCHDOG is selected to build the code shared with
- the sparc64 specific implementation.
-
 config HAVE_PERF_REGS
bool
help
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index b9e816bde14a..800196c78f65 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -9,7 +9,7 @@
 #include 
 
 /* Arch specific watchdogs might need to share extra watchdog-related APIs. */
-#if defined(CONFIG_HAVE_NMI_WATCHDOG)
+#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HAVE_NMI_WATCHDOG)
 #include 
 #endif
 
@@ -92,8 +92,7 @@ static inline void hardlockup_detector_disable(void) {}
 #endif
 
 /* Sparc64 has special implemetantion that is always enabled. */
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || \
-(defined(CONFIG_HAVE_NMI_WATCHDOG) && 
!defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH))
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_NMI_WATCHDOG)
 void arch_touch_nmi_watchdog(void);
 #else
 static inline void arch_touch_nmi_watchdog(void) { }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d201f5d3876b..4b4aa0f941f9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1050,15 +1050,13 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
 #  sparc64: has a custom implementation which is not using the common
 #  hardlockup command line options and sysctl interface.
 #
-# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
-# implementaion. It is automatically enabled also for other arch-specific
-#

[PATCH 3/7] watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h

2023-06-07 Thread Petr Mladek

arch_touch_nmi_watchdog() needs a different implementation for various
hardlockup detector implementations. And it does nothing when
any hardlockup detector is not build at all.

arch_touch_nmi_watchdog() has to be declared in linux/nmi.h. It is done
directly in this header file for the perf and buddy detectors. And it
is done in the included asm/linux.h for arch specific detectors.

The reason probably is that the arch specific variants build the code
using another conditions. For example, powerpc64/sparc64 builds the code
when CONFIG_PPC_WATCHDOG is enabled.

Another reason might be that these architectures define more functions
in asm/nmi.h anyway.

However the generic code actually knows the information. The config
variables HAVE_NMI_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_ARCH are used
to decide whether to build the buddy detector.

In particular, CONFIG_HARDLOCKUP_DETECTOR is set only when a generic
or arch-specific hardlockup detector is built. The only exception
is sparc64 which ignores the global HARDLOCKUP_DETECTOR switch.

The information about sparc64 is a bit complicated. The hardlockup
detector is built there when CONFIG_HAVE_NMI_WATCHDOG is set and
CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

People might wonder whether this change really makes things easier.
The motivation is:

  + The current logic in linux/nmi.h is far from obvious.
For example, arch_touch_nmi_watchdog() is defined as {} when
neither CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER nor
CONFIG_HAVE_NMI_WATCHDOG is defined.

  + The change synchronizes the checks in lib/Kconfig.debug and
in the generic code.

  + It is a step that will help cleaning HAVE_NMI_WATCHDOG related
checks.

The change should not change the existing behavior.

Signed-off-by: Petr Mladek 
---
 arch/powerpc/include/asm/nmi.h |  2 --
 arch/sparc/include/asm/nmi.h   |  1 -
 include/linux/nmi.h| 13 ++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h
index 43bfd4de868f..ce25318c3902 100644
--- a/arch/powerpc/include/asm/nmi.h
+++ b/arch/powerpc/include/asm/nmi.h
@@ -3,11 +3,9 @@
 #define _ASM_NMI_H
 
 #ifdef CONFIG_PPC_WATCHDOG
-extern void arch_touch_nmi_watchdog(void);
 long soft_nmi_interrupt(struct pt_regs *regs);
 void watchdog_hardlockup_set_timeout_pct(u64 pct);
 #else
-static inline void arch_touch_nmi_watchdog(void) {}
 static inline void watchdog_hardlockup_set_timeout_pct(u64 pct) {}
 #endif
 
diff --git a/arch/sparc/include/asm/nmi.h b/arch/sparc/include/asm/nmi.h
index 90ee7863d9fe..920dc23f443f 100644
--- a/arch/sparc/include/asm/nmi.h
+++ b/arch/sparc/include/asm/nmi.h
@@ -8,7 +8,6 @@ void nmi_adjust_hz(unsigned int new_hz);
 
 extern atomic_t nmi_active;
 
-void arch_touch_nmi_watchdog(void);
 void start_nmi_watchdog(void *unused);
 void stop_nmi_watchdog(void *unused);
 
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index b5d0b7ab52fb..b9e816bde14a 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -7,6 +7,8 @@
 
 #include 
 #include 
+
+/* Arch specific watchdogs might need to share extra watchdog-related APIs. */
 #if defined(CONFIG_HAVE_NMI_WATCHDOG)
 #include 
 #endif
@@ -89,12 +91,17 @@ extern unsigned int hardlockup_panic;
 static inline void hardlockup_detector_disable(void) {}
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
+/* Sparc64 has special implemetantion that is always enabled. */
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || \
+(defined(CONFIG_HAVE_NMI_WATCHDOG) && 
!defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH))
 void arch_touch_nmi_watchdog(void);
+#else
+static inline void arch_touch_nmi_watchdog(void) { }
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
 void watchdog_hardlockup_touch_cpu(unsigned int cpu);
 void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs);
-#elif !defined(CONFIG_HAVE_NMI_WATCHDOG)
-static inline void arch_touch_nmi_watchdog(void) { }
 #endif
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
-- 
2.35.3

[PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-07 Thread Petr Mladek

There are four possible variants of hardlockup detectors:

  + buddy: available when SMP is set.

  + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.

  + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.

  + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

The check for the sparc64 variant is more complicated because
HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific
and sparc64 specific variant. Therefore it is automatically
selected with HAVE_HARDLOCKUP_DETECTOR_ARCH.

This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH.
It reduces the size of some checks but it makes them harder to follow.

Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH
is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global
HARDLOCKUP_DETECTOR switch is enabled/disabled.

Make the logic more straightforward by the following changes:

  + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and
HAVE_NMI_WATCHDOG in comments.

  + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate
HAVE_* for all four hardlockup detector variants.

Use it in the other conditions instead of SMP. It makes it
clear that it is about the buddy detector.

  + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR
and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand
the conditions between the four hardlockup detector variants.

  + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY
can be enabled. It explains the dependency on the other
hardlockup detector variants.

Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply".
It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when
the global HARDLOCKUP_DETECTOR switch is changed.

  + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables
disappear when the hardlockup detectors are disabled.

Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY
value is not preserved when the global switch is disabled.
The user has to make the decision when it is enabled again.

Signed-off-by: Petr Mladek 
---
 arch/Kconfig  | 22 -
 lib/Kconfig.debug | 63 ---
 2 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 422f0ffa269e..13c6e596cf9e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -404,17 +404,27 @@ config HAVE_NMI_WATCHDOG
depends on HAVE_NMI
bool
help
- The arch provides a low level NMI watchdog. It provides
- asm/nmi.h, and defines its own watchdog_hardlockup_probe() and
- arch_touch_nmi_watchdog().
+ The arch provides its own hardlockup detector implementation instead
+ of the generic perf one.
+
+ Sparc64 defines this variable without HAVE_HARDLOCKUP_DETECTOR_ARCH.
+ It does _not_ use the command line parameters and sysctl interface
+ used by generic hardlockup detectors. Instead it is enabled/disabled
+ by the top-level watchdog interface that is common for both softlockup
+ and hardlockup detectors.
 
 config HAVE_HARDLOCKUP_DETECTOR_ARCH
bool
select HAVE_NMI_WATCHDOG
help
- The arch chooses to provide its own hardlockup detector, which is
- a superset of the HAVE_NMI_WATCHDOG. It also conforms to config
- interfaces and parameters provided by hardlockup detector subsystem.
+ The arch provides its own hardlockup detector implementation instead
+ of the generic ones.
+
+ It uses the same command line parameters, and sysctl interface,
+ as the generic hardlockup detectors.
+
+ HAVE_NMI_WATCHDOG is selected to build the code shared with
+ the sparc64 specific implementation.
 
 config HAVE_PERF_REGS
bool
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3e91fa33c7a0..d201f5d3876b 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1035,16 +1035,33 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
  Say N if unsure.
 
+config HAVE_HARDLOCKUP_DETECTOR_BUDDY
+   bool
+   depends on SMP
+   default y
+
 #
-# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
-# lockup detector rather than the perf based detector.
+# Global switch whether to build a hardlockup detector at all. It is available
+# only when the architecture supports at least one implementation. There are
+# two exceptions. The hardlockup detector is newer enabled on:
+#
+#  s390: it reported many false positives there
+#
+#  sparc64: has a custom implementation which is not using the common
+#  hardlockup command line options and sysctl interface.
+#
+# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
+# implementaion. It is automatically

[PATCH 1/7] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-07 Thread Petr Mladek

There are four possible variants of hardlockup detectors:

  + buddy: available when SMP is set.

  + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.

  + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.

  + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

Only one hardlockup detector can be compiled in. The selection is done
using quite complex dependencies between several CONFIG variables.
The following patches will try to make it more straightforward.

As a first step, reorder the definitions of the various CONFIG variables.
The logical order is:

   1. HAVE_* variables define available variants. They are typically
  defined in the arch/ config files.

   2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
  detector is enabled at all.

   3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
  the buddy detector should be preferred over the perf one.
  Note that the arch specific variants are always preferred when
  available.

   4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
  detector is enabled in the end.

   5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
  are temporary variables that are going to be removed in
  a followup patch.

The patch just shuffles the definitions. It should not change the existing
behavior.

Signed-off-by: Petr Mladek 
---
 lib/Kconfig.debug | 112 +++---
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ed7b01c4bd41..3e91fa33c7a0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1035,62 +1035,6 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
  Say N if unsure.
 
-# Both the "perf" and "buddy" hardlockup detectors count hrtimer
-# interrupts. This config enables functions managing this common code.
-config HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
-   bool
-   select SOFTLOCKUP_DETECTOR
-
-config HARDLOCKUP_DETECTOR_PERF
-   bool
-   depends on HAVE_HARDLOCKUP_DETECTOR_PERF
-   select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
-
-config HARDLOCKUP_DETECTOR_BUDDY
-   bool
-   depends on SMP
-   select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
-
-# For hardlockup detectors you can have one directly provided by the arch
-# or use a "non-arch" one. If you're using a "non-arch" one that is
-# further divided the perf hardlockup detector (which, confusingly, needs
-# arch-provided perf support) and the buddy hardlockup detector (which just
-# needs SMP). In either case, using the "non-arch" code conflicts with
-# the NMI watchdog code (which is sometimes used directly and sometimes used
-# by the arch-provided hardlockup detector).
-config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
-   bool
-   depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG
-   default y
-
-config HARDLOCKUP_DETECTOR_PREFER_BUDDY
-   bool "Prefer the buddy CPU hardlockup detector"
-   depends on HAVE_HARDLOCKUP_DETECTOR_PERF && SMP
-   help
- Say Y here to prefer the buddy hardlockup detector over the perf one.
-
- With the buddy detector, each CPU uses its softlockup hrtimer
- to check that the next CPU is processing hrtimer interrupts by
- verifying that a counter is increasing.
-
- This hardlockup detector is useful on systems that don't have
- an arch-specific hardlockup detector or if resources needed
- for the hardlockup detector are better used for other things.
-
-# This will select the appropriate non-arch hardlockdup detector
-config HARDLOCKUP_DETECTOR_NON_ARCH
-   bool
-   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
-   select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || 
HARDLOCKUP_DETECTOR_PREFER_BUDDY
-   select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && 
!HARDLOCKUP_DETECTOR_PREFER_BUDDY
-
-#
-# Enables a timestamp based low pass filter to compensate for perf based
-# hard lockup detection which runs too fast due to turbo modes.
-#
-config HARDLOCKUP_CHECK_TIMESTAMP
-   bool
-
 #
 # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
 # lockup detector rather than the perf based detector.
@@ -,6 +1055,62 @@ config HARDLOCKUP_DETECTOR
  chance to run.  The current stack trace is displayed upon detection
  and the system will stay locked up.
 
+config HARDLOCKUP_DETECTOR_PREFER_BUDDY
+   bool "Prefer the buddy CPU hardlockup detector"
+   depends on HAVE_HARDLOCKUP_DETECTOR_PERF && SMP
+   help
+ Say Y here to prefer the buddy hardlockup detector over the perf one.
+
+ With the buddy detector, each CPU uses its softlockup hrtimer
+ to check that the next CPU is processing hrtimer interrupts by
+ verifying that a counter is increasing.
+
+

[PATCH 0/7] watchdog/hardlockup: Cleanup configuration of hardlockup detectors

2023-06-07 Thread Petr Mladek

Hi,

this patchset is supposed to replace the last patch in the patchset cleaning
up after introducing the buddy detector, see
https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid

There are four possible variants of hardlockup detectors:

  + buddy: available when SMP is set.

  + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.

  + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.

  + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

Only one hardlockup detector can be compiled in. The selection is done
using quite complex dependencies between several CONFIG variables.
The following patches will try to make it more straightforward.

Before, the decision was done using the following variables:

+ HAVE_HARDLOCKUP_DETECTOR_PERF
+ HAVE_HARDLOCKUP_DETECTOR_BUDDY
+ HAVE_HARDLOCKUP_DETECTOR_ARCH
+ HAVE_NMI_WATCHDOG
 
+ HARDLOCKUP_DETECTOR
+ HARDLOCKUP_DETECTOR_PREFER_BUDDY

+ HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+ HARDLOCKUP_DETECTOR_NON_ARCH

+ HARDLOCKUP_DETECTOR_PERF
+ HARDLOCKUP_DETECTOR_BUDDY

   and the particular watchdog was used when the following variables were set:

+ perf:  HARDLOCKUP_DETECTOR_PERF
+ buddy: HARDLOCKUP_DETECTOR_BUDDY
+ arch-specific: HAVE_HARDLOCKUP_DETECTOR_ARCH
+ sparc64:   HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH


After, the decision is done using the following variables:

+ HAVE_HARDLOCKUP_DETECTOR_PERF
+ HAVE_HARDLOCKUP_DETECTOR_BUDDY
+ HAVE_HARDLOCKUP_DETECTOR_ARCH
+ HAVE_HARDLOCKUP_DETECTOR_SPARC64
 
+ HARDLOCKUP_DETECTOR
+ HARDLOCKUP_DETECTOR_PREFER_BUDDY

+ HARDLOCKUP_DETECTOR_PERF
+ HARDLOCKUP_DETECTOR_BUDDY
+ HARDLOCKUP_DETECTOR_ARCH
+ HARDLOCKUP_DETECTOR_SPARC64

   and the particular watchdog is used when one of these variables is set:

+ perf:  HARDLOCKUP_DETECTOR_PERF
+ buddy: HARDLOCKUP_DETECTOR_BUDDY
+ arch-specific: HARDLOCKUP_DETECTOR_ARCH
+ sparc64:   HARDLOCKUP_DETECTOR_SPARC64


Plus, many checks are more straightforward and even self-explanatory.

I build and run tested it on x86_64. I only checked the generated
.config after using sparc_defconfig, sparc64_defconfig, ppc64_defconfig,
and ppc40x_defconfig.

Best Regards,
Petr

Petr Mladek (7):
  watchdog/hardlockup: Sort hardlockup detector related config values a
logical way
  watchdog/hardlockup: Make the config checks more straightforward
  watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in
linux/nmi.h
  watchdog/hardlockup: Enable HAVE_NMI_WATCHDOG only on sparc64
  watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to
HAVE_HARDLOCKUP_WATCHDOG_SPARC64
  watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64
  watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

 arch/Kconfig   |  17 ++---
 arch/powerpc/Kconfig   |   5 +-
 arch/powerpc/include/asm/nmi.h |   2 -
 arch/sparc/Kconfig |   2 +-
 arch/sparc/Kconfig.debug   |  20 ++
 arch/sparc/include/asm/nmi.h   |   1 -
 include/linux/nmi.h|  14 ++--
 kernel/watchdog.c  |   2 +-
 lib/Kconfig.debug  | 115 +++--
 9 files changed, 104 insertions(+), 74 deletions(-)

-- 
2.35.3

[PATCH 1/2] powerpc/book3s64/mm: Remove radix_mem_block_size

2023-06-07 Thread Aneesh Kumar K.V

Simplify powernv memory_block_size call back by directly using 1G value
instead of using radix_mem_block_size. Also, remove the variable and use
memory_block_size_bytes() in the kernel mapping function. This gets called
in radix__early_init_mmu() which is called after probe_machine is called
and correct machine-specific callback is assigned.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  5 --
 arch/powerpc/mm/book3s64/radix_pgtable.c | 64 +---
 arch/powerpc/platforms/powernv/setup.c   |  4 +-
 3 files changed, 3 insertions(+), 70 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 570a4960cf17..7b7643f32e0e 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -71,11 +71,6 @@ extern unsigned int mmu_pid_bits;
 /* Base PID to allocate from */
 extern unsigned int mmu_base_pid;
 
-/*
- * memory block size used with radix translation.
- */
-extern unsigned long __ro_after_init radix_mem_block_size;
-
 #define PRTB_SIZE_SHIFT(mmu_pid_bits + 4)
 #define PRTB_ENTRIES   (1ul << mmu_pid_bits)
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 2297aa764ecd..905cfa0af1ae 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -37,7 +37,6 @@
 #include 
 
 unsigned int mmu_base_pid;
-unsigned long radix_mem_block_size __ro_after_init;
 
 static __ref void *early_alloc_pgtable(unsigned long size, int nid,
unsigned long region_start, unsigned long region_end)
@@ -300,7 +299,7 @@ static int __meminit create_physical_mapping(unsigned long 
start,
bool prev_exec, exec = false;
pgprot_t prot;
int psize;
-   unsigned long max_mapping_size = radix_mem_block_size;
+   unsigned long max_mapping_size = memory_block_size_bytes();
 
if (debug_pagealloc_enabled_or_kfence())
max_mapping_size = PAGE_SIZE;
@@ -502,58 +501,6 @@ static int __init radix_dt_scan_page_sizes(unsigned long 
node,
return 1;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __init probe_memory_block_size(unsigned long node, const char 
*uname, int
- depth, void *data)
-{
-   unsigned long *mem_block_size = (unsigned long *)data;
-   const __be32 *prop;
-   int len;
-
-   if (depth != 1)
-   return 0;
-
-   if (strcmp(uname, "ibm,dynamic-reconfiguration-memory"))
-   return 0;
-
-   prop = of_get_flat_dt_prop(node, "ibm,lmb-size", );
-
-   if (!prop || len < dt_root_size_cells * sizeof(__be32))
-   /*
-* Nothing in the device tree
-*/
-   *mem_block_size = MIN_MEMORY_BLOCK_SIZE;
-   else
-   *mem_block_size = of_read_number(prop, dt_root_size_cells);
-   return 1;
-}
-
-static unsigned long __init radix_memory_block_size(void)
-{
-   unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
-
-   /*
-* OPAL firmware feature is set by now. Hence we are ok
-* to test OPAL feature.
-*/
-   if (firmware_has_feature(FW_FEATURE_OPAL))
-   mem_block_size = 1UL * 1024 * 1024 * 1024;
-   else
-   of_scan_flat_dt(probe_memory_block_size, _block_size);
-
-   return mem_block_size;
-}
-
-#else   /* CONFIG_MEMORY_HOTPLUG */
-
-static unsigned long __init radix_memory_block_size(void)
-{
-   return 1UL * 1024 * 1024 * 1024;
-}
-
-#endif /* CONFIG_MEMORY_HOTPLUG */
-
-
 void __init radix__early_init_devtree(void)
 {
int rc;
@@ -578,15 +525,6 @@ void __init radix__early_init_devtree(void)
psize_to_rpti_pgsize(MMU_PAGE_64K);
}
 
-   /*
-* Max mapping size used when mapping pages. We don't use
-* ppc_md.memory_block_size() here because this get called
-* early and we don't have machine probe called yet. Also
-* the pseries implementation only check for ibm,lmb-size.
-* All hypervisor supporting radix do expose that device
-* tree node.
-*/
-   radix_mem_block_size = radix_memory_block_size();
return;
 }
 
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 5e9c6b55809f..29f855c66ad3 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -488,9 +488,9 @@ static unsigned long pnv_memory_block_size(void)
 * this size.
 */
if (radix_enabled())
-   return radix_mem_block_size;
+   return 1UL << 30;
else
-   return 256UL * 1024 * 1024;
+   return 256UL << 20;
 }
 #endif
 
-- 
2.40.1

[PATCH 2/2] powerpc/mm: Add memory_block_size as a kernel parameter

2023-06-07 Thread Aneesh Kumar K.V

Certain devices can possess non-standard memory capacities, not constrained
to multiples of 1GB. Provide a kernel parameter so that we can map the
device memory completely on memory hotplug.

Restrict memory_block_size value to a power of 2 value similar to LMB size.
The memory block size should also be more than the section size.

Signed-off-by: Aneesh Kumar K.V 
---
 .../admin-guide/kernel-parameters.txt |  3 +++
 arch/powerpc/kernel/setup_64.c| 27 +++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..833b8c5b4b4c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,9 @@
Note that even when enabled, there are a few cases where
the feature is not effective.
 
+   memory_block_size=size [PPC]
+Use this parameter to configure the memory block size 
value.
+
memtest=[KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
Format: 
default : 0 
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 246201d0d879..54ce748a340d 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -885,13 +885,40 @@ void __init setup_per_cpu_areas(void)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static unsigned long set_mem_block_size;
 unsigned long memory_block_size_bytes(void)
 {
+   if (set_mem_block_size)
+   return set_mem_block_size;
+
if (ppc_md.memory_block_size)
return ppc_md.memory_block_size();
 
return MIN_MEMORY_BLOCK_SIZE;
 }
+
+/*
+ * Restrict to a power of 2 value for memblock which is larger than
+ * section size
+ */
+static int __init parse_mem_block_size(char *ptr)
+{
+   unsigned int order;
+   unsigned long size = memparse(ptr, NULL);
+
+   order = fls64(size);
+   if (!order)
+   return 0;
+
+   order--;
+   if (order < SECTION_SIZE_BITS)
+   return 0;
+
+   set_mem_block_size = 1UL << order;
+
+   return 0;
+}
+early_param("memory_block_size", parse_mem_block_size);
 #endif
 
 #if defined(CONFIG_PPC_INDIRECT_PIO) || defined(CONFIG_PPC_INDIRECT_MMIO)
-- 
2.40.1

Re: [PATCH net-next v2 5/8] net: fs_enet: Convert to platform remove callback returning void

2023-06-07 Thread Simon Horman

On Tue, Jun 06, 2023 at 06:28:26PM +0200, Uwe Kleine-König wrote:
> The .remove() callback for a platform driver returns an int which makes
> many driver authors wrongly assume it's possible to do error handling by
> returning an error code. However the value returned is (mostly) ignored
> and this typically results in resource leaks. To improve here there is a
> quest to make the remove callback return void. In the first step of this
> quest all drivers are converted to .remove_new() which already returns
> void.
> 
> Trivially convert this driver from always returning zero in the remove
> callback to the void returning variant.
> 
> Reviewed-by: Michal Kubiak 
> Signed-off-by: Uwe Kleine-König 

Reviewed-by: Simon Horman

Re: [PATCH net-next v2 8/8] net: ucc_geth: Convert to platform remove callback returning void

2023-06-07 Thread Simon Horman

On Tue, Jun 06, 2023 at 06:28:29PM +0200, Uwe Kleine-König wrote:
> The .remove() callback for a platform driver returns an int which makes
> many driver authors wrongly assume it's possible to do error handling by
> returning an error code. However the value returned is (mostly) ignored
> and this typically results in resource leaks. To improve here there is a
> quest to make the remove callback return void. In the first step of this
> quest all drivers are converted to .remove_new() which already returns
> void.
> 
> Trivially convert this driver from always returning zero in the remove
> callback to the void returning variant.
> 
> Reviewed-by: Michal Kubiak 
> Signed-off-by: Uwe Kleine-König 

Reviewed-by: Simon Horman

Re: Passing the complex args in the GPR's

2023-06-07 Thread Michael Matz

Hey,

On Tue, 6 Jun 2023, Umesh Kalappa via Gcc wrote:

> Question is : Why does GCC choose to use GPR's here and have any
> reference to support this decision  ?

You explicitely used -m32 ppc, so 
https://www.polyomino.org.uk/publications/2011/Power-Arch-32-bit-ABI-supp-1.0-Unified.pdf
 
applies.  It explicitely states in "B.1 ATR-Linux Inclusion and 
Conformance" that it is "ATR-PASS-COMPLEX-IN-GPRS", and other sections 
detail what that means (namely passing complex args in r3 .. r10, whatever 
fits).  GCC adheres to that, and has to.

The history how that came to be was explained in the thread.


Ciao,
Michael.

 > 
> Thank you
> ~Umesh
> 
> 
> 
> On Tue, Jun 6, 2023 at 10:16 PM Segher Boessenkool
>  wrote:
> >
> > Hi!
> >
> > On Tue, Jun 06, 2023 at 08:35:22PM +0530, Umesh Kalappa wrote:
> > > Hi Adnrew,
> > > Thank you for the quick response and for PPC64 too ,we do have
> > > mismatches in ABI b/w complex operations like
> > > https://godbolt.org/z/bjsYovx4c .
> > >
> > > Any reason why GCC chose to use GPR 's here ?
> >
> > What did you expect, what happened instead?  Why did you expect that,
> > and why then is it an error what did happen?
> >
> > You used -O0.  As long as the code works, all is fine.  But unoptimised
> > code frequently is hard to read, please use -O2 instead?
> >
> > As Andrew says, why did you use -m32 for GCC but -m64 for LLVM?  It is
> > hard to compare those at all!  32-bit PowerPC Linux ABI (based on 32-bit
> > PowerPC ELF ABI from 1995, BE version) vs. 64-bit ELFv2 ABI from 2015
> > (LE version).
> >
> >
> > Segher
>

Re: [PATCH] security/integrity: fix pointer to ESL data and its size on pseries

2023-06-07 Thread Nayna




On 6/6/23 16:51, Jarkko Sakkinen wrote:

On Tue Jun 6, 2023 at 8:26 PM EEST, Nayna Jain wrote:

On PowerVM guest, variable data is prefixed with 8 bytes of timestamp.
Extract ESL by stripping off the timestamp before passing to ESL parser.


Cc: sta...@vger.kenrnel.org # v6.3

?


Aah yes. Missed that.. Thanks..





Fixes: 4b3e71e9a34c ("integrity/powerpc: Support loading keys from PLPKS")
Signed-off-by: Nayna Jain 
---
  .../integrity/platform_certs/load_powerpc.c   | 39 ---
  1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
index b9de70b90826..57768cbf1fd3 100644
--- a/security/integrity/platform_certs/load_powerpc.c
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -15,6 +15,9 @@
  #include "keyring_handler.h"
  #include "../integrity.h"
  
+#define extract_data(db, data, size, offset)	\

+   do { db = data + offset; size = size - offset; } while (0)
+
  /*
   * Get a certificate list blob from the named secure variable.
   *
@@ -55,8 +58,10 @@ static __init void *get_cert_list(u8 *key, unsigned long 
keylen, u64 *size)
   */
  static int __init load_powerpc_certs(void)
  {
+   void *data = NULL;
+   u64 dsize = 0;
+   u64 offset = 0;
void *db = NULL, *dbx = NULL;

So... what do you need db still for?

If you meant to rename 'db' to 'data', then you should not do it, since this is
a bug fix. It is zero gain, and a factor harder backport.


In case of PowerVM guest, data points to timestamp + ESL.  And then with 
offset of 8 bytes, db points to ESL.


While db is used for parsing ESL, data is then later used to free 
(timestamp + ESL) memory.


Hope it answers the question.

Thanks & Regards,

    - Nayna





-   u64 dbsize = 0, dbxsize = 0;
int rc = 0;
ssize_t len;
char buf[32];
@@ -74,38 +79,46 @@ static int __init load_powerpc_certs(void)
return -ENODEV;
}
  
+	if (strcmp("ibm,plpks-sb-v1", buf) == 0)

+   /* PLPKS authenticated variables ESL data is prefixed with 8 
bytes of timestamp */
+   offset = 8;
+
/*
 * Get db, and dbx. They might not exist, so it isn't an error if we
 * can't get them.
 */
-   db = get_cert_list("db", 3, );
-   if (!db) {
+   data = get_cert_list("db", 3, );
+   if (!data) {
pr_info("Couldn't get db list from firmware\n");
-   } else if (IS_ERR(db)) {
-   rc = PTR_ERR(db);
+   } else if (IS_ERR(data)) {
+   rc = PTR_ERR(data);
pr_err("Error reading db from firmware: %d\n", rc);
return rc;
} else {
-   rc = parse_efi_signature_list("powerpc:db", db, dbsize,
+   extract_data(db, data, dsize, offset);
+
+   rc = parse_efi_signature_list("powerpc:db", db, dsize,
  get_handler_for_db);
if (rc)
pr_err("Couldn't parse db signatures: %d\n", rc);
-   kfree(db);
+   kfree(data);
}
  
-	dbx = get_cert_list("dbx", 4,  );

-   if (!dbx) {
+   data = get_cert_list("dbx", 4,  );
+   if (!data) {
pr_info("Couldn't get dbx list from firmware\n");
-   } else if (IS_ERR(dbx)) {
-   rc = PTR_ERR(dbx);
+   } else if (IS_ERR(data)) {
+   rc = PTR_ERR(data);
pr_err("Error reading dbx from firmware: %d\n", rc);
return rc;
} else {
-   rc = parse_efi_signature_list("powerpc:dbx", dbx, dbxsize,
+   extract_data(dbx, data, dsize, offset);
+
+   rc = parse_efi_signature_list("powerpc:dbx", dbx, dsize,
  get_handler_for_dbx);
if (rc)
pr_err("Couldn't parse dbx signatures: %d\n", rc);
-   kfree(dbx);
+   kfree(data);
}
  
  	return rc;

--
2.31.1

BR, Jarkko

Re: [PATCH] powerpc/64s: Fix VAS mm use after free

2023-06-07 Thread Sachin Sant




> On 07-Jun-2023, at 3:40 PM, Nicholas Piggin  wrote:
> 
> The refcount on mm is dropped before the coprocessor is detached.
> 
> Reported-by: Sachin Sant 
> Fixes: 7bc6f71bdff5f ("powerpc/vas: Define and use common vas_window struct")
> Fixes: b22f2d88e435c ("powerpc/pseries/vas: Integrate API with open/close 
> windows")
> Signed-off-by: Nicholas Piggin 
> ---
> How's this for fixing your vas_deallocate_window warning at
> radix_tlb.c:991 ?
> 
> I added a few new warnings in the TLB flush code recently which is
> why these new warns are showing up.
> 

Thanks Nick. This fixes the reported warning.
Nx-gzip as well as mce error inject tests completed successfully.

Tested-by: Sachin Sant 

- Sachin

[PATCH] powerpc/64s: Fix VAS mm use after free

2023-06-07 Thread Nicholas Piggin

The refcount on mm is dropped before the coprocessor is detached.

Reported-by: Sachin Sant 
Fixes: 7bc6f71bdff5f ("powerpc/vas: Define and use common vas_window struct")
Fixes: b22f2d88e435c ("powerpc/pseries/vas: Integrate API with open/close 
windows")
Signed-off-by: Nicholas Piggin 
---
How's this for fixing your vas_deallocate_window warning at
radix_tlb.c:991 ?

I added a few new warnings in the TLB flush code recently which is
why these new warns are showing up.

Thanks,
Nick

 arch/powerpc/platforms/powernv/vas-window.c | 2 +-
 arch/powerpc/platforms/pseries/vas.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 0072682531d8..b664838008c1 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1310,8 +1310,8 @@ int vas_win_close(struct vas_window *vwin)
/* if send window, drop reference to matching receive window */
if (window->tx_win) {
if (window->user_win) {
-   put_vas_user_win_ref(>task_ref);
mm_context_remove_vas_window(vwin->task_ref.mm);
+   put_vas_user_win_ref(>task_ref);
}
put_rx_win(window->rxwin);
}
diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index 513180467562..9a44a98ba342 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -507,8 +507,8 @@ static int vas_deallocate_window(struct vas_window *vwin)
vascaps[win->win_type].nr_open_windows--;
mutex_unlock(_pseries_mutex);
 
-   put_vas_user_win_ref(>task_ref);
mm_context_remove_vas_window(vwin->task_ref.mm);
+   put_vas_user_win_ref(>task_ref);
 
kfree(win);
return 0;
-- 
2.40.1

WARNING at arch/powerpc/mm/book3s64/radix_tlb.c:991 (flush_all_mm)

2023-06-07 Thread Sachin Sant

While running powerpc selftests on a Power10 LPAR, following warning
is seen:

[   43.878929] MCE: CPU25: PID: 6025 Comm: inject-ra-err NIP: [1e48]
[   43.878933] MCE: CPU25: Initiator CPU
[   43.878935] MCE: CPU25: Unknown
[   43.906866] [ cut here ]
[   43.906870] WARNING: CPU: 25 PID: 6025 at 
arch/powerpc/mm/book3s64/radix_tlb.c:991 __flush_all_mm+0x1c0/0x2c0
[   43.906878] Modules linked in: dm_mod(E) nft_fib_inet(E) nft_fib_ipv4(E) 
nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) 
nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) 
nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bonding(E) tls(E) rfkill(E) 
ip_set(E) sunrpc(E) nf_tables(E) nfnetlink(E) pseries_rng(E) 
aes_gcm_p10_crypto(E) drm(E) drm_panel_orientation_quirks(E) xfs(E) 
libcrc32c(E) sd_mod(E) t10_pi(E) sr_mod(E) crc64_rocksoft_generic(E) cdrom(E) 
crc64_rocksoft(E) crc64(E) sg(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) 
vmx_crypto(E) fuse(E)
[   43.906917] CPU: 25 PID: 6025 Comm: inject-ra-err Kdump: loaded Tainted: G   
ME  6.4.0-rc5-00016-ga4d7d7011219 #2
[   43.906922] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
[   43.906925] NIP:  c00a94a0 LR: c011b7d4 CTR: 007b
[   43.906929] REGS: c000ded9b8c0 TRAP: 0700   Tainted: G   ME  
 (6.4.0-rc5-00016-ga4d7d7011219)
[   43.906933] MSR:  8282b033   CR: 
24002484  XER: 2004
[   43.906942] CFAR: c00a930c IRQMASK: 0  [   43.906942] GPR00: 
c011b7d4 c000ded9bb60 c14a1500 c000aaf15a00  [   
43.906942] GPR04:  802a000c c000aaf15a00 
802a000b  [   43.906942] GPR08: 00380201  
00380200 0001  [   43.906942] GPR12: c0159cc0 
c0135faaf300    [   43.906942] GPR16: 
     [   
43.906942] GPR20:    
  [   43.906942] GPR24:   
 c4208580  [   43.906942] GPR28: c25f9930 
c2b9c038  c000a0a58600  [   43.906979] NIP 
[c00a94a0] __flush_all_mm+0x1c0/0x2c0
[   43.906982] LR [c011b7d4] vas_deallocate_window+0x174/0x300
[   43.906987] Call Trace:
[   43.906988] [c000ded9bb60] [c000ded9bbb0] 0xc000ded9bbb0 
(unreliable)
[   43.906994] [c000ded9bbb0] [c011b918] 
vas_deallocate_window+0x2b8/0x300
[   43.906998] [c000ded9bc30] [c011cec0] coproc_release+0x60/0xb0
[   43.907002] [c000ded9bc60] [c05a86f8] __fput+0xc8/0x350
[   43.907008] [c000ded9bcb0] [c0196c14] task_work_run+0xe4/0x170
[   43.907014] [c000ded9bd00] [c016940c] do_exit+0x2fc/0x590
[   43.907019] [c000ded9bdb0] [c01698ec] do_group_exit+0x4c/0xc0
[   43.907023] [c000ded9bdf0] [c0169988] sys_exit_group+0x28/0x30
[   43.907027] [c000ded9be10] [c0034adc] 
system_call_exception+0x13c/0x340
[   43.907033] [c000ded9be50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
[   43.907038] --- interrupt: 3000 at 0x7fffa0b0c2c4
[   43.907041] NIP:  7fffa0b0c2c4 LR:  CTR: 
[   43.907043] REGS: c000ded9be80 TRAP: 3000   Tainted: G   ME  
 (6.4.0-rc5-00016-ga4d7d7011219)
[   43.907047] MSR:  8280f033   CR: 
42002822  XER: 
[   43.907056] IRQMASK: 0  [   43.907056] GPR00: 00ea 
7fffc8ab87f0 7fffa0e87e00   [   43.907056] GPR04: 
 3da602a0 7fffa0e81200   [   
43.907056] GPR08:    
  [   43.907056] GPR12:  7fffa0e8a5a0 
   [   43.907056] GPR16:  
    [   43.907056] GPR20: 
  0001 0001  [   
43.907056] GPR24: 7fffa0c423c0  7fffa0c40a38 
  [   43.907056] GPR28: 0001 7fffa0e835b0 
f000   [   43.907090] NIP [7fffa0b0c2c4] 
0x7fffa0b0c2c4
[   43.907092] LR [] 0x0
[   43.907094] --- interrupt: 3000
[   43.907096] Code: 3902 38c00400 38a0 3880 3860 4bf6c0b5 
6000 4b10 7fc3f378 3882 4bffe371 4bfffe9c <0fe0> 7c0802a6 
fbe10048 f8010060  [   43.907108] ---[ end trace  ]—


- Sachin

Re: [RFC PATCH v2 5/6] KVM: PPC: Add support for nested PAPR guests

2023-06-07 Thread Nicholas Piggin

On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote:
> A series of hcalls have been added to the PAPR which allow a regular
> guest partition to create and manage guest partitions of its own. Add
> support to KVM to utilize these hcalls to enable running nested guests.
>
> Overview of the new hcall usage:
>
> - L1 and L0 negotiate capabilities with
>   H_GUEST_{G,S}ET_CAPABILITIES()
>
> - L1 requests the L0 create a L2 with
>   H_GUEST_CREATE() and receives a handle to use in future hcalls
>
> - L1 requests the L0 create a L2 vCPU with
>   H_GUEST_CREATE_VCPU()
>
> - L1 sets up the L2 using H_GUEST_SET and the
>   H_GUEST_VCPU_RUN input buffer
>
> - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()
>
> - L2 returns to L1 with an exit reason and L1 reads the
>   H_GUEST_VCPU_RUN output buffer populated by the L0
>
> - L1 handles the exit using H_GET_STATE if necessary
>
> - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN
>
> - L1 frees the L2 in the L0 with H_GUEST_DELETE()
>
> Support for the new API is determined by trying
> H_GUEST_GET_CAPABILITIES. On a successful return, the new API will then
> be used.
>
> Use the vcpu register state setters for tracking modified guest state
> elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
> buffer immediately before running a L2. The guest wide
> elements can not be added to the input buffer so send them with a
> separate H_GUEST_SET call if necessary.
>
> Make the vcpu register getter load the corresponding value from the real
> host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
> which values have already been loaded between H_GUEST_VCPU_RUN calls. If
> an element is present in the H_GUEST_VCPU_RUN output buffer it also does
> not need to be loaded again.
>
> There is existing support for running nested guests on KVM
> with powernv. However the interface used for this is not supported by
> other PAPR hosts. This existing API is still supported.
>
> Signed-off-by: Jordan Niethe 
> ---
> v2:
>   - Declare op structs as static
>   - Use expressions in switch case with local variables
>   - Do not use the PVR for the LOGICAL PVR ID
>   - Handle emul_inst as now a double word
>   - Use new GPR(), etc macros
>   - Determine PAPR nested capabilities from cpu features
> ---
>  arch/powerpc/include/asm/guest-state-buffer.h | 105 +-
>  arch/powerpc/include/asm/hvcall.h |  30 +
>  arch/powerpc/include/asm/kvm_book3s.h | 122 ++-
>  arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
>  arch/powerpc/include/asm/kvm_host.h   |  21 +
>  arch/powerpc/include/asm/kvm_ppc.h|  64 +-
>  arch/powerpc/include/asm/plpar_wrappers.h | 198 
>  arch/powerpc/kvm/Makefile |   1 +
>  arch/powerpc/kvm/book3s_hv.c  | 126 ++-
>  arch/powerpc/kvm/book3s_hv.h  |  74 +-
>  arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
>  arch/powerpc/kvm/book3s_hv_papr.c | 940 ++
>  arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
>  arch/powerpc/kvm/guest-state-buffer.c |  49 +
>  14 files changed, 1684 insertions(+), 94 deletions(-)
>  create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c
>
> diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
> b/arch/powerpc/include/asm/guest-state-buffer.h
> index 65a840abf1bb..116126edd8e2 100644
> --- a/arch/powerpc/include/asm/guest-state-buffer.h
> +++ b/arch/powerpc/include/asm/guest-state-buffer.h
> @@ -5,6 +5,7 @@
>  #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
>  #define _ASM_POWERPC_GUEST_STATE_BUFFER_H
>  
> +#include "asm/hvcall.h"
>  #include 
>  #include 
>  #include 
> @@ -14,16 +15,16 @@
>   **/
>  #define GSID_BLANK   0x
>  
> -#define GSID_HOST_STATE_SIZE 0x0001 /* Size of Hypervisor Internal 
> Format VCPU state */
> -#define GSID_RUN_OUTPUT_MIN_SIZE 0x0002 /* Minimum size of the Run VCPU 
> output buffer */
> -#define GSID_LOGICAL_PVR 0x0003 /* Logical PVR */
> -#define GSID_TB_OFFSET   0x0004 /* Timebase Offset */
> -#define GSID_PARTITION_TABLE 0x0005 /* Partition Scoped Page Table */
> -#define GSID_PROCESS_TABLE   0x0006 /* Process Table */
> +#define GSID_HOST_STATE_SIZE 0x0001
> +#define GSID_RUN_OUTPUT_MIN_SIZE 0x0002
> +#define GSID_LOGICAL_PVR 0x0003
> +#define GSID_TB_OFFSET   0x0004
> +#define GSID_PARTITION_TABLE 0x0005
> +#define GSID_PROCESS_TABLE   0x0006

You lost your comments.

> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 0ca2d8b37b42..c5c57552b447 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct kvmppc_bat {
>   u64 raw;
> @@

Re: [RFC PATCH v2 4/6] KVM: PPC: Add helper library for Guest State Buffers

2023-06-07 Thread Nicholas Piggin

On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote:
> The new PAPR nested guest API introduces the concept of a Guest State
> Buffer for communication about L2 guests between L1 and L0 hosts.
>
> In the new API, the L0 manages the L2 on behalf of the L1. This means
> that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
> table...), it must request the L0 perform the modification. If the
> nested host needs to read L2 state likewise this request must
> go through the L0.
>
> The Guest State Buffer is a Type-Length-Value style data format defined
> in the PAPR which assigns all relevant partition state a unique
> identity. Unlike a typical TLV format the length is redundant as the
> length of each identity is fixed but is included for checking
> correctness.
>
> A guest state buffer consists of an element count followed by a stream
> of elements, where elements are composed of an ID number, data length,
> then the data:
>
>   Header:
>
><---4 bytes--->
>   ++-
>   | Element Count  | Elements...
>   ++-
>
>   Element:
>
><2 bytes---> <-2 bytes-> <-Length bytes->
>   ++---++
>   | Guest State ID |  Length   |  Data  |
>   ++---++
>
> Guest State IDs have other attributes defined in the PAPR such as
> whether they are per thread or per guest, or read-only.
>
> Introduce a library for using guest state buffers. This includes support
> for actions such as creating buffers, adding elements to buffers,
> reading the value of elements and parsing buffers. This will be used
> later by the PAPR nested guest support.

This is a tour de force in one of these things, so I hate to be
the "me smash with club" guy, but what if you allocated buffers
with enough room for all the state (or 99% of cases, in which
case an overflow would make an hcall)?

What's actually a fast-path that we don't get from the interrupt
return buffer? Getting and setting a few regs for MMIO emulation?


> Signed-off-by: Jordan Niethe 
> ---
> v2:
>   - Add missing #ifdef CONFIG_VSXs
>   - Move files from lib/ to kvm/
>   - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
>   - Use kunit for guest state buffer tests
>   - Add configuration option for the tests
>   - Use macros for contiguous id ranges like GPRs
>   - Add some missing EXPORTs to functions
>   - HEIR element is a double word not a word
> ---
>  arch/powerpc/Kconfig.debug|  12 +
>  arch/powerpc/include/asm/guest-state-buffer.h | 901 ++
>  arch/powerpc/include/asm/kvm_book3s.h |   2 +
>  arch/powerpc/kvm/Makefile |   3 +
>  arch/powerpc/kvm/guest-state-buffer.c | 563 +++
>  arch/powerpc/kvm/test-guest-state-buffer.c| 321 +++
>  6 files changed, 1802 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
>  create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
>  create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c
>
> diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> index 6aaf8dc60610..ed830a714720 100644
> --- a/arch/powerpc/Kconfig.debug
> +++ b/arch/powerpc/Kconfig.debug
> @@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
>   bool "Run self-tests of the MSI bitmap code"
>   depends on DEBUG_KERNEL
>  
> +config GUEST_STATE_BUFFER_TEST
> + def_tristate n
> + prompt "Enable Guest State Buffer unit tests"
> + depends on KUNIT
> + depends on KVM_BOOK3S_HV_POSSIBLE
> + default KUNIT_ALL_TESTS
> + help
> +   The Guest State Buffer is a data format specified in the PAPR.
> +   It is by hcalls to communicate the state of L2 guests between
> +   the L1 and L0 hypervisors. Enable unit tests for the library
> +   used to create and use guest state buffers.
> +
>  config PPC_IRQ_SOFT_MASK_DEBUG
>   bool "Include extra checks for powerpc irq soft masking"
>   depends on PPC64
> diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
> b/arch/powerpc/include/asm/guest-state-buffer.h
> new file mode 100644
> index ..65a840abf1bb
> --- /dev/null
> +++ b/arch/powerpc/include/asm/guest-state-buffer.h
> @@ -0,0 +1,901 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Interface based on include/net/netlink.h
> + */
> +#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
> +#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
> +
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * Guest State Buffer Constants
> + **/
> +#define GSID_BLANK   0x

The namespaces are a little abbreviated. KVM_PAPR_ might be nice if
you're calling the API that.

> +
> +#define GSID_HOST_STATE_SIZE 0x0001 /* Size of Hypervisor Internal 
> Format VCPU state */
> +#define GSID_RUN_OUTPUT_MIN_SIZE

Re: [PATCH] powerpc/64s/radix: Fix exit lazy tlb mm switch with irqs enabled

2023-06-07 Thread Sachin Sant




> Reported-by: Sachin Sant 
> Link: 
> https://lore.kernel.org/linuxppc-dev/87a5xcgopc.fsf@mail.lhotse/T/#m105488939d0cd9f980978ed2fdeeb89bf731e673
> Fixes: a665eec0a22e1 ("powerpc/64s/radix: Fix mm_cpumask trimming race vs 
> kthread_use_mm")
> Signed-off-by: Nicholas Piggin 
> ---
> This sounds worse than it probably is, radix can likely tolerate an
> interrupt hitting in mm switch, and the active_mm update may not be racy
> in practice either. Still be good to backport it because I'm not 100%
> sure of that.
> 
> This path can be stressed by reducing tlb_mm_cpumask_trim_timer (e.g.,
> to 3).
> 
> Thanks,
> Nick
> 
> arch/powerpc/mm/book3s64/radix_tlb.c | 10 +-
> 1 file changed, 9 insertions(+), 1 deletion(-)
> 
This patch fixes the reported warning.

Ran powerpc selftests (with default value for tlb_mm_cpumask_trim_timer as
well as tlb_mm_cpumask_trim_timer=3 ). No new errors were observed.

Tested-by: Sachin Sant 

- Sachin

Re: [RFC PATCH v2 2/6] KVM: PPC: Add fpr getters and setters

2023-06-07 Thread Nicholas Piggin

On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote:
> Add wrappers for fpr registers to prepare for supporting PAPR nested
> guests.
>
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/include/asm/kvm_book3s.h | 31 +++
>  arch/powerpc/include/asm/kvm_booke.h  | 10 +
>  arch/powerpc/kvm/book3s.c | 16 +++---
>  arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
>  arch/powerpc/kvm/powerpc.c| 22 +--
>  5 files changed, 61 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 4e91f54a3f9f..a632e79639f0 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -413,6 +413,37 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
> *vcpu)
>   return vcpu->arch.fault_dar;
>  }
>  
> +static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
> +{
> + return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
> +}
> +
> +static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
> +{
> + vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
> +}
> +
> +static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->arch.fp.fpscr;
> +}
> +
> +static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
> +{
> + vcpu->arch.fp.fpscr = val;
> +}
> +
> +
> +static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
> +{
> + return vcpu->arch.fp.fpr[i][j];
> +}
> +
> +static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j, 
> u64 val)
> +{
> + vcpu->arch.fp.fpr[i][j] = val;
> +}
> +
>  #define BOOK3S_WRAPPER_SET(reg, size)
> \
>  static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)  
> \
>  {\
> diff --git a/arch/powerpc/include/asm/kvm_booke.h 
> b/arch/powerpc/include/asm/kvm_booke.h
> index 0c3401b2e19e..7c3291aa8922 100644
> --- a/arch/powerpc/include/asm/kvm_booke.h
> +++ b/arch/powerpc/include/asm/kvm_booke.h
> @@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
>   return vcpu->arch.regs.nip;
>  }
>  
> +static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
> +{
> + vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
> +}
> +
> +static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
> +{
> + return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
> +}
> +
>  #ifdef CONFIG_BOOKE
>  static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 2fe31b518886..6cd20ab9e94e 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -636,17 +636,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
>   break;
>   case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
>   i = id - KVM_REG_PPC_FPR0;
> - *val = get_reg_val(id, VCPU_FPR(vcpu, i));
> + *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
>   break;
>   case KVM_REG_PPC_FPSCR:
> - *val = get_reg_val(id, vcpu->arch.fp.fpscr);
> + *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
>   break;
>  #ifdef CONFIG_VSX
>   case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
>   if (cpu_has_feature(CPU_FTR_VSX)) {
>   i = id - KVM_REG_PPC_VSR0;
> - val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
> - val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
> + val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
> + val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
>   } else {
>   r = -ENXIO;
>   }
> @@ -724,7 +724,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>   break;
>   case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
>   i = id - KVM_REG_PPC_FPR0;
> - VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
> + kvmppc_set_fpr(vcpu, i, set_reg_val(id, *val));
>   break;
>   case KVM_REG_PPC_FPSCR:
>   vcpu->arch.fp.fpscr = set_reg_val(id, *val);
> @@ -733,8 +733,8 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>   case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
>   if (cpu_has_feature(CPU_FTR_VSX)) {
>   i = id - KVM_REG_PPC_VSR0;
> - vcpu->arch.fp.fpr[i][0] = val->vsxval[0];
> - vcpu->arch.fp.fpr[i][1] = val->vsxval[1];
> +

Re: [RFC PATCH v2 1/6] KVM: PPC: Use getters and setters for vcpu register state

2023-06-07 Thread Nicholas Piggin

On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote:
> There are already some getter and setter functions used for accessing
> vcpu register state, e.g. kvmppc_get_pc(). There are also more
> complicated examples that are generated by macros like
> kvmppc_get_sprg0() which are generated by the SHARED_SPRNG_WRAPPER()
> macro.
>
> In the new PAPR API for nested guest partitions the L1 is required to
> communicate with the L0 to modify and read nested guest state.
>
> Prepare to support this by replacing direct accesses to vcpu register
> state with wrapper functions. Follow the existing pattern of using
> macros to generate individual wrappers. These wrappers will
> be augmented for supporting PAPR nested guests later.
>
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/include/asm/kvm_book3s.h  |  68 +++-
>  arch/powerpc/include/asm/kvm_ppc.h |  48 --
>  arch/powerpc/kvm/book3s.c  |  22 +--
>  arch/powerpc/kvm/book3s_64_mmu_hv.c|   4 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |   9 +-
>  arch/powerpc/kvm/book3s_64_vio.c   |   4 +-
>  arch/powerpc/kvm/book3s_hv.c   | 222 +
>  arch/powerpc/kvm/book3s_hv.h   |  59 +++
>  arch/powerpc/kvm/book3s_hv_builtin.c   |  10 +-
>  arch/powerpc/kvm/book3s_hv_p9_entry.c  |   4 +-
>  arch/powerpc/kvm/book3s_hv_ras.c   |   5 +-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c|   8 +-
>  arch/powerpc/kvm/book3s_hv_rm_xics.c   |   4 +-
>  arch/powerpc/kvm/book3s_xive.c |   9 +-
>  arch/powerpc/kvm/powerpc.c |   4 +-
>  15 files changed, 322 insertions(+), 158 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index bbf5e2c5fe09..4e91f54a3f9f 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -392,6 +392,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
>   return vcpu->arch.regs.nip;
>  }
>  
> +static inline void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 val)
> +{
> + vcpu->arch.pid = val;
> +}
> +
> +static inline u32 kvmppc_get_pid(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->arch.pid;
> +}
> +
>  static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu);
>  static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu)
>  {
> @@ -403,10 +413,66 @@ static inline ulong kvmppc_get_fault_dar(struct 
> kvm_vcpu *vcpu)
>   return vcpu->arch.fault_dar;
>  }
>  
> +#define BOOK3S_WRAPPER_SET(reg, size)
> \
> +static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)  
> \
> +{\
> + \
> + vcpu->arch.reg = val;   \
> +}
> +
> +#define BOOK3S_WRAPPER_GET(reg, size)
> \
> +static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)
> \
> +{\
> + return vcpu->arch.reg;  \
> +}
> +
> +#define BOOK3S_WRAPPER(reg, size)\
> + BOOK3S_WRAPPER_SET(reg, size)   \
> + BOOK3S_WRAPPER_GET(reg, size)   \
> +
> +BOOK3S_WRAPPER(tar, 64)
> +BOOK3S_WRAPPER(ebbhr, 64)
> +BOOK3S_WRAPPER(ebbrr, 64)
> +BOOK3S_WRAPPER(bescr, 64)
> +BOOK3S_WRAPPER(ic, 64)
> +BOOK3S_WRAPPER(vrsave, 64)
> +
> +
> +#define VCORE_WRAPPER_SET(reg, size) \
> +static inline void kvmppc_set_##reg ##_hv(struct kvm_vcpu *vcpu, u##size 
> val)\
> +{\
> + vcpu->arch.vcore->reg = val;\
> +}
> +
> +#define VCORE_WRAPPER_GET(reg, size) \
> +static inline u##size kvmppc_get_##reg ##_hv(struct kvm_vcpu *vcpu)  \
> +{\
> + return vcpu->arch.vcore->reg;   \
> +}
> +
> +#define VCORE_WRAPPER(reg, size) \
> + VCORE_WRAPPER_SET(reg, size)\
> + VCORE_WRAPPER_GET(reg, size)\
> +
> +
> +VCORE_WRAPPER(vtb, 64)
> +VCORE_WRAPPER(tb_offset, 64)
> +VCORE_WRAPPER(lpcr, 64)

The general idea is fine, some of the names could use a bit of
improvement. What's a BOOK3S_WRAPPER for example, is it not a
VCPU_WRAPPER, or alternatively why isn't a VCORE_WRAPPER Book3S
as well?

> +
> +static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->arch.dec_expires;
> +}
> +
> +static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
> +{
> + vcpu->arch.dec_expires = val;
> +}

60 matches

Mail list logo