Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-07 Thread Mike Rapoport
Hi Yajun,

On Tue, Nov 08, 2022 at 02:27:53AM +, Yajun Deng wrote:
> Hi Sachin,
> I didn't have a powerpc architecture machine. I don't know why this happened.
> 
> Hi Mike,
> Do you have any suggestions?

You can try reproducing the bug qemu or work with Sachin to debug the
issue.

> I tested in tools/testing/memblock, and it was successful.

Memblock tests provide limited coverage still and they don't deal with all
possible cases.

For now I'm dropping this patch from the memblock tree until the issue is
fixed.
 
> November 6, 2022 8:07 PM, "Sachin Sant"  wrote:
> 
> > While booting recent linux-next on a IBM Power10 Server LPAR
> > following crash is observed:
> > 
> > [ 0.00] numa: Partition configured for 32 NUMA nodes.
> > [ 0.00] [ cut here ]
> > [ 0.00] kernel BUG at mm/memblock.c:519!
> > [ 0.00] Oops: Exception in kernel mode, sig: 5 [#1]
> > [ 0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > [ 0.00] Modules linked in:
> > [ 0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3-next-20221104 
> > #1
> > [ 0.00] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
> > of:IBM,FW1030.00
> > (NH1030_026) hv:phyp pSeries
> > [ 0.00] NIP: c04ba240 LR: c04bb240 CTR: c04ba210
> > [ 0.00] REGS: c2a8b7b0 TRAP: 0700 Not tainted 
> > (6.1.0-rc3-next-20221104)
> > [ 0.00] MSR: 80021033  CR: 24042424 XER: 
> > 0001
> > [ 0.00] CFAR: c04ba290 IRQMASK: 1
> > [ 0.00] GPR00: c04bb240 c2a8ba50 c136ee00 
> > c010f3ac00a8
> > [ 0.00] GPR04:  c010f3ac0090 0010f3ac 
> > 0d00
> > [ 0.00] GPR08: 0001 0007 0001 
> > 0081
> > [ 0.00] GPR12: c04ba210 c2e1  
> > 000d
> > [ 0.00] GPR16: 0f6be620 0f6be8e8 0f6be788 
> > 0f6bed58
> > [ 0.00] GPR20: 0f6f6d58 c29a8de8 0010f3ad8800 
> > 0080
> > [ 0.00] GPR24: 0010f3ad7b00  0100 
> > 0d00
> > [ 0.00] GPR28: 0010f3ad7b00 c29a8de8 c29a8e00 
> > 0006
> > [ 0.00] NIP [c04ba240] memblock_merge_regions.isra.12+0x40/0x130
> > [ 0.00] LR [c04bb240] memblock_add_range+0x190/0x300
> > [ 0.00] Call Trace:
> > [ 0.00] [c2a8ba50] [0100] 0x100 (unreliable)
> > [ 0.00] [c2a8ba90] [c04bb240] 
> > memblock_add_range+0x190/0x300
> > [ 0.00] [c2a8bb10] [c04bb5e0] memblock_reserve+0x70/0xd0
> > [ 0.00] [c2a8bba0] [c2045234] 
> > memblock_alloc_range_nid+0x11c/0x1e8
> > [ 0.00] [c2a8bc60] [c20453a4] 
> > memblock_alloc_internal+0xa4/0x110
> > [ 0.00] [c2a8bcb0] [c20456cc] 
> > memblock_alloc_try_nid+0x94/0xcc
> > [ 0.00] [c2a8bd40] [c200b570] alloc_paca_data+0x7c/0xcc
> > [ 0.00] [c2a8bdb0] [c200b770] allocate_paca+0x8c/0x28c
> > [ 0.00] [c2a8be50] [c200a26c] setup_arch+0x1c4/0x4d8
> > [ 0.00] [c2a8bed0] [c2004378] start_kernel+0xb4/0xa84
> > [ 0.00] [c2a8bf90] [c000da90] 
> > start_here_common+0x1c/0x20
> > [ 0.00] Instruction dump:
> > [ 0.00] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 3be0 
> > f8010010
> > [ 0.00] f821ffc1 e923 3969 480c <0b0a> 7d3f4b78 
> > 393f0001 7fbf5840
> > [ 0.00] ---[ end trace  ]---
> > [ 0.00]
> > [ 0.00] Kernel panic - not syncing: Fatal exception
> > [ 0.00] Rebooting in 180 seconds..
> > 
> > This problem was introduced with next-20221101. Git bisect points to
> > following patch
> > 
> > commit 3f82c9c4ac377082e1230f5299e0ccce07b15e12
> > Date: Tue Oct 25 15:09:43 2022 +0800
> > memblock: don't run loop in memblock_add_range() twice
> > 
> > Reverting this patch helps boot the kernel to login prompt.
> > 
> > Have attached .config
> > 
> > - Sachin

-- 
Sincerely yours,
Mike.


Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-07 Thread Isaku Yamahata
On Tue, Nov 08, 2022 at 01:09:27AM +,
"Huang, Kai"  wrote:

> On Mon, 2022-11-07 at 13:46 -0800, Isaku Yamahata wrote:
> > > On Fri, Nov 04, 2022, Isaku Yamahata wrote:
> > > > Thanks for the patch series. I the rebased TDX KVM patch series and it
> > > > worked.
> > > > Since cpu offline needs to be rejected in some cases(To keep at least 
> > > > one
> > > > cpu
> > > > on a package), arch hook for cpu offline is needed.
> > > 
> > > I hate to bring this up because I doubt there's a real use case for 
> > > SUSPEND
> > > with
> > > TDX, but the CPU offline path isn't just for true offlining of CPUs.  When
> > > the
> > > system enters SUSPEND, only the initiating CPU goes through
> > > kvm_suspend()+kvm_resume(),
> > > all responding CPUs go through CPU offline+online.  I.e. disallowing all
> > > CPUs from
> > > going "offline" will prevent suspending the system.
> > 
> > The current TDX KVM implementation disallows CPU package from offline only
> > when
> > TDs are running.  If no TD is running, CPU offline is allowed.  So before
> > SUSPEND, TDs need to be killed via systemd or something.  After killing TDs,
> > the
> > system can enter into SUSPEND state.
> 
> This seems not correct.  You need one cpu for each to be online in order to
> create TD as well, as TDH.MNG.KEY.CONFIG needs to be called on all packages,
> correct?

That's correct. In such case, the creation of TD fails.  TD creation checks if
at least one cpu is online on all CPU packages.  If no, error.
-- 
Isaku Yamahata 


Re: [PATCH] ibmvfc: Avoid path failures during live migration

2022-11-07 Thread Martin K. Petersen
On Wed, 26 Oct 2022 13:13:56 -0500, Brian King wrote:

> This patch fixes an issue reported when performing a live
> migration when multipath is configured with a short fast
> fail timeout of 5 seconds and also to have no_path_retry
> set to fail. In this scenario, all paths would go into the
> devloss state while the ibmvfc driver went through discovery
> to log back in. On a loaded system, the discovery might
> take longer than 5 seconds, which was resulting in all paths
> being marked failed, which then resulted in a read only filesystem.
> 
> [...]

Applied to 6.1/scsi-fixes, thanks!

[1/1] ibmvfc: Avoid path failures during live migration
  https://git.kernel.org/mkp/scsi/c/62fa3ce05d5d

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v9 6/7] powerpc/code-patching: Use temporary mm for Radix MMU

2022-11-07 Thread Benjamin Gray
On Thu, 2022-11-03 at 14:10 +1100, Benjamin Gray wrote:
> On Wed, 2022-11-02 at 10:11 +, Christophe Leroy wrote:
> > Le 25/10/2022 à 06:44, Benjamin Gray a écrit :
> > > +   /*
> > > +    * PTE allocation uses GFP_KERNEL which means we need to
> > > +    * pre-allocate the PTE here because we cannot do the
> > > +    * allocation during patching when IRQs are disabled.
> > > +    */
> > > +   pgdp = pgd_offset(mm, addr);
> > > +
> > > +   p4dp = p4d_alloc(mm, pgdp, addr);
> > > +   if (WARN_ON(!p4dp))
> > > +   goto fail_no_p4d;
> > > +
> > > +   pudp = pud_alloc(mm, p4dp, addr);
> > > +   if (WARN_ON(!pudp))
> > > +   goto fail_no_pud;
> > > +
> > > +   pmdp = pmd_alloc(mm, pudp, addr);
> > > +   if (WARN_ON(!pmdp))
> > > +   goto fail_no_pmd;
> > > +
> > > +   ptep = pte_alloc_map(mm, pmdp, addr);
> > > +   if (WARN_ON(!ptep))
> > > +   goto fail_no_pte;
> > 
> > Insn't there standard generic functions to do that ?
> > 
> > For instance, __get_locked_pte() seems to do more or less the same.
> 
> __get_locked_pte invokes walk_to_pmd, which leaks memory if the
> allocation fails. This may not be a concern necessarily at boot
> (though
> I still don't like it), but startup is run every time a CPU comes
> online, so the leak is theoretically unbounded.
> 
> There's no need to leak it in this context, because we know that each
> page is exclusively used by the corresponding patching mm.

I found tlb_gather_mmu() to initialise a struct mmu_gather, so I've
removed all the open coding (it should free any partial page tables if
get_locked_pte fails). Currently running it through CI before posting,
will probably get the v10 out tomorrow.


[PATCH] soc: fsl: qe: request pins non-exclusively

2022-11-07 Thread Dmitry Torokhov
Commit 84582f9ed090 ("soc: fsl: qe: Avoid using gpio_to_desc()") changed
qe_pin_request() to request and hold GPIO corresponding to a given pin.
Unfortunately this does not work, as fhci-hcd requests these GPIOs
first, befor calling qe_pin_request() (see
drivers/usb/host/fhci-hcd.c::of_fhci_probe()).
To fix it change qe_pin_request() to request GPIOs non-exclusively, and
free them once the code determines GPIO controller and offset for each
GPIO/pin.

Also reaching deep into gpiolib implementation is not the best idea. We
should either export gpio_chip_hwgpio() or keep converting to the global
gpio numbers space until we fix the driver to implement proper pin
control.

Fixes: 84582f9ed090 ("soc: fsl: qe: Avoid using gpio_to_desc()")
Signed-off-by: Dmitry Torokhov 
---

Compiled only, not tested on hardware.

 drivers/soc/fsl/qe/gpio.c   | 71 ++---
 drivers/usb/host/fhci-hcd.c |  2 +-
 include/soc/fsl/qe/qe.h |  5 +--
 3 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/drivers/soc/fsl/qe/gpio.c b/drivers/soc/fsl/qe/gpio.c
index 0ee887f89deb..5bb71a2b5b7a 100644
--- a/drivers/soc/fsl/qe/gpio.c
+++ b/drivers/soc/fsl/qe/gpio.c
@@ -13,7 +13,7 @@
 #include 
 #include 
 #include 
-#include 
+#include  /* for of_mm_gpio_chip */
 #include 
 #include 
 #include 
@@ -21,13 +21,6 @@
 #include 
 
 #include 
-/*
- * FIXME: this is legacy code that is accessing gpiolib internals in order
- * to implement a custom pin controller. The proper solution is to create
- * a real combined pin control and GPIO driver in drivers/pinctrl. However
- * this hack is here for legacy code reasons.
- */
-#include "../../../gpio/gpiolib.h"
 
 struct qe_gpio_chip {
struct of_mm_gpio_chip mm_gc;
@@ -149,20 +142,19 @@ struct qe_pin {
 * something like qe_pio_controller. Someday.
 */
struct qe_gpio_chip *controller;
-   struct gpio_desc *gpiod;
int num;
 };
 
 /**
  * qe_pin_request - Request a QE pin
- * @np:device node to get a pin from
- * @index: index of a pin in the device tree
+ * @dev:   device to get the pin from
+ * @index: index of the pin in the device tree
  * Context:non-atomic
  *
  * This function return qe_pin so that you could use it with the rest of
  * the QE Pin Multiplexing API.
  */
-struct qe_pin *qe_pin_request(struct device_node *np, int index)
+struct qe_pin *qe_pin_request(struct device *dev, int index)
 {
struct qe_pin *qe_pin;
struct gpio_chip *gc;
@@ -171,40 +163,46 @@ struct qe_pin *qe_pin_request(struct device_node *np, int 
index)
 
qe_pin = kzalloc(sizeof(*qe_pin), GFP_KERNEL);
if (!qe_pin) {
-   pr_debug("%s: can't allocate memory\n", __func__);
+   dev_dbg(dev, "%s: can't allocate memory\n", __func__);
return ERR_PTR(-ENOMEM);
}
 
-   gpiod = fwnode_gpiod_get_index(of_fwnode_handle(np), NULL, index, 
GPIOD_ASIS, "qe");
-   if (IS_ERR(gpiod)) {
-   err = PTR_ERR(gpiod);
-   goto err0;
-   }
-   if (!gpiod) {
-   err = -EINVAL;
+   /*
+* Request gpio as nonexclusive as it was likely was reserved by
+* the caller, and we are not planning on controlling it, we only
+* need the descriptor to the to the gpio chip structure.
+*/
+   gpiod = gpiod_get_index(dev, NULL, index,
+   GPIOD_ASIS | GPIOD_FLAGS_BIT_NONEXCLUSIVE);
+   err = PTR_ERR_OR_ZERO(gpiod);
+   if (err)
goto err0;
-   }
+
gc = gpiod_to_chip(gpiod);
if (WARN_ON(!gc)) {
err = -ENODEV;
goto err0;
-   }
-   qe_pin->gpiod = gpiod;
-   qe_pin->controller = gpiochip_get_data(gc);
-   /*
-* FIXME: this gets the local offset on the gpio_chip so that the driver
-* can manipulate pin control settings through its custom API. The real
-* solution is to create a real pin control driver for this.
-*/
-   qe_pin->num = gpio_chip_hwgpio(gpiod);
-
-   if (!fwnode_device_is_compatible(gc->fwnode, 
"fsl,mpc8323-qe-pario-bank")) {
-   pr_debug("%s: tried to get a non-qe pin\n", __func__);
-   gpiod_put(gpiod);
+   } else if (!fwnode_device_is_compatible(gc->fwnode,
+   "fsl,mpc8323-qe-pario-bank")) {
+   dev_dbg(dev, "%s: tried to get a non-qe pin\n", __func__);
err = -EINVAL;
-   goto err0;
+   } else {
+   qe_pin->controller = gpiochip_get_data(gc);
+   /*
+* FIXME: this gets the local offset on the gpio_chip so that
+* the driver can manipulate pin control settings through its
+* custom API. The real solution is to create a real pin control
+* driver for this.
+*/
+   qe_pin->num = 

Re: [PATCH 00/44] KVM: Rework kvm_init() and hardware enabling

2022-11-07 Thread Isaku Yamahata
On Fri, Nov 04, 2022 at 08:27:14PM +,
Sean Christopherson  wrote:

> On Fri, Nov 04, 2022, Isaku Yamahata wrote:
> > Thanks for the patch series. I the rebased TDX KVM patch series and it 
> > worked.
> > Since cpu offline needs to be rejected in some cases(To keep at least one 
> > cpu
> > on a package), arch hook for cpu offline is needed.
> 
> I hate to bring this up because I doubt there's a real use case for SUSPEND 
> with
> TDX, but the CPU offline path isn't just for true offlining of CPUs.  When the
> system enters SUSPEND, only the initiating CPU goes through 
> kvm_suspend()+kvm_resume(),
> all responding CPUs go through CPU offline+online.  I.e. disallowing all CPUs 
> from
> going "offline" will prevent suspending the system.

The current TDX KVM implementation disallows CPU package from offline only when
TDs are running.  If no TD is running, CPU offline is allowed.  So before
SUSPEND, TDs need to be killed via systemd or something.  After killing TDs, the
system can enter into SUSPEND state.


> I don't see anything in the TDX series or the specs that suggests 
> suspend+resume
> is disallowed when TDX is enabled, so blocking that seems just as wrong as
> preventing software from soft-offlining CPUs.

When it comes to SUSPEND, it means suspend-to-idle, ACPI S1, S3, or S4.
suspend-to-idle doesn't require CPU offline.

Although CPU related spec doesn't mention about S3, the ACPI spec says

  7.4.2.2 System _S1 State (Sleeping with Processor Context Maintained)
  The processor-complex context is maintained.

  7.4.2.4 System _S3 State or 7.4.2.5 System _S4 State
  The processor-complex context is not maintained.

It's safe to say the processor context related to TDX is complex, I think.
Let me summarize the situation. What do you think?

- While no TD running:
  No additional limitation on CPU offline.

- On TD creation:
  If any of whole cpu package is software offlined, TD creation fails.
  Alternative: forcibly online necessary CPUs, create TD, and offline CPUs

- TD running:
  Although it's not required to keep all CPU packages online, keep CPU package
  from offlining for TD destruction.

- TD destruction:
  If any of whole cpu package is software offlined, TD destruction fails.
  The current implementation prevents any cpu package from offlinining during
  TD running.
  Alternative:
  - forcibly online necessary CPUs, destruct TD, and offline CPUs again and
allow CPU package to offline
  - Stash TDX resources somewhere. When cpu packages are onlined, free those
release.

- On SUSPEND:
  TODO: Allow CPU offline if S1 is requested.
  - suspend-to-idle: nothing to do because cpu offline isn't required
  - ACPI S1: Need to allow offline CPUs.  This can be implemented by referencing
suspend_state_t pm_suspend_target_state is PM_SUSPEND_TO_STANBY.
  - ACPI S3/S4: refuse cpu offline.  The system needs to kill all TDs before
starting SUSPEND process. This is what is implemented.

Thanks,
-- 
Isaku Yamahata 


Re: [PATCH 30/44] KVM: Drop kvm_arch_check_processor_compat() hook

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Drop kvm_arch_check_processor_compat() and its support code now that
> all
> architecture implementations are nops.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/arm64/kvm/arm.c   |  7 +--
>  arch/mips/kvm/mips.c   |  7 +--
>  arch/powerpc/kvm/book3s.c  |  2 +-
>  arch/powerpc/kvm/e500.c    |  2 +-
>  arch/powerpc/kvm/e500mc.c  |  2 +-
>  arch/powerpc/kvm/powerpc.c |  5 -
>  arch/riscv/kvm/main.c  |  7 +--
>  arch/s390/kvm/kvm-s390.c   |  7 +--
>  arch/x86/kvm/svm/svm.c |  4 ++--
>  arch/x86/kvm/vmx/vmx.c |  4 ++--
>  arch/x86/kvm/x86.c |  5 -
>  include/linux/kvm_host.h   |  4 +---
>  virt/kvm/kvm_main.c    | 24 +---
>  13 files changed, 13 insertions(+), 67 deletions(-)

Reviewed-by: Eric Farman  # s390


Re: [PATCH 27/44] KVM: Drop kvm_arch_{init,exit}() hooks

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
> are nops.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/arm64/kvm/arm.c    | 11 ---
>  arch/mips/kvm/mips.c    | 10 --
>  arch/powerpc/include/asm/kvm_host.h |  1 -
>  arch/powerpc/kvm/powerpc.c  |  5 -
>  arch/riscv/kvm/main.c   |  9 -
>  arch/s390/kvm/kvm-s390.c    | 10 --
>  arch/x86/kvm/x86.c  | 10 --
>  include/linux/kvm_host.h    |  3 ---
>  virt/kvm/kvm_main.c | 19 ++-
>  9 files changed, 2 insertions(+), 76 deletions(-)

Reviewed-by: Eric Farman  # s390


Re: [PATCH 25/44] KVM: s390: Do s390 specific init without bouncing through kvm_init()

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Move the guts of kvm_arch_init() into a new helper,
> __kvm_s390_init(),
> and invoke the new helper directly from kvm_s390_init() instead of
> bouncing through kvm_init().  Invoking kvm_arch_init() is the very
> first action performed by kvm_init(), i.e. this is a glorified nop.
> 
> Moving setup to __kvm_s390_init() will allow tagging more functions
> as
> __init, and emptying kvm_arch_init() will allow dropping the hook
> entirely once all architecture implementations are nops.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/s390/kvm/kvm-s390.c | 29 +
>  1 file changed, 25 insertions(+), 4 deletions(-)

Reviewed-by: Eric Farman 



Re: [PATCH 26/44] KVM: s390: Mark __kvm_s390_init() and its descendants as __init

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Tag __kvm_s390_init() and its unique helpers as __init.  These
> functions
> are only ever called during module_init(), but could not be tagged
> accordingly while they were invoked from the common kvm_arch_init(),
> which is not __init because of x86.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/s390/kvm/interrupt.c | 2 +-
>  arch/s390/kvm/kvm-s390.c  | 4 ++--
>  arch/s390/kvm/kvm-s390.h  | 2 +-
>  arch/s390/kvm/pci.c   | 2 +-
>  arch/s390/kvm/pci.h   | 2 +-
>  5 files changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Eric Farman 


Re: [PATCH 09/44] KVM: Drop arch hardware (un)setup hooks

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now
> that
> all implementations are nops.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/arm64/include/asm/kvm_host.h   |  1 -
>  arch/arm64/kvm/arm.c    |  5 -
>  arch/mips/include/asm/kvm_host.h    |  1 -
>  arch/mips/kvm/mips.c    |  5 -
>  arch/powerpc/include/asm/kvm_host.h |  1 -
>  arch/powerpc/kvm/powerpc.c  |  5 -
>  arch/riscv/include/asm/kvm_host.h   |  1 -
>  arch/riscv/kvm/main.c   |  5 -
>  arch/s390/kvm/kvm-s390.c    | 10 --
>  arch/x86/kvm/x86.c  | 10 --
>  include/linux/kvm_host.h    |  2 --
>  virt/kvm/kvm_main.c |  7 ---
>  12 files changed, 53 deletions(-)

Reviewed-by: Eric Farman  # s390


Re: [PATCH 06/44] KVM: s390: Move hardware setup/unsetup to init/exit

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Now that kvm_arch_hardware_setup() is called immediately after
> kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into
> kvm_arch_{init,exit}() as a step towards dropping one of the hooks.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/s390/kvm/kvm-s390.c | 23 +--
>  1 file changed, 13 insertions(+), 10 deletions(-)

Reviewed-by: Eric Farman 


Re: [PATCH 05/44] KVM: s390: Unwind kvm_arch_init() piece-by-piece() if a step fails

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> In preparation for folding kvm_arch_hardware_setup() into
> kvm_arch_init(),
> unwind initialization one step at a time instead of simply calling
> kvm_arch_exit().  Using kvm_arch_exit() regardless of which
> initialization
> step failed relies on all affected state playing nice with being
> undone
> even if said state wasn't first setup.  That holds true for state
> that is
> currently configured by kvm_arch_init(), but not for state that's
> handled
> by kvm_arch_hardware_setup(), e.g. calling
> gmap_unregister_pte_notifier()
> without first registering a notifier would result in list corruption
> due
> to attempting to delete an entry that was never added to the list.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/s390/kvm/kvm-s390.c | 21 ++---
>  1 file changed, 14 insertions(+), 7 deletions(-)

Reviewed-by: Eric Farman 


Re: [PATCH 04/44] KVM: Teardown VFIO ops earlier in kvm_exit()

2022-11-07 Thread Eric Farman
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Move the call to kvm_vfio_ops_exit() further up kvm_exit() to try and
> bring some amount of symmetry to the setup order in kvm_init(), and
> more
> importantly so that the arch hooks are invoked dead last by
> kvm_exit().
> This will allow arch code to move away from the arch hooks without
> any
> change in ordering between arch code and common code in kvm_exit().
> 
> That kvm_vfio_ops_exit() is called last appears to be 100%
> arbitrary.  It
> was bolted on after the fact by commit 571ee1b68598 ("kvm: vfio: fix
> unregister kvm_device_ops of vfio").  The nullified
> kvm_device_ops_table
> is also local to kvm_main.c and is used only when there are active
> VMs,
> so unless arch code is doing something truly bizarre, nullifying the
> table earlier in kvm_exit() is little more than a nop.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  virt/kvm/kvm_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Eric Farman 


Re: [PATCH v2 1/4] powerpc/64: Add INTERRUPT_SANITIZE_REGISTERS Kconfig

2022-11-07 Thread Segher Boessenkool
Hi!

On Mon, Nov 07, 2022 at 02:31:59PM +1100, Rohan McLure wrote:
> Add Kconfig option for enabling clearing of registers on arrival in an
> interrupt handler. This reduces the speculation influence of registers
> on kernel internals.

Assuming you are talking about existing PowerPC CPUs from the last 30
years:

There is no data speculation.  At all.  Ever.

There is branch prediction, but that is not influenced by register
contents, either (for any current CPUs at least).  (Except when you get
a flush because of a mispredict, but if this zeroing changes anything,
we will have used wild (but user controlled) values in the old
non-zeroing situation, and that is a much bigger problem itself already,
also for security!  This can be an unlikely kernel bug, or a very
unlikely compiler bug.)

All GPRs are renamed, always.  If you zero all GPRs on interrupt entry
(which is context synchronising, importantly), this will guarantee there
can be no timing influence from the GPRs, because all of the physical
registers depend on nothing that happened before.  So that is good, at
least it can give some peace of mind.  Except that this makes 30 new
registers in just a few cycles, which *itself* can cause stalls, if the
renaming things are still busy.  Context synchronising does not
necessarily help there, the renaming machinery can do stuff *after* an
insn completes.

I don't see how this helps anything.  If it does, "reduces speculation
influence" is not a good description of what it does, afaics?


Segher


Re: [PATCH 1/2] powerpc: export the CPU node count

2022-11-07 Thread Laurent Dufour
On 07/11/2022 13:11:17, Nicholas Piggin wrote:
> On Sat Oct 29, 2022 at 2:00 AM AEST, Laurent Dufour wrote:
>> At boot time, the FDT is parsed to compute the number of CPUs.
>> In addition count the number of CPU nodes and export it.
>>
>> This is useful when building the FDT for a kexeced kernel since we need to
>> take in account the CPU node added since the boot time during CPU hotplug
>> operations.
> 
> It would be nice if it just realloced memory in this case, but that
> looks like a bigger change.

I agree, and I think the best option in long term would be the series 
Sourabh Jain sent in June, updating the crash kernel FDT without reloading
it 
(https://lore.kernel.org/linuxppc-dev/20220620070106.93141-1-sourabhj...@linux.ibm.com/)

In the meantime, this solves the issue.

> 
> But these patches look okay to me, if you can solve the compile bug.

Indeed, the compile bugs are raised because I added the definition of the new 
variable 
'boot_cpu_node_count' in kexec_ranges.h, and add the inclusion of that file in 
prom.c.

I was not confident putting this new variable definition in that header file, 
but I 
didn't find a better option.

Do you have a better idea of header file to use?

Could I just declare this variable "extern" in 
arch/powerpc/kexec/file_load_64.c? This looks
ugly to me.

Thanks,
Laurent.


> Thanks,
> Nick
> 
>>
>> Signed-off-by: Laurent Dufour 
>> ---
>>  arch/powerpc/include/asm/kexec_ranges.h | 2 ++
>>  arch/powerpc/kernel/prom.c  | 4 
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/kexec_ranges.h 
>> b/arch/powerpc/include/asm/kexec_ranges.h
>> index f83866a19e87..bf35d00ddd09 100644
>> --- a/arch/powerpc/include/asm/kexec_ranges.h
>> +++ b/arch/powerpc/include/asm/kexec_ranges.h
>> @@ -22,4 +22,6 @@ int add_rtas_mem_range(struct crash_mem **mem_ranges);
>>  int add_opal_mem_range(struct crash_mem **mem_ranges);
>>  int add_reserved_mem_ranges(struct crash_mem **mem_ranges);
>>  
>> +extern unsigned int boot_cpu_node_count;
>> +
>>  #endif /* _ASM_POWERPC_KEXEC_RANGES_H */
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index 1eed87d954ba..d326148fd5a4 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -56,6 +56,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  
>> @@ -72,6 +73,7 @@ int __initdata iommu_is_off;
>>  int __initdata iommu_force_on;
>>  unsigned long tce_alloc_start, tce_alloc_end;
>>  u64 ppc64_rma_size;
>> +unsigned int boot_cpu_node_count __ro_after_init;
>>  #endif
>>  static phys_addr_t first_memblock_size;
>>  static int __initdata boot_cpu_count;
>> @@ -335,6 +337,8 @@ static int __init early_init_dt_scan_cpus(unsigned long 
>> node,
>>  if (type == NULL || strcmp(type, "cpu") != 0)
>>  return 0;
>>  
>> +boot_cpu_node_count++;
>> +
>>  /* Get physical cpuid */
>>  intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", );
>>  if (!intserv)
>> -- 
>> 2.38.1
> 



Re: [PATCH v2 1/4] powerpc/64: Add INTERRUPT_SANITIZE_REGISTERS Kconfig

2022-11-07 Thread Christophe Leroy


Le 07/11/2022 à 04:31, Rohan McLure a écrit :
> Add Kconfig option for enabling clearing of registers on arrival in an
> interrupt handler. This reduces the speculation influence of registers
> on kernel internals. The option will be consumed by 64-bit systems that
> feature speculation and wish to implement this mitigation.
> 
> This patch only introduces the Kconfig option, no actual mitigations.

If that has to do with speculation, do we need a new Kconfig option ? 
Can't we use CONFIG_PPC_BARRIER_NOSPEC for that ?

> 
> The primary overhead of this mitigation lies in an increased number of
> registers that must be saved and restored by interrupt handlers on
> Book3S systems. Enable by default on Book3E systems, which prior to
> this patch eagerly save and restore register state, meaning that the
> mitigation when implemented will have minimal overhead.
> 
> Acked-by: Nicholas Piggin 
> Signed-off-by: Rohan McLure 
> ---
> Resubmitting patches as their own series after v6 partially merged:
> Link: 
> https://lore.kernel.org/all/166488988686.779920.13794870102696416283.b4...@ellerman.id.au/t/
> ---
>   arch/powerpc/Kconfig | 9 +
>   1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2ca5418457ed..9d3d20c6f365 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -529,6 +529,15 @@ config HOTPLUG_CPU
>   
> Say N if you are unsure.
>   
> +config INTERRUPT_SANITIZE_REGISTERS
> + bool "Clear gprs on interrupt arrival"
> + depends on PPC64 && ARCH_HAS_SYSCALL_WRAPPER
> + default PPC_BOOK3E_64
> + help
> +   Reduce the influence of user register state on interrupt handlers and
> +   syscalls through clearing user state from registers before handling
> +   the exception.
> +
>   config PPC_QUEUED_SPINLOCKS
>   bool "Queued spinlocks" if EXPERT
>   depends on SMP

[PATCH printk v3 00/40] reduce console_lock scope

2022-11-07 Thread John Ogness
This is v3 of a series to prepare for threaded/atomic
printing. v2 is here [0]. This series focuses on reducing the
scope of the BKL console_lock. It achieves this by switching to
SRCU and a dedicated mutex for console list iteration and
modification, respectively. The console_lock will no longer
offer this protection and is completely removed from
(un)register_console() and console_stop/start() code.

Also, during the review of v2 it came to our attention that
many console drivers are checking CON_ENABLED to see if they
are registered. Because this flag can change without
unregistering and because this flag does not represent an
atomic point when an (un)registration process is complete,
a new console_is_registered() function is introduced. This
function uses the console_list_lock to synchronize with the
(un)registration process to provide a reliable status.

All users of the console_lock for list iteration have been
modified. For the call sites where the console_lock is still
needed (because of other reasons), comments are added to
explain exactly why the console_lock was needed.

All users of CON_ENABLED for registration status have been
modified to use console_is_registered(). Note that there are
still users of CON_ENABLED, but this is for legitimate purposes
about a registered console being able to print.

The base commit for this series is from Paul McKenney's RCU tree
and provides an NMI-safe SRCU implementation [1]. Without the
NMI-safe SRCU implementation, this series is not less safe than
mainline. But we will need the NMI-safe SRCU implementation for
atomic consoles anyway, so we might as well get it in
now. Especially since it _does_ increase the reliability for
mainline in the panic path.

Changes since v3:

general:

- introduce a synchronized console_is_registered() to query if
  a console is registered, meant to replace CON_ENABLED
  (mis)use for this purpose

- directly read console->flags for registered consoles if it is
  race-free (and document that it is so)

- replace uart_console_enabled() with a new
  uart_console_registered() based on console_is_registered()

- change comments about why console_lock is used to synchronize
  console->device() by providing an example

registration check fixups:

- the following drivers were modified to use the new
  console_is_registered() instead of CON_ENABLED checks

   - arch/m68k/emu/nfcon.c
   - drivers/firmware/efi/earlycon.c
   - drivers/net/netconsole.c
   - drivers/tty/hvc/hvc_console.c
   - drivers/tty/serial/8250/8250_core.c
   - drivers/tty/serial/earlycon.c
   - drivers/tty/serial/pic32_uart.c
   - drivers/tty/serial/samsung_tty.c
   - drivers/tty/serial/serial_core.c
   - drivers/tty/serial/xilinx_uartps.c
   - drivers/usb/early/xhci-dbc.c

um: kmsg_dumper:

- change stdout dump criteria to match original intention

kgdb/kdb:

- in configure_kgdboc(), take console_list_lock to synchronize
  tty_find_polling_driver() against register_console()

- add comments explaining why calling console->write() without
  locking might work

tty: sh-sci:

- use a setup() callback to setup the early console

fbdev: xen:

- implement a cleaner approach for
  console_force_preferred_locked()

rcu:

- implement debug_lockdep_rcu_enabled() for
  !CONFIG_DEBUG_LOCK_ALLOC

printk:

- check CONFIG_DEBUG_LOCK_ALLOC for srcu_read_lock_held()
  availability

- for console_lock/_trylock/_unlock, replace "lock the console
  system" language with "block the console subsystem from
  printing"

- use WRITE_ONCE() for updating console->flags of registered
  consoles

- expand comments of synchronize_srcu() calls to explain why
  they are needed, and also expand comments to explain when it
  is not needed

- change CON_BOOT consoles to always begin at earliest message

- for non-BOOT/non-PRINTBUFFER consoles, initialize @seq to the
  minimal @seq of any of the enabled boot consoles

- add comments and lockdep assertion to
  unregister_console_locked() because it is not clear from the
  name which lock is implied

- dropped patches that caused unnecessary churn in the series

John Ogness

[0] 
https://lore.kernel.org/lkml/20221019145600.1282823-1-john.ogn...@linutronix.de
[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/log/?h=srcunmisafe.2022.10.21a

John Ogness (38):
  rcu: implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC
  printk: Prepare for SRCU console list protection
  printk: fix setting first seq for consoles
  um: kmsg_dump: only dump when no output console available
  console: introduce console_is_enabled() wrapper
  printk: use console_is_enabled()
  um: kmsg_dump: use console_is_enabled()
  kdb: kdb_io: use console_is_enabled()
  um: kmsg_dumper: use srcu console list iterator
  tty: serial: kgdboc: document console_lock usage
  tty: tty_io: document console_lock usage
  proc: consoles: document console_lock usage
  kdb: use srcu console list iterator
  printk: console_flush_all: use srcu console list iterator
  printk: 

[PATCH printk v3 26/40] tty: hvc: use console_is_registered()

2022-11-07 Thread John Ogness
It is not reliable to check for CON_ENABLED in order to identify if a
console is registered. Use console_is_registered() instead.

Signed-off-by: John Ogness 
---
 drivers/tty/hvc/hvc_console.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 4802cfaa107f..a683e21df19c 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -264,8 +264,8 @@ static void hvc_port_destruct(struct tty_port *port)
 
 static void hvc_check_console(int index)
 {
-   /* Already enabled, bail out */
-   if (hvc_console.flags & CON_ENABLED)
+   /* Already registered, bail out */
+   if (console_is_registered(_console))
return;
 
/* If this index is what the user requested, then register
-- 
2.30.2



Re: [RFC PATCH 05/19] powerpc/32: Use load and store multiple in GPR save/restore macros

2022-11-07 Thread Christophe Leroy


Le 07/11/2022 à 13:45, Nicholas Piggin a écrit :
> On Thu Nov 3, 2022 at 6:26 PM AEST, Christophe Leroy wrote:
>> It contradicts commit a85c728cb5e1 ("powerpc/32: Don't use lmw/stmw for
>> saving/restoring non volatile regs")
> 
> Hmm okay. Do you want to keep the #ifdef case in ppc_save_regs for ppc32
> to save ~124 bytes of text? I don't mind either way.

I think you can make ppc_save_regs() generic.

I don't think the lmw/stmw is worth it in ppc_save_regs(), just drop it.

Christophe

> 
> Thanks,
> Nick
> 
>>
>> Le 31/10/2022 à 06:54, Nicholas Piggin a écrit :
>>> ---
>>>arch/powerpc/include/asm/ppc_asm.h | 18 --
>>>1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/ppc_asm.h 
>>> b/arch/powerpc/include/asm/ppc_asm.h
>>> index 753a2757bcd4..ac44383d350a 100644
>>> --- a/arch/powerpc/include/asm/ppc_asm.h
>>> +++ b/arch/powerpc/include/asm/ppc_asm.h
>>> @@ -57,8 +57,22 @@
>>>#define SAVE_NVGPRS(base)SAVE_GPRS(14, 31, base)
>>>#define REST_NVGPRS(base)REST_GPRS(14, 31, base)
>>>#else
>>> -#define SAVE_GPRS(start, end, base)OP_REGS stw, 4, start, end, 
>>> base, GPR0
>>> -#define REST_GPRS(start, end, base)OP_REGS lwz, 4, start, end, 
>>> base, GPR0
>>> +.macro __SAVE_GPRS start, end, base, offset
>>> +   .if \end == 31
>>> +   stmw\start,\offset(\base)
>>> +   .else
>>> +   OP_REGS stw, 4, \start, \end, \base, \offset
>>> +   .endif
>>> +.endm
>>> +.macro __REST_GPRS start, end, base, offset
>>> +   .if \end == 31
>>> +   lmw \start,\offset(\base)
>>> +   .else
>>> +   OP_REGS lwz, 4, \start, \end, \base, \offset
>>> +   .endif
>>> +.endm
>>> +#define SAVE_GPRS(start, end, base)__SAVE_GPRS start, end, base, 
>>> GPR0
>>> +#define REST_GPRS(start, end, base)__REST_GPRS start, end, base, 
>>> GPR0
>>>#define SAVE_NVGPRS(base)SAVE_GPRS(13, 31, base)
>>>#define REST_NVGPRS(base)REST_GPRS(13, 31, base)
>>>#endif
> 

Re: [RFC PATCH 05/19] powerpc/32: Use load and store multiple in GPR save/restore macros

2022-11-07 Thread Nicholas Piggin
On Thu Nov 3, 2022 at 6:26 PM AEST, Christophe Leroy wrote:
> It contradicts commit a85c728cb5e1 ("powerpc/32: Don't use lmw/stmw for 
> saving/restoring non volatile regs")

Hmm okay. Do you want to keep the #ifdef case in ppc_save_regs for ppc32
to save ~124 bytes of text? I don't mind either way.

Thanks,
Nick

>
> Le 31/10/2022 à 06:54, Nicholas Piggin a écrit :
> > ---
> >   arch/powerpc/include/asm/ppc_asm.h | 18 --
> >   1 file changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/ppc_asm.h 
> > b/arch/powerpc/include/asm/ppc_asm.h
> > index 753a2757bcd4..ac44383d350a 100644
> > --- a/arch/powerpc/include/asm/ppc_asm.h
> > +++ b/arch/powerpc/include/asm/ppc_asm.h
> > @@ -57,8 +57,22 @@
> >   #define SAVE_NVGPRS(base) SAVE_GPRS(14, 31, base)
> >   #define REST_NVGPRS(base) REST_GPRS(14, 31, base)
> >   #else
> > -#define SAVE_GPRS(start, end, base)OP_REGS stw, 4, start, end, 
> > base, GPR0
> > -#define REST_GPRS(start, end, base)OP_REGS lwz, 4, start, end, 
> > base, GPR0
> > +.macro __SAVE_GPRS start, end, base, offset
> > +   .if \end == 31
> > +   stmw\start,\offset(\base)
> > +   .else
> > +   OP_REGS stw, 4, \start, \end, \base, \offset
> > +   .endif
> > +.endm
> > +.macro __REST_GPRS start, end, base, offset
> > +   .if \end == 31
> > +   lmw \start,\offset(\base)
> > +   .else
> > +   OP_REGS lwz, 4, \start, \end, \base, \offset
> > +   .endif
> > +.endm
> > +#define SAVE_GPRS(start, end, base)__SAVE_GPRS start, end, base, 
> > GPR0
> > +#define REST_GPRS(start, end, base)__REST_GPRS start, end, base, 
> > GPR0
> >   #define SAVE_NVGPRS(base) SAVE_GPRS(13, 31, base)
> >   #define REST_NVGPRS(base) REST_GPRS(13, 31, base)
> >   #endif



Re: [RFC PATCH 2/6] powerpc/64s: Helpers to switch between linear and vmapped stack pointers

2022-11-07 Thread Nicholas Piggin
On Sat Nov 5, 2022 at 6:00 PM AEST, Christophe Leroy wrote:
>
>
> Le 04/11/2022 à 18:27, Andrew Donnellan a écrit :
> > powerpc unfortunately has too many places where we run stuff in real mode.
> > 
> > With CONFIG_VMAP_STACK enabled, this means we need to be able to swap the
> > stack pointer to use the linear mapping when we enter a real mode section,
> > and back afterwards.
> > 
> > Store the top bits of the stack pointer in both the linear map and the
> > vmalloc space in the PACA, and add some helper macros/functions to swap
> > between them.
>
> That may work when pagesize is 64k because stack is on a single page, 
> but I doubt is works with 4k pages, because vmalloc may allocate non 
> contiguous pages.

Yeah. This could be a first-stage though, and depend on 64k page size
and stack size, or !KVM or whatever. When the real-mode code is solved,
that could be relaxed.

Thanks,
Nick


Re: [RFC PATCH 14/19] powerpc: split validate_sp into two functions

2022-11-07 Thread Nicholas Piggin
On Mon Nov 7, 2022 at 10:58 AM AEST, Russell Currey wrote:
> On Mon, 2022-10-31 at 15:54 +1000, Nicholas Piggin wrote:
> > Most callers just want to validate an arbitrary kernel stack pointer,
> > some need a particular size. Make the size case the exceptional one
> > with an extra function.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/processor.h | 15 ---
> >  arch/powerpc/kernel/process.c    | 23 ++-
> >  arch/powerpc/kernel/stacktrace.c |  2 +-
> >  arch/powerpc/perf/callchain.c    |  6 +++---
> >  4 files changed, 30 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/processor.h
> > b/arch/powerpc/include/asm/processor.h
> > index 631802999d59..e96c9b8c2a60 100644
> > --- a/arch/powerpc/include/asm/processor.h
> > +++ b/arch/powerpc/include/asm/processor.h
> > @@ -374,9 +374,18 @@ static inline unsigned long __pack_fe01(unsigned
> > int fpmode)
> >  
> >  #endif
> >  
> > -/* Check that a certain kernel stack pointer is valid in task_struct
> > p */
> > -int validate_sp(unsigned long sp, struct task_struct *p,
> > -   unsigned long nbytes);
> > +/*
> > + * Check that a certain kernel stack pointer is a valid (minimum
> > sized)
> > + * stack frame in task_struct p.
> > + */
> > +int validate_sp(unsigned long sp, struct task_struct *p);
> > +
> > +/*
> > + * validate the stack frame of a particular minimum size, used for
> > when we are
> > + * looking at a certain object in the stack beyond the minimum.
> > + */
> > +int validate_sp_size(unsigned long sp, struct task_struct *p,
> > +    unsigned long nbytes);
> >  
> >  /*
> >   * Prefetch macros.
> > diff --git a/arch/powerpc/kernel/process.c
> > b/arch/powerpc/kernel/process.c
> > index 6cb3982a11ef..b5defea32e75 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -2128,9 +2128,12 @@ static inline int
> > valid_emergency_stack(unsigned long sp, struct task_struct *p,
> > return 0;
> >  }
> >  
> > -
> > -int validate_sp(unsigned long sp, struct task_struct *p,
> > -  unsigned long nbytes)
> > +/*
> > + * validate the stack frame of a particular minimum size, used for
> > when we are
> > + * looking at a certain object in the stack beyond the minimum.
> > + */
> > +int validate_sp_size(unsigned long sp, struct task_struct *p,
> > +    unsigned long nbytes)
> >  {
> > unsigned long stack_page = (unsigned long)task_stack_page(p);
> >  
> > @@ -2146,7 +2149,10 @@ int validate_sp(unsigned long sp, struct
> > task_struct *p,
> > return valid_emergency_stack(sp, p, nbytes);
> >  }
> >  
> > -EXPORT_SYMBOL(validate_sp);
> > +int validate_sp(unsigned long sp, struct task_struct *p)
> > +{
> > +   return validate_sp(sp, p, STACK_FRAME_OVERHEAD);
>
> Hi Nick, I assume this supposed to be validate_sp_size()?  Did you get
> this to compile?

Oops, yeah I think I sent a slightly stale version of the series. I
did fix that one.

Thanks,
Nick


Re: [PATCH v9 5/7] powerpc/tlb: Add local flush for page given mm_struct and psize

2022-11-07 Thread Nicholas Piggin
On Mon Nov 7, 2022 at 4:58 PM AEST, Benjamin Gray wrote:
> On Thu, 2022-11-03 at 11:39 +1100, Benjamin Gray wrote:
> > On Wed, 2022-11-02 at 09:56 +, Christophe Leroy wrote:
> > > By the way, 'extern' keyword is pointless and deprecated for
> > > functions 
> > > prototypes, please don't add new ones, even if other historical 
> > > prototypes have one.
> > 
> > This and the above commented parts match the style of the surrounding
> > implementations. For example,
> > 
> > static inline void local_flush_tlb_mm(struct mm_struct *mm)
> > {
> > if (radix_enabled())
> > return radix__local_flush_tlb_mm(mm);
> > return hash__local_flush_tlb_mm(mm);
> > }
> > 
> > I am not going to add code that is inconsistent with the surrounding
> > code. That just causes confusion later down the line when readers
> > wonder why this function is special compared to the others. If it
> > needs
> > to use modern style, then I would be happy to include a patch that
> > modernises the surrounding code first.
> > 
> > Though why the hash__* functions are empty I'm not sure... If they
> > were
> > not implemented I would have expected a BUILD_BUG(). If they are
> > unnecessary due to hash itself, it's odd that they exist (as you
> > point
> > out for the new one).
>
> From what I can see in the history, the empty hash functions were
> originally introduced as 64-bit compatibility definitions for the hash
> MMU (which I guess doesn't require any action).

Yeah the hash MMU does hash PTE update and TLB flushing in the Linux pte
update APIs (which end up at hash__pte_update()). By the time Linux
calls flush_tlb_xxx(), the powerpc code had already done the necessary
TLB flushing.

> These empty functions
> were shuffled around, then eventually prefixed with hash__* to make way
> for the radix variants, which are hidden behind a generic
> 'local_flush_tlb_mm' etc. implementation as we see today. So basically,
> the empty hash__* functions no longer (never, really) served a purpose
> once the generic wrapper was added. I'll add a patch to delete them and
> clean up the return voids.

Yeah I think you got it - the functions had to be there pre-radix
because they were required by core code, and when radix was added
it probably just followed a template. Removing empty hash__ functions
should be fine I think.

Thanks,
Nick


Re: [PATCH v5 2/2] powerpc/64: Add module check for ELF ABI version

2022-11-07 Thread Nicholas Piggin
On Thu Nov 3, 2022 at 6:35 PM AEST, Christophe Leroy wrote:
>
>
> Le 31/10/2022 à 13:07, Nicholas Piggin a écrit :
> > Override the generic module ELF check to provide a check for the ELF ABI
> > version. This becomes important if we allow big-endian ELF ABI V2 builds
> > but it doesn't hurt to check now.
> > 
> > Cc: Jessica Yu 
> > Signed-off-by: Michael Ellerman 
> > [np: split patch, added changelog, adjust to Jessica's proposal]
> > Signed-off-by: Nicholas Piggin 
> > ---
> >   arch/powerpc/kernel/module.c | 17 +
> >   1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
> > index f6d6ae0a1692..d46bf9bfda26 100644
> > --- a/arch/powerpc/kernel/module.c
> > +++ b/arch/powerpc/kernel/module.c
> > @@ -19,6 +19,23 @@
> >   
> >   static LIST_HEAD(module_bug_list);
> >   
> > +#ifdef CONFIG_PPC64
>
> Can it go in arch/powerpc/kernel/module_64.c instead ?
>
> > +bool module_elf_check_arch(Elf_Ehdr *hdr)
> > +{
> > +   unsigned long abi_level = hdr->e_flags & 0x3;
> > +
> > +   if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2)) {
> > +   if (abi_level != 2)
> > +   return false;
> > +   } else {
> > +   if (abi_level >= 2)
> > +   return false;
> > +   }
> > +
> > +   return true;
>
> Can be simpler:
>
>   if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2))
>   return abi_level == 2;
>   else
>   return abi_level < 2;

Yes I think both of those can be done. Good suggestions.

Thanks,
Nick


Re: [PATCH 1/2] powerpc: export the CPU node count

2022-11-07 Thread Nicholas Piggin
On Sat Oct 29, 2022 at 2:00 AM AEST, Laurent Dufour wrote:
> At boot time, the FDT is parsed to compute the number of CPUs.
> In addition count the number of CPU nodes and export it.
>
> This is useful when building the FDT for a kexeced kernel since we need to
> take in account the CPU node added since the boot time during CPU hotplug
> operations.

It would be nice if it just realloced memory in this case, but that
looks like a bigger change.

But these patches look okay to me, if you can solve the compile bug.

Thanks,
Nick

>
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/include/asm/kexec_ranges.h | 2 ++
>  arch/powerpc/kernel/prom.c  | 4 
>  2 files changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/kexec_ranges.h 
> b/arch/powerpc/include/asm/kexec_ranges.h
> index f83866a19e87..bf35d00ddd09 100644
> --- a/arch/powerpc/include/asm/kexec_ranges.h
> +++ b/arch/powerpc/include/asm/kexec_ranges.h
> @@ -22,4 +22,6 @@ int add_rtas_mem_range(struct crash_mem **mem_ranges);
>  int add_opal_mem_range(struct crash_mem **mem_ranges);
>  int add_reserved_mem_ranges(struct crash_mem **mem_ranges);
>  
> +extern unsigned int boot_cpu_node_count;
> +
>  #endif /* _ASM_POWERPC_KEXEC_RANGES_H */
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 1eed87d954ba..d326148fd5a4 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -56,6 +56,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -72,6 +73,7 @@ int __initdata iommu_is_off;
>  int __initdata iommu_force_on;
>  unsigned long tce_alloc_start, tce_alloc_end;
>  u64 ppc64_rma_size;
> +unsigned int boot_cpu_node_count __ro_after_init;
>  #endif
>  static phys_addr_t first_memblock_size;
>  static int __initdata boot_cpu_count;
> @@ -335,6 +337,8 @@ static int __init early_init_dt_scan_cpus(unsigned long 
> node,
>   if (type == NULL || strcmp(type, "cpu") != 0)
>   return 0;
>  
> + boot_cpu_node_count++;
> +
>   /* Get physical cpuid */
>   intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", );
>   if (!intserv)
> -- 
> 2.38.1



Re: [PATCH] powerpc: Interrupt handler stack randomisation

2022-11-07 Thread Nicholas Piggin
On Tue Oct 25, 2022 at 1:38 PM AEST, Rohan McLure wrote:
> Stack frames used by syscall handlers support random offsets as of
> commit f4a0318f278d (powerpc: add support for syscall stack randomization).
> Implement the same for general interrupt handlers, by applying the
> random stack offset and then updating this offset from within the
> DEFINE_INTERRUPT_HANDLER macros.
>
> Applying this offset perturbs the layout of interrupt handler stack
> frames, rendering to the kernel stack more difficult to control by means
> of user invoked interrupts.

Can you randomise the irq/softirq stacks as well with a patch 2?

_Probably_ don't have to do another mftb for those AFAICS. Hard irq
should be driven by one of these handlers (including in the irq replay
case). Softirq fast path is driven from irq_exit() from the async
irq handlers, so that should be pretty well randomized. Softirq
slowpath is done by ksoftirqd using its own stack, so nothing to
be done there.

>
> Link: https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-May/243238.html
>
> Signed-off-by: Rohan McLure 
> ---
>  arch/powerpc/include/asm/interrupt.h | 12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index 4745bb9998bd..b7f7beff4e13 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -68,6 +68,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -448,9 +449,12 @@ interrupt_handler long func(struct pt_regs *regs)
> \
>   long ret;   \
>   \
>   __hard_RI_enable(); \
> + add_random_kstack_offset(); \
>   \
>   ret = ##func (regs);\
>   \
> + choose_random_kstack_offset(mftb());\
> + \
>   return ret; \
>  }\
>  NOKPROBE_SYMBOL(func);   
> \
> @@ -480,9 +484,11 @@ static __always_inline void ##func(struct pt_regs 
> *regs);\
>  interrupt_handler void func(struct pt_regs *regs)\
>  {\
>   interrupt_enter_prepare(regs);  \
> + add_random_kstack_offset(); \
>   \
>   ##func (regs);  \
>   \
> + choose_random_kstack_offset(mftb());\
>   interrupt_exit_prepare(regs);   \
>  }\
>  NOKPROBE_SYMBOL(func);   
> \

Hmm. It uses per-cpu data, so it actually can't be used for all
interrupt types, HMI and MCE because they're real-mode. SLB because
that might take an SLB fault, and maybe not hash faults either because
they might fault again I think?

You'd have to change the core code to let us put the offset in the paca.
Not sure if 32-bit has any such restrictions (e.g., does 32s need
MSR[RI] enabled before accessing per-cpu data?)

If you can do that that then I'd also put it as the first thing in
the func(), because the enter_prepare code can call non-trivial code
(e.g., irq_enter / nmi_enter).

Thnaks,
Nick



Re: [PATCH 1/2] tools/perf: Fix printing os->prefix in CSV metrics output

2022-11-07 Thread kajoljain



On 11/4/22 13:07, Disha Goel wrote:
> On 11/4/22 12: 59 PM, Athira Rajeev wrote: On 03-Nov-2022, at 9: 45 PM,
> Ian Rogers  wrote: On Wed, Nov 2, 2022 at 1: 36 AM
> Athira Rajeev ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍
> ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍
> ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍
> 
> 
> 
> On 11/4/22 12:59 PM, Athira Rajeev wrote:
>>> On 03-Nov-2022, at 9:45 PM, Ian Rogers  wrote:
>>>
>>> On Wed, Nov 2, 2022 at 1:36 AM Athira Rajeev
>>>  wrote:
> On 18-Oct-2022, at 2:26 PM, Athira Rajeev  
> wrote:
>
> Perf stat with CSV output option prints an extra empty
> string as first field in metrics output line.
> Sample output below:
>
>  # ./perf stat -x, --per-socket -a -C 1 ls
>  S0,1,1.78,msec,cpu-clock,1785146,100.00,0.973,CPUs utilized
>  S0,1,26,,context-switches,1781750,100.00,0.015,M/sec
>  S0,1,1,,cpu-migrations,1780526,100.00,0.561,K/sec
>  S0,1,1,,page-faults,1779060,100.00,0.561,K/sec
>  S0,1,875807,,cycles,1769826,100.00,0.491,GHz
>  S0,1,85281,,stalled-cycles-frontend,1767512,100.00,9.74,frontend 
> cycles idle
>  S0,1,576839,,stalled-cycles-backend,1766260,100.00,65.86,backend 
> cycles idle
>  S0,1,288430,,instructions,1762246,100.00,0.33,insn per cycle
> > ,S0,1,,,2.00,stalled cycles per insn
>
> The above command line uses field separator as ","
> via "-x," option and per-socket option displays
> socket value as first field. But here the last line
> for "stalled cycles per insn" has "," in the
> beginning.
>
> Sample output using interval mode:
>  # ./perf stat -I 1000 -x, --per-socket -a -C 1 ls
>  0.001813453,S0,1,1.87,msec,cpu-clock,1872052,100.00,0.002,CPUs 
> utilized
>  0.001813453,S0,1,2,,context-switches,1868028,100.00,1.070,K/sec
>  --
>  0.001813453,S0,1,85379,,instructions,1856754,100.00,0.32,insn per 
> cycle
> > 0.001813453,,S0,1,,,1.34,stalled cycles per insn
>
> Above result also has an extra csv separator after
> the timestamp. Patch addresses extra field separator
> in the beginning of the metric output line.
>
> The counter stats are displayed by function
> "perf_stat__print_shadow_stats" in code
> "util/stat-shadow.c". While printing the stats info
> for "stalled cycles per insn", function "new_line_csv"
> is used as new_line callback.
>
> The new_line_csv function has check for "os->prefix"
> and if prefix is not null, it will be printed along
> with cvs separator.
> Snippet from "new_line_csv":
>  if (os->prefix)
>  fprintf(os->fh, "%s%s", os->prefix, config->csv_sep);
>
> Here os->prefix gets printed followed by ","
> which is the cvs separator. The os->prefix is
> used in interval mode option ( -I ), to print
> time stamp on every new line. But prefix is
> already set to contain csv separator when used
> in interval mode for csv option.
>
> Reference: Function "static void print_interval"
> Snippet:
>  sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, 
> config->csv_sep);
>
> Also if prefix is not assigned (if not used with
> -I option), it gets set to empty string.
> Reference: function printout() in util/stat-display.c
> Snippet:
>  .prefix = prefix ? prefix : "",
>
> Since prefix already set to contain cvs_sep in interval
> option, patch removes printing config->csv_sep in
> new_line_csv function to avoid printing extra field.
>
> After the patch:
>
>  # ./perf stat -x, --per-socket -a -C 1 ls
>  S0,1,2.04,msec,cpu-clock,2045202,100.00,1.013,CPUs utilized
>  S0,1,2,,context-switches,2041444,100.00,979.289,/sec
>  S0,1,0,,cpu-migrations,2040820,100.00,0.000,/sec
>  S0,1,2,,page-faults,2040288,100.00,979.289,/sec
>  S0,1,254589,,cycles,2036066,100.00,0.125,GHz
>  S0,1,82481,,stalled-cycles-frontend,2032420,100.00,32.40,frontend 
> cycles idle
>  S0,1,113170,,stalled-cycles-backend,2031722,100.00,44.45,backend 
> cycles idle
>  S0,1,88766,,instructions,2030942,100.00,0.35,insn per cycle
>  S0,1,,,1.27,stalled cycles per insn
>
> Fixes: 92a61f6412d3 ("perf stat: Implement CSV metrics output")
> Reported-by: Disha Goel 
> Signed-off-by: Athira Rajeev 
> 
> perf stat csv test passed after applying the patch
> Tested-by: Disha Goel 


Hi,
  Patch looks fine to me.

Reviewed-By: Kajol Jain 

Thanks,
Kajol Jain

> 
 Hi All,

 Looking for review comments for this change.
>>> Hi,
>>>
>>> Thanks for addressing issues in this code. What is the status of the
>>> CSV output test following these changes?
>>>
>>> I think going forward we need to move away from CSV and columns, to
>>> something with 

Re: [PATCH 0/5] powerpc/kprobes: preempt related changes and cleanups

2022-11-07 Thread Nicholas Piggin
+linux-arch

FYI most preempt_enable_no_resched() calls outside of core scheduler
code come from kprobes copy-and-paste, and can possibly be fixed in
similar ways as this series. It's a bit of a footgun API that should
be banished from outside kernel/sched/, at least without an explicit
comment for each case that explains the need and how the missed
resched is resolved or not applicable.

Thanks,
Nick

On Fri Oct 21, 2022 at 3:28 AM AEST, Naveen N. Rao wrote:
> This series attempts to address some of the concerns raised in 
> https://github.com/linuxppc/issues/issues/440
>
> The last two patches are minor cleanups in related kprobes code.
>
> - Naveen
>
>
> Naveen N. Rao (5):
>   powerpc/kprobes: Remove preempt disable around call to get_kprobe() in
> arch_prepare_kprobe()
>   powerpc/kprobes: Have optimized_callback() use preempt_enable()
>   powerpc/kprobes: Use preempt_enable() rather than the no_resched
> variant
>   powerpc/kprobes: Setup consistent pt_regs across kprobes, optprobes
> and KPROBES_ON_FTRACE
>   powerpc/kprobes: Remove unnecessary headers from kprobes
>
>  arch/powerpc/kernel/kprobes-ftrace.c|  4 
>  arch/powerpc/kernel/kprobes.c   | 16 ++--
>  arch/powerpc/kernel/optprobes.c |  2 +-
>  arch/powerpc/kernel/optprobes_head.S|  5 +
>  arch/powerpc/kernel/trace/ftrace_mprofile.S |  6 ++
>  5 files changed, 14 insertions(+), 19 deletions(-)
>
>
> base-commit: 7dc2a00fdd44a4d0c3bac9fd10558b3933586a0c
> -- 
> 2.38.0



Re: [PATCH 4/5] powerpc/kprobes: Setup consistent pt_regs across kprobes, optprobes and KPROBES_ON_FTRACE

2022-11-07 Thread Nicholas Piggin
On Fri Oct 21, 2022 at 3:29 AM AEST, Naveen N. Rao wrote:
> Ensure a more consistent pt_regs across kprobes, optprobes and
> KPROBES_ON_FTRACE:
> - Drop setting trap to 0x700 under optprobes. This is not accurate and
>   is unnecessary. Instead, zero it out for both optprobes and
>   KPROBES_ON_FTRACE.

Okay I think.

> - Save irq soft mask in the ftrace handler, similar to what we do in
>   optprobes and trap-based kprobes.

This advertises the irqs status of regs correctly, whereas previously
it was uninitialised.

> - Drop setting orig_gpr3 and result to zero in optprobes. These are not
>   relevant under kprobes and should not be used by the handlers.

This is for CFAR, which we can't get anyway because we just branched
here. I would rather zero it explicitly though.

Otherwise,

Reviewed-by: Nicholas Piggin 

>
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/optprobes_head.S| 5 +
>  arch/powerpc/kernel/trace/ftrace_mprofile.S | 6 ++
>  2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/optprobes_head.S 
> b/arch/powerpc/kernel/optprobes_head.S
> index cd4e7bc32609d3..06df09b4e8b155 100644
> --- a/arch/powerpc/kernel/optprobes_head.S
> +++ b/arch/powerpc/kernel/optprobes_head.S
> @@ -49,11 +49,8 @@ optprobe_template_entry:
>   /* Save SPRS */
>   mfmsr   r5
>   PPC_STL r5,_MSR(r1)
> - li  r5,0x700
> - PPC_STL r5,_TRAP(r1)
>   li  r5,0
> - PPC_STL r5,ORIG_GPR3(r1)
> - PPC_STL r5,RESULT(r1)
> + PPC_STL r5,_TRAP(r1)
>   mfctr   r5
>   PPC_STL r5,_CTR(r1)
>   mflrr5
> diff --git a/arch/powerpc/kernel/trace/ftrace_mprofile.S 
> b/arch/powerpc/kernel/trace/ftrace_mprofile.S
> index d031093bc43671..f82004089426e6 100644
> --- a/arch/powerpc/kernel/trace/ftrace_mprofile.S
> +++ b/arch/powerpc/kernel/trace/ftrace_mprofile.S
> @@ -107,6 +107,12 @@
>   PPC_STL r9, _CTR(r1)
>   PPC_STL r10, _XER(r1)
>   PPC_STL r11, _CCR(r1)
> +#ifdef CONFIG_PPC64
> + lbz r7, PACAIRQSOFTMASK(r13)
> + std r7, SOFTE(r1)
> +#endif
> + li  r8, 0
> + PPC_STL r8, _TRAP(r1)
>   .endif
>  
>   /* Load _regs in r6 for call below */
> -- 
> 2.38.0



Re: [PATCH 1/4] powerpc: mm: Replace p{u,m,4}d_is_leaf with p{u,m,4}_leaf

2022-11-07 Thread Christophe Leroy


Le 06/11/2022 à 23:49, Rohan McLure a écrit :
> Replace occurrences of p{u,m,4}d_is_leaf with p{u,m,4}_leaf, as the
> latter is the name given to checking that a higher-level entry in
> multi-level paging contains a page translation entry (pte). This commit
> allows for p{u,m,4}d_leaf to be used on all powerpc platforms.
> 
> Prior to this commit, the two names have both been present in the
> kernel, having as far as I can tell the same exact purpose. While the
> 'is' in the title may better indicate that the macro/function is a
> boolean returning check, the former naming scheme is standard through
> all other architectures.

Would be easier to understand and review if you split in two patches:
1/ Replace all uses of p{u,m,4}d_is_leaf by p{u,m,4}_leaf
2/ Properly implement p{u,m,4}_leaf and remove p{u,m,4}d_is_leaf

> 
> 32-bit systems import pgtable-nop4d.h which defines a default pud_leaf.
> Define pud_leaf preprocessor macro on both Book3E/S 32-bit to avoid
> including the default definition in asm/pgtable.h.

I think you should do it the other way round: Move it away from 
asm/pgtable.h.

pud_leaf(), you only have to add a stub in asm/nohash/64/pgtable.h

pmd_leaf(), you have to add a stub in asm/nohash/pgtable.h and 
asm/book3s/32/pgtable.h

I think doing like that would be cleaner.



Christophe

> 
> Signed-off-by: Rohan McLure 
> ---
> V4: new patch.
> ---
>   arch/powerpc/include/asm/book3s/32/pgtable.h |  2 +-
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 10 --
>   arch/powerpc/include/asm/nohash/32/pgtable.h |  1 +
>   arch/powerpc/include/asm/pgtable.h   | 18 +-
>   arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 ++--
>   arch/powerpc/mm/book3s64/radix_pgtable.c | 14 +++---
>   arch/powerpc/mm/pgtable.c|  6 +++---
>   arch/powerpc/mm/pgtable_64.c |  6 +++---
>   arch/powerpc/xmon/xmon.c |  6 +++---
>   9 files changed, 37 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
> b/arch/powerpc/include/asm/book3s/32/pgtable.h
> index 75823f39e042..f1b91ad8f3a5 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
> @@ -234,6 +234,7 @@ void unmap_kernel_page(unsigned long va);
>   #define pte_clear(mm, addr, ptep) \
>   do { pte_update(mm, addr, ptep, ~_PAGE_HASHPTE, 0, 0); } while (0)
>   
> +#define pud_leaf pud_leaf
>   #define pmd_none(pmd)   (!pmd_val(pmd))
>   #define pmd_bad(pmd)(pmd_val(pmd) & _PMD_BAD)
>   #define pmd_present(pmd)(pmd_val(pmd) & _PMD_PRESENT_MASK)
> @@ -242,7 +243,6 @@ static inline void pmd_clear(pmd_t *pmdp)
>   *pmdp = __pmd(0);
>   }
>   
> -
>   /*
>* When flushing the tlb entry for a page, we also need to flush the hash
>* table entry.  flush_hash_pages is assembler (for speed) in hashtable.S.
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index c436d8422654..3f51de24e4fc 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1426,16 +1426,14 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
>   /*
>* Like pmd_huge() and pmd_large(), but works regardless of config options
>*/
> -#define pmd_is_leaf pmd_is_leaf
> -#define pmd_leaf pmd_is_leaf
> -static inline bool pmd_is_leaf(pmd_t pmd)
> +#define pmd_leaf pmd_leaf
> +static inline bool pmd_leaf(pmd_t pmd)
>   {
>   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
>   }
>   
> -#define pud_is_leaf pud_is_leaf
> -#define pud_leaf pud_is_leaf
> -static inline bool pud_is_leaf(pud_t pud)
> +#define pud_leaf pud_leaf
> +static inline bool pud_leaf(pud_t pud)
>   {
>   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
>   }
> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
> b/arch/powerpc/include/asm/nohash/32/pgtable.h
> index 0d40b33184eb..04a3b0b128eb 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
> @@ -201,6 +201,7 @@ static inline pte_t pte_mkexec(pte_t pte)
>   }
>   #endif
>   
> +#define pud_leaf pud_leaf
>   #define pmd_none(pmd)   (!pmd_val(pmd))
>   #define pmd_bad(pmd)(pmd_val(pmd) & _PMD_BAD)
>   #define pmd_present(pmd)(pmd_val(pmd) & _PMD_PRESENT_MASK)
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 283f40d05a4d..8e7625a89922 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -134,25 +134,25 @@ static inline void pte_frag_set(mm_context_t *ctx, void 
> *p)
>   }
>   #endif
>   
> -#ifndef pmd_is_leaf
> -#define pmd_is_leaf pmd_is_leaf
> -static inline bool pmd_is_leaf(pmd_t pmd)
> +#ifndef pmd_leaf
> +#define pmd_leaf 

Re: [PATCH 3/5] powerpc/kprobes: Use preempt_enable() rather than the no_resched variant

2022-11-07 Thread Nicholas Piggin
On Fri Oct 21, 2022 at 3:28 AM AEST, Naveen N. Rao wrote:
> preempt_enable_no_resched() is just the same as preempt_enable() when we
> are in a irqs disabled context. kprobe_handler() and the post/fault
> handlers are all called with irqs disabled. As such, convert those to
> just use preempt_enable().

Reviewed-by: Nicholas Piggin 

>
> Reported-by: Nicholas Piggin 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 88f42de681e1f8..86ca5a61ea9afb 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -369,7 +369,7 @@ int kprobe_handler(struct pt_regs *regs)
>  
>   if (ret > 0) {
>   restore_previous_kprobe(kcb);
> - preempt_enable_no_resched();
> + preempt_enable();
>   return 1;
>   }
>   }
> @@ -382,7 +382,7 @@ int kprobe_handler(struct pt_regs *regs)
>   if (p->pre_handler && p->pre_handler(p, regs)) {
>   /* handler changed execution path, so skip ss setup */
>   reset_current_kprobe();
> - preempt_enable_no_resched();
> + preempt_enable();
>   return 1;
>   }
>  
> @@ -395,7 +395,7 @@ int kprobe_handler(struct pt_regs *regs)
>  
>   kcb->kprobe_status = KPROBE_HIT_SSDONE;
>   reset_current_kprobe();
> - preempt_enable_no_resched();
> + preempt_enable();
>   return 1;
>   }
>   }
> @@ -404,7 +404,7 @@ int kprobe_handler(struct pt_regs *regs)
>   return 1;
>  
>  no_kprobe:
> - preempt_enable_no_resched();
> + preempt_enable();
>   return ret;
>  }
>  NOKPROBE_SYMBOL(kprobe_handler);
> @@ -490,7 +490,7 @@ int kprobe_post_handler(struct pt_regs *regs)
>   }
>   reset_current_kprobe();
>  out:
> - preempt_enable_no_resched();
> + preempt_enable();
>  
>   /*
>* if somebody else is singlestepping across a probe point, msr
> @@ -529,7 +529,7 @@ int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
>   restore_previous_kprobe(kcb);
>   else
>   reset_current_kprobe();
> - preempt_enable_no_resched();
> + preempt_enable();
>   break;
>   case KPROBE_HIT_ACTIVE:
>   case KPROBE_HIT_SSDONE:
> -- 
> 2.38.0



Re: [PATCH 2/5] powerpc/kprobes: Have optimized_callback() use preempt_enable()

2022-11-07 Thread Nicholas Piggin
On Fri Oct 21, 2022 at 3:28 AM AEST, Naveen N. Rao wrote:
> Similar to x86 commit 2e62024c265aa6 ("kprobes/x86: Use preempt_enable()
> in optimized_callback()"), change powerpc optprobes to use
> preempt_enable() rather than preempt_enable_no_resched() since powerpc
> also removed irq disabling for optprobes in commit f72180cc93a2c6
> ("powerpc/kprobes: Do not disable interrupts for optprobes and
> kprobes_on_ftrace").

Looks okay. Even if we did have irqs disabled here, we should just use
preempt_enable(), which nests properly inside or outside local irqs.

Reviewed-by: Nicholas Piggin 

>
> Reported-by: Nicholas Piggin 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/optprobes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
> index 3b1c2236cbee57..004fae2044a3e0 100644
> --- a/arch/powerpc/kernel/optprobes.c
> +++ b/arch/powerpc/kernel/optprobes.c
> @@ -112,7 +112,7 @@ static void optimized_callback(struct optimized_kprobe 
> *op,
>   __this_cpu_write(current_kprobe, NULL);
>   }
>  
> - preempt_enable_no_resched();
> + preempt_enable();
>  }
>  NOKPROBE_SYMBOL(optimized_callback);
>  
> -- 
> 2.38.0



Re: [PATCH 1/5] powerpc/kprobes: Remove preempt disable around call to get_kprobe() in arch_prepare_kprobe()

2022-11-07 Thread Nicholas Piggin
On Fri Oct 21, 2022 at 3:28 AM AEST, Naveen N. Rao wrote:
> arch_prepare_kprobe() is called from register_kprobe() via
> prepare_kprobe(), or through register_aggr_kprobe(), both with the
> kprobe_mutex held. Per the comment for get_kprobe():
>   /*
>* This routine is called either:
>*  - under the 'kprobe_mutex' - during kprobe_[un]register().
>*  OR
>*  - with preemption disabled - from architecture specific code.
>*/

That comment should read [un]register_kprobe(), right?

>
> As such, there is no need to disable preemption around the call to
> get_kprobe(). Drop the same.

And prepare_kprobe() and register_aggr_kprobe() are both called with
kprobe_mutex held so you rely on the same protection. This seems to
fix a lost-resched bug with preempt kernels too.

Reviewed-by: Nicholas Piggin 

>
> Reported-by: Nicholas Piggin 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index bd7b1a03545948..88f42de681e1f8 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -158,9 +158,7 @@ int arch_prepare_kprobe(struct kprobe *p)
>   printk("Cannot register a kprobe on the second word of prefixed 
> instruction\n");
>   ret = -EINVAL;
>   }
> - preempt_disable();
>   prev = get_kprobe(p->addr - 1);
> - preempt_enable_no_resched();
>  
>   /*
>* When prev is a ftrace-based kprobe, we don't have an insn, and it
> -- 
> 2.38.0