from:"Nathan Lynch"

Re: [PATCH v4 1/2] powerpc/eeh: fix pseries_eeh_configure_bridge()

2020-05-06 Thread Nathan Lynch

Sam Bobroff  writes:
> If a device is hot unplgged during EEH recovery, it's possible for the
> RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
> parameter error (-3), however negative return values are not checked
> for and this leads to an infinite loop.
>
> Fix this by correctly bailing out on negative values.

Reviewed-by: Nathan Lynch 

Thanks.

Re: [PATCH v2 3/4] powerpc/memhotplug: Make lmb size 64bit

2020-08-20 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> @@ -322,12 +322,16 @@ static int pseries_remove_mem_node(struct device_node 
> *np)
>   /*
>* Find the base address and size of the memblock
>*/
> - regs = of_get_property(np, "reg", NULL);
> - if (!regs)
> + prop = of_get_property(np, "reg", NULL);
> + if (!prop)
>   return ret;
>  
> - base = be64_to_cpu(*(unsigned long *)regs);
> - lmb_size = be32_to_cpu(regs[3]);
> + /*
> +  * "reg" property represents (addr,size) tuple.
> +  */
> + base = of_read_number(prop, mem_addr_cells);
> + prop += mem_addr_cells;
> + lmb_size = of_read_number(prop, mem_size_cells);

Would of_n_size_cells() and of_n_addr_cells() work here?

Re: [PATCH v2 1/4] powerpc/drmem: Make lmb_size 64 bit

2020-08-20 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block 
> panic")
> make sure different variables tracking lmb_size are updated to be 64 bit.
>
> This was found by code audit.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/drmem.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/drmem.h 
> b/arch/powerpc/include/asm/drmem.h
> index 17ccc6474ab6..d719cbac34b2 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -21,7 +21,7 @@ struct drmem_lmb {
>  struct drmem_lmb_info {
>   struct drmem_lmb*lmbs;
>   int n_lmbs;
> - u32 lmb_size;
> + u64 lmb_size;
>  };
>  
>  extern struct drmem_lmb_info *drmem_info;
> @@ -67,7 +67,7 @@ struct of_drconf_cell_v2 {
>  #define DRCONF_MEM_RESERVED  0x0080
>  #define DRCONF_MEM_HOTREMOVABLE  0x0100
>  
> -static inline u32 drmem_lmb_size(void)
> +static inline u64 drmem_lmb_size(void)
>  {
>   return drmem_info->lmb_size;
>  }

Looks fine.
Acked-by: Nathan Lynch

Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct

2020-08-14 Thread Nathan Lynch

Scott Cheloha  writes:
> At memory hot-remove time we can retrieve an LMB's nid from its
> corresponding memory_block.  There is no need to store the nid
> in multiple locations.
>
> Note that lmb_to_memblock() uses find_memory_block() to get the
> corresponding memory_block.  As find_memory_block() runs in sub-linear
> time this approach is negligibly slower than what we do at present.
>
> In exchange for this lookup at hot-remove time we no longer need to
> call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
> On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
> spares us an O(n^2) initialization during boot.
>
> On systems with many LMBs that initialization overhead is palpable and
> disruptive.  For example, on a box with 249854 LMBs we're seeing
> drmem_init() take upwards of 30 seconds to complete:
>
> [   53.721639] drmem: initializing drmem v2
> [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! 
> [swapper/0:1]
> [   80.604377] Modules linked in:
> [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
> [   80.604397] NIP:  c00a4980 LR: c00a4940 CTR: 
> 
> [   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
> [   80.604412] MSR:  82009033   CR: 
> 44000248  XER: 000d
> [   80.604431] CFAR: c00a4a38 IRQMASK: 0
> [   80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 
> c0003cfede30
> [   80.604431] GPR04:  c0f4095a 002f 
> 1000
> [   80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 
> c00c0002fdfb2001
> [   80.604431] GPR12:  c0001e8ec200
> [   80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0
> [   80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0
> [   80.604492] Call Trace:
> [   80.604498] [c0002dbff8493ac0] [c00a4940] 
> hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
> [   80.604509] [c0002dbff8493b20] [c0087c10] 
> memory_add_physaddr_to_nid+0x20/0x60
> [   80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0
> [   80.604530] [c0002dbff8493c10] [c0010154] 
> do_one_initcall+0x64/0x2c0
> [   80.604540] [c0002dbff8493ce0] [c10c4aa0] 
> kernel_init_freeable+0x2d8/0x3a0
> [   80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148
> [   80.604560] [c0002dbff8493e20] [c000b648] 
> ret_from_kernel_thread+0x5c/0x74
> [   80.604567] Instruction dump:
> [   80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 
> 7d094214
> [   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac 
> e949 7fbe5040
> [   89.047390] drmem: 249854 LMB(s)
>
> With a patched kernel on the same machine we're no longer seeing the
> soft lockup.  drmem_init() now completes in negligible time, even when
> the LMB count is large.

This has been a real annoyance, and it's great to see it fixed in a way
that both simplifies the code and reduces data structure size.

>
> Signed-off-by: Scott Cheloha 

I think this also should have:

Fixes: b2d3b5ee66f2 ("powerpc/pseries: Track LMB nid instead of using device 
tree")

since you're essentially reverting it.

The problem that commit was trying to address arose from the fact that
the removal path was determining the Linux node for a block from the
firmware description of the block:

memory_add_physaddr_to_nid
  hot_add_scn_to_nid
hot_add_drconf_scn_to_nid
  of_drconf_to_nid_single
[parses the DT for the nid]

However, this can become out of sync with Linux's NUMA assignment for
the memory due to VM migration or other events.

I'm satisfied that this change does not reintroduce that problem. The
removal path still derives the node without going to the device tree,
but now performs a more efficient lookup.

Reviewed-by: Nathan Lynch

Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct

2020-08-14 Thread Nathan Lynch

Nathan Lynch  writes:
> I'm satisfied that this change does not reintroduce that problem. The
> removal path still derives the node without going to the device tree,
> but now performs a more efficient lookup.

Er, actually it's slightly less efficient in the remove path, as you
note in your commit message. This doesn't alter my opinion of the
change. The higher-level algorithmic inefficiencies (e.g. all the
for_each_drmem_lmb loops) in this code are more of a concern.

[PATCH 0/2] pseries extended cede cpu offline mode removal

2020-06-01 Thread Nathan Lynch

I propose removing the "extended cede" offline mode for CPUs as well
as the partition suspend code which accommodates it by temporarily
onlining all CPUs prior to suspending the LPAR.

Detailed justifications are within the individual commit messages.

I'm hoping to move the pseries partition suspend implementation to
the Linux suspend framework, and I expect that having only one offline
mode for CPUs will make that task quite a bit less complex.

Nathan Lynch (2):
  powerpc/pseries: remove cede offline state for CPUs
  powerpc/rtas: don't online CPUs for partition suspend

 Documentation/core-api/cpu_hotplug.rst|   7 -
 arch/powerpc/include/asm/rtas.h   |   2 -
 arch/powerpc/kernel/rtas.c| 122 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c  | 170 ++
 .../platforms/pseries/offline_states.h|  38 
 arch/powerpc/platforms/pseries/pmem.c |   1 -
 arch/powerpc/platforms/pseries/smp.c  |  28 +--
 arch/powerpc/platforms/pseries/suspend.c  |  22 +--
 8 files changed, 18 insertions(+), 372 deletions(-)
 delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h

-- 
2.25.4

[PATCH 2/2] powerpc/rtas: don't online CPUs for partition suspend

2020-06-01 Thread Nathan Lynch

Partition suspension, used for hibernation and migration, requires
that the OS place all but one of the LPAR's processor threads into one
of two states prior to calling the ibm,suspend-me RTAS function:

  * the architected offline state (via RTAS stop-self); or
  * the H_JOIN hcall, which does not return until the partition
resumes execution

Using H_CEDE as the offline mode, introduced by
commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into
an appropriate offline state"), means that any threads which are
offline from Linux's point of view must be moved to one of those two
states before a partition suspension can proceed.

This was eventually addressed in commit 120496ac2d2d ("powerpc: Bring
all threads online prior to migration/hibernation"), which added code
to temporarily bring up any offline processor threads so they can call
H_JOIN. Conceptually this is fine, but the implementation has had
multiple races with cpu hotplug operations initiated from user
space[1][2][3], the error handling is fragile, and it generates
user-visible cpu hotplug events which is a lot of noise for a platform
feature that's supposed to minimize disruption to workloads.

With commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU
into an appropriate offline state") reverted, this code becomes
unnecessary, so remove it. Since any offline CPUs now are truly
offline from the platform's point of view, it is no longer necessary
to bring up CPUs only to have them call H_JOIN and then go offline
again upon resuming. Only active threads are required to call H_JOIN;
stopped threads can be left alone.

[1] commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and
serialization during LPM")
[2] commit 9fb603050ffd ("powerpc/rtas: retry when cpu offline races
with suspend/migration")
[3] commit dfd718a2ed1f ("powerpc/rtas: Fix a potential race between
CPU-Offline & Migration")

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h  |   2 -
 arch/powerpc/kernel/rtas.c   | 122 +--
 arch/powerpc/platforms/pseries/suspend.c |  22 +---
 3 files changed, 3 insertions(+), 143 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3c1887351c71..bd227e0eab07 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -368,8 +368,6 @@ extern int rtas_set_indicator_fast(int indicator, int 
index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
-extern int rtas_online_cpus_mask(cpumask_var_t cpus);
-extern int rtas_offline_cpus_mask(cpumask_var_t cpus);
 extern int rtas_ibm_suspend_me(u64 handle);
 
 struct rtc_time;
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c5fa251b8950..01210593d60c 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -842,96 +842,6 @@ static void rtas_percpu_suspend_me(void *info)
__rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1);
 }
 
-enum rtas_cpu_state {
-   DOWN,
-   UP,
-};
-
-#ifndef CONFIG_SMP
-static int rtas_cpu_state_change_mask(enum rtas_cpu_state state,
-   cpumask_var_t cpus)
-{
-   if (!cpumask_empty(cpus)) {
-   cpumask_clear(cpus);
-   return -EINVAL;
-   } else
-   return 0;
-}
-#else
-/* On return cpumask will be altered to indicate CPUs changed.
- * CPUs with states changed will be set in the mask,
- * CPUs with status unchanged will be unset in the mask. */
-static int rtas_cpu_state_change_mask(enum rtas_cpu_state state,
-   cpumask_var_t cpus)
-{
-   int cpu;
-   int cpuret = 0;
-   int ret = 0;
-
-   if (cpumask_empty(cpus))
-   return 0;
-
-   for_each_cpu(cpu, cpus) {
-   struct device *dev = get_cpu_device(cpu);
-
-   switch (state) {
-   case DOWN:
-   cpuret = device_offline(dev);
-   break;
-   case UP:
-   cpuret = device_online(dev);
-   break;
-   }
-   if (cpuret < 0) {
-   pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
-   __func__,
-   ((state == UP) ? "up" : "down"),
-   cpu, cpuret);
-   if (!ret)
-   ret = cpuret;
-   if (state == UP) {
-   /* clear bits for unchanged cpus, return */
-

[PATCH 1/2] powerpc/pseries: remove cede offline state for CPUs

2020-06-01 Thread Nathan Lynch

This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add
hooks to put the CPU into an appropriate offline state"), which added
an offline mode for CPUs which uses the H_CEDE hcall instead of the
architected stop-self RTAS function in order to facilitate "folding"
of dedicated mode processors on PowerVM platforms to achieve energy
savings. This has been the default offline mode since its
introduction.

There's nothing about stop-self that would prevent the hypervisor from
achieving the energy savings available via H_CEDE, so the original
premise of this change appears to be flawed.

I also have encountered the claim that the transition to and from
ceded state is much faster than stop-self/start-cpu. Certainly we
would not want to use stop-self as an *idle* mode. That is what H_CEDE
is for. However, this difference is insignificant in the context of
Linux CPU hotplug, where the latency of an offline or online operation
on current systems is on the order of 100ms, mainly attributable to
all the various subsystems' cpuhp callbacks.

The cede offline mode also prevents accurate accounting, as discussed
before:
https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-...@linux.vnet.ibm.com/

Unconditionally use stop-self to offline processor threads. This is
the architected method for offlining CPUs on PAPR systems.

The "cede_offline" boot parameter is rendered obsolete.

Removing this code enables the removal of the partition suspend code
which temporarily onlines all present CPUs.

Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an 
appropriate offline state")

Signed-off-by: Nathan Lynch 
---
 Documentation/core-api/cpu_hotplug.rst|   7 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c  | 170 ++
 .../platforms/pseries/offline_states.h|  38 
 arch/powerpc/platforms/pseries/pmem.c |   1 -
 arch/powerpc/platforms/pseries/smp.c  |  28 +--
 5 files changed, 15 insertions(+), 229 deletions(-)
 delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h

diff --git a/Documentation/core-api/cpu_hotplug.rst 
b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7817f7..b1ae1ac159cf 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -50,13 +50,6 @@ Command Line Switches
 
   This option is limited to the X86 and S390 architecture.
 
-``cede_offline={"off","on"}``
-  Use this option to disable/enable putting offlined processors to an extended
-  ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
-
-  This option is limited to the PowerPC architecture.
-
 ``cpu0_hotplug``
   Allow to shutdown CPU0.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 3e8cbfe7a80f..d4b346355bb9 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -35,54 +35,10 @@
 #include 
 
 #include "pseries.h"
-#include "offline_states.h"
 
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
-static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) =
-   CPU_STATE_OFFLINE;
-static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE;
-
-static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
-
-static bool cede_offline_enabled __read_mostly = true;
-
-/*
- * Enable/disable cede_offline when available.
- */
-static int __init setup_cede_offline(char *str)
-{
-   return (kstrtobool(str, _offline_enabled) == 0);
-}
-
-__setup("cede_offline=", setup_cede_offline);
-
-enum cpu_state_vals get_cpu_current_state(int cpu)
-{
-   return per_cpu(current_state, cpu);
-}
-
-void set_cpu_current_state(int cpu, enum cpu_state_vals state)
-{
-   per_cpu(current_state, cpu) = state;
-}
-
-enum cpu_state_vals get_preferred_offline_state(int cpu)
-{
-   return per_cpu(preferred_offline_state, cpu);
-}
-
-void set_preferred_offline_state(int cpu, enum cpu_state_vals state)
-{
-   per_cpu(preferred_offline_state, cpu) = state;
-}
-
-void set_default_offline_state(int cpu)
-{
-   per_cpu(preferred_offline_state, cpu) = default_offline_state;
-}
-
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -101,9 +57,7 @@ static void rtas_stop_self(void)
 
 static void pseries_mach_cpu_die(void)
 {
-   unsigned int cpu = smp_processor_id();
unsigned int hwcpu = hard_smp_processor_id();
-   u8 cede_latency_hint = 0;
 
local_irq_disable();
idle_task_exit();
@@ -112,49 +66,6 @@ static void pseries_mach_cpu_die(void)
else
xics_teardown_cpu();
 
-   if (get_preferred_offli

Re: Build regressions/improvements in v5.10-rc1

2020-10-26 Thread Nathan Lynch

Geert Uytterhoeven  writes:
> On Mon, Oct 26, 2020 at 10:46 AM Geert Uytterhoeven
>  wrote:
>> Below is the list of build error/warning regressions/improvements in
>> v5.10-rc1[1] compared to v5.9[2].
>>
>> Summarized:
>>   - build errors: +3/-7
>>   - build warnings: +26/-28
>>
>> Happy fixing! ;-)
>>
>> Thanks to the linux-next team for providing the build service.
>>
>> [1] 
>> http://kisskb.ellerman.id.au/kisskb/branch/linus/head/3650b228f83adda7e5ee532e2b90429c03f7b9ec/
>>  (all 192 configs)
>> [2] 
>> http://kisskb.ellerman.id.au/kisskb/branch/linus/head/bbf5c979011a099af5dc76498918ed7df445635b/
>>  (all 192 configs)
>>
>>
>> *** ERRORS ***
>>
>> 3 error regressions:
>>   + /kisskb/src/arch/um/kernel/skas/clone.c: error: expected declaration 
>> specifiers or '...' before string constant:  => 24:16
>
> um-all{mod,yes}config
>
>>   + error: hotplug-memory.c: undefined reference to 
>> `of_drconf_to_nid_single':  => .text+0x5e0)
>
> powerpc-gcc5/pseries_le_defconfig+NO_NUMA

Probably introduced by:

commit 72cdd117c449896c707fc6cfe5b90978160697d0
Author: Scott Cheloha 
Date:   Wed Sep 16 09:51:22 2020 -0500

pseries/hotplug-memory: hot-add: skip redundant LMB lookup

Scott?

Re: [PATCH 2/2] powerpc/numa: Remove a redundant variable

2020-07-21 Thread Nathan Lynch

Srikar Dronamraju  writes:
> In of_drconf_to_nid_single() default_nid always refers to NUMA_NO_NODE.
> Hence replace it with NUMA_NO_NODE.
>
> No functional changes.
>

...

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index a2c5fe0d0cad..b066d89c2975 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -432,16 +432,15 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>  static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  {
>   struct assoc_arrays aa = { .arrays = NULL };
> - int default_nid = NUMA_NO_NODE;
> - int nid = default_nid;
> + int nid = NUMA_NO_NODE;
>   int rc, index;
>  
>   if ((min_common_depth < 0) || !numa_enabled)
> - return default_nid;
> + return NUMA_NO_NODE;
>  
>   rc = of_get_assoc_arrays();
>   if (rc)
> - return default_nid;
> + return NUMA_NO_NODE;
>  
>   if (min_common_depth <= aa.array_sz &&
>   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
> aa.n_arrays) {
> @@ -449,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>   nid = of_read_number([index], 1);
>  
>   if (nid == 0x || nid >= num_possible_nodes())
> - nid = default_nid;
> + nid = NUMA_NO_NODE;
>  
>   if (nid > 0) {
>   index = lmb->aa_index * aa.array_sz;

Doesn't seem like an improvement to me, sorry. Admittedly it's a small
point, but a local variable named "default_nid" communicates that it's a
default or fallback value. That information is lost with this change.

Re: [PATCH 1/2] powerpc/numa: Limit possible nodes to within num_possible_nodes

2020-07-21 Thread Nathan Lynch

Srikar Dronamraju  writes:
> MAX_NUMNODES is a theoretical maximum number of nodes thats is supported
> by the kernel. Device tree properties exposes the number of possible
> nodes on the current platform. The kernel would detected this and would
> use it for most of its resource allocations.  If the platform now
> increases the nodes to over what was already exposed, then it may lead
> to inconsistencies. Hence limit it to the already exposed nodes.
>
> Suggested-by: Nathan Lynch 
> Cc: linuxppc-dev 
> Cc: Michael Ellerman 
> Cc: Nathan Lynch 
> Cc: Tyrel Datwyler 
> Signed-off-by: Srikar Dronamraju 
> ---
>  arch/powerpc/mm/numa.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8ec7ff05ae47..a2c5fe0d0cad 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -221,7 +221,7 @@ static void initialize_distance_lookup_table(int nid,
>   }
>  }
>  
> -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
> +/* Returns nid in the range [0..num_possible_nodes()-1], or -1 if no useful 
> numa
>   * info is found.
>   */
>  static int associativity_to_nid(const __be32 *associativity)
> @@ -235,7 +235,7 @@ static int associativity_to_nid(const __be32 
> *associativity)
>   nid = of_read_number([min_common_depth], 1);
>  
>   /* POWER4 LPAR uses 0x as invalid node */
> - if (nid == 0x || nid >= MAX_NUMNODES)
> + if (nid == 0x || nid >= num_possible_nodes())
>   nid = NUMA_NO_NODE;
>  
>   if (nid > 0 &&
> @@ -448,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>   index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
>   nid = of_read_number([index], 1);
>  
> - if (nid == 0x || nid >= MAX_NUMNODES)
> + if (nid == 0xffff || nid >= num_possible_nodes())
>   nid = default_nid;
>  
>   if (nid > 0) {

Yes, looks fine to me.

Reviewed-by: Nathan Lynch

Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

2020-08-06 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index e437a9ac4956..6c659aada55b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>   }
>  }
>  
> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  
> NUMA_NO_NODE};

It's odd to me to use MAX_NUMNODES for this array when it's going to be
indexed not by Linux's logical node IDs but by the platform-provided
domain number, which has no relation to MAX_NUMNODES.

> +
> +int firmware_group_id_to_nid(int firmware_gid)
> +{
> + static int last_nid = 0;
> +
> + /*
> +  * For PowerNV we don't change the node id. This helps to avoid
> +  * confusion w.r.t the expected node ids. On pseries, node numbers
> +  * are virtualized. Hence do logical node id for pseries.
> +  */
> + if (!firmware_has_feature(FW_FEATURE_LPAR))
> + return firmware_gid;
> +
> + if (firmware_gid ==  -1)
> + return NUMA_NO_NODE;
> +
> + if (nid_map[firmware_gid] == NUMA_NO_NODE)
> + nid_map[firmware_gid] = last_nid++;

This should at least be bounds-checked in case of domain numbering in
excess of MAX_NUMNODES. Or a different data structure should be used?
Not sure.

I'd prefer Linux's logical node type not be easily interchangeable with
the firmware node/group id type. The firmware type could be something
like:

struct affinity_domain {
u32 val;
};
typedef struct affinity_domain affinity_domain_t;

with appropriate accessors/APIs.

This can prevent a whole class of errors that is currently possible with
CPUs, since the logical and "hardware" ID types are both simple
integers.

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-07 Thread Nathan Lynch

Hi everyone,

Michael Ellerman  writes:
> Greg Kurz  writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman  wrote:
>>> Spinning forever seems like a bad idea, but as has been demonstrated at
>>> least twice now, continuing when we don't know the state of the other
>>> CPU can lead to straight up crashes.
>>> 
>>> So I think I'm persuaded that it's preferable to have the kernel stuck
>>> spinning rather than oopsing.
>>> 
>>
>> +1
>>
>>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>>> first instinct is no, if we're stuck here for 20s a stack trace would be
>>> good. But then we will probably hit that on some big and/or heavily
>>> loaded machine.
>>> 
>>> So possibly we should call cond_resched() but have some custom logic in
>>> the loop to print a warning if we are stuck for more than some
>>> sufficiently long amount of time.
>>
>> How long should that be ?
>
> Yeah good question.
>
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
>
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.

Maybe I'm not quite clear what this is referring to, but I don't think
stop-self/query-stopped-state latency increases with processor count, at
least not on PowerVM. And IIRC I was concerned with the earlier patch's
potential for causing the softlockup watchdog to rightly complain by
polling the stopped state without ever scheduling away.

The fact that smp_query_cpu_stopped() kind of collapses the two distinct
results from the query-cpu-stopped-state RTAS call into one return value
may make it harder than necessary to reason about the questions around
cond_resched() and whether to warn.

Sorry to pull this stunt but I have had some code sitting in a neglected
branch that I think gets the logic around this right.

What we should have is a simple C wrapper for the RTAS call that reflects the
architected inputs and outputs:

(-- rtas.c --)

/**
 * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
 * @hwcpu: Identifies the processor thread to be queried.
 * @status: Pointer to status, valid only on success.
 *
 * Determine whether the given processor thread is in the stopped
 * state.  If successful and @status is non-NULL, the thread's status
 * is stored to @status.
 *
 * Return:
 * * 0   - Success
 * * -1  - Hardware error
 * * -2  - Busy, try again later
 */
int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
{
   unsigned int cpu_status;
   int token;
   int fwrc;

   token = rtas_token("query-cpu-stopped-state");

   fwrc = rtas_call(token, 1, 2, _status, hwcpu);
   if (fwrc != 0)
   goto out;

   if (status != NULL)
   *status = cpu_status;
out:
   return fwrc;
}

And then a utility function that waits for the remote thread to enter
stopped state, with higher-level logic for rescheduling and warning. The
fact that smp_query_cpu_stopped() currently does not handle a -2/busy
status is a bug, fixed below by using rtas_busy_delay(). Note the
justification for the explicit cond_resched() in the outer loop:

(-- rtas.h --)

/* query-cpu-stopped-state CPU_status */
#define RTAS_QCSS_STATUS_STOPPED 0
#define RTAS_QCSS_STATUS_IN_PROGRESS 1
#define RTAS_QCSS_STATUS_NOT_STOPPED 2

(-- pseries/hotplug-cpu.c --)

/**
 * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
 */
static void wait_for_cpu_stopped(unsigned int cpu)
{
   unsigned int status;
   unsigned int hwcpu;

   hwcpu = get_hard_smp_processor_id(cpu);

   do {
   int fwrc;

   /*
* rtas_busy_delay() will yield only if RTAS returns a
* busy status. Since query-cpu-stopped-state can
* yield RTAS_QCSS_STATUS_IN_PROGRESS or
* RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
* period before the target thread stops, we must take
* care to explicitly reschedule while polling.
*/
   cond_resched();

   do {
   fwrc = rtas_query_cpu_stopped_state(hwcpu, );
   } while (rtas_busy_delay(fwrc));

   if (fwrc == 0)
   continue;

   pr_err_ratelimited("query-cpu-stopped-state for "
  "thread 0x%x returned %d\n",
  hwcpu, fwrc);
   goto out;

   } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
status == RTAS_QCSS_STATUS_IN_PROGRESS);

   if (status != RTAS_QCSS_STATUS_STOPPED) {

Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-11 Thread Nathan Lynch

Nathan Lynch  writes:
> Michael Ellerman  writes:
>> One thought, which I possibly should not put in writing, is that we
>> could use the alignment of the pointer as a poor man's substitute for a
>> counter, eg:
>>
>> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
>> +{
>> +if (lmb % PAGE_SIZE == 0)
>> +cond_resched();
>> +
>> +return ++lmb;
>> +}
>>
>> I think the lmbs are allocated in a block, so I think that will work.
>> Maybe PAGE_SIZE is not the right size to use, but you get the idea.
>>
>> Gross I know, but might be OK as short term solution?
>
> OK, looking into this.

To follow up:

I wasn't able to measure more than ~1% difference in DLPAR memory
performance with my original version of this, but that was on a
relatively small configuration - hundreds of elements in the array as
opposed to thousands. I took an educated guess at an appropriate
interval and posted v2:

https://lore.kernel.org/linuxppc-dev/20200812012005.1919255-1-nath...@linux.ibm.com/

[PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-11 Thread Nathan Lynch

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Call cond_resched() on every 20th element. Each iteration of the loop
in DLPAR code paths can involve around ten RTAS calls which can each
take up to 250us, so this ensures the check is performed at worst
every few milliseconds.

Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
format")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/drmem.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

Changes since v1:
* Add bounds assertions in drmem_lmb_next().
* Call cond_resched() in the iterator on only every 20th element
  instead of on every iteration, to reduce overhead in tight loops.

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 17ccc6474ab6..583277e30dd2 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,9 @@
 #ifndef _ASM_POWERPC_LMB_H
 #define _ASM_POWERPC_LMB_H
 
+#include 
+#include 
+
 struct drmem_lmb {
u64 base_addr;
u32 drc_index;
@@ -26,8 +29,21 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
+{
+   const unsigned int resched_interval = 20;
+
+   BUG_ON(lmb < drmem_info->lmbs);
+   BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);
+
+   if ((lmb - drmem_info->lmbs) % resched_interval == 0)
+   cond_resched();
+
+   return ++lmb;
+}
+
 #define for_each_drmem_lmb_in_range(lmb, start, end)   \
-   for ((lmb) = (start); (lmb) < (end); (lmb)++)
+   for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
 
 #define for_each_drmem_lmb(lmb)\
for_each_drmem_lmb_in_range((lmb),  \
-- 
2.25.4

Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-10 Thread Nathan Lynch

Michael Ellerman  writes:
> One thought, which I possibly should not put in writing, is that we
> could use the alignment of the pointer as a poor man's substitute for a
> counter, eg:
>
> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
> +{
> + if (lmb % PAGE_SIZE == 0)
> + cond_resched();
> +
> + return ++lmb;
> +}
>
> I think the lmbs are allocated in a block, so I think that will work.
> Maybe PAGE_SIZE is not the right size to use, but you get the idea.
>
> Gross I know, but might be OK as short term solution?

OK, looking into this.

Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-12 Thread Nathan Lynch

Hi Christophe,

Christophe Leroy  writes:
>> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
>> +{
>> +const unsigned int resched_interval = 20;
>> +
>> +BUG_ON(lmb < drmem_info->lmbs);
>> +BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);
>
> BUG_ON() shall be avoided unless absolutely necessary.
> Wouldn't WARN_ON() together with an early return be enough ?

Not sure what a sensible early return behavior would be. If the iterator
doesn't advance the pointer the behavior will be a hang.

BUG_ON a bounds-check failure is appropriate here; many users of this
API will corrupt memory otherwise.

>> +
>> +if ((lmb - drmem_info->lmbs) % resched_interval == 0)
>> +cond_resched();
>
> Do you need something that precise ? Can't you use 16 or 32 and use a 
> logical AND instead of a MODULO ?

Eh if you're executing in this code you've already lost with respect to
performance considerations at this level, see the discussion on v1. I'll
use 16 since I'm going to reroll the patch though.

> And what garanties that lmb is always an element of a table based at 
> drmem_info->lmbs ?

Well, that's its only intended use right now. There should not be any
other arrays of drmem_lmb objects, and I hope we don't gain any.

> What about:
>
> static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb, 
> struct drmem_lmb *start)
> {
>   const unsigned int resched_interval = 16;
>
>   if ((++lmb - start) & resched_interval == 0)
   ^^^
Did you mean '%' here? The bitwise AND doesn't do what I want.

Otherwise, making drmem_lmb_next() more general by adding a 'start'
argument could ease refactoring to come, so I'll do that.

Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

2020-08-07 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> On 8/7/20 9:54 AM, Nathan Lynch wrote:
>> "Aneesh Kumar K.V"  writes:
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index e437a9ac4956..6c659aada55b 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>>> }
>>>   }
>>>   
>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  
>>> NUMA_NO_NODE};
>> 
>> It's odd to me to use MAX_NUMNODES for this array when it's going to be
>> indexed not by Linux's logical node IDs but by the platform-provided
>> domain number, which has no relation to MAX_NUMNODES.
>
>
> I didn't want to dynamically allocate this. We could fetch 
> "ibm,max-associativity-domains" to find the size for that. The current 
> code do assume  firmware group id to not exceed MAX_NUMNODES. Hence kept 
> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May 
> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES?

Well, consider:

- ibm,max-associativity-domains can change at runtime with LPM. This
  doesn't happen in practice yet, but we should probably start thinking
  about how to support that.
- The domain numbering isn't clearly specified to have any particular
  properties such as beginning at zero or a contiguous range.

While the current code likely contains assumptions contrary to these
points, a change such as this is an opportunity to think about whether
those assumptions can be reduced or removed. In particular I think it
would be good to gracefully degrade when the number of NUMA affinity
domains can exceed MAX_NUMNODES. Using the platform-supplied domain
numbers to directly index Linux data structures will make that
impossible.

So, maybe genradix or even xarray wouldn't actually be overengineering
here.

Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score

2020-08-06 Thread Nathan Lynch

Michael Ellerman  writes:
> Tyrel Datwyler  writes:
>> On 7/27/20 11:46 AM, Scott Cheloha wrote:
>>> The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall,
>>> Affinity_Domain_Info_By_Partition, which returns, among other things,
>>> a "partition affinity score" for a given LPAR.  This score, a value on
>>> [0-100], represents the processor-memory affinity for the LPAR in
>>> question.  A score of 0 indicates the worst possible affinity while a
>>> score of 100 indicates perfect affinity.  The score can be used to
>>> reason about performance.
>>> 
>>> This patch adds the score for the local LPAR to the lparcfg procfile
>>> under a new 'partition_affinity_score' key.
>>> 
>>> Signed-off-by: Scott Cheloha 
>>
>> I was hoping Michael would chime in the first time around on this patch 
>> series
>> about adding another key/value pair to lparcfg.
>
> That guy is so unreliable.
>
> I don't love adding new stuff in lparcfg, but given the file already
> exists and there's no prospect of removing it, it's probably not worth
> the effort to put the new field anywhere else.
>
> My other query with this was how on earth anyone is meant to interpret
> the metric. ie. if my metric is 50, what does that mean? If it's 90
> should I worry?

Here's some more background.

This interface is just passing up what the platform provides, and it's
identical to the partition affinity score described in the documentation
for the management console's lsmemopt command:

https://www.ibm.com/support/knowledgecenter/POWER9/p9edm/lsmemopt.html

The score is 0-100, higher values are better. To illustrate: I believe a
partition's score will be 100 (or very close to it) if all of its CPUs
and memory reside within one node. It will be lower than that when a
partition has some memory without local CPUs, and lower still when there
is no CPU-memory affinity within the partition. Beyond that I don't have
more specific information and the algorithm and scale are set by the
platform.

The intent is for this to be a metric to gather during problem
determination e.g. via sosreport or similar, but as far as Linux is
concerned this should be treated as an opaque value.

Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score

2020-08-06 Thread Nathan Lynch

Scott Cheloha  writes:
> The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall,
> Affinity_Domain_Info_By_Partition, which returns, among other things,
> a "partition affinity score" for a given LPAR.  This score, a value on
> [0-100], represents the processor-memory affinity for the LPAR in
> question.  A score of 0 indicates the worst possible affinity while a
> score of 100 indicates perfect affinity.  The score can be used to
> reason about performance.
>
> This patch adds the score for the local LPAR to the lparcfg procfile
> under a new 'partition_affinity_score' key.
>
> Signed-off-by: Scott Cheloha 
> ---
>  arch/powerpc/platforms/pseries/lparcfg.c | 35 
>  1 file changed, 35 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
> b/arch/powerpc/platforms/pseries/lparcfg.c
> index b8d28ab88178..e278390ab28d 100644
> --- a/arch/powerpc/platforms/pseries/lparcfg.c
> +++ b/arch/powerpc/platforms/pseries/lparcfg.c
> @@ -136,6 +136,39 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data 
> *ppp_data)
>   return rc;
>  }
>  
> +static void show_gpci_data(struct seq_file *m)
> +{
> + struct hv_gpci_request_buffer *buf;
> + unsigned int affinity_score;
> + long ret;
> +
> + buf = kmalloc(sizeof(*buf), GFP_KERNEL);
> + if (buf == NULL)
> + return;
> +
> + /*
> +  * Show the local LPAR's affinity score.
> +  *
> +  * 0xB1 selects the Affinity_Domain_Info_By_Partition subcall.
> +  * The score is at byte 0xB in the output buffer.
> +  */
> + memset(>params, 0, sizeof(buf->params));
> + buf->params.counter_request = cpu_to_be32(0xB1);
> + buf->params.starting_index = cpu_to_be32(-1);   /* local LPAR */
> + buf->params.counter_info_version_in = 0x5;  /* v5+ for score */
> + ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf),
> +  sizeof(*buf));
> + if (ret != H_SUCCESS) {
> + pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n",
> +  ret, be32_to_cpu(buf->params.detail_rc));
> + goto out;
> + }
> + affinity_score = buf->bytes[0xB];
> + seq_printf(m, "partition_affinity_score=%u\n", affinity_score);
> +out:
> + kfree(buf);
> +}
> +
>  static unsigned h_pic(unsigned long *pool_idle_time,
> unsigned long *num_procs)
>  {
> @@ -487,6 +520,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void 
> *v)
>  partition_active_processors * 100);
>   }
>  
> + show_gpci_data(m);
> +
>   seq_printf(m, "partition_active_processors=%d\n",
>  partition_active_processors);

Acked-by: Nathan Lynch

Re: [PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h

2020-08-06 Thread Nathan Lynch

Scott Cheloha  writes:

> The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are
> useful to modules outside of perf/, so move them into asm/hvcall.h to live
> alongside the other powerpc hypercall structs.
>
> Leave the perf-specific GPCI stuff in perf/hv-gpci.h.
>
> Signed-off-by: Scott Cheloha 

Acked-by: Nathan Lynch

Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-31 Thread Nathan Lynch

Michael Ellerman  writes:

> Nathan Lynch  writes:
>> Michael Ellerman  writes:
>>> Nathan Lynch  writes:
>>>> Laurent Dufour  writes:
>>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>>> they explicitly yield.
>>>>>> 
>>>>>> Rather than placing cond_resched() calls within various
>>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>>> expression of the loop macro itself so users can't omit it.
>>>>>
>>>>> Is that not too much to call cond_resched() on every LMB?
>>>>>
>>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>>
>>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>>> this context.
>>>
>>> Hmm, mostly.
>>>
>>> But there are quite a few cases like drmem_update_dt_v1():
>>>
>>> for_each_drmem_lmb(lmb) {
>>> dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>> dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>> dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>> dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>>
>>> dr_cell++;
>>> }
>>>
>>> Which will compile to a pretty tight loop at the moment.
>>>
>>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>>
>>> And although the actual TIF check is cheap the function call to do it is
>>> not free.
>>>
>>> So I worry this is going to make some of those long loops take even
>>> longer.
>>
>> That's fair, and I was wrong - some of the loop bodies are relatively
>> simple, not doing allocations or taking locks, etc.
>>
>> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
>> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
>
> If we did that, how many call-sites would need converting?
> Is it ~2 or ~20 or ~200?

At a glance I would convert 15-20 out of the 24 users in the tree I'm
looking at. Let me know if I should do a v2 with that approach.

Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-12 Thread Nathan Lynch

Michael Ellerman  writes:
> Tyrel Datwyler  writes:
>> On 8/11/20 6:20 PM, Nathan Lynch wrote:
>>>  
>>> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
>>> +{
>>> +   const unsigned int resched_interval = 20;
>>> +
>>> +   BUG_ON(lmb < drmem_info->lmbs);
>>> +   BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);
>>
>> I think BUG_ON is a pretty big no-no these days unless there is no other 
>> option.
>
> It's complicated, but yes we would like to avoid adding them if we can.
>
> In a case like this there is no other option, *if* the check has to be
> in drmem_lmb_next().
>
> But I don't think we really need to check that there.
>
> If for some reason this was called with an *lmb pointing outside of the
> lmbs array it would confuse the cond_resched() logic, but it's not worth
> crashing the box for that IMHO.

The BUG_ONs are pretty much orthogonal to the cond_resched().

It's not apparent from the context of the change, but some users of the
for_each_drmem_lmb* macros modify elements of the drmem_info->lmbs
array. If the lmb passed to drmem_lmb_next() violates the bounds check
(say, if the callsite inappropriately modifies it within the loop), such
users are guaranteed to corrupt other objects in memory. This was my
thinking in adding the BUG_ONs, and I don't see another place to do
it.

[PATCH v3] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-13 Thread Nathan Lynch

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Introduce a drmem_lmb_next() iteration helper function which calls
cond_resched() at a regular interval during array traversal. Each
iteration of the loop in DLPAR code paths can involve around ten RTAS
calls which can each take up to 250us, so this ensures the check is
performed at worst every few milliseconds.

Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
format")
Signed-off-by: Nathan Lynch 
---

Notes:
Changes since v2:
* Make drmem_lmb_next() more general.
* Adjust reschedule interval for better code generation.
* Add commentary to drmem_lmb_next() to explain the cond_resched()
  call.
* Remove bounds assertions.

Changes since v1:
* Add bounds assertions in drmem_lmb_next().
* Call cond_resched() in the iterator on only every 20th element
  instead of on every iteration, to reduce overhead in tight loops.

 arch/powerpc/include/asm/drmem.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 17ccc6474ab6..6fb928605ed1 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_POWERPC_LMB_H
 #define _ASM_POWERPC_LMB_H
 
+#include 
+
 struct drmem_lmb {
u64 base_addr;
u32 drc_index;
@@ -26,8 +28,22 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb,
+  const struct drmem_lmb *start)
+{
+   /*
+* DLPAR code paths can take several milliseconds per element
+* when interacting with firmware. Ensure that we don't
+* unfairly monopolize the CPU.
+*/
+   if (((++lmb - start) % 16) == 0)
+   cond_resched();
+
+   return lmb;
+}
+
 #define for_each_drmem_lmb_in_range(lmb, start, end)   \
-   for ((lmb) = (start); (lmb) < (end); (lmb)++)
+   for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb, start))
 
 #define for_each_drmem_lmb(lmb)\
for_each_drmem_lmb_in_range((lmb),  \
-- 
2.25.4

Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

2020-08-13 Thread Nathan Lynch

Hi Aneesh,

"Aneesh Kumar K.V"  writes:
> "Aneesh Kumar K.V"  writes:
>> On 8/8/20 2:15 AM, Nathan Lynch wrote:
>>> "Aneesh Kumar K.V"  writes:
>>>> On 8/7/20 9:54 AM, Nathan Lynch wrote:
>>>>> "Aneesh Kumar K.V"  writes:
>>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>>>>> index e437a9ac4956..6c659aada55b 100644
>>>>>> --- a/arch/powerpc/mm/numa.c
>>>>>> +++ b/arch/powerpc/mm/numa.c
>>>>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int 
>>>>>> nid,
>>>>>>  }
>>>>>>}
>>>>>>
>>>>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  
>>>>>> NUMA_NO_NODE};
>>>>>
>>>>> It's odd to me to use MAX_NUMNODES for this array when it's going to be
>>>>> indexed not by Linux's logical node IDs but by the platform-provided
>>>>> domain number, which has no relation to MAX_NUMNODES.
>>>>
>>>>
>>>> I didn't want to dynamically allocate this. We could fetch
>>>> "ibm,max-associativity-domains" to find the size for that. The current
>>>> code do assume  firmware group id to not exceed MAX_NUMNODES. Hence kept
>>>> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May
>>>> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES?
>>> 
>>> Well, consider:
>>> 
>>> - ibm,max-associativity-domains can change at runtime with LPM. This
>>>doesn't happen in practice yet, but we should probably start thinking
>>>about how to support that.
>>> - The domain numbering isn't clearly specified to have any particular
>>>properties such as beginning at zero or a contiguous range.
>>> 
>>> While the current code likely contains assumptions contrary to these
>>> points, a change such as this is an opportunity to think about whether
>>> those assumptions can be reduced or removed. In particular I think it
>>> would be good to gracefully degrade when the number of NUMA affinity
>>> domains can exceed MAX_NUMNODES. Using the platform-supplied domain
>>> numbers to directly index Linux data structures will make that
>>> impossible.
>>> 
>>> So, maybe genradix or even xarray wouldn't actually be overengineering
>>> here.
>>> 
>>
>> One of the challenges with such a data structure is that we initialize 
>> the nid_map before the slab is available. This means a memblock based 
>> allocation and we would end up implementing such a sparse data structure 
>> ourselves here.

Yes, good point.


>> As you mentioned above, since we know that hypervisor as of now limits 
>> the max affinity domain id below ibm,max-associativity-domains we are 
>> good with an array-like nid_map we have here. This keeps the code simpler.
>>
>> This will also allow us to switch to a more sparse data structure as you 
>> requested here in the future because the main change that is pushed in 
>> this series is the usage of firmare_group_id_to_nid(). The details of 
>> the data structure we use to keep track of that mapping are pretty much 
>> internal to that function.
>
> How about this? This makes it not a direct index. But it do limit the
> search to max numa node on the system. 
>
> static int domain_id_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  -1 };
>
> static int __affinity_domain_to_nid(int domain_id, int max_nid)
> {
>   int i;
>
>   for (i = 0; i < max_nid; i++) {
>   if (domain_id_map[i] == domain_id)
>   return i;
>   }
>   return NUMA_NO_NODE;
> }

OK, this indexes the array by Linux's node id, good. I was wondering if
I could persuade you do flip it around like this :-)

Walking through the code below:

> int affinity_domain_to_nid(struct affinity_domain *domain)
> {
>   int nid, domain_id;
>   static int last_nid = 0;
>   static DEFINE_SPINLOCK(node_id_lock);
>
>   domain_id = domain->id;
>   /*
>* For PowerNV we don't change the node id. This helps to avoid
>* confusion w.r.t the expected node ids. On pseries, node numbers
>* are virtualized. Hence do logical node id for pseries.
>*/
>   if (!firmware_has_feature(FW_FEATURE_LPAR))
>   return domain_id;
>
>   if (domain_id ==  -1 || last_nid == MAX_NUMNODES)
>   return NUMA_NO_

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-11 Thread Nathan Lynch

Michael Ellerman  writes:

> Michael Roth  writes:
>> Quoting Nathan Lynch (2020-08-07 02:05:09)
> ...
>>> wait_for_cpu_stopped() should be able to accommodate a time-based
>>> warning if necessary, but speaking as a likely recipient of any bug
>>> reports that would arise here, I'm not convinced of the need and I
>>> don't know what a good value would be. It's relatively easy to sample
>>> the stack of a task that's apparently failing to make progress, plus I
>>> probably would use 'perf probe' or similar to report the inputs and
>>> outputs for the RTAS call.
>>
>> I think if we make the timeout sufficiently high like 2 minutes or so
>> it wouldn't hurt and if we did seem them it would probably point to an
>> actual bug. But I don't have a strong feeling either way.
>
> I think we should print a warning after 2 minutes.
>
> It's true that there are fairly easy mechanisms to work out where the
> thread is stuck, but customers are unlikely to use them. They're just
> going to report that it's stuck with no further info, and probably
> reboot the machine before we get a chance to get any further info.
>
> Whereas if the kernel prints a warning with a stack trace we at least
> have that to go on in an initial bug report.
>
>>> I'm happy to make this a proper submission after I can clean it up and
>>> retest it, or Michael R. is welcome to appropriate it, assuming it's
>>> acceptable.
>>> 
>>
>> I've given it a shot with this patch and it seems to be holding up in
>> testing. If we don't think the ~2 minutes warning message is needed I
>> can clean it up to post:
>>
>> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
>>
>> I'd likely break the refactoring patches out to a separate patch under
>> Nathan's name since it fixes a separate bug potentially.
>
> While I like Nathan's refactoring, we probably want to do the minimal
> fix first to ease backporting.
>
> Then do the refactoring on top of that.

Fair enough, thanks.

Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform

2020-07-10 Thread Nathan Lynch

Srikar Dronamraju  writes:

> * Michael Ellerman  [2020-07-07 15:02:17]:
>
>> Srikar Dronamraju  writes:
>> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
>> > /proc/device-tree/rtas/ibm,current-associativity-domains
>> > 0005 0001 0002 0002 0002 0010
>> > /proc/device-tree/rtas/ibm,max-associativity-domains
>> > 0005 0001 0008 0020 0020 0100
>> >
>> > $ cat /sys/devices/system/node/possible ##Before patch
>> > 0-31
>> >
>> > $ cat /sys/devices/system/node/possible ##After patch
>> > 0-1
>> >
>> > Note the maximum nodes this platform can support is only 2 but the
>> > possible nodes is set to 32.
>> 
>> But what about LPM to a system with more nodes?
>> 
>
> I have very less info on LPM, so I checked with Nathan Lynch before posting
> and as per Nathan in the current design of LPM, Linux wouldn't use the new
> node numbers.

(I see a v2 has been posted already but I needed a little time to check
with our hypervisor people on this point.)

It's less of a design and more of a least-bad option in the absence of a
more flexible NUMA architecture in Linux.

For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA
information from the device tree that it was booted with, and it must
disregard ibm,associativity and similar information after LPM or certain
other platform events. Historically there has been code that tried to
honor changes in NUMA information but it caused much worse problems than
degraded performance. That code has been disabled by default since last
year and is now subject to removal:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897

Most NUMA-aware code happens to follow that rule because the device tree
associativity information tends to get cached on first access in Linux's
own data structures. It all feels a little fragile to me though,
especially since we can DLPAR add processors and memory after LPM with
"new" associativity properties which don't relate to the logical
topology Linux has already built. However, on currently available hw, as
long as we're using ibm,max-associativity-domains to limit the set of
possible nodes, I believe such resources will always receive valid (but
possibly suboptimal) NUMA assignments. That's because as of this
writing ibm,max-associativity-domains has the same contents on all
currently available PowerVM systems.

Now if we change to using ibm,current-associativity-domains, which we
*can* expect to differ between differently configured systems, post-LPM
DLPAR additions can yield resources with node assignments that
fall outside of the possible range, especially when we've migrated from
a smaller system to a larger one.

Is the current code robust against that possibility? I don't think so:
it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and
possibly more code need to guard against this in order to prevent
NODE_DATA() null dereferences and the like. Probably those functions
should be made to clamp the nid assignment at num_possible_nodes()
instead of MAX_NUMNODES, which strikes me as more correct regardless of
your patch.

Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform

2020-07-06 Thread Nathan Lynch

Srikar Dronamraju  writes:
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 9fcf2d195830..3d55cef1a2dc 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
>   return;
>  
>   if (of_property_read_u32_index(rtas,
> - "ibm,max-associativity-domains",
> + "ibm,current-associativity-domains",
>   min_common_depth, ))

Looks good if ibm,current-associativity-domains[min_common_depth]
actually denotes the range of possible values, i.e. a value of 2 implies
node numbers 0 and 1. PAPR+ says it's the "number of unique values",
which isn't how I would specify the property if it's supposed to express
a range. But it's probably OK...

Re: [PATCH] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score

2020-06-19 Thread Nathan Lynch

Hi Tyrel,

Tyrel Datwyler  writes:
> On 6/19/20 8:34 AM, Scott Cheloha wrote:
>> The H_GetPerformanceCounterInfo PHYP hypercall has a subcall,
>> Affinity_Domain_Info_By_Partition, which returns, among other things,
>> a "partition affinity score" for a given LPAR.  This score, a value on
>> [0-100], represents the processor-memory affinity for the LPAR in
>> question.  A score of 0 indicates the worst possible affinity while a
>> score of 100 indicates perfect affinity.  The score can be used to
>> reason about performance.
>> 
>> This patch adds the score for the local LPAR to the lparcfg procfile
>> under a new 'partition_affinity_score' key.
>
> I expect that you will probably get a NACK from Michael on this. The overall
> desire is to move away from these dated /proc interfaces. While its true that 
> I
> did add a new value recently it was strictly to facilitate and correct the
> calculation of a derived value that was already dependent on a couple other
> existing values in lparcfg.
>
> With that said I would expect that you would likely be advised to expose this 
> as
> a sysfs attribute. The question is where? We probably should put some thought 
> in
> to this as I would like to port each lparcfg value over to sysfs so that we 
> can
> move to deprecating lparcfg. Putting everything under something like
> /sys/kernel/lparcfg/* maybe. Michael may have a better suggestion.

I think this score fits pretty naturally in lparcfg: it's a simple
metric that is specific to the pseries/papr platform, like everything
else in there.

A few dozen key=value pairs contained in a single file is simple and
efficient, unlike sysfs with its rather inconsistently applied
one-value-per-file convention. Surely it's OK if lparcfg gains a line
every few years?


>> The H_GetPerformanceCounterInfo hypercall is already used elsewhere in
>> the kernel, in powerpc/perf/hv-gpci.c.  Refactoring that code and this
>> code into a more general API might be worthwhile if additional modules
>> require the hypercall in the future.
>
> If you are duplicating code its likely you should already be doing this. See 
> the
> rest of my comments about below.
>
>> 
>> Signed-off-by: Scott Cheloha 
>> ---
>>  arch/powerpc/platforms/pseries/lparcfg.c | 53 
>>  1 file changed, 53 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
>> b/arch/powerpc/platforms/pseries/lparcfg.c
>> index b8d28ab88178..b75151eee0f0 100644
>> --- a/arch/powerpc/platforms/pseries/lparcfg.c
>> +++ b/arch/powerpc/platforms/pseries/lparcfg.c
>> @@ -136,6 +136,57 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data 
>> *ppp_data)
>>  return rc;
>>  }
>>  
>> +/*
>> + * Based on H_GetPerformanceCounterInfo v1.10.
>> + */
>> +static void show_gpci_data(struct seq_file *m)
>> +{
>> +struct perf_counter_info_params {
>> +__be32 counter_request;
>> +__be32 starting_index;
>> +__be16 secondary_index;
>> +__be16 returned_values;
>> +__be32 detail_rc;
>> +__be16 counter_value_element_size;
>> +u8 counter_info_version_in;
>> +u8 counter_info_version_out;
>> +u8 reserved[0xC];
>> +} __packed;
>
> This looks to duplicate the hv_get_perf_counter_info_params struct from
> arch/powerpc/perf/hv-gpci.h. Maybe this include file needs to move to
> arch/powerpc/asm/inlcude so you don't have to redefine this struct.
>
>> +struct hv_gpci_request_buffer {
>> +struct perf_counter_info_params params;
>> +u8 output[4096 - sizeof(struct perf_counter_info_params)];
>> +} __packed;
>
> This struct is code duplication of the one defined in
> arch/powerpc/perf/hv-gpci.c and could be moved into hv-gpci.h along with
> HGPCI_MAX_DATA_BYTES so that you can use those versions here.

I tend to agree with these comments.


>> +struct hv_gpci_request_buffer *buf;
>> +long ret;
>> +unsigned int affinity_score;
>> +
>> +buf = kmalloc(sizeof(*buf), GFP_KERNEL);
>> +if (buf == NULL)
>> +return;
>> +
>> +/*
>> + * Show the local LPAR's affinity score.
>> + *
>> + * 0xB1 selects the Affinity_Domain_Info_By_Partition subcall.
>> + * The score is at byte 0xB in the output buffer.
>> + */
>> +memset(>params, 0, sizeof(buf->params));
>> +buf->params.counter_request = cpu_to_be32(0xB1);
>> +buf->params.starting_index = cpu_to_be32(-1);   /* local LPAR */
>> +buf->params.counter_info_version_in = 0x5;  /* v5+ for score */
>> +ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf),
>> + sizeof(*buf));
>> +if (ret != H_SUCCESS) {
>> +pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n",
>> + ret, be32_to_cpu(buf->params.detail_rc));
>> +goto out;
>> +}
>> +affinity_score = buf->output[0xB];
>> +seq_printf(m,

[PATCH 03/18] powerpc/numa: remove ability to enable topology updates

2020-06-11 Thread Nathan Lynch

Remove the /proc/powerpc/topology_updates interface and the
topology_updates=on/off command line argument. The internal
topology_updates_enabled flag remains for now, but always false.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 71 +-
 1 file changed, 1 insertion(+), 70 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..34d95de77bdd 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -984,27 +984,7 @@ static int __init early_numa(char *p)
 }
 early_param("numa", early_numa);
 
-/*
- * The platform can inform us through one of several mechanisms
- * (post-migration device tree updates, PRRN or VPHN) that the NUMA
- * assignment of a resource has changed. This controls whether we act
- * on that. Disabled by default.
- */
-static bool topology_updates_enabled;
-
-static int __init early_topology_updates(char *p)
-{
-   if (!p)
-   return 0;
-
-   if (!strcmp(p, "on")) {
-   pr_warn("Caution: enabling topology updates\n");
-   topology_updates_enabled = true;
-   }
-
-   return 0;
-}
-early_param("topology_updates", early_topology_updates);
+static const bool topology_updates_enabled;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
@@ -1632,52 +1612,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-static int topology_read(struct seq_file *file, void *v)
-{
-   if (vphn_enabled || prrn_enabled)
-   seq_puts(file, "on\n");
-   else
-   seq_puts(file, "off\n");
-
-   return 0;
-}
-
-static int topology_open(struct inode *inode, struct file *file)
-{
-   return single_open(file, topology_read, NULL);
-}
-
-static ssize_t topology_write(struct file *file, const char __user *buf,
- size_t count, loff_t *off)
-{
-   char kbuf[4]; /* "on" or "off" plus null. */
-   int read_len;
-
-   read_len = count < 3 ? count : 3;
-   if (copy_from_user(kbuf, buf, read_len))
-   return -EINVAL;
-
-   kbuf[read_len] = '\0';
-
-   if (!strncmp(kbuf, "on", 2)) {
-   topology_updates_enabled = true;
-   start_topology_update();
-   } else if (!strncmp(kbuf, "off", 3)) {
-   stop_topology_update();
-   topology_updates_enabled = false;
-   } else
-   return -EINVAL;
-
-   return count;
-}
-
-static const struct proc_ops topology_proc_ops = {
-   .proc_read  = seq_read,
-   .proc_write = topology_write,
-   .proc_open  = topology_open,
-   .proc_release   = single_release,
-};
-
 static int topology_update_init(void)
 {
start_topology_update();
@@ -1685,9 +1619,6 @@ static int topology_update_init(void)
if (vphn_enabled)
topology_schedule_update();
 
-   if (!proc_create("powerpc/topology_updates", 0644, NULL, 
_proc_ops))
-   return -ENOMEM;
-
topology_inited = 1;
return 0;
 }
-- 
2.25.4

[PATCH 12/18] powerpc/rtasd: simplify handle_rtas_event(), emit message on events

2020-06-11 Thread Nathan Lynch

prrn_is_enabled() always returns false/0, so handle_rtas_event() can
be simplified and some dead code can be removed. Use machine_is()
instead of #ifdef to run this code only on pseries, and add an
informational ratelimited message that we are ignoring the
events. PRRN events are relatively rare in normal operation and
usually arise from operator-initiated actions such as a DPO (Dynamic
Platform Optimizer) run.

Eventually we do want to consume these events and update the device
tree, but that needs more care to be safe vs LPM and DLPAR.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtasd.c | 28 +++-
 1 file changed, 3 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index 89b798f8f656..8561dfb33f24 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -273,37 +273,15 @@ void pSeries_log_error(char *buf, unsigned int err_type, 
int fatal)
}
 }
 
-#ifdef CONFIG_PPC_PSERIES
-static void handle_prrn_event(s32 scope)
-{
-   /*
-* For PRRN, we must pass the negative of the scope value in
-* the RTAS event.
-*/
-   pseries_devicetree_update(-scope);
-   numa_update_cpu_topology(false);
-}
-
 static void handle_rtas_event(const struct rtas_error_log *log)
 {
-   if (rtas_error_type(log) != RTAS_TYPE_PRRN || !prrn_is_enabled())
+   if (!machine_is(pseries))
return;
 
-   /* For PRRN Events the extended log length is used to denote
-* the scope for calling rtas update-nodes.
-*/
-   handle_prrn_event(rtas_error_extended_log_length(log));
+   if (rtas_error_type(log) == RTAS_TYPE_PRRN)
+   pr_info_ratelimited("Platform resource reassignment 
ignored.\n");
 }
 
-#else
-
-static void handle_rtas_event(const struct rtas_error_log *log)
-{
-   return;
-}
-
-#endif
-
 static int rtas_log_open(struct inode * inode, struct file * file)
 {
return 0;
-- 
2.25.4

[PATCH 05/18] powerpc/numa: make vphn_enabled, prrn_enabled flags const

2020-06-11 Thread Nathan Lynch

Previous changes have made it so these flags are never changed;
enforce this by making them const.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9e20f12e6caf..1b89bacb8975 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1132,8 +1132,8 @@ struct topology_update_data {
 #define TOPOLOGY_DEF_TIMER_SECS60
 
 static cpumask_t cpu_associativity_changes_mask;
-static int vphn_enabled;
-static int prrn_enabled;
+static const int vphn_enabled;
+static const int prrn_enabled;
 static void reset_topology_timer(void);
 static int topology_timer_secs = 1;
 static int topology_inited;
-- 
2.25.4

[PATCH 06/18] powerpc/numa: remove unreachable topology timer code

2020-06-11 Thread Nathan Lynch

Since vphn_enabled is always 0, we can stub out
timed_topology_update() and remove the code which becomes unreachable.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 21 -
 1 file changed, 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1b89bacb8975..6207297490a8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1129,13 +1129,9 @@ struct topology_update_data {
int new_nid;
 };
 
-#define TOPOLOGY_DEF_TIMER_SECS60
-
 static cpumask_t cpu_associativity_changes_mask;
 static const int vphn_enabled;
 static const int prrn_enabled;
-static void reset_topology_timer(void);
-static int topology_timer_secs = 1;
 static int topology_inited;
 
 /*
@@ -1143,15 +1139,6 @@ static int topology_inited;
  */
 int timed_topology_update(int nsecs)
 {
-   if (vphn_enabled) {
-   if (nsecs > 0)
-   topology_timer_secs = nsecs;
-   else
-   topology_timer_secs = TOPOLOGY_DEF_TIMER_SECS;
-
-   reset_topology_timer();
-   }
-
return 0;
 }
 
@@ -1438,14 +1425,6 @@ static void topology_schedule_update(void)
schedule_work(_work);
 }
 
-static struct timer_list topology_timer;
-
-static void reset_topology_timer(void)
-{
-   if (vphn_enabled)
-   mod_timer(_timer, jiffies + topology_timer_secs * HZ);
-}
-
 /*
  * Start polling for associativity changes.
  */
-- 
2.25.4

[PATCH 16/18] powerpc/pseries: remove memory "re-add" implementation

2020-06-11 Thread Nathan Lynch

dlpar_memory() no longer has any callers which pass
PSERIES_HP_ELOG_ACTION_READD. Remove this case and the corresponding
unreachable code.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h   |  1 -
 .../platforms/pseries/hotplug-memory.c| 42 ---
 2 files changed, 43 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 0107d724e9da..55f9a154c95d 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -215,7 +215,6 @@ inline uint16_t pseries_errorlog_length(struct 
pseries_errorlog *sect)
 
 #define PSERIES_HP_ELOG_ACTION_ADD 1
 #define PSERIES_HP_ELOG_ACTION_REMOVE  2
-#define PSERIES_HP_ELOG_ACTION_READD   3
 
 #define PSERIES_HP_ELOG_ID_DRC_NAME1
 #define PSERIES_HP_ELOG_ID_DRC_INDEX   2
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5ace2f9a277e..67ece3ac9ac2 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -487,40 +487,6 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
return rc;
 }
 
-static int dlpar_memory_readd_by_index(u32 drc_index)
-{
-   struct drmem_lmb *lmb;
-   int lmb_found;
-   int rc;
-
-   pr_info("Attempting to update LMB, drc index %x\n", drc_index);
-
-   lmb_found = 0;
-   for_each_drmem_lmb(lmb) {
-   if (lmb->drc_index == drc_index) {
-   lmb_found = 1;
-   rc = dlpar_remove_lmb(lmb);
-   if (!rc) {
-   rc = dlpar_add_lmb(lmb);
-   if (rc)
-   dlpar_release_drc(lmb->drc_index);
-   }
-   break;
-   }
-   }
-
-   if (!lmb_found)
-   rc = -EINVAL;
-
-   if (rc)
-   pr_info("Failed to update memory at %llx\n",
-   lmb->base_addr);
-   else
-   pr_info("Memory at %llx was updated\n", lmb->base_addr);
-
-   return rc;
-}
-
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 {
struct drmem_lmb *lmb, *start_lmb, *end_lmb;
@@ -617,10 +583,6 @@ static int dlpar_memory_remove_by_index(u32 drc_index)
 {
return -EOPNOTSUPP;
 }
-static int dlpar_memory_readd_by_index(u32 drc_index)
-{
-   return -EOPNOTSUPP;
-}
 
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 {
@@ -902,10 +864,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
break;
}
 
-   break;
-   case PSERIES_HP_ELOG_ACTION_READD:
-   drc_index = hp_elog->_drc_u.drc_index;
-   rc = dlpar_memory_readd_by_index(drc_index);
break;
default:
pr_err("Invalid action (%d) specified\n", hp_elog->action);
-- 
2.25.4

[PATCH 01/18] powerpc/pseries: remove cede offline state for CPUs

2020-06-11 Thread Nathan Lynch

This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add
hooks to put the CPU into an appropriate offline state"), which added
an offline mode for CPUs which uses the H_CEDE hcall instead of the
architected stop-self RTAS function in order to facilitate "folding"
of dedicated mode processors on PowerVM platforms to achieve energy
savings. This has been the default offline mode since its
introduction.

There's nothing about stop-self that would prevent the hypervisor from
achieving the energy savings available via H_CEDE, so the original
premise of this change appears to be flawed.

I also have encountered the claim that the transition to and from
ceded state is much faster than stop-self/start-cpu. Certainly we
would not want to use stop-self as an *idle* mode. That is what H_CEDE
is for. However, this difference is insignificant in the context of
Linux CPU hotplug, where the latency of an offline or online operation
on current systems is on the order of 100ms, mainly attributable to
all the various subsystems' cpuhp callbacks.

The cede offline mode also prevents accurate accounting, as discussed
before:
https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-...@linux.vnet.ibm.com/

Unconditionally use stop-self to offline processor threads. This is
the architected method for offlining CPUs on PAPR systems.

The "cede_offline" boot parameter is rendered obsolete.

Removing this code enables the removal of the partition suspend code
which temporarily onlines all present CPUs.

Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an 
appropriate offline state")

Signed-off-by: Nathan Lynch 
Reviewed-by: Gautham R. Shenoy 
---
 Documentation/core-api/cpu_hotplug.rst|   7 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c  | 170 ++
 .../platforms/pseries/offline_states.h|  38 
 arch/powerpc/platforms/pseries/pmem.c |   1 -
 arch/powerpc/platforms/pseries/smp.c  |  28 +--
 5 files changed, 15 insertions(+), 229 deletions(-)
 delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h

diff --git a/Documentation/core-api/cpu_hotplug.rst 
b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7817f7..b1ae1ac159cf 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -50,13 +50,6 @@ Command Line Switches
 
   This option is limited to the X86 and S390 architecture.
 
-``cede_offline={"off","on"}``
-  Use this option to disable/enable putting offlined processors to an extended
-  ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
-
-  This option is limited to the PowerPC architecture.
-
 ``cpu0_hotplug``
   Allow to shutdown CPU0.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 3e8cbfe7a80f..d4b346355bb9 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -35,54 +35,10 @@
 #include 
 
 #include "pseries.h"
-#include "offline_states.h"
 
 /* This version can't take the spinlock, because it never returns */
 static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
 
-static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) =
-   CPU_STATE_OFFLINE;
-static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE;
-
-static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
-
-static bool cede_offline_enabled __read_mostly = true;
-
-/*
- * Enable/disable cede_offline when available.
- */
-static int __init setup_cede_offline(char *str)
-{
-   return (kstrtobool(str, _offline_enabled) == 0);
-}
-
-__setup("cede_offline=", setup_cede_offline);
-
-enum cpu_state_vals get_cpu_current_state(int cpu)
-{
-   return per_cpu(current_state, cpu);
-}
-
-void set_cpu_current_state(int cpu, enum cpu_state_vals state)
-{
-   per_cpu(current_state, cpu) = state;
-}
-
-enum cpu_state_vals get_preferred_offline_state(int cpu)
-{
-   return per_cpu(preferred_offline_state, cpu);
-}
-
-void set_preferred_offline_state(int cpu, enum cpu_state_vals state)
-{
-   per_cpu(preferred_offline_state, cpu) = state;
-}
-
-void set_default_offline_state(int cpu)
-{
-   per_cpu(preferred_offline_state, cpu) = default_offline_state;
-}
-
 static void rtas_stop_self(void)
 {
static struct rtas_args args;
@@ -101,9 +57,7 @@ static void rtas_stop_self(void)
 
 static void pseries_mach_cpu_die(void)
 {
-   unsigned int cpu = smp_processor_id();
unsigned int hwcpu = hard_smp_processor_id();
-   u8 cede_latency_hint = 0;
 
local_irq_disable();
idle_task_exit();
@@ -112,49 +66,6 @@ static void pseries_mach_cpu_die(void)
else
xics_teardown_cpu();
 
-   if

[PATCH 00/18] remove extended cede offline mode and bogus topology update code

2020-06-11 Thread Nathan Lynch

Two major parts to this series:

1. Removal of the extended cede offline mode for CPUs as well as the
   partition suspend code which accommodates it by temporarily
   onlining all CPUs prior to suspending the LPAR. This solves some
   accounting problems, simplifies the pseries CPU hotplug code, and
   greatly uncomplicates the existing partition suspend code, easing
   a much-needed transition to the Linux suspend framework. The two
   patches which make up this part have been posted before:

   https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=180718

   and they are simply incorporated unchanged into the larger series
   here, with Gautham's Reviewed-by added to patch #1.

2. Removal of the long-disabled "topology update" code, most of which
   resides in mm/numa.c, but there are pieces in pseries and rtasd to
   excise as well. This code was an attempt to honor changes in a
   partition's NUMA properties arising from resource reassignments
   which occur as part of a migration, VPHN change, or a Dynamic
   Platform Optimizer operation. Its main technique is to remove and
   re-add affected processors and LMBs and hope in vain that the
   changes in cpu-node and physaddr-node relationships aren't
   disruptive. We want to provide user space with some indication that
   Linux's logical NUMA representation has become out of sync with the
   platform's assignments, but we need to get this unusable stuff out
   of the way before this code can sustain new features.

Nathan Lynch (18):
  powerpc/pseries: remove cede offline state for CPUs
  powerpc/rtas: don't online CPUs for partition suspend
  powerpc/numa: remove ability to enable topology updates
  powerpc/numa: remove unreachable topology update code
  powerpc/numa: make vphn_enabled, prrn_enabled flags const
  powerpc/numa: remove unreachable topology timer code
  powerpc/numa: remove unreachable topology workqueue code
  powerpc/numa: remove vphn_enabled and prrn_enabled internal flags
  powerpc/numa: stub out numa_update_cpu_topology()
  powerpc/numa: remove timed_topology_update()
  powerpc/numa: remove start/stop_topology_update()
  powerpc/rtasd: simplify handle_rtas_event(), emit message on events
  powerpc/numa: remove prrn_is_enabled()
  powerpc/numa: remove arch_update_cpu_topology
  powerpc/pseries: remove prrn special case from DT update path
  powerpc/pseries: remove memory "re-add" implementation
  powerpc/pseries: remove dlpar_cpu_readd()
  powerpc/pseries: remove obsolete memory hotplug DT notifier code

 Documentation/core-api/cpu_hotplug.rst|   7 -
 arch/powerpc/include/asm/rtas.h   |   3 -
 arch/powerpc/include/asm/topology.h   |  27 -
 arch/powerpc/kernel/rtas.c| 122 +
 arch/powerpc/kernel/rtasd.c   |  28 +-
 arch/powerpc/mm/numa.c| 486 --
 arch/powerpc/platforms/pseries/hotplug-cpu.c  | 189 +--
 .../platforms/pseries/hotplug-memory.c| 107 +---
 arch/powerpc/platforms/pseries/mobility.c |  31 --
 .../platforms/pseries/offline_states.h|  38 --
 arch/powerpc/platforms/pseries/pmem.c |   1 -
 arch/powerpc/platforms/pseries/smp.c  |  28 +-
 arch/powerpc/platforms/pseries/suspend.c  |  27 +-
 13 files changed, 22 insertions(+), 1072 deletions(-)
 delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h

-- 
2.25.4

[PATCH 11/18] powerpc/numa: remove start/stop_topology_update()

2020-06-11 Thread Nathan Lynch

These APIs have become no-ops, so remove them and all call sites.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h   | 10 --
 arch/powerpc/mm/numa.c| 20 
 arch/powerpc/platforms/pseries/mobility.c |  4 
 arch/powerpc/platforms/pseries/suspend.c  |  5 +
 4 files changed, 1 insertion(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 379e2cc3789f..537c638582eb 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -93,19 +93,9 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
-extern int start_topology_update(void);
-extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 #else
-static inline int start_topology_update(void)
-{
-   return 0;
-}
-static inline int stop_topology_update(void)
-{
-   return 0;
-}
 static inline int prrn_is_enabled(void)
 {
return 0;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6c579ac3e679..dec7ce3b5e67 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1157,8 +1157,6 @@ static long vphn_get_associativity(unsigned long cpu,
, rc);
break;
}
-
-   stop_topology_update();
 out:
return rc;
 }
@@ -1212,22 +1210,6 @@ int arch_update_cpu_topology(void)
return numa_update_cpu_topology(true);
 }
 
-/*
- * Start polling for associativity changes.
- */
-int start_topology_update(void)
-{
-   return 0;
-}
-
-/*
- * Disable polling for VPHN associativity changes.
- */
-int stop_topology_update(void)
-{
-   return 0;
-}
-
 int prrn_is_enabled(void)
 {
return 0;
@@ -1235,8 +1217,6 @@ int prrn_is_enabled(void)
 
 static int topology_update_init(void)
 {
-   start_topology_update();
-
topology_inited = 1;
return 0;
 }
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 10d982997736..c0b09f6f0ae3 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -388,8 +388,6 @@ static ssize_t migration_store(struct class *class,
if (rc)
return rc;
 
-   stop_topology_update();
-
do {
rc = rtas_ibm_suspend_me(streamid);
if (rc == -EAGAIN)
@@ -401,8 +399,6 @@ static ssize_t migration_store(struct class *class,
 
post_mobility_fixup();
 
-   start_topology_update();
-
return count;
 }
 
diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index f789693f61f4..81e0ac58d620 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -145,11 +145,8 @@ static ssize_t store_hibernate(struct device *dev,
ssleep(1);
} while (rc == -EAGAIN);
 
-   if (!rc) {
-   stop_topology_update();
+   if (!rc)
rc = pm_suspend(PM_SUSPEND_MEM);
-   start_topology_update();
-   }
 
stream_id = 0;
 
-- 
2.25.4

[PATCH 07/18] powerpc/numa: remove unreachable topology workqueue code

2020-06-11 Thread Nathan Lynch

Since vphn_enabled is always 0, we can remove the call to
topology_schedule_update() and remove the code which becomes
unreachable as a result.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6207297490a8..8415481a7f13 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1414,17 +1414,6 @@ int arch_update_cpu_topology(void)
return numa_update_cpu_topology(true);
 }
 
-static void topology_work_fn(struct work_struct *work)
-{
-   rebuild_sched_domains();
-}
-static DECLARE_WORK(topology_work, topology_work_fn);
-
-static void topology_schedule_update(void)
-{
-   schedule_work(_work);
-}
-
 /*
  * Start polling for associativity changes.
  */
@@ -1450,9 +1439,6 @@ static int topology_update_init(void)
 {
start_topology_update();
 
-   if (vphn_enabled)
-   topology_schedule_update();
-
topology_inited = 1;
return 0;
 }
-- 
2.25.4

[PATCH 18/18] powerpc/pseries: remove obsolete memory hotplug DT notifier code

2020-06-11 Thread Nathan Lynch

pseries_update_drconf_memory() runs from a DT notifier in response to
an update to the ibm,dynamic-memory property of the
/ibm,dynamic-reconfiguration-memory node. This property is an older
less compact format than the ibm,dynamic-memory-v2 property used in
most currently supported firmwares. There has never been an equivalent
function for the v2 property.

pseries_update_drconf_memory() compares the 'assigned' flag for each
LMB in the old vs new properties and adds or removes the block
accordingly. However it appears to be of no actual utility:

* Partition suspension and PRRNs are specified only to change LMBs'
  NUMA affinity information. This notifier should be a no-op for those
  scenarios since the assigned flags should not change.

* The memory hotplug/DLPAR path has a hack which short-circuits
  execution of the notifier:
 dlpar_memory()
...
rtas_hp_event = true;
drmem_update_dt()
   of_update_property()
  pseries_memory_notifier()
 pseries_update_drconf_memory()
if (rtas_hp_event) return;

So this code only makes sense as a relic of the time when more of the
DLPAR workflow took place in user space. I don't see a purpose for it
now.

Signed-off-by: Nathan Lynch 
---
 .../platforms/pseries/hotplug-memory.c| 65 +--
 1 file changed, 1 insertion(+), 64 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 67ece3ac9ac2..73a5dcd977e1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -22,8 +22,6 @@
 #include 
 #include "pseries.h"
 
-static bool rtas_hp_event;
-
 unsigned long pseries_memory_block_size(void)
 {
struct device_node *np;
@@ -871,11 +869,8 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
break;
}
 
-   if (!rc) {
-   rtas_hp_event = true;
+   if (!rc)
rc = drmem_update_dt();
-   rtas_hp_event = false;
-   }
 
unlock_device_hotplug();
return rc;
@@ -911,60 +906,6 @@ static int pseries_add_mem_node(struct device_node *np)
return (ret < 0) ? -EINVAL : 0;
 }
 
-static int pseries_update_drconf_memory(struct of_reconfig_data *pr)
-{
-   struct of_drconf_cell_v1 *new_drmem, *old_drmem;
-   unsigned long memblock_size;
-   u32 entries;
-   __be32 *p;
-   int i, rc = -EINVAL;
-
-   if (rtas_hp_event)
-   return 0;
-
-   memblock_size = pseries_memory_block_size();
-   if (!memblock_size)
-   return -EINVAL;
-
-   if (!pr->old_prop)
-   return 0;
-
-   p = (__be32 *) pr->old_prop->value;
-   if (!p)
-   return -EINVAL;
-
-   /* The first int of the property is the number of lmb's described
-* by the property. This is followed by an array of of_drconf_cell
-* entries. Get the number of entries and skip to the array of
-* of_drconf_cell's.
-*/
-   entries = be32_to_cpu(*p++);
-   old_drmem = (struct of_drconf_cell_v1 *)p;
-
-   p = (__be32 *)pr->prop->value;
-   p++;
-   new_drmem = (struct of_drconf_cell_v1 *)p;
-
-   for (i = 0; i < entries; i++) {
-   if ((be32_to_cpu(old_drmem[i].flags) & DRCONF_MEM_ASSIGNED) &&
-   (!(be32_to_cpu(new_drmem[i].flags) & DRCONF_MEM_ASSIGNED))) 
{
-   rc = pseries_remove_memblock(
-   be64_to_cpu(old_drmem[i].base_addr),
-memblock_size);
-   break;
-   } else if ((!(be32_to_cpu(old_drmem[i].flags) &
-   DRCONF_MEM_ASSIGNED)) &&
-   (be32_to_cpu(new_drmem[i].flags) &
-   DRCONF_MEM_ASSIGNED)) {
-   rc = memblock_add(be64_to_cpu(old_drmem[i].base_addr),
- memblock_size);
-   rc = (rc < 0) ? -EINVAL : 0;
-   break;
-   }
-   }
-   return rc;
-}
-
 static int pseries_memory_notifier(struct notifier_block *nb,
   unsigned long action, void *data)
 {
@@ -978,10 +919,6 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
case OF_RECONFIG_DETACH_NODE:
err = pseries_remove_mem_node(rd->dn);
break;
-   case OF_RECONFIG_UPDATE_PROPERTY:
-   if (!strcmp(rd->prop->name, "ibm,dynamic-memory"))
-   err = pseries_update_drconf_memory(rd);
-   break;
}
return notifier_from_errno(err);
 }
-- 
2.25.4

[PATCH 04/18] powerpc/numa: remove unreachable topology update code

2020-06-11 Thread Nathan Lynch

Since the topology_updates_enabled flag is now always false, remove it
and the code which has become unreachable. This is the minimum change
that prevents 'defined but unused' warnings emitted by the compiler
after stubbing out the start/stop_topology_updates() functions.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 149 +
 1 file changed, 2 insertions(+), 147 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 34d95de77bdd..9e20f12e6caf 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -984,8 +984,6 @@ static int __init early_numa(char *p)
 }
 early_param("numa", early_numa);
 
-static const bool topology_updates_enabled;
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * Find the node associated with a hot added memory section for
@@ -1133,7 +1131,6 @@ struct topology_update_data {
 
 #define TOPOLOGY_DEF_TIMER_SECS60
 
-static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
 static cpumask_t cpu_associativity_changes_mask;
 static int vphn_enabled;
 static int prrn_enabled;
@@ -1158,63 +1155,6 @@ int timed_topology_update(int nsecs)
return 0;
 }
 
-/*
- * Store the current values of the associativity change counters in the
- * hypervisor.
- */
-static void setup_cpu_associativity_change_counters(void)
-{
-   int cpu;
-
-   /* The VPHN feature supports a maximum of 8 reference points */
-   BUILD_BUG_ON(MAX_DISTANCE_REF_POINTS > 8);
-
-   for_each_possible_cpu(cpu) {
-   int i;
-   u8 *counts = vphn_cpu_change_counts[cpu];
-   volatile u8 *hypervisor_counts = 
lppaca_of(cpu).vphn_assoc_counts;
-
-   for (i = 0; i < distance_ref_points_depth; i++)
-   counts[i] = hypervisor_counts[i];
-   }
-}
-
-/*
- * The hypervisor maintains a set of 8 associativity change counters in
- * the VPA of each cpu that correspond to the associativity levels in the
- * ibm,associativity-reference-points property. When an associativity
- * level changes, the corresponding counter is incremented.
- *
- * Set a bit in cpu_associativity_changes_mask for each cpu whose home
- * node associativity levels have changed.
- *
- * Returns the number of cpus with unhandled associativity changes.
- */
-static int update_cpu_associativity_changes_mask(void)
-{
-   int cpu;
-   cpumask_t *changes = _associativity_changes_mask;
-
-   for_each_possible_cpu(cpu) {
-   int i, changed = 0;
-   u8 *counts = vphn_cpu_change_counts[cpu];
-   volatile u8 *hypervisor_counts = 
lppaca_of(cpu).vphn_assoc_counts;
-
-   for (i = 0; i < distance_ref_points_depth; i++) {
-   if (hypervisor_counts[i] != counts[i]) {
-   counts[i] = hypervisor_counts[i];
-   changed = 1;
-   }
-   }
-   if (changed) {
-   cpumask_or(changes, changes, cpu_sibling_mask(cpu));
-   cpu = cpu_last_thread_sibling(cpu);
-   }
-   }
-
-   return cpumask_weight(changes);
-}
-
 /*
  * Retrieve the new associativity information for a virtual processor's
  * home node.
@@ -1498,16 +1438,6 @@ static void topology_schedule_update(void)
schedule_work(_work);
 }
 
-static void topology_timer_fn(struct timer_list *unused)
-{
-   if (prrn_enabled && cpumask_weight(_associativity_changes_mask))
-   topology_schedule_update();
-   else if (vphn_enabled) {
-   if (update_cpu_associativity_changes_mask() > 0)
-   topology_schedule_update();
-   reset_topology_timer();
-   }
-}
 static struct timer_list topology_timer;
 
 static void reset_topology_timer(void)
@@ -1516,69 +1446,12 @@ static void reset_topology_timer(void)
mod_timer(_timer, jiffies + topology_timer_secs * HZ);
 }
 
-#ifdef CONFIG_SMP
-
-static int dt_update_callback(struct notifier_block *nb,
-   unsigned long action, void *data)
-{
-   struct of_reconfig_data *update = data;
-   int rc = NOTIFY_DONE;
-
-   switch (action) {
-   case OF_RECONFIG_UPDATE_PROPERTY:
-   if (of_node_is_type(update->dn, "cpu") &&
-   !of_prop_cmp(update->prop->name, "ibm,associativity")) {
-   u32 core_id;
-   of_property_read_u32(update->dn, "reg", _id);
-   rc = dlpar_cpu_readd(core_id);
-   rc = NOTIFY_OK;
-   }
-   break;
-   }
-
-   return rc;
-}
-
-static struct notifier_block dt_update_nb = {
-   .notifier_call = dt_update_callback,
-};
-
-#endif
-
 /*
  * Start polling for associativity changes.
  */
 int start_topology_update(void)
 {
-

[PATCH 09/18] powerpc/numa: stub out numa_update_cpu_topology()

2020-06-11 Thread Nathan Lynch

Previous changes have removed the code which sets bits in
cpu_associativity_changes_mask and thus it is never modifed at
runtime. From this we can reason that numa_update_cpu_topology()
always returns 0 without doing anything. Remove the body of
numa_update_cpu_topology() and remove all code which becomes
unreachable as a result.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 193 +
 1 file changed, 1 insertion(+), 192 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8749d7f2b1a6..b220e5b60140 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1122,14 +1122,6 @@ u64 memory_hotplug_max(void)
 
 /* Virtual Processor Home Node (VPHN) support */
 #ifdef CONFIG_PPC_SPLPAR
-struct topology_update_data {
-   struct topology_update_data *next;
-   unsigned int cpu;
-   int old_nid;
-   int new_nid;
-};
-
-static cpumask_t cpu_associativity_changes_mask;
 static int topology_inited;
 
 /*
@@ -1219,192 +1211,9 @@ int find_and_online_cpu_nid(int cpu)
return new_nid;
 }
 
-/*
- * Update the CPU maps and sysfs entries for a single CPU when its NUMA
- * characteristics change. This function doesn't perform any locking and is
- * only safe to call from stop_machine().
- */
-static int update_cpu_topology(void *data)
-{
-   struct topology_update_data *update;
-   unsigned long cpu;
-
-   if (!data)
-   return -EINVAL;
-
-   cpu = smp_processor_id();
-
-   for (update = data; update; update = update->next) {
-   int new_nid = update->new_nid;
-   if (cpu != update->cpu)
-   continue;
-
-   unmap_cpu_from_node(cpu);
-   map_cpu_to_node(cpu, new_nid);
-   set_cpu_numa_node(cpu, new_nid);
-   set_cpu_numa_mem(cpu, local_memory_node(new_nid));
-   vdso_getcpu_init();
-   }
-
-   return 0;
-}
-
-static int update_lookup_table(void *data)
-{
-   struct topology_update_data *update;
-
-   if (!data)
-   return -EINVAL;
-
-   /*
-* Upon topology update, the numa-cpu lookup table needs to be updated
-* for all threads in the core, including offline CPUs, to ensure that
-* future hotplug operations respect the cpu-to-node associativity
-* properly.
-*/
-   for (update = data; update; update = update->next) {
-   int nid, base, j;
-
-   nid = update->new_nid;
-   base = cpu_first_thread_sibling(update->cpu);
-
-   for (j = 0; j < threads_per_core; j++) {
-   update_numa_cpu_lookup_table(base + j, nid);
-   }
-   }
-
-   return 0;
-}
-
-/*
- * Update the node maps and sysfs entries for each cpu whose home node
- * has changed. Returns 1 when the topology has changed, and 0 otherwise.
- *
- * cpus_locked says whether we already hold cpu_hotplug_lock.
- */
 int numa_update_cpu_topology(bool cpus_locked)
 {
-   unsigned int cpu, sibling, changed = 0;
-   struct topology_update_data *updates, *ud;
-   cpumask_t updated_cpus;
-   struct device *dev;
-   int weight, new_nid, i = 0;
-
-   if (topology_inited)
-   return 0;
-
-   weight = cpumask_weight(_associativity_changes_mask);
-   if (!weight)
-   return 0;
-
-   updates = kcalloc(weight, sizeof(*updates), GFP_KERNEL);
-   if (!updates)
-   return 0;
-
-   cpumask_clear(_cpus);
-
-   for_each_cpu(cpu, _associativity_changes_mask) {
-   /*
-* If siblings aren't flagged for changes, updates list
-* will be too short. Skip on this update and set for next
-* update.
-*/
-   if (!cpumask_subset(cpu_sibling_mask(cpu),
-   _associativity_changes_mask)) {
-   pr_info("Sibling bits not set for associativity "
-   "change, cpu%d\n", cpu);
-   cpumask_or(_associativity_changes_mask,
-   _associativity_changes_mask,
-   cpu_sibling_mask(cpu));
-   cpu = cpu_last_thread_sibling(cpu);
-   continue;
-   }
-
-   new_nid = find_and_online_cpu_nid(cpu);
-
-   if (new_nid == numa_cpu_lookup_table[cpu]) {
-   cpumask_andnot(_associativity_changes_mask,
-   _associativity_changes_mask,
-   cpu_sibling_mask(cpu));
-   dbg("Assoc chg gives same node %d for cpu%d\n",
-   new_nid, cpu);
-   cpu = cpu_last_th

[PATCH 08/18] powerpc/numa: remove vphn_enabled and prrn_enabled internal flags

2020-06-11 Thread Nathan Lynch

These flags are always zero now; remove them and suitably adjust the
remaining references to them.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/mm/numa.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8415481a7f13..8749d7f2b1a6 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1130,8 +1130,6 @@ struct topology_update_data {
 };
 
 static cpumask_t cpu_associativity_changes_mask;
-static const int vphn_enabled;
-static const int prrn_enabled;
 static int topology_inited;
 
 /*
@@ -1292,7 +1290,7 @@ int numa_update_cpu_topology(bool cpus_locked)
struct device *dev;
int weight, new_nid, i = 0;
 
-   if (!prrn_enabled && !vphn_enabled && topology_inited)
+   if (topology_inited)
return 0;
 
weight = cpumask_weight(_associativity_changes_mask);
@@ -1432,7 +1430,7 @@ int stop_topology_update(void)
 
 int prrn_is_enabled(void)
 {
-   return prrn_enabled;
+   return 0;
 }
 
 static int topology_update_init(void)
-- 
2.25.4

[PATCH 15/18] powerpc/pseries: remove prrn special case from DT update path

2020-06-11 Thread Nathan Lynch

pseries_devicetree_update() is no longer called with PRRN_SCOPE. The
purpose of prrn_update_node() was to remove and then add back a LMB
whose NUMA assignment had changed. This has never been reliable, and
this codepath has been default-disabled for several releases. Remove
prrn_update_node().

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 27 ---
 1 file changed, 27 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index c0b09f6f0ae3..78cd772a579b 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -244,29 +244,6 @@ static int add_dt_node(__be32 parent_phandle, __be32 
drc_index)
return rc;
 }
 
-static void prrn_update_node(__be32 phandle)
-{
-   struct pseries_hp_errorlog hp_elog;
-   struct device_node *dn;
-
-   /*
-* If a node is found from a the given phandle, the phandle does not
-* represent the drc index of an LMB and we can ignore.
-*/
-   dn = of_find_node_by_phandle(be32_to_cpu(phandle));
-   if (dn) {
-   of_node_put(dn);
-   return;
-   }
-
-   hp_elog.resource = PSERIES_HP_ELOG_RESOURCE_MEM;
-   hp_elog.action = PSERIES_HP_ELOG_ACTION_READD;
-   hp_elog.id_type = PSERIES_HP_ELOG_ID_DRC_INDEX;
-   hp_elog._drc_u.drc_index = phandle;
-
-   handle_dlpar_errorlog(_elog);
-}
-
 int pseries_devicetree_update(s32 scope)
 {
char *rtas_buf;
@@ -305,10 +282,6 @@ int pseries_devicetree_update(s32 scope)
break;
case UPDATE_DT_NODE:
update_dt_node(phandle, scope);
-
-   if (scope == PRRN_SCOPE)
-   prrn_update_node(phandle);
-
break;
case ADD_DT_NODE:
drc_index = *data++;
-- 
2.25.4

[PATCH 14/18] powerpc/numa: remove arch_update_cpu_topology

2020-06-11 Thread Nathan Lynch

Since arch_update_cpu_topology() doesn't do anything on powerpc now,
remove it and associated dead code.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h |  6 --
 arch/powerpc/mm/numa.c  | 10 --
 2 files changed, 16 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 658aad65912b..b2c346c5e16f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -43,7 +43,6 @@ extern void __init dump_numa_cpu_topology(void);
 
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct device *dev, int nid);
-extern int numa_update_cpu_topology(bool cpus_locked);
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
@@ -78,11 +77,6 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 {
 }
 
-static inline int numa_update_cpu_topology(bool cpus_locked)
-{
-   return 0;
-}
-
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
 static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 26fcc947dd2d..e437a9ac4956 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1200,16 +1200,6 @@ int find_and_online_cpu_nid(int cpu)
return new_nid;
 }
 
-int numa_update_cpu_topology(bool cpus_locked)
-{
-   return 0;
-}
-
-int arch_update_cpu_topology(void)
-{
-   return numa_update_cpu_topology(true);
-}
-
 static int topology_update_init(void)
 {
topology_inited = 1;
-- 
2.25.4

[PATCH 10/18] powerpc/numa: remove timed_topology_update()

2020-06-11 Thread Nathan Lynch

timed_topology_update is a no-op now, so remove it and all call sites.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h  | 5 -
 arch/powerpc/mm/numa.c   | 9 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 --
 3 files changed, 16 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2db7ba789720..379e2cc3789f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -97,7 +97,6 @@ extern int start_topology_update(void);
 extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
-extern int timed_topology_update(int nsecs);
 #else
 static inline int start_topology_update(void)
 {
@@ -115,10 +114,6 @@ static inline int find_and_online_cpu_nid(int cpu)
 {
return 0;
 }
-static inline int timed_topology_update(int nsecs)
-{
-   return 0;
-}
 
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b220e5b60140..6c579ac3e679 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1124,14 +1124,6 @@ u64 memory_hotplug_max(void)
 #ifdef CONFIG_PPC_SPLPAR
 static int topology_inited;
 
-/*
- * Change polling interval for associativity changes.
- */
-int timed_topology_update(int nsecs)
-{
-   return 0;
-}
-
 /*
  * Retrieve the new associativity information for a virtual processor's
  * home node.
@@ -1147,7 +1139,6 @@ static long vphn_get_associativity(unsigned long cpu,
switch (rc) {
case H_SUCCESS:
dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
goto out;
 
case H_FUNCTION:
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index d4b346355bb9..dbfabb185eb5 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -263,7 +263,6 @@ static int dlpar_offline_cpu(struct device_node *dn)
break;
 
cpu_maps_update_done();
-   timed_topology_update(1);
rc = device_offline(get_cpu_device(cpu));
if (rc)
goto out;
@@ -302,7 +301,6 @@ static int dlpar_online_cpu(struct device_node *dn)
if (get_hard_smp_processor_id(cpu) != thread)
continue;
cpu_maps_update_done();
-   timed_topology_update(1);
find_and_online_cpu_nid(cpu);
rc = device_online(get_cpu_device(cpu));
if (rc) {
-- 
2.25.4

[PATCH 13/18] powerpc/numa: remove prrn_is_enabled()

2020-06-11 Thread Nathan Lynch

All users of this prrn_is_enabled() are gone; remove it.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h | 5 -
 arch/powerpc/mm/numa.c  | 5 -
 2 files changed, 10 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 537c638582eb..658aad65912b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -93,13 +93,8 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
-extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 #else
-static inline int prrn_is_enabled(void)
-{
-   return 0;
-}
 static inline int find_and_online_cpu_nid(int cpu)
 {
return 0;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dec7ce3b5e67..26fcc947dd2d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1210,11 +1210,6 @@ int arch_update_cpu_topology(void)
return numa_update_cpu_topology(true);
 }
 
-int prrn_is_enabled(void)
-{
-   return 0;
-}
-
 static int topology_update_init(void)
 {
topology_inited = 1;
-- 
2.25.4

[PATCH 02/18] powerpc/rtas: don't online CPUs for partition suspend

2020-06-11 Thread Nathan Lynch

Partition suspension, used for hibernation and migration, requires
that the OS place all but one of the LPAR's processor threads into one
of two states prior to calling the ibm,suspend-me RTAS function:

  * the architected offline state (via RTAS stop-self); or
  * the H_JOIN hcall, which does not return until the partition
resumes execution

Using H_CEDE as the offline mode, introduced by
commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into
an appropriate offline state"), means that any threads which are
offline from Linux's point of view must be moved to one of those two
states before a partition suspension can proceed.

This was eventually addressed in commit 120496ac2d2d ("powerpc: Bring
all threads online prior to migration/hibernation"), which added code
to temporarily bring up any offline processor threads so they can call
H_JOIN. Conceptually this is fine, but the implementation has had
multiple races with cpu hotplug operations initiated from user
space[1][2][3], the error handling is fragile, and it generates
user-visible cpu hotplug events which is a lot of noise for a platform
feature that's supposed to minimize disruption to workloads.

With commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU
into an appropriate offline state") reverted, this code becomes
unnecessary, so remove it. Since any offline CPUs now are truly
offline from the platform's point of view, it is no longer necessary
to bring up CPUs only to have them call H_JOIN and then go offline
again upon resuming. Only active threads are required to call H_JOIN;
stopped threads can be left alone.

[1] commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and
serialization during LPM")
[2] commit 9fb603050ffd ("powerpc/rtas: retry when cpu offline races
with suspend/migration")
[3] commit dfd718a2ed1f ("powerpc/rtas: Fix a potential race between
CPU-Offline & Migration")

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h  |   2 -
 arch/powerpc/kernel/rtas.c   | 122 +--
 arch/powerpc/platforms/pseries/suspend.c |  22 +---
 3 files changed, 3 insertions(+), 143 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 014968f25f7e..0107d724e9da 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -253,8 +253,6 @@ extern int rtas_set_indicator_fast(int indicator, int 
index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
-extern int rtas_online_cpus_mask(cpumask_var_t cpus);
-extern int rtas_offline_cpus_mask(cpumask_var_t cpus);
 extern int rtas_ibm_suspend_me(u64 handle);
 
 struct rtc_time;
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index a09eba03f180..806d554ce357 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -843,96 +843,6 @@ static void rtas_percpu_suspend_me(void *info)
__rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1);
 }
 
-enum rtas_cpu_state {
-   DOWN,
-   UP,
-};
-
-#ifndef CONFIG_SMP
-static int rtas_cpu_state_change_mask(enum rtas_cpu_state state,
-   cpumask_var_t cpus)
-{
-   if (!cpumask_empty(cpus)) {
-   cpumask_clear(cpus);
-   return -EINVAL;
-   } else
-   return 0;
-}
-#else
-/* On return cpumask will be altered to indicate CPUs changed.
- * CPUs with states changed will be set in the mask,
- * CPUs with status unchanged will be unset in the mask. */
-static int rtas_cpu_state_change_mask(enum rtas_cpu_state state,
-   cpumask_var_t cpus)
-{
-   int cpu;
-   int cpuret = 0;
-   int ret = 0;
-
-   if (cpumask_empty(cpus))
-   return 0;
-
-   for_each_cpu(cpu, cpus) {
-   struct device *dev = get_cpu_device(cpu);
-
-   switch (state) {
-   case DOWN:
-   cpuret = device_offline(dev);
-   break;
-   case UP:
-   cpuret = device_online(dev);
-   break;
-   }
-   if (cpuret < 0) {
-   pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
-   __func__,
-   ((state == UP) ? "up" : "down"),
-   cpu, cpuret);
-   if (!ret)
-   ret = cpuret;
-   if (state == UP) {
-   /* clear bits for unchanged cpus, return */
-

[PATCH 17/18] powerpc/pseries: remove dlpar_cpu_readd()

2020-06-11 Thread Nathan Lynch

dlpar_cpu_readd() is unused now.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h  |  1 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 19 ---
 2 files changed, 20 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index b2c346c5e16f..f0b6300e7dd3 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -115,7 +115,6 @@ int get_physical_package_id(int cpu);
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
 
-int dlpar_cpu_readd(int cpu);
 #endif
 #endif
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index dbfabb185eb5..4bad7a83addc 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -779,25 +779,6 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add)
return rc;
 }
 
-int dlpar_cpu_readd(int cpu)
-{
-   struct device_node *dn;
-   struct device *dev;
-   u32 drc_index;
-   int rc;
-
-   dev = get_cpu_device(cpu);
-   dn = dev->of_node;
-
-   rc = of_property_read_u32(dn, "ibm,my-drc-index", _index);
-
-   rc = dlpar_cpu_remove_by_index(drc_index);
-   if (!rc)
-   rc = dlpar_cpu_add(drc_index);
-
-   return rc;
-}
-
 int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
 {
u32 count, drc_index;
-- 
2.25.4

Re: [PATCH 1/2] powerpc/pseries: remove cede offline state for CPUs

2020-06-03 Thread Nathan Lynch

Hi Gautham,

Gautham R Shenoy  writes:
> On Mon, Jun 01, 2020 at 11:31:39PM -0500, Nathan Lynch wrote:
>> This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add
>> hooks to put the CPU into an appropriate offline state"), which added
>> an offline mode for CPUs which uses the H_CEDE hcall instead of the
>> architected stop-self RTAS function in order to facilitate "folding"
>> of dedicated mode processors on PowerVM platforms to achieve energy
>> savings. This has been the default offline mode since its
>> introduction.
>> 
>> There's nothing about stop-self that would prevent the hypervisor from
>> achieving the energy savings available via H_CEDE, so the original
>> premise of this change appears to be flawed.
>
>
> IIRC, back in 2009, when the Extended-CEDE was introduced, it couldn't
> be exposed via the cpuidle subsystem since this state needs an
> explicit H_PROD as opposed to the H_IPI which wakes up the regular
> CEDE call. So, the alterative was to use the CPU-Hotplug way by having
> a userspace daemon fold the cores which weren't needed currently and
> bring them back online when they were needed. Back then, Long Term
> CEDE was definitely faster compared to stop-self call (It is a pity
> that I didn't post the numbers when I wrote the patch) and the
> time-taken to unfold a core was definitely one of the concerns.
> (https://lkml.org/lkml/2009/9/23/522).

Thanks for the background. Ideally, we would understand what has changed
since then... certainly the offline path would have been slower when
pseries_cpu_die() slept for 200ms between issuing
query-cpu-stopped-state calls, but that was changed in 2008 in
b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") so perhaps it wasn't a
factor in your measurements. Even less sure about what could have
made the online path more expensive.

(For the benefit of anybody else referring to the 2009 discussion that
you linked: that exchange is a bit strange to me because there seems to
be a conception that stop-self releases the processor to the platform in
a way that could prevent the OS from restarting it on demand. I don't
believe this has ever been the case. Processor deallocation is a related
but separate workflow involving different RTAS functions.)

> The patch looks good to me.
> Reviewed-by: Gautham R. Shenoy 

I appreciate the review, thanks!

Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-24 Thread Nathan Lynch

Pingfan Liu  writes:
> On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch  wrote:
>> Pingfan Liu  writes:
>> > This will introduce extra dt updating payload for each involved lmb when 
>> > hotplug.
>> > But it should be fine since drmem_update_dt() is memory based operation and
>> > hotplug is not a hot path.
>>
>> This is great analysis but the performance implications of the change
>> are grave. The add/remove paths here are already O(n) where n is the
>> quantity of memory assigned to the LP, this change would make it O(n^2):
>>
>> dlpar_memory_add_by_count
>>   for_each_drmem_lmb <--
>> dlpar_add_lmb
>>   drmem_update_dt(_v1|_v2)
>> for_each_drmem_lmb   <--
>>
>> Memory add/remove isn't a hot path but quadratic runtime complexity
>> isn't acceptable. Its current performance is bad enough that I have
> Yes, the quadratic runtime complexity sounds terrible.
> And I am curious about the bug. Does the system have thousands of lmb?

Yes.

>> Not to mention we leak memory every time drmem_update_dt is called
>> because we can't safely free device tree properties :-(
> Do you know what block us to free it?

It's a longstanding problem. References to device tree properties aren't
counted or tracked so there's no way to safely free them unless the node
itself is released. But the ibm,dynamic-reconfiguration-memory node does
not ever go away and its properties are only subject to updates.

Maybe there's a way to address the specific case of
ibm,dynamic-reconfiguration-memory and the ibm,dynamic-memory(-v2)
properties, instead of tackling the general problem.

Regardless of all that, the drmem code needs better data structures and
lookup functions.

Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-23 Thread Nathan Lynch

Pingfan Liu  writes:
> A bug is observed on pseries by taking the following steps on rhel:
> -1. drmgr -c mem -r -q 5
> -2. echo c > /proc/sysrq-trigger
>
> And then, the failure looks like:
> kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> kdump: saving vmcore-dmesg.txt
> kdump: saving vmcore-dmesg.txt complete
> kdump: saving vmcore
>  Checking for memory holes : [  0.0 %] /  
>  Checking for memory holes : [100.0 %] |  
>  Excluding unnecessary pages   : [100.0 %] \  
>  Copying data  : [  0.3 %] -  
> eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! 
> EA=0x7fffba40 access=0x8004 current=makedumpfile
> [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> psize 2 pte=0xc0005504
> [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> access=0x8004 current=makedumpfile
> [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> psize 2 pte=0xc0005504
> [   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
> 7fffbbc4d7fc lr 00011356ca3c code 2
> [   44.338548] Core dump to |/bin/false pipe failed
> /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> $CORE_COLLECTOR /proc/vmcore 
> $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> kdump: saving vmcore failed
>
> * Root cause *
>   After analyzing, it turns out that in the current implementation,
> when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
> the code __remove_memory() comes before drmem_update_dt().
> So in kdump kernel, when read_from_oldmem() resorts to
> pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
> can be observed "Bus error"
>
> From a viewpoint of listener and publisher, the publisher notifies the
> listener before data is ready.  This introduces a problem where udev
> launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> updating. And in capture kernel, makedumpfile will access the memory based
> on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
>
> * Fix *
>   In order to fix this issue, update dt before __remove_memory(), and
> accordingly the same rule in hot-add path.
>
> This will introduce extra dt updating payload for each involved lmb when 
> hotplug.
> But it should be fine since drmem_update_dt() is memory based operation and
> hotplug is not a hot path.

This is great analysis but the performance implications of the change
are grave. The add/remove paths here are already O(n) where n is the
quantity of memory assigned to the LP, this change would make it O(n^2):

dlpar_memory_add_by_count
  for_each_drmem_lmb <--
dlpar_add_lmb
  drmem_update_dt(_v1|_v2)
for_each_drmem_lmb   <--

Memory add/remove isn't a hot path but quadratic runtime complexity
isn't acceptable. Its current performance is bad enough that I have
internal bugs open on it.

Not to mention we leak memory every time drmem_update_dt is called
because we can't safely free device tree properties :-(

Also note that this sort of reverts (fixes?) 063b8b1251fd
("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
request").

Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes

2020-07-16 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> This is the next version of the fixes for memory unplug on radix.
> The issues and the fix are described in the actual patches.

I guess this isn't actually causing problems at runtime right now, but I
notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
arch_remove_memory(), which ought to be mmu-agnostic:

int __ref arch_add_memory(int nid, u64 start, u64 size,
  struct mhp_params *params)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;

resize_hpt_for_hotplug(memblock_phys_mem_size());

start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size, nid,
params->pgprot);
...

[PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-28 Thread Nathan Lynch

The drmem lmb list can have hundreds of thousands of entries, and
unfortunately lookups take the form of linear searches. As long as
this is the case, traversals have the potential to monopolize the CPU
and provoke lockup reports, workqueue stalls, and the like unless
they explicitly yield.

Rather than placing cond_resched() calls within various
for_each_drmem_lmb() loop blocks in the code, put it in the iteration
expression of the loop macro itself so users can't omit it.

Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
format")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/drmem.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..36d0ed04bda8 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_POWERPC_LMB_H
 #define _ASM_POWERPC_LMB_H
 
+#include 
+
 struct drmem_lmb {
u64 base_addr;
u32 drc_index;
@@ -26,8 +28,14 @@ struct drmem_lmb_info {
 
 extern struct drmem_lmb_info *drmem_info;
 
+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
+{
+   cond_resched();
+   return ++lmb;
+}
+
 #define for_each_drmem_lmb_in_range(lmb, start, end)   \
-   for ((lmb) = (start); (lmb) < (end); (lmb)++)
+   for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb))
 
 #define for_each_drmem_lmb(lmb)\
for_each_drmem_lmb_in_range((lmb),  \
-- 
2.25.4

Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-28 Thread Nathan Lynch

Hi Laurent,

Laurent Dufour  writes:
> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>> The drmem lmb list can have hundreds of thousands of entries, and
>> unfortunately lookups take the form of linear searches. As long as
>> this is the case, traversals have the potential to monopolize the CPU
>> and provoke lockup reports, workqueue stalls, and the like unless
>> they explicitly yield.
>> 
>> Rather than placing cond_resched() calls within various
>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>> expression of the loop macro itself so users can't omit it.
>
> Is that not too much to call cond_resched() on every LMB?
>
> Could that be less frequent, every 10, or 100, I don't really know ?

Everything done within for_each_drmem_lmb is relatively heavyweight
already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
of milliseconds. I don't think cond_resched() is an expensive check in
this context.

Re: [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's

2020-07-23 Thread Nathan Lynch

Pingfan Liu  writes:
> This patch prepares for the incoming patch which swaps the order of KOBJ_
> uevent and dt's updating.
>
> It has no functional effect, just groups lmb operation and memblock's in
> order to insert dt updating operation easily, and makes it easier to
> review.

...

> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 5d545b7..1a3ac3b 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
>  static int dlpar_remove_lmb(struct drmem_lmb *lmb)
>  {
>   unsigned long block_sz;
> - int rc;
> + phys_addr_t base_addr;
> + int rc, nid;
>  
>   if (!lmb_is_removable(lmb))
>   return -EINVAL;
> @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
>   if (rc)
>   return rc;
>  
> + base_addr = lmb->base_addr;
> + nid = lmb->nid;
>   block_sz = pseries_memory_block_size();
>  
> - __remove_memory(lmb->nid, lmb->base_addr, block_sz);
> -
> - /* Update memory regions for memory remove */
> - memblock_remove(lmb->base_addr, block_sz);
> -
>   invalidate_lmb_associativity_index(lmb);
>   lmb_clear_nid(lmb);
>   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
>  
> + __remove_memory(nid, base_addr, block_sz);
> +
> + /* Update memory regions for memory remove */
> + memblock_remove(base_addr, block_sz);
> +
>   return 0;
>  }

I don't understand; the commit message should not claim this has no
functional effect when it changes the order of operations like
this. Maybe this is an improvement over the current behavior, but it's
not explained why it would be.

Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-30 Thread Nathan Lynch

Michael Ellerman  writes:
> Nathan Lynch  writes:
>> Laurent Dufour  writes:
>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>> unfortunately lookups take the form of linear searches. As long as
>>>> this is the case, traversals have the potential to monopolize the CPU
>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>> they explicitly yield.
>>>> 
>>>> Rather than placing cond_resched() calls within various
>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>> expression of the loop macro itself so users can't omit it.
>>>
>>> Is that not too much to call cond_resched() on every LMB?
>>>
>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>
>> Everything done within for_each_drmem_lmb is relatively heavyweight
>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>> of milliseconds. I don't think cond_resched() is an expensive check in
>> this context.
>
> Hmm, mostly.
>
> But there are quite a few cases like drmem_update_dt_v1():
>
>   for_each_drmem_lmb(lmb) {
>   dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>   dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>   dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>   dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>
>   dr_cell++;
>   }
>
> Which will compile to a pretty tight loop at the moment.
>
> Or drmem_update_dt_v2() which has two loops over all lmbs.
>
> And although the actual TIF check is cheap the function call to do it is
> not free.
>
> So I worry this is going to make some of those long loops take even
> longer.

That's fair, and I was wrong - some of the loop bodies are relatively
simple, not doing allocations or taking locks, etc.

One way to deal is to keep for_each_drmem_lmb() as-is and add a new
iterator that can reschedule, e.g. for_each_drmem_lmb_slow().

On the other hand... it's probably not too strong to say that the
drmem/hotplug code is in crisis with respect to correctness and
algorithmic complexity, so those are my overriding concerns right
now. Yes, this change will pessimize loops that are reinitializing the
entire drmem_lmb array on every DLPAR operation, but:

1. it doesn't make any user of for_each_drmem_lmb() less correct;
2. why is this code doing that in the first place, other than to
   accommodate a poor data structure choice?

The duration of the system calls where this code runs are measured in
minutes or hours on large configurations because of all the behaviors
that are at best O(n) with the amount of memory assigned to the
partition. For simplicity's sake I'd rather defer lower-level
performance considerations like this until the drmem data structures'
awful lookup properties are fixed -- hopefully in the 5.10 timeframe.

Thoughts?

Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform

2020-07-06 Thread Nathan Lynch

Tyrel Datwyler  writes:
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void)
>>  return;
>> 
>>  if (of_property_read_u32_index(rtas,
>> -"ibm,max-associativity-domains",
>> +"ibm,current-associativity-domains",
>
> I'm not sure ibm,current-associativity-domains is guaranteed to exist on older
> firmware. You may need check that it exists and fall back to
> ibm,max-associativity-domains in the event it doesn't

Yes. Looks like it's a PowerVM-specific property.

[PATCH] powerpc/pseries/dlpar: handle ibm, configure-connector delay status

2021-01-06 Thread Nathan Lynch

dlpar_configure_connector() has two problems in its handling of
ibm,configure-connector's return status:

1. When the status is -2 (busy, call again), we call
   ibm,configure-connector again immediately without checking whether
   to schedule, which can result in monopolizing the CPU.
2. Extended delay status (9900..9905) goes completely unhandled,
   causing the configuration to unnecessarily terminate.

Fix both of these issues by using rtas_busy_delay().

Fixes: ab519a011caa ("powerpc/pseries: Kernel DLPAR Infrastructure")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/dlpar.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 16e86ba8aa20..f6b7749d6ada 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -127,7 +127,6 @@ void dlpar_free_cc_nodes(struct device_node *dn)
 #define NEXT_PROPERTY   3
 #define PREV_PARENT 4
 #define MORE_MEMORY 5
-#define CALL_AGAIN -2
 #define ERR_CFG_USE -9003
 
 struct device_node *dlpar_configure_connector(__be32 drc_index,
@@ -168,6 +167,9 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
 
spin_unlock(_data_buf_lock);
 
+   if (rtas_busy_delay(rc))
+   continue;
+
switch (rc) {
case COMPLETE:
break;
@@ -216,9 +218,6 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
last_dn = last_dn->parent;
break;
 
-   case CALL_AGAIN:
-   break;
-
case MORE_MEMORY:
case ERR_CFG_USE:
default:
-- 
2.29.2

Re: [PATCH 01/29] powerpc/rtas: move rtas_call_reentrant() out of pseries guards

2020-12-04 Thread Nathan Lynch

Nathan Lynch  writes:

> rtas_call_reentrant() isn't platform-dependent; move it out of
> CONFIG_PPC_PSERIES-guarded code.

As reported by the 0-day CI, this is mistaken, and it breaks non-pseries
builds:

>> arch/powerpc/kernel/rtas.c:938:21: error: no member named 
>> 'rtas_args_reentrant' in 'struct paca_struct'
   args = local_paca->rtas_args_reentrant;
  ~~  ^
   1 error generated.

https://lore.kernel.org/linuxppc-dev/202012050432.sfbbjwmw-...@intel.com/

I'll drop this and reroll the series.

[PATCH v2 01/28] powerpc/rtas: prevent suspend-related sys_rtas use on LE

2020-12-07 Thread Nathan Lynch

While drmgr has had work in some areas to make its RTAS syscall
interactions endian-neutral, its code for performing partition
migration via the syscall has never worked on LE. While it is able to
complete ibm,suspend-me successfully, it crashes when attempting the
subsequent ibm,update-nodes call.

drmgr is the only known (or plausible) user of ibm,suspend-me,
ibm,update-nodes, and ibm,update-properties, so allow them only in
big-endian configurations.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 954f41676f69..4ed64aba37d6 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1050,9 +1050,11 @@ static struct rtas_filter rtas_filters[] __ro_after_init 
= {
{ "set-time-for-power-on", -1, -1, -1, -1, -1 },
{ "ibm,set-system-parameter", -1, 1, -1, -1, -1 },
{ "set-time-of-day", -1, -1, -1, -1, -1 },
+#ifdef CONFIG_CPU_BIG_ENDIAN
{ "ibm,suspend-me", -1, -1, -1, -1, -1 },
{ "ibm,update-nodes", -1, 0, -1, -1, -1, 4096 },
{ "ibm,update-properties", -1, 0, -1, -1, -1, 4096 },
+#endif
{ "ibm,physical-attestation", -1, 0, 1, -1, -1 },
 };
 
-- 
2.28.0

[PATCH v2 05/28] powerpc/rtas: add rtas_activate_firmware()

2020-12-07 Thread Nathan Lynch

Provide a documented wrapper function for the ibm,activate-firmware
service, which must be called after a partition migration or
hibernation.

If the function is absent or the call fails, the OS will continue to
run normally with the current firmware, so there is no need to perform
any recovery. Just log it and continue.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/rtas.c  | 30 ++
 2 files changed, 31 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index b43165fc6c2a..fdefe6a974eb 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -247,6 +247,7 @@ extern void __noreturn rtas_restart(char *cmd);
 extern void rtas_power_off(void);
 extern void __noreturn rtas_halt(void);
 extern void rtas_os_term(char *str);
+void rtas_activate_firmware(void);
 extern int rtas_get_sensor(int sensor, int index, int *state);
 extern int rtas_get_sensor_fast(int sensor, int index, int *state);
 extern int rtas_get_power_level(int powerdomain, int *level);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 8a618a3c4beb..3a740ae933f8 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -798,6 +798,36 @@ void rtas_os_term(char *str)
printk(KERN_EMERG "ibm,os-term call failed %d\n", status);
 }
 
+/**
+ * rtas_activate_firmware() - Activate a new version of firmware.
+ *
+ * Activate a new version of partition firmware. The OS must call this
+ * after resuming from a partition hibernation or migration in order
+ * to maintain the ability to perform live firmware updates. It's not
+ * catastrophic for this method to be absent or to fail; just log the
+ * condition in that case.
+ *
+ * Context: This function may sleep.
+ */
+void rtas_activate_firmware(void)
+{
+   int token;
+   int fwrc;
+
+   token = rtas_token("ibm,activate-firmware");
+   if (token == RTAS_UNKNOWN_SERVICE) {
+   pr_notice("ibm,activate-firmware method unavailable\n");
+   return;
+   }
+
+   do {
+   fwrc = rtas_call(token, 0, 1, NULL);
+   } while (rtas_busy_delay(fwrc));
+
+   if (fwrc)
+   pr_err("ibm,activate-firmware failed (%i)\n", fwrc);
+}
+
 static int ibm_suspend_me_token = RTAS_UNKNOWN_SERVICE;
 #ifdef CONFIG_PPC_PSERIES
 static int __rtas_suspend_last_cpu(struct rtas_suspend_me_data *data, int 
wake_when_done)
-- 
2.28.0

[PATCH v2 09/28] powerpc/pseries/mobility: error message improvements

2020-12-07 Thread Nathan Lynch

- Convert printk(KERN_ERR) to pr_err().
- Include errno in property update failure message.
- Remove reference to "Post-mobility" from device tree update message:
  with pr_err() it will have a "mobility:" prefix.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 527a64e2d89f..31d81b7da961 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -208,8 +208,8 @@ static int update_dt_node(__be32 phandle, s32 scope)
rc = update_dt_property(dn, , prop_name,
vd, prop_data);
if (rc) {
-   printk(KERN_ERR "Could not update %s"
-  " property\n", prop_name);
+   pr_err("updating %s property failed: 
%d\n",
+  prop_name, rc);
}
 
prop_data += vd;
@@ -343,8 +343,7 @@ void post_mobility_fixup(void)
 
rc = pseries_devicetree_update(MIGRATION_SCOPE);
if (rc)
-   printk(KERN_ERR "Post-mobility device tree update "
-   "failed: %d\n", rc);
+   pr_err("device tree update failed: %d\n", rc);
 
cacheinfo_rebuild();
 
-- 
2.28.0

[PATCH v2 15/28] powerpc/rtas: dispatch partition migration requests to pseries

2020-12-07 Thread Nathan Lynch

sys_rtas() cannot call ibm,suspend-me directly in the same way it
handles other inputs. Instead it must dispatch the request to code
that can first perform the H_JOIN sequence before any call to
ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair
amount of platform-specific code to implement this.

Since a different, more robust implementation of the suspend sequence
is now in the pseries platform code, we want to dispatch the request
there.

Note that invoking ibm,suspend-me via the RTAS syscall is all but
deprecated; this change preserves ABI compatibility for old programs
while providing to them the benefit of the new partition suspend
implementation. This is a behavior change in that the kernel performs
the device tree update and firmware activation before returning, but
experimentation indicates this is tolerated fine by legacy user space.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h   | 5 +
 arch/powerpc/kernel/rtas.c| 2 +-
 arch/powerpc/platforms/pseries/mobility.c | 5 +
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index fdefe6a974eb..3b52d8574fcc 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -279,8 +279,13 @@ extern time64_t last_rtas_event;
 extern int clobbering_unread_rtas_event(void);
 extern int pseries_devicetree_update(s32 scope);
 extern void post_mobility_fixup(void);
+int rtas_syscall_dispatch_ibm_suspend_me(u64 handle);
 #else
 static inline int clobbering_unread_rtas_event(void) { return 0; }
+static inline int rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
+{
+   return -EINVAL;
+}
 #endif
 
 #ifdef CONFIG_PPC_RTAS_DAEMON
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 3a740ae933f8..d4b048571728 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1272,7 +1272,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
int rc = 0;
u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32)
  | be32_to_cpu(args.args[1]);
-   rc = rtas_ibm_suspend_me_unsafe(handle);
+   rc = rtas_syscall_dispatch_ibm_suspend_me(handle);
if (rc == -EAGAIN)
args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE);
else if (rc == -EIO)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index fe7e35cdc9d5..e670180f311d 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -615,6 +615,11 @@ static int pseries_migrate_partition(u64 handle)
return ret;
 }
 
+int rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
+{
+   return pseries_migrate_partition(handle);
+}
+
 static ssize_t migration_store(struct class *class,
   struct class_attribute *attr, const char *buf,
   size_t count)
-- 
2.28.0

[PATCH v2 20/28] powerpc/machdep: remove suspend_disable_cpu()

2020-12-07 Thread Nathan Lynch

There are no users left of the suspend_disable_cpu() callback, remove
it.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/machdep.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 475687f24f4a..cf6ebbc16cb4 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -207,7 +207,6 @@ struct machdep_calls {
void (*suspend_disable_irqs)(void);
void (*suspend_enable_irqs)(void);
 #endif
-   int (*suspend_disable_cpu)(void);
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
ssize_t (*cpu_probe)(const char *, size_t);
-- 
2.28.0

[PATCH v2 23/28] powerpc/rtas: remove unused rtas_suspend_last_cpu()

2020-12-07 Thread Nathan Lynch

rtas_suspend_last_cpu() is now unused, remove it and
__rtas_suspend_last_cpu() which also becomes unused.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 -
 arch/powerpc/kernel/rtas.c  | 43 -
 2 files changed, 44 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 97ccb40fb09f..332e1000ca0f 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -256,7 +256,6 @@ extern bool rtas_indicator_present(int token, int 
*maxindex);
 extern int rtas_set_indicator(int indicator, int index, int new_value);
 extern int rtas_set_indicator_fast(int indicator, int index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
-extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
 int rtas_ibm_suspend_me(int *fw_status);
 
 struct rtc_time;
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index aedd46967b99..9a7d1bba3ef7 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -830,49 +830,6 @@ void rtas_activate_firmware(void)
 
 static int ibm_suspend_me_token = RTAS_UNKNOWN_SERVICE;
 #ifdef CONFIG_PPC_PSERIES
-static int __rtas_suspend_last_cpu(struct rtas_suspend_me_data *data, int 
wake_when_done)
-{
-   u16 slb_size = mmu_slb_size;
-   int rc = H_MULTI_THREADS_ACTIVE;
-   int cpu;
-
-   slb_set_size(SLB_MIN_SIZE);
-   printk(KERN_DEBUG "calling ibm,suspend-me on cpu %i\n", 
smp_processor_id());
-
-   while (rc == H_MULTI_THREADS_ACTIVE && !atomic_read(>done) &&
-  !atomic_read(>error))
-   rc = rtas_call(data->token, 0, 1, NULL);
-
-   if (rc || atomic_read(>error)) {
-   printk(KERN_DEBUG "ibm,suspend-me returned %d\n", rc);
-   slb_set_size(slb_size);
-   }
-
-   if (atomic_read(>error))
-   rc = atomic_read(>error);
-
-   atomic_set(>error, rc);
-   pSeries_coalesce_init();
-
-   if (wake_when_done) {
-   atomic_set(>done, 1);
-
-   for_each_online_cpu(cpu)
-   plpar_hcall_norets(H_PROD, 
get_hard_smp_processor_id(cpu));
-   }
-
-   if (atomic_dec_return(>working) == 0)
-   complete(data->complete);
-
-   return rc;
-}
-
-int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data)
-{
-   atomic_inc(>working);
-   return __rtas_suspend_last_cpu(data, 0);
-}
-
 /**
  * rtas_call_reentrant() - Used for reentrant rtas calls
  * @token: Token for desired reentrant RTAS call
-- 
2.28.0

[PATCH v2 11/28] powerpc/pseries/mobility: extract VASI session polling logic

2020-12-07 Thread Nathan Lynch

The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to
the caller until the specified VASI suspend session state makes the
transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest
of separating concerns to prepare for a new implementation of the
join/suspend sequence, extract VASI session polling logic into a
couple of local functions. Waiting for the session state to reach
H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures
that we will never get an EAGAIN result necessitating a retry. No
user-visible change in behavior is intended.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 69 +--
 1 file changed, 64 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 01ac7c03558e..573ed48b43d8 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -345,6 +345,66 @@ void post_mobility_fixup(void)
return;
 }
 
+static int poll_vasi_state(u64 handle, unsigned long *res)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+   long hvrc;
+   int ret;
+
+   hvrc = plpar_hcall(H_VASI_STATE, retbuf, handle);
+   switch (hvrc) {
+   case H_SUCCESS:
+   ret = 0;
+   *res = retbuf[0];
+   break;
+   case H_PARAMETER:
+   ret = -EINVAL;
+   break;
+   case H_FUNCTION:
+   ret = -EOPNOTSUPP;
+   break;
+   case H_HARDWARE:
+   default:
+   pr_err("unexpected H_VASI_STATE result %ld\n", hvrc);
+   ret = -EIO;
+   break;
+   }
+   return ret;
+}
+
+static int wait_for_vasi_session_suspending(u64 handle)
+{
+   unsigned long state;
+   int ret;
+
+   /*
+* Wait for transition from H_VASI_ENABLED to
+* H_VASI_SUSPENDING. Treat anything else as an error.
+*/
+   while (true) {
+   ret = poll_vasi_state(handle, );
+
+   if (ret != 0 || state == H_VASI_SUSPENDING) {
+   break;
+   } else if (state == H_VASI_ENABLED) {
+   ssleep(1);
+   } else {
+   pr_err("unexpected H_VASI_STATE result %lu\n", state);
+   ret = -EIO;
+   break;
+   }
+   }
+
+   /*
+* Proceed even if H_VASI_STATE is unavailable. If H_JOIN or
+* ibm,suspend-me are also unimplemented, we'll recover then.
+*/
+   if (ret == -EOPNOTSUPP)
+   ret = 0;
+
+   return ret;
+}
+
 static ssize_t migration_store(struct class *class,
   struct class_attribute *attr, const char *buf,
   size_t count)
@@ -356,12 +416,11 @@ static ssize_t migration_store(struct class *class,
if (rc)
return rc;
 
-   do {
-   rc = rtas_ibm_suspend_me_unsafe(streamid);
-   if (rc == -EAGAIN)
-   ssleep(1);
-   } while (rc == -EAGAIN);
+   rc = wait_for_vasi_session_suspending(streamid);
+   if (rc)
+   return rc;
 
+   rc = rtas_ibm_suspend_me_unsafe(streamid);
if (rc)
return rc;
 
-- 
2.28.0

[PATCH v2 04/28] powerpc/rtas: add rtas_ibm_suspend_me()

2020-12-07 Thread Nathan Lynch

Now that the name is available, provide a simple wrapper for
ibm,suspend-me which returns both a Linux errno and optionally the
actual RTAS status to the caller.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/rtas.c  | 57 +
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 8436ed01567b..b43165fc6c2a 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -258,6 +258,7 @@ extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
 int rtas_ibm_suspend_me_unsafe(u64 handle);
+int rtas_ibm_suspend_me(int *fw_status);
 
 struct rtc_time;
 extern time64_t rtas_get_boot_time(void);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 0a8e5dc2c108..8a618a3c4beb 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -684,6 +684,63 @@ int rtas_set_indicator_fast(int indicator, int index, int 
new_value)
return rc;
 }
 
+/**
+ * rtas_ibm_suspend_me() - Call ibm,suspend-me to suspend the LPAR.
+ *
+ * @fw_status: RTAS call status will be placed here if not NULL.
+ *
+ * rtas_ibm_suspend_me() should be called only on a CPU which has
+ * received H_CONTINUE from the H_JOIN hcall. All other active CPUs
+ * should be waiting to return from H_JOIN.
+ *
+ * rtas_ibm_suspend_me() may suspend execution of the OS
+ * indefinitely. Callers should take appropriate measures upon return, such as
+ * resetting watchdog facilities.
+ *
+ * Callers may choose to retry this call if @fw_status is
+ * %RTAS_THREADS_ACTIVE.
+ *
+ * Return:
+ * 0  - The partition has resumed from suspend, possibly after
+ *  migration to a different host.
+ * -ECANCELED - The operation was aborted.
+ * -EAGAIN- There were other CPUs not in H_JOIN at the time of the call.
+ * -EBUSY - Some other condition prevented the suspend from succeeding.
+ * -EIO   - Hardware/platform error.
+ */
+int rtas_ibm_suspend_me(int *fw_status)
+{
+   int fwrc;
+   int ret;
+
+   fwrc = rtas_call(rtas_token("ibm,suspend-me"), 0, 1, NULL);
+
+   switch (fwrc) {
+   case 0:
+   ret = 0;
+   break;
+   case RTAS_SUSPEND_ABORTED:
+   ret = -ECANCELED;
+   break;
+   case RTAS_THREADS_ACTIVE:
+   ret = -EAGAIN;
+   break;
+   case RTAS_NOT_SUSPENDABLE:
+   case RTAS_OUTSTANDING_COPROC:
+   ret = -EBUSY;
+   break;
+   case -1:
+   default:
+   ret = -EIO;
+   break;
+   }
+
+   if (fw_status)
+   *fw_status = fwrc;
+
+   return ret;
+}
+
 void __noreturn rtas_restart(char *cmd)
 {
if (rtas_flash_term_hook)
-- 
2.28.0

[PATCH v2 12/28] powerpc/pseries/mobility: use stop_machine for join/suspend

2020-12-07 Thread Nathan Lynch

The partition suspend sequence as specified in the platform
architecture requires that all active processor threads call
H_JOIN, which:

- suspends the calling thread until it is the target of
  an H_PROD; or
- immediately returns H_CONTINUE, if the calling thread is the last to
  call H_JOIN. This thread is expected to call ibm,suspend-me to
  completely suspend the partition.

Upon returning from ibm,suspend-me the calling thread must wake all
others using H_PROD.

rtas_ibm_suspend_me_unsafe() uses on_each_cpu() to implement this
protocol, but because of its synchronizing nature this is susceptible
to deadlock versus users of stop_machine() or other callers of
on_each_cpu().

Not only is stop_machine() intended for use cases like this, it
handles error propagation and allows us to keep the data shared
between CPUs minimal: a single atomic counter which ensures exactly
one CPU will wake the others from their joined states.

Switch the migration code to use stop_machine() and a less complex
local implementation of the H_JOIN/ibm,suspend-me logic, which
carries additional benefits:

- more informative error reporting, appropriately ratelimited
- resets the lockup detector / watchdog on resume to prevent lockup
  warnings when the OS has been suspended for a time exceeding the
  threshold.

Fixes: 91dc182ca6e2 ("[PATCH] powerpc: special-case ibm,suspend-me RTAS call")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 132 --
 1 file changed, 125 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 573ed48b43d8..5a3951626a96 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -12,9 +12,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -405,6 +407,128 @@ static int wait_for_vasi_session_suspending(u64 handle)
return ret;
 }
 
+static void prod_single(unsigned int target_cpu)
+{
+   long hvrc;
+   int hwid;
+
+   hwid = get_hard_smp_processor_id(target_cpu);
+   hvrc = plpar_hcall_norets(H_PROD, hwid);
+   if (hvrc == H_SUCCESS)
+   return;
+   pr_err_ratelimited("H_PROD of CPU %u (hwid %d) error: %ld\n",
+  target_cpu, hwid, hvrc);
+}
+
+static void prod_others(void)
+{
+   unsigned int cpu;
+
+   for_each_online_cpu(cpu) {
+   if (cpu != smp_processor_id())
+   prod_single(cpu);
+   }
+}
+
+static u16 clamp_slb_size(void)
+{
+   u16 prev = mmu_slb_size;
+
+   slb_set_size(SLB_MIN_SIZE);
+
+   return prev;
+}
+
+static int do_suspend(void)
+{
+   u16 saved_slb_size;
+   int status;
+   int ret;
+
+   pr_info("calling ibm,suspend-me on CPU %i\n", smp_processor_id());
+
+   /*
+* The destination processor model may have fewer SLB entries
+* than the source. We reduce mmu_slb_size to a safe minimum
+* before suspending in order to minimize the possibility of
+* programming non-existent entries on the destination. If
+* suspend fails, we restore it before returning. On success
+* the OF reconfig path will update it from the new device
+* tree after resuming on the destination.
+*/
+   saved_slb_size = clamp_slb_size();
+
+   ret = rtas_ibm_suspend_me();
+   if (ret != 0) {
+   pr_err("ibm,suspend-me error: %d\n", status);
+   slb_set_size(saved_slb_size);
+   }
+
+   return ret;
+}
+
+static int do_join(void *arg)
+{
+   atomic_t *counter = arg;
+   long hvrc;
+   int ret;
+
+   /* Must ensure MSR.EE off for H_JOIN. */
+   hard_irq_disable();
+   hvrc = plpar_hcall_norets(H_JOIN);
+
+   switch (hvrc) {
+   case H_CONTINUE:
+   /*
+* All other CPUs are offline or in H_JOIN. This CPU
+* attempts the suspend.
+*/
+   ret = do_suspend();
+   break;
+   case H_SUCCESS:
+   /*
+* The suspend is complete and this cpu has received a
+* prod.
+*/
+   ret = 0;
+   break;
+   case H_BAD_MODE:
+   case H_HARDWARE:
+   default:
+   ret = -EIO;
+   pr_err_ratelimited("H_JOIN error %ld on CPU %i\n",
+  hvrc, smp_processor_id());
+   break;
+   }
+
+   if (atomic_inc_return(counter) == 1) {
+   pr_info("CPU %u waking all threads\n", smp_processor_id());
+   prod_others();
+   }
+   /*
+* Execution may have been suspended for several seconds, so
+* reset the watchdog.
+*/
+   touch_nmi_watchdog();
+   retur

[PATCH v2 24/28] powerpc/pseries/hibernation: remove redundant cacheinfo update

2020-12-07 Thread Nathan Lynch

Partitions with cache nodes in the device tree can encounter the
following warning on resume:

CPU 0 already accounted in PowerPC,POWER9@0(Data)
WARNING: CPU: 0 PID: 3177 at arch/powerpc/kernel/cacheinfo.c:197 
cacheinfo_cpu_online+0x640/0x820

These calls to cacheinfo_cpu_offline/online have been redundant since
commit e610a466d16a ("powerpc/pseries/mobility: rebuild cacheinfo
hierarchy post-migration").

Fixes: e610a466d16a ("powerpc/pseries/mobility: rebuild cacheinfo hierarchy 
post-migration")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 703728cb95ec..6a94cc0deb88 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -13,7 +13,6 @@
 #include 
 #include 
 #include 
-#include "../../kernel/cacheinfo.h"
 
 static struct device suspend_dev;
 static DECLARE_COMPLETION(suspend_work);
@@ -63,9 +62,7 @@ static void pseries_suspend_enable_irqs(void)
 * Update configuration which can be modified based on device tree
 * changes during resume.
 */
-   cacheinfo_cpu_offline(smp_processor_id());
post_mobility_fixup();
-   cacheinfo_cpu_online(smp_processor_id());
 }
 
 /**
-- 
2.28.0

[PATCH v2 06/28] powerpc/hvcall: add token and codes for H_VASI_SIGNAL

2020-12-07 Thread Nathan Lynch

H_VASI_SIGNAL can be used by a partition to request cancellation of
its migration. To be used in future changes.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/hvcall.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index c1fbccb04390..c98f5141e3fc 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -155,6 +155,14 @@
 #define H_VASI_RESUMED  5
 #define H_VASI_COMPLETED6
 
+/* VASI signal codes. Only the Cancel code is valid for H_VASI_SIGNAL. */
+#define H_VASI_SIGNAL_CANCEL1
+#define H_VASI_SIGNAL_ABORT 2
+#define H_VASI_SIGNAL_SUSPEND   3
+#define H_VASI_SIGNAL_COMPLETE  4
+#define H_VASI_SIGNAL_ENABLE5
+#define H_VASI_SIGNAL_FAILOVER  6
+
 /* Each control block has to be on a 4K boundary */
 #define H_CB_ALIGNMENT  4096
 
@@ -261,6 +269,7 @@
 #define H_ADD_CONN 0x284
 #define H_DEL_CONN 0x288
 #define H_JOIN 0x298
+#define H_VASI_SIGNAL   0x2A0
 #define H_VASI_STATE0x2A4
 #define H_VIOCTL   0x2A8
 #define H_ENABLE_CRQ   0x2B0
-- 
2.28.0

[PATCH v2 17/28] powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops

2020-12-07 Thread Nathan Lynch

There are three ways pseries_suspend_begin() can be reached:

1. When "mem" is written to /sys/power/state:

kobj_attr_store()
-> state_store()
  -> pm_suspend()
-> suspend_devices_and_enter()
  -> pseries_suspend_begin()

This never works because there is no way to supply a valid stream id
using this interface, and H_VASI_STATE is called with a stream id of
zero. So this call path is useless at best.

2. When a stream id is written to /sys/devices/system/power/hibernate.
pseries_suspend_begin() is polled directly from store_hibernate()
until the stream is in the "Suspending" state (i.e. the platform is
ready for the OS to suspend execution):

dev_attr_store()
-> store_hibernate()
  -> pseries_suspend_begin()

3. When a stream id is written to /sys/devices/system/power/hibernate
(continued). After #2, pseries_suspend_begin() is called once again
from the pm core:

dev_attr_store()
-> store_hibernate()
  -> pm_suspend()
-> suspend_devices_and_enter()
  -> pseries_suspend_begin()

This is redundant because the VASI suspend state is already known to
be Suspending.

The begin() callback of platform_suspend_ops is optional, so we can
simply remove that assignment with no loss of function.

Fixes: 32d8ad4e621d ("powerpc/pseries: Partition hibernation support")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 81e0ac58d620..3eaa9d59dc7a 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -187,7 +187,6 @@ static struct bus_type suspend_subsys = {
 
 static const struct platform_suspend_ops pseries_suspend_ops = {
.valid  = suspend_valid_only_mem,
-   .begin  = pseries_suspend_begin,
.prepare_late   = pseries_prepare_late,
.enter  = pseries_suspend_enter,
 };
-- 
2.28.0

[PATCH v2 27/28] powerpc/rtas: remove unused rtas_suspend_me_data

2020-12-07 Thread Nathan Lynch

All code which used this type has been removed.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas-types.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas-types.h 
b/arch/powerpc/include/asm/rtas-types.h
index aa420561bc10..8df6235d64d1 100644
--- a/arch/powerpc/include/asm/rtas-types.h
+++ b/arch/powerpc/include/asm/rtas-types.h
@@ -23,14 +23,6 @@ struct rtas_t {
struct device_node *dev;/* virtual address pointer */
 };
 
-struct rtas_suspend_me_data {
-   atomic_t working; /* number of cpus accessing this struct */
-   atomic_t done;
-   int token; /* ibm,suspend-me */
-   atomic_t error;
-   struct completion *complete; /* wait on this until working == 0 */
-};
-
 struct rtas_error_log {
/* Byte 0 */
u8  byte0;  /* Architectural version */
-- 
2.28.0

[PATCH v2 18/28] powerpc/pseries/hibernation: pass stream id via function arguments

2020-12-07 Thread Nathan Lynch

There is no need for the stream id to be a file-global variable; pass
it from hibernate_store() to pseries_suspend_begin() for the
H_VASI_STATE call.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 3eaa9d59dc7a..232621f33510 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -15,7 +15,6 @@
 #include 
 #include "../../kernel/cacheinfo.h"
 
-static u64 stream_id;
 static struct device suspend_dev;
 static DECLARE_COMPLETION(suspend_work);
 static struct rtas_suspend_me_data suspend_data;
@@ -29,7 +28,7 @@ static atomic_t suspending;
  * Return value:
  * 0 on success / other on failure
  **/
-static int pseries_suspend_begin(suspend_state_t state)
+static int pseries_suspend_begin(u64 stream_id)
 {
long vasi_state, rc;
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
@@ -132,6 +131,7 @@ static ssize_t store_hibernate(struct device *dev,
   struct device_attribute *attr,
   const char *buf, size_t count)
 {
+   u64 stream_id;
int rc;
 
if (!capable(CAP_SYS_ADMIN))
@@ -140,7 +140,7 @@ static ssize_t store_hibernate(struct device *dev,
stream_id = simple_strtoul(buf, NULL, 16);
 
do {
-   rc = pseries_suspend_begin(PM_SUSPEND_MEM);
+   rc = pseries_suspend_begin(stream_id);
if (rc == -EAGAIN)
ssleep(1);
} while (rc == -EAGAIN);
@@ -148,8 +148,6 @@ static ssize_t store_hibernate(struct device *dev,
if (!rc)
rc = pm_suspend(PM_SUSPEND_MEM);
 
-   stream_id = 0;
-
if (!rc)
rc = count;
 
-- 
2.28.0

[PATCH v2 26/28] powerpc/pseries/hibernation: remove prepare_late() callback

2020-12-07 Thread Nathan Lynch

The pseries hibernate code no longer calls into the original
join/suspend code in kernel/rtas.c, so pseries_prepare_late() and
related code don't accomplish anything now.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 25 
 1 file changed, 25 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 589a91730db8..1b902cbf85c5 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -15,9 +15,6 @@
 #include 
 
 static struct device suspend_dev;
-static DECLARE_COMPLETION(suspend_work);
-static struct rtas_suspend_me_data suspend_data;
-static atomic_t suspending;
 
 /**
  * pseries_suspend_begin - First phase of hibernation
@@ -61,23 +58,6 @@ static int pseries_suspend_enter(suspend_state_t state)
return rtas_ibm_suspend_me(NULL);
 }
 
-/**
- * pseries_prepare_late - Prepare to suspend all other CPUs
- *
- * Return value:
- * 0 on success / other on failure
- **/
-static int pseries_prepare_late(void)
-{
-   atomic_set(, 1);
-   atomic_set(_data.working, 0);
-   atomic_set(_data.done, 0);
-   atomic_set(_data.error, 0);
-   suspend_data.complete = _work;
-   reinit_completion(_work);
-   return 0;
-}
-
 /**
  * store_hibernate - Initiate partition hibernation
  * @dev:   subsys root device
@@ -152,7 +132,6 @@ static struct bus_type suspend_subsys = {
 
 static const struct platform_suspend_ops pseries_suspend_ops = {
.valid  = suspend_valid_only_mem,
-   .prepare_late   = pseries_prepare_late,
.enter  = pseries_suspend_enter,
 };
 
@@ -195,10 +174,6 @@ static int __init pseries_suspend_init(void)
if (!firmware_has_feature(FW_FEATURE_LPAR))
return 0;
 
-   suspend_data.token = rtas_token("ibm,suspend-me");
-   if (suspend_data.token == RTAS_UNKNOWN_SERVICE)
-   return 0;
-
if ((rc = pseries_suspend_sysfs_register(_dev)))
return rc;
 
-- 
2.28.0

[PATCH v2 14/28] powerpc/pseries/mobility: retry partition suspend after error

2020-12-07 Thread Nathan Lynch

This is a mitigation for the relatively rare occurrence where a
virtual IOA can be in a transient state that prevents the
suspend/migration from succeeding, resulting in an error from
ibm,suspend-me.

If the join/suspend sequence returns an error, it is acceptable to
retry as long as the VASI suspend session state is still
"Suspending" (i.e. the platform is still waiting for the OS to
suspend).

Retry a few times on suspend failure while this condition holds,
progressively increasing the delay between attempts. We don't want to
retry indefinitey because firmware emits an error log event on each
unsuccessful attempt.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 59 ++-
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index f234a7ed87aa..fe7e35cdc9d5 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -542,16 +542,71 @@ static void pseries_cancel_migration(u64 handle, int err)
pr_err("H_VASI_SIGNAL error: %ld\n", hvrc);
 }
 
+static int pseries_suspend(u64 handle)
+{
+   const unsigned int max_attempts = 5;
+   unsigned int retry_interval_ms = 1;
+   unsigned int attempt = 1;
+   int ret;
+
+   while (true) {
+   atomic_t counter = ATOMIC_INIT(0);
+   unsigned long vasi_state;
+   int vasi_err;
+
+   ret = stop_machine(do_join, , cpu_online_mask);
+   if (ret == 0)
+   break;
+   /*
+* Encountered an error. If the VASI stream is still
+* in Suspending state, it's likely a transient
+* condition related to some device in the partition
+* and we can retry in the hope that the cause has
+* cleared after some delay.
+*
+* A better design would allow drivers etc to prepare
+* for the suspend and avoid conditions which prevent
+* the suspend from succeeding. For now, we have this
+* mitigation.
+*/
+   pr_notice("Partition suspend attempt %u of %u error: %d\n",
+ attempt, max_attempts, ret);
+
+   if (attempt == max_attempts)
+   break;
+
+   vasi_err = poll_vasi_state(handle, _state);
+   if (vasi_err == 0) {
+   if (vasi_state != H_VASI_SUSPENDING) {
+   pr_notice("VASI state %lu after failed 
suspend\n",
+ vasi_state);
+   break;
+   }
+   } else if (vasi_err != -EOPNOTSUPP) {
+   pr_err("VASI state poll error: %d", vasi_err);
+   break;
+   }
+
+   pr_notice("Will retry partition suspend after %u ms\n",
+ retry_interval_ms);
+
+   msleep(retry_interval_ms);
+   retry_interval_ms *= 10;
+   attempt++;
+   }
+
+   return ret;
+}
+
 static int pseries_migrate_partition(u64 handle)
 {
-   atomic_t counter = ATOMIC_INIT(0);
int ret;
 
ret = wait_for_vasi_session_suspending(handle);
if (ret)
return ret;
 
-   ret = stop_machine(do_join, , cpu_online_mask);
+   ret = pseries_suspend(handle);
if (ret == 0)
post_mobility_fixup();
else
-- 
2.28.0

[PATCH v2 28/28] powerpc/pseries/mobility: refactor node lookup during DT update

2020-12-07 Thread Nathan Lynch

In pseries_devicetree_update(), with each call to ibm,update-nodes the
partition firmware communicates the node to be deleted or updated by
placing its phandle in the work buffer. Each of delete_dt_node(),
update_dt_node(), and add_dt_node() have duplicate lookups using the
phandle value and corresponding refcount management.

Move the lookup and of_node_put() into pseries_devicetree_update(),
and emit a warning on any failed lookups.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 49 ---
 1 file changed, 17 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index e670180f311d..ea4d6a660e0d 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -61,18 +61,10 @@ static int mobility_rtas_call(int token, char *buf, s32 
scope)
return rc;
 }
 
-static int delete_dt_node(__be32 phandle)
+static int delete_dt_node(struct device_node *dn)
 {
-   struct device_node *dn;
-
-   dn = of_find_node_by_phandle(be32_to_cpu(phandle));
-   if (!dn)
-   return -ENOENT;
-
pr_debug("removing node %pOFfp\n", dn);
-
dlpar_detach_node(dn);
-   of_node_put(dn);
return 0;
 }
 
@@ -137,10 +129,9 @@ static int update_dt_property(struct device_node *dn, 
struct property **prop,
return 0;
 }
 
-static int update_dt_node(__be32 phandle, s32 scope)
+static int update_dt_node(struct device_node *dn, s32 scope)
 {
struct update_props_workarea *upwa;
-   struct device_node *dn;
struct property *prop = NULL;
int i, rc, rtas_rc;
char *prop_data;
@@ -157,14 +148,8 @@ static int update_dt_node(__be32 phandle, s32 scope)
if (!rtas_buf)
return -ENOMEM;
 
-   dn = of_find_node_by_phandle(be32_to_cpu(phandle));
-   if (!dn) {
-   kfree(rtas_buf);
-   return -ENOENT;
-   }
-
upwa = (struct update_props_workarea *)_buf[0];
-   upwa->phandle = phandle;
+   upwa->phandle = cpu_to_be32(dn->phandle);
 
do {
rtas_rc = mobility_rtas_call(update_properties_token, rtas_buf,
@@ -224,26 +209,18 @@ static int update_dt_node(__be32 phandle, s32 scope)
cond_resched();
} while (rtas_rc == 1);
 
-   of_node_put(dn);
kfree(rtas_buf);
return 0;
 }
 
-static int add_dt_node(__be32 parent_phandle, __be32 drc_index)
+static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
 {
struct device_node *dn;
-   struct device_node *parent_dn;
int rc;
 
-   parent_dn = of_find_node_by_phandle(be32_to_cpu(parent_phandle));
-   if (!parent_dn)
-   return -ENOENT;
-
dn = dlpar_configure_connector(drc_index, parent_dn);
-   if (!dn) {
-   of_node_put(parent_dn);
+   if (!dn)
return -ENOENT;
-   }
 
rc = dlpar_attach_node(dn, parent_dn);
if (rc)
@@ -251,7 +228,6 @@ static int add_dt_node(__be32 parent_phandle, __be32 
drc_index)
 
pr_debug("added node %pOFfp\n", dn);
 
-   of_node_put(parent_dn);
return rc;
 }
 
@@ -284,22 +260,31 @@ int pseries_devicetree_update(s32 scope)
data++;
 
for (i = 0; i < node_count; i++) {
+   struct device_node *np;
__be32 phandle = *data++;
__be32 drc_index;
 
+   np = 
of_find_node_by_phandle(be32_to_cpu(phandle));
+   if (!np) {
+   pr_warn("Failed lookup: phandle 0x%x 
for action 0x%x\n",
+   be32_to_cpu(phandle), action);
+   continue;
+   }
+
switch (action) {
case DELETE_DT_NODE:
-   delete_dt_node(phandle);
+   delete_dt_node(np);
break;
case UPDATE_DT_NODE:
-   update_dt_node(phandle, scope);
+   update_dt_node(np, scope);
break;
case ADD_DT_NODE:
drc_index = *data++;
-   add_dt_node(phandle, drc_index);
+   add_dt_node(np, drc_index);
break;
}
 
+   of_node_put(np);
cond_resched();
}
}
-- 
2.28.0

[PATCH v2 00/28] partition suspend updates

2020-12-07 Thread Nathan Lynch

This series aims to improve the pseries-specific partition migration
and hibernation implementation, part of which has been living in
kernel/rtas.c. Most of that code is eliminated or moved to
platforms/pseries, and the following major functional changes are
made:

- Use stop_machine() instead of on_each_cpu() to avoid deadlock in the
  join/suspend sequence.

- Retry the join/suspend sequence on errors that are likely to be
  transient. This is a mitigation for the fact that drivers currently
  have no way to prepare for an impending partition suspension,
  sometimes resulting in a virtual adapter being in a state which
  causes the platform to fail the suspend call.

- Request cancellation of the migration via H_VASI_SIGNAL if Linux is
  going to error out of the suspend attempt. This allows the
  management console and other entities to promptly clean up their
  operations instead of relying on long timeouts to fail the
  migration.

- Little-endian users of ibm,suspend-me, ibm,update-nodes and
  ibm,update-properties via sys_rtas are blocked when
  CONFIG_PPC_RTAS_FILTERS is enabled.

- Legacy user space code (drmgr) historically has driven the migration
  process by using sys_rtas to separately call ibm,suspend-me,
  ibm,activate-firmware, and ibm,update-nodes/properties, in that
  order. With these changes, when sys_rtas() dispatches
  ibm,suspend-me, the kernel performs the device tree update and
  firmware activation before returning. This is more reliable, and
  drmgr does not seem bothered by it.

- If the H_VASI_STATE hcall is absent, the implementation proceeds
  with the suspend instead of erroring out. This allows us to exercise
  these code paths in QEMU.

Changes since v1:
- Drop "powerpc/rtas: move rtas_call_reentrant() out of pseries
  guards". rtas_call_reentrant() actually is pseries-specific and this
  broke builds without CONFIG_PPC_PSERIES set.
- Simplify polling logic in wait_for_vasi_session_suspending().
  ("powerpc/pseries/mobility: extract VASI session polling logic")
- Use direct return instead of goto in pseries_migrate_partition().
  ("powerpc/pseries/mobility: use stop_machine for join/suspend")
- Change dispatch of ibm,suspend-me in rtas syscall path to use
  conventional config symbol guards instead of a weak function.
  ("powerpc/rtas: dispatch partition migration requests to pseries")
- Fix refcount imbalance in add_dt_node() error path.
  ("powerpc/pseries/mobility: refactor node lookup during DT update")

Nathan Lynch (28):
  powerpc/rtas: prevent suspend-related sys_rtas use on LE
  powerpc/rtas: complete ibm,suspend-me status codes
  powerpc/rtas: rtas_ibm_suspend_me -> rtas_ibm_suspend_me_unsafe
  powerpc/rtas: add rtas_ibm_suspend_me()
  powerpc/rtas: add rtas_activate_firmware()
  powerpc/hvcall: add token and codes for H_VASI_SIGNAL
  powerpc/pseries/mobility: don't error on absence of ibm,update-nodes
  powerpc/pseries/mobility: add missing break to default case
  powerpc/pseries/mobility: error message improvements
  powerpc/pseries/mobility: use rtas_activate_firmware() on resume
  powerpc/pseries/mobility: extract VASI session polling logic
  powerpc/pseries/mobility: use stop_machine for join/suspend
  powerpc/pseries/mobility: signal suspend cancellation to platform
  powerpc/pseries/mobility: retry partition suspend after error
  powerpc/rtas: dispatch partition migration requests to pseries
  powerpc/rtas: remove rtas_ibm_suspend_me_unsafe()
  powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend
ops
  powerpc/pseries/hibernation: pass stream id via function arguments
  powerpc/pseries/hibernation: remove pseries_suspend_cpu()
  powerpc/machdep: remove suspend_disable_cpu()
  powerpc/rtas: remove rtas_suspend_cpu()
  powerpc/pseries/hibernation: switch to rtas_ibm_suspend_me()
  powerpc/rtas: remove unused rtas_suspend_last_cpu()
  powerpc/pseries/hibernation: remove redundant cacheinfo update
  powerpc/pseries/hibernation: perform post-suspend fixups later
  powerpc/pseries/hibernation: remove prepare_late() callback
  powerpc/rtas: remove unused rtas_suspend_me_data
  powerpc/pseries/mobility: refactor node lookup during DT update

 arch/powerpc/include/asm/hvcall.h |   9 +
 arch/powerpc/include/asm/machdep.h|   1 -
 arch/powerpc/include/asm/rtas-types.h |   8 -
 arch/powerpc/include/asm/rtas.h   |  17 +-
 arch/powerpc/kernel/rtas.c| 243 ++-
 arch/powerpc/platforms/pseries/mobility.c | 358 ++
 arch/powerpc/platforms/pseries/suspend.c  |  79 +
 7 files changed, 415 insertions(+), 300 deletions(-)

-- 
2.28.0

[PATCH v2 07/28] powerpc/pseries/mobility: don't error on absence of ibm, update-nodes

2020-12-07 Thread Nathan Lynch

Treat the absence of the ibm,update-nodes function as benign instead
of reporting an error. If the platform does not provide that facility,
it's not a problem for Linux.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 6ff642e84c6a..e66359b00297 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -261,7 +261,7 @@ int pseries_devicetree_update(s32 scope)
 
update_nodes_token = rtas_token("ibm,update-nodes");
if (update_nodes_token == RTAS_UNKNOWN_SERVICE)
-   return -EINVAL;
+   return 0;
 
rtas_buf = kzalloc(RTAS_DATA_BUF_SIZE, GFP_KERNEL);
if (!rtas_buf)
-- 
2.28.0

[PATCH v2 10/28] powerpc/pseries/mobility: use rtas_activate_firmware() on resume

2020-12-07 Thread Nathan Lynch

It's incorrect to abort post-suspend processing if
ibm,activate-firmware isn't available. Use rtas_activate_firmware(),
which logs this condition appropriately and allows us to proceed.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 31d81b7da961..01ac7c03558e 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -312,21 +312,8 @@ int pseries_devicetree_update(s32 scope)
 void post_mobility_fixup(void)
 {
int rc;
-   int activate_fw_token;
 
-   activate_fw_token = rtas_token("ibm,activate-firmware");
-   if (activate_fw_token == RTAS_UNKNOWN_SERVICE) {
-   printk(KERN_ERR "Could not make post-mobility "
-  "activate-fw call.\n");
-   return;
-   }
-
-   do {
-   rc = rtas_call(activate_fw_token, 0, 1, NULL);
-   } while (rtas_busy_delay(rc));
-
-   if (rc)
-   printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc);
+   rtas_activate_firmware();
 
/*
 * We don't want CPUs to go online/offline while the device
-- 
2.28.0

[PATCH v2 22/28] powerpc/pseries/hibernation: switch to rtas_ibm_suspend_me()

2020-12-07 Thread Nathan Lynch

rtas_suspend_last_cpu() and related code perform a lot of work that
isn't relevant to the hibernation workflow. All other CPUs are offline
when called so there is no need to place them in H_JOIN or prod them
on resume, nor is there need for retries or operations on shared
state.

Call the rtas_ibm_suspend_me() wrapper function directly from
pseries_suspend_enter() instead of using rtas_suspend_last_cpu().

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 3315d698d5ab..703728cb95ec 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -76,11 +76,7 @@ static void pseries_suspend_enable_irqs(void)
  **/
 static int pseries_suspend_enter(suspend_state_t state)
 {
-   int rc = rtas_suspend_last_cpu(_data);
-
-   atomic_set(, 0);
-   atomic_set(_data.done, 1);
-   return rc;
+   return rtas_ibm_suspend_me(NULL);
 }
 
 /**
-- 
2.28.0

[PATCH v2 08/28] powerpc/pseries/mobility: add missing break to default case

2020-12-07 Thread Nathan Lynch

update_dt_node() has a switch statement where the default case lacks a
break statement.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index e66359b00297..527a64e2d89f 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -213,6 +213,7 @@ static int update_dt_node(__be32 phandle, s32 scope)
}
 
prop_data += vd;
+   break;
}
 
cond_resched();
-- 
2.28.0

[PATCH v2 13/28] powerpc/pseries/mobility: signal suspend cancellation to platform

2020-12-07 Thread Nathan Lynch

If we're returning an error to user space, use H_VASI_SIGNAL to send a
cancellation request to the platform. This isn't strictly required but
it communicates that Linux will not attempt to complete the suspend,
which allows the various entities involved to promptly end the
operation in progress.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 5a3951626a96..f234a7ed87aa 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -513,6 +513,35 @@ static int do_join(void *arg)
return ret;
 }
 
+/*
+ * Abort reason code byte 0. We use only the 'Migrating partition' value.
+ */
+enum vasi_aborting_entity {
+   ORCHESTRATOR= 1,
+   VSP_SOURCE  = 2,
+   PARTITION_FIRMWARE  = 3,
+   PLATFORM_FIRMWARE   = 4,
+   VSP_TARGET  = 5,
+   MIGRATING_PARTITION = 6,
+};
+
+static void pseries_cancel_migration(u64 handle, int err)
+{
+   u32 reason_code;
+   u32 detail;
+   u8 entity;
+   long hvrc;
+
+   entity = MIGRATING_PARTITION;
+   detail = abs(err) & 0xff;
+   reason_code = (entity << 24) | detail;
+
+   hvrc = plpar_hcall_norets(H_VASI_SIGNAL, handle,
+ H_VASI_SIGNAL_CANCEL, reason_code);
+   if (hvrc)
+   pr_err("H_VASI_SIGNAL error: %ld\n", hvrc);
+}
+
 static int pseries_migrate_partition(u64 handle)
 {
atomic_t counter = ATOMIC_INIT(0);
@@ -525,6 +554,8 @@ static int pseries_migrate_partition(u64 handle)
ret = stop_machine(do_join, , cpu_online_mask);
if (ret == 0)
post_mobility_fixup();
+   else
+   pseries_cancel_migration(handle, ret);
 
return ret;
 }
-- 
2.28.0

[PATCH v2 21/28] powerpc/rtas: remove rtas_suspend_cpu()

2020-12-07 Thread Nathan Lynch

rtas_suspend_cpu() no longer has users; remove it and
__rtas_suspend_cpu() which now becomes unused as well.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 -
 arch/powerpc/kernel/rtas.c  | 52 -
 2 files changed, 53 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 9a6107ffe378..97ccb40fb09f 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -256,7 +256,6 @@ extern bool rtas_indicator_present(int token, int 
*maxindex);
 extern int rtas_set_indicator(int indicator, int index, int new_value);
 extern int rtas_set_indicator_fast(int indicator, int index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
-extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
 int rtas_ibm_suspend_me(int *fw_status);
 
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 7e6024f570da..aedd46967b99 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -873,58 +873,6 @@ int rtas_suspend_last_cpu(struct rtas_suspend_me_data 
*data)
return __rtas_suspend_last_cpu(data, 0);
 }
 
-static int __rtas_suspend_cpu(struct rtas_suspend_me_data *data, int 
wake_when_done)
-{
-   long rc = H_SUCCESS;
-   unsigned long msr_save;
-   int cpu;
-
-   atomic_inc(>working);
-
-   /* really need to ensure MSR.EE is off for H_JOIN */
-   msr_save = mfmsr();
-   mtmsr(msr_save & ~(MSR_EE));
-
-   while (rc == H_SUCCESS && !atomic_read(>done) && 
!atomic_read(>error))
-   rc = plpar_hcall_norets(H_JOIN);
-
-   mtmsr(msr_save);
-
-   if (rc == H_SUCCESS) {
-   /* This cpu was prodded and the suspend is complete. */
-   goto out;
-   } else if (rc == H_CONTINUE) {
-   /* All other cpus are in H_JOIN, this cpu does
-* the suspend.
-*/
-   return __rtas_suspend_last_cpu(data, wake_when_done);
-   } else {
-   printk(KERN_ERR "H_JOIN on cpu %i failed with rc = %ld\n",
-  smp_processor_id(), rc);
-   atomic_set(>error, rc);
-   }
-
-   if (wake_when_done) {
-   atomic_set(>done, 1);
-
-   /* This cpu did the suspend or got an error; in either case,
-* we need to prod all other other cpus out of join state.
-* Extra prods are harmless.
-*/
-   for_each_online_cpu(cpu)
-   plpar_hcall_norets(H_PROD, 
get_hard_smp_processor_id(cpu));
-   }
-out:
-   if (atomic_dec_return(>working) == 0)
-   complete(data->complete);
-   return rc;
-}
-
-int rtas_suspend_cpu(struct rtas_suspend_me_data *data)
-{
-   return __rtas_suspend_cpu(data, 0);
-}
-
 /**
  * rtas_call_reentrant() - Used for reentrant rtas calls
  * @token: Token for desired reentrant RTAS call
-- 
2.28.0

[PATCH v2 25/28] powerpc/pseries/hibernation: perform post-suspend fixups later

2020-12-07 Thread Nathan Lynch

The pseries hibernate code calls post_mobility_fixup() which is sort
of a dumping ground of fixups that need to run after resuming from
suspend regardless of whether suspend was a hibernation or a
migration. Calling post_mobility_fixup() from
pseries_suspend_enable_irqs() runs this code early in resume with
devices suspended and only one CPU up, while the much more commonly
used migration case runs these fixups in a more typical process
context.

Call post_mobility_fixup() after the suspend core returns a success
status to the hibernate sysfs store method and remove
pseries_suspend_enable_irqs().

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 21 -
 1 file changed, 4 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 6a94cc0deb88..589a91730db8 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -50,21 +50,6 @@ static int pseries_suspend_begin(u64 stream_id)
return 0;
 }
 
-/**
- * pseries_suspend_enable_irqs
- *
- * Post suspend configuration updates
- *
- **/
-static void pseries_suspend_enable_irqs(void)
-{
-   /*
-* Update configuration which can be modified based on device tree
-* changes during resume.
-*/
-   post_mobility_fixup();
-}
-
 /**
  * pseries_suspend_enter - Final phase of hibernation
  *
@@ -127,8 +112,11 @@ static ssize_t store_hibernate(struct device *dev,
if (!rc)
rc = pm_suspend(PM_SUSPEND_MEM);
 
-   if (!rc)
+   if (!rc) {
rc = count;
+   post_mobility_fixup();
+   }
+
 
return rc;
 }
@@ -214,7 +202,6 @@ static int __init pseries_suspend_init(void)
if ((rc = pseries_suspend_sysfs_register(_dev)))
return rc;
 
-   ppc_md.suspend_enable_irqs = pseries_suspend_enable_irqs;
suspend_set_ops(_suspend_ops);
return 0;
 }
-- 
2.28.0

[PATCH v2 03/28] powerpc/rtas: rtas_ibm_suspend_me -> rtas_ibm_suspend_me_unsafe

2020-12-07 Thread Nathan Lynch

The pseries partition suspend sequence requires that all active CPUs
call H_JOIN, which suspends all but one of them with interrupts
disabled. The "chosen" CPU is then to call ibm,suspend-me to complete
the suspend. Upon returning from ibm,suspend-me, the chosen CPU is to
use H_PROD to wake the joined CPUs.

Using on_each_cpu() for this, as rtas_ibm_suspend_me() does to
implement partition migration, is susceptible to deadlock with other
users of on_each_cpu() and with users of stop_machine APIs. The
callback passed to on_each_cpu() is not allowed to synchronize with
other CPUs in the way it is used here.

Complicating the fix is the fact that rtas_ibm_suspend_me() also
occupies the function name that should be used to provide a more
conventional wrapper for ibm,suspend-me. Rename rtas_ibm_suspend_me()
to rtas_ibm_suspend_me_unsafe() to free up the name and indicate that
it should not gain users.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h   | 2 +-
 arch/powerpc/kernel/rtas.c| 6 +++---
 arch/powerpc/platforms/pseries/mobility.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index f060181a0d32..8436ed01567b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -257,7 +257,7 @@ extern int rtas_set_indicator_fast(int indicator, int 
index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
-extern int rtas_ibm_suspend_me(u64 handle);
+int rtas_ibm_suspend_me_unsafe(u64 handle);
 
 struct rtc_time;
 extern time64_t rtas_get_boot_time(void);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 4ed64aba37d6..0a8e5dc2c108 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -843,7 +843,7 @@ static void rtas_percpu_suspend_me(void *info)
__rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1);
 }
 
-int rtas_ibm_suspend_me(u64 handle)
+int rtas_ibm_suspend_me_unsafe(u64 handle)
 {
long state;
long rc;
@@ -949,7 +949,7 @@ int rtas_call_reentrant(int token, int nargs, int nret, int 
*outputs, ...)
 }
 
 #else /* CONFIG_PPC_PSERIES */
-int rtas_ibm_suspend_me(u64 handle)
+int rtas_ibm_suspend_me_unsafe(u64 handle)
 {
return -ENOSYS;
 }
@@ -1185,7 +1185,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
int rc = 0;
u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32)
  | be32_to_cpu(args.args[1]);
-   rc = rtas_ibm_suspend_me(handle);
+   rc = rtas_ibm_suspend_me_unsafe(handle);
if (rc == -EAGAIN)
args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE);
else if (rc == -EIO)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 2f73cb5bf12d..6ff642e84c6a 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -370,7 +370,7 @@ static ssize_t migration_store(struct class *class,
return rc;
 
do {
-   rc = rtas_ibm_suspend_me(streamid);
+   rc = rtas_ibm_suspend_me_unsafe(streamid);
if (rc == -EAGAIN)
ssleep(1);
} while (rc == -EAGAIN);
-- 
2.28.0

[PATCH v2 02/28] powerpc/rtas: complete ibm,suspend-me status codes

2020-12-07 Thread Nathan Lynch

We don't completely account for the possible return codes for
ibm,suspend-me. Add definitions for these.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 55f9a154c95d..f060181a0d32 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -23,11 +23,16 @@
 #define RTAS_RMOBUF_MAX (64 * 1024)
 
 /* RTAS return status codes */
-#define RTAS_NOT_SUSPENDABLE   -9004
 #define RTAS_BUSY  -2/* RTAS Busy */
 #define RTAS_EXTENDED_DELAY_MIN9900
 #define RTAS_EXTENDED_DELAY_MAX9905
 
+/* statuses specific to ibm,suspend-me */
+#define RTAS_SUSPEND_ABORTED 9000 /* Suspension aborted */
+#define RTAS_NOT_SUSPENDABLE-9004 /* Partition not suspendable */
+#define RTAS_THREADS_ACTIVE -9005 /* Multiple processor threads active */
+#define RTAS_OUTSTANDING_COPROC -9006 /* Outstanding coprocessor operations */
+
 /*
  * In general to call RTAS use rtas_token("string") to lookup
  * an RTAS token for the given string (e.g. "event-scan").
-- 
2.28.0

[PATCH v2 16/28] powerpc/rtas: remove rtas_ibm_suspend_me_unsafe()

2020-12-07 Thread Nathan Lynch

rtas_ibm_suspend_me_unsafe() is now unused; remove it and
rtas_percpu_suspend_me() which becomes unused as a result.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h |  1 -
 arch/powerpc/kernel/rtas.c  | 67 +
 2 files changed, 1 insertion(+), 67 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3b52d8574fcc..9a6107ffe378 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -258,7 +258,6 @@ extern int rtas_set_indicator_fast(int indicator, int 
index, int new_value);
 extern void rtas_progress(char *s, unsigned short hex);
 extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
-int rtas_ibm_suspend_me_unsafe(u64 handle);
 int rtas_ibm_suspend_me(int *fw_status);
 
 struct rtc_time;
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index d4b048571728..7e6024f570da 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -925,66 +925,6 @@ int rtas_suspend_cpu(struct rtas_suspend_me_data *data)
return __rtas_suspend_cpu(data, 0);
 }
 
-static void rtas_percpu_suspend_me(void *info)
-{
-   __rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1);
-}
-
-int rtas_ibm_suspend_me_unsafe(u64 handle)
-{
-   long state;
-   long rc;
-   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
-   struct rtas_suspend_me_data data;
-   DECLARE_COMPLETION_ONSTACK(done);
-
-   if (!rtas_service_present("ibm,suspend-me"))
-   return -ENOSYS;
-
-   /* Make sure the state is valid */
-   rc = plpar_hcall(H_VASI_STATE, retbuf, handle);
-
-   state = retbuf[0];
-
-   if (rc) {
-   printk(KERN_ERR "rtas_ibm_suspend_me: vasi_state returned 
%ld\n",rc);
-   return rc;
-   } else if (state == H_VASI_ENABLED) {
-   return -EAGAIN;
-   } else if (state != H_VASI_SUSPENDING) {
-   printk(KERN_ERR "rtas_ibm_suspend_me: vasi_state returned state 
%ld\n",
-  state);
-   return -EIO;
-   }
-
-   atomic_set(, 0);
-   atomic_set(, 0);
-   atomic_set(, 0);
-   data.token = rtas_token("ibm,suspend-me");
-   data.complete = 
-
-   lock_device_hotplug();
-
-   cpu_hotplug_disable();
-
-   /* Call function on all CPUs.  One of us will make the
-* rtas call
-*/
-   on_each_cpu(rtas_percpu_suspend_me, , 0);
-
-   wait_for_completion();
-
-   if (atomic_read() != 0)
-   printk(KERN_ERR "Error doing global join\n");
-
-
-   cpu_hotplug_enable();
-
-   unlock_device_hotplug();
-
-   return atomic_read();
-}
-
 /**
  * rtas_call_reentrant() - Used for reentrant rtas calls
  * @token: Token for desired reentrant RTAS call
@@ -1035,12 +975,7 @@ int rtas_call_reentrant(int token, int nargs, int nret, 
int *outputs, ...)
return ret;
 }
 
-#else /* CONFIG_PPC_PSERIES */
-int rtas_ibm_suspend_me_unsafe(u64 handle)
-{
-   return -ENOSYS;
-}
-#endif
+#endif /* CONFIG_PPC_PSERIES */
 
 /**
  * Find a specific pseries error log in an RTAS extended event log.
-- 
2.28.0

[PATCH v2 19/28] powerpc/pseries/hibernation: remove pseries_suspend_cpu()

2020-12-07 Thread Nathan Lynch

Since commit 48f6e7f6d948 ("powerpc/pseries: remove cede offline state
for CPUs"), ppc_md.suspend_disable_cpu() is no longer used and all
CPUs (save one) are placed into true offline state as opposed to
H_JOIN. So pseries_suspend_cpu() is effectively unused; remove it.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/suspend.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c 
b/arch/powerpc/platforms/pseries/suspend.c
index 232621f33510..3315d698d5ab 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -48,20 +48,6 @@ static int pseries_suspend_begin(u64 stream_id)
   vasi_state);
return -EIO;
}
-
-   return 0;
-}
-
-/**
- * pseries_suspend_cpu - Suspend a single CPU
- *
- * Makes the H_JOIN call to suspend the CPU
- *
- **/
-static int pseries_suspend_cpu(void)
-{
-   if (atomic_read())
-   return rtas_suspend_cpu(_data);
return 0;
 }
 
@@ -235,7 +221,6 @@ static int __init pseries_suspend_init(void)
if ((rc = pseries_suspend_sysfs_register(_dev)))
return rc;
 
-   ppc_md.suspend_disable_cpu = pseries_suspend_cpu;
ppc_md.suspend_enable_irqs = pseries_suspend_enable_irqs;
suspend_set_ops(_suspend_ops);
return 0;
-- 
2.28.0

Re: [PATCH 12/29] powerpc/pseries/mobility: extract VASI session polling logic

2020-12-04 Thread Nathan Lynch

Michael Ellerman  writes:
> Nathan Lynch  writes:
>> The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to
>> the caller until the specified VASI suspend session state makes the
>> transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest
>> of separating concerns to prepare for a new implementation of the
>> join/suspend sequence, extract VASI session polling logic into a
>> couple of local functions. Waiting for the session state to reach
>> H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures
>> that we will never get an EAGAIN result necessitating a retry. No
>> user-visible change in behavior is intended.
>>
>> Signed-off-by: Nathan Lynch 
>> ---
>>  arch/powerpc/platforms/pseries/mobility.c | 76 +--
>>  1 file changed, 71 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
>> b/arch/powerpc/platforms/pseries/mobility.c
>> index dc6abf164db7..1b8ae221b98a 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -345,6 +345,73 @@ void post_mobility_fixup(void)
> ...
>
>> +
>> +static int wait_for_vasi_session_suspending(u64 handle)
>> +{
>> +unsigned long state;
>> +bool keep_polling;
>> +int ret;
>> +
>> +/*
>> + * Wait for transition from H_VASI_ENABLED to
>> + * H_VASI_SUSPENDING. Treat anything else as an error.
>> + */
>> +do {
>> +keep_polling = false;
>> +ret = poll_vasi_state(handle, );
>> +if (ret != 0)
>> +break;
>> +
>> +switch (state) {
>> +case H_VASI_SUSPENDING:
>> +break;
>> +case H_VASI_ENABLED:
>> +keep_polling = true;
>> +ssleep(1);
>> +break;
>> +default:
>> +pr_err("unexpected H_VASI_STATE result %lu\n", state);
>> +ret = -EIO;
>> +break;
>> +}
>> +} while (keep_polling);
>
> This seems like it could be simpler?
>
> eg:
>
>   while (true) {
>   ret = poll_vasi_state(handle, );
>
>   if (ret != 0 || state == H_VASI_SUSPENDING)
>   break;
>   else if (state == H_VASI_ENABLED)
>   ssleep(1);
>   else {
>   pr_err("unexpected H_VASI_STATE result %lu\n", state);
>   ret = -EIO;
>   break;
>   }
>   }
>
>
> Or did I miss something?

No I think you're right, thanks.

Re: [PATCH 03/29] powerpc/rtas: complete ibm,suspend-me status codes

2020-12-04 Thread Nathan Lynch

Michael Ellerman  writes:
> Nathan Lynch  writes:
>> We don't completely account for the possible return codes for
>> ibm,suspend-me. Add definitions for these.
>>
>> Signed-off-by: Nathan Lynch 
>> ---
>>  arch/powerpc/include/asm/rtas.h | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index 55f9a154c95d..f060181a0d32 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -23,11 +23,16 @@
>>  #define RTAS_RMOBUF_MAX (64 * 1024)
>>  
>>  /* RTAS return status codes */
>> -#define RTAS_NOT_SUSPENDABLE-9004
>>  #define RTAS_BUSY   -2/* RTAS Busy */
>>  #define RTAS_EXTENDED_DELAY_MIN 9900
>>  #define RTAS_EXTENDED_DELAY_MAX 9905
>>  
>> +/* statuses specific to ibm,suspend-me */
>> +#define RTAS_SUSPEND_ABORTED 9000 /* Suspension aborted */
>
> This made me ... pause.
>
> But it really is positive 9000, I checked PAPR.

Yes, 9000 falls within the "vendor-specific success codes" range in the
RTAS status value table. I guess aborting a suspend is
operator-initiated and it's not considered an error condition from that
point of view.

Re: [PATCH 13/29] powerpc/pseries/mobility: use stop_machine for join/suspend

2020-12-04 Thread Nathan Lynch

Hi Michael,

Michael Ellerman  writes:
> Nathan Lynch  writes:
>> The partition suspend sequence as specified in the platform
>> architecture requires that all active processor threads call
>> H_JOIN, which:
> ...
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
>> b/arch/powerpc/platforms/pseries/mobility.c
>> index 1b8ae221b98a..44ca7d4e143d 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -412,6 +414,128 @@ static int wait_for_vasi_session_suspending(u64 handle)
> ...
>
>> +
>> +static int do_join(void *arg)
>> +{
>> +atomic_t *counter = arg;
>> +long hvrc;
>> +int ret;
>> +
>> +/* Must ensure MSR.EE off for H_JOIN. */
>> +hard_irq_disable();
>
> Didn't stop_machine() already do that for us?
>
> In the state machine in multi_cpu_stop().

Yes, but I didn't want to rely on something that seems like an
implementation detail of stop_machine(). I assumed it's benign and in
keeping with hard_irq_disable()'s intended semantics to make multiple
calls to it within a critical section.

I don't feel strongly about this though. If I remove this
hard_irq_disable() I'll modify the comment to indicate that this
function relies on stop_machine() to satisfy this condition for H_JOIN.


>> +hvrc = plpar_hcall_norets(H_JOIN);
>> +
>> +switch (hvrc) {
>> +case H_CONTINUE:
>> +/*
>> + * All other CPUs are offline or in H_JOIN. This CPU
>> + * attempts the suspend.
>> + */
>> +ret = do_suspend();
>> +break;
>> +case H_SUCCESS:
>> +/*
>> + * The suspend is complete and this cpu has received a
>> + * prod.
>> + */
>> +ret = 0;
>> +break;
>> +case H_BAD_MODE:
>> +case H_HARDWARE:
>> +default:
>> +ret = -EIO;
>> +pr_err_ratelimited("H_JOIN error %ld on CPU %i\n",
>> +   hvrc, smp_processor_id());
>> +break;
>> +}
>> +
>> +if (atomic_inc_return(counter) == 1) {
>> +pr_info("CPU %u waking all threads\n", smp_processor_id());
>> +prod_others();
>> +}
>
> Do we even need the counter? IIUC only one CPU receives H_CONTINUE. So
> couldn't we just have that CPU do the prodding of others?

CPUs may exit H_JOIN due to system reset interrupt at any time, and
H_JOIN may return H_HARDWARE to a caller after other CPUs have entered
the join state successfully. In these cases the counter ensures exactly
one thread performs the prod sequence.

>
>> +/*
>> + * Execution may have been suspended for several seconds, so
>> + * reset the watchdog.
>> + */
>> +touch_nmi_watchdog();
>> +return ret;
>> +}
>> +
>> +static int pseries_migrate_partition(u64 handle)
>> +{
>> +atomic_t counter = ATOMIC_INIT(0);
>> +int ret;
>> +
>> +ret = wait_for_vasi_session_suspending(handle);
>> +if (ret)
>> +goto out;
>
> Direct return would be clearer IMHO.

OK, I can change this.

Re: [PATCH 16/29] powerpc/rtas: dispatch partition migration requests to pseries

2020-12-04 Thread Nathan Lynch

Michael Ellerman  writes:
> Nathan Lynch  writes:
>> sys_rtas() cannot call ibm,suspend-me directly in the same way it
>> handles other inputs. Instead it must dispatch the request to code
>> that can first perform the H_JOIN sequence before any call to
>> ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair
>> amount of platform-specific code to implement this.
>>
>> Since a different, more robust implementation of the suspend sequence
>> is now in the pseries platform code, we want to dispatch the request
>> there while minimizing additional dependence on pseries.
>>
>> Use a weak function that only pseries overrides.
>
> Over the years weak functions have caused their fair share of problems.
>
> There are cases where they are the cleanest option, but for intra-arch
> code like this I think and ifdef is much simpler.

Fair enough, I wasn't all that confident about this anyway.


>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index fdefe6a974eb..be0fc2536673 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -260,6 +260,7 @@ extern int rtas_suspend_cpu(struct rtas_suspend_me_data 
>> *data);
>>  extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
>>  int rtas_ibm_suspend_me_unsafe(u64 handle);
>>  int rtas_ibm_suspend_me(int *fw_status);
>> +int rtas_syscall_dispatch_ibm_suspend_me(u64 handle);
>
> ie. we'd just do:
>
> #ifdef CONFIG_PPC_PSERIES
> int rtas_syscall_dispatch_ibm_suspend_me(u64 handle);
> #else
> int rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
> {
>   return -EINVAL;
> }
> #endif

Yep will do.

Re: [PATCH] powerpc/pseries/hotplug-cpu: Fix memleak when cpus node not exist

2020-11-10 Thread Nathan Lynch

Zhang Xiaoxu  writes:
> From: zhangxiaoxu 
>
> If the cpus nodes not exist, we lost to free the 'cpu_drcs', which
> will leak memory.
>
> Fixes: a0ff72f9f5a7 ("powerpc/pseries/hotplug-cpu: Remove double free in 
> error path")
> Reported-by: Hulk Robot 
> Signed-off-by: zhangxiaoxu 
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index f2837e33bf5d..4bb1c9f2bb11 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -743,6 +743,7 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add)
>   parent = of_find_node_by_path("/cpus");
>   if (!parent) {
>   pr_warn("Could not find CPU root node in device tree\n");
> + kfree(cpu_drcs);
>   return -1;
>   }

Thanks for finding this.

a0ff72f9f5a7 ("powerpc/pseries/hotplug-cpu: Remove double free in error
path") was posted in Sept 2019 but was not applied until July 2020:

https://lore.kernel.org/linuxppc-dev/20190919231633.1344-1-nath...@linux.ibm.com/

Here is that change as posted; note the function context is
find_dlpar_cpus_to_add(), not dlpar_cpu_add_by_count():

--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -726,7 +726,6 @@ static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 
cpus_to_add)
parent = of_find_node_by_path("/cpus");
if (!parent) {
pr_warn("Could not find CPU root node in device tree\n");
-   kfree(cpu_drcs);
return -1;
}

Meanwhile b015f6bc9547dbc056edde7177c7868ca8629c4c ("powerpc/pseries: Add
cpu DLPAR support for drc-info property") was posted in Nov 2019 and
committed a few days later:

https://lore.kernel.org/linux-pci/1573449697-5448-4-git-send-email-tyr...@linux.ibm.com/

This change reorganized the same code, removing
find_dlpar_cpus_to_add(), and it had the effect of fixing the same
issue.

However git apparently allowed the older change to still apply on top of
this (changing a function different from the one in the original
patch!), leading to a real bug.

Your patch is correct but it should be framed as a revert of
a0ff72f9f5a7 with this context in the commit message.

Re: [PATCH 29/29] powerpc/pseries/mobility: refactor node lookup during DT update

2020-11-20 Thread Nathan Lynch

Nathan Lynch  writes:
> In pseries_devicetree_update(), with each call to ibm,update-nodes the
> partition firmware communicates the node to be deleted or updated by
> placing its phandle in the work buffer. Each of delete_dt_node(),
> update_dt_node(), and add_dt_node() have duplicate lookups using the
> phandle value and corresponding refcount management.

...

I noticed that this introduces a reference count imbalance in an error
path:

> -static int add_dt_node(__be32 parent_phandle, __be32 drc_index)
> +static int add_dt_node(struct device_node *parent_dn, __be32 drc_index)
>  {
>   struct device_node *dn;
> - struct device_node *parent_dn;
>   int rc;
>  
> - parent_dn = of_find_node_by_phandle(be32_to_cpu(parent_phandle));
> - if (!parent_dn)
> - return -ENOENT;
> -
>   dn = dlpar_configure_connector(drc_index, parent_dn);
>   if (!dn) {
>   of_node_put(parent_dn);

here:   ^^^

> @@ -251,7 +230,6 @@ static int add_dt_node(__be32 parent_phandle, __be32 
> drc_index)
>  
>   pr_debug("added node %pOFfp\n", dn);
>  
> - of_node_put(parent_dn);
>   return rc;
>  }

The change correctly removes the of_node_put() from the happy path but
the put in the error branch should have been removed too.

Re: [PATCH 02/29] powerpc/rtas: prevent suspend-related sys_rtas use on LE

2020-10-30 Thread Nathan Lynch

Andrew Donnellan  writes:
> On 30/10/20 12:17 pm, Nathan Lynch wrote:
>> While drmgr has had work in some areas to make its RTAS syscall
>> interactions endian-neutral, its code for performing partition
>> migration via the syscall has never worked on LE. While it is able to
>> complete ibm,suspend-me successfully, it crashes when attempting the
>> subsequent ibm,update-nodes call.
>> 
>> drmgr is the only known (or plausible) user of these ibm,suspend-me,
>> ibm,update-nodes, and ibm,update-properties, so allow them only in
>> big-endian configurations.
>
> And there's a zero chance that drmgr will ever be fixed on LE?

It's always used the sysfs interface on LE, and the only way to provoke
it to attempt the syscalls is by doing something like this before
running the migration:

# echo 0 > /tmp/fake_api_version
# mount -o bind,ro /tmp/fake_api_version /sys/kernel/mobility/api_version

So I'm not aware of any circumstance that would actually motivate
someone to make it work on LE. I'd say it's more likely that drmgr will
drop its support for using the syscall altogether.

[PATCH 10/29] powerpc/pseries/mobility: error message improvements

2020-10-29 Thread Nathan Lynch

- Convert printk(KERN_ERR) to pr_err().
- Include errno in property update failure message.
- Remove reference to "Post-mobility" from device tree update message:
  with pr_err() it will have a "mobility:" prefix.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index d85799b5464a..ef8f5641e700 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -208,8 +208,8 @@ static int update_dt_node(__be32 phandle, s32 scope)
rc = update_dt_property(dn, , prop_name,
vd, prop_data);
if (rc) {
-   printk(KERN_ERR "Could not update %s"
-  " property\n", prop_name);
+   pr_err("updating %s property failed: 
%d\n",
+  prop_name, rc);
}
 
prop_data += vd;
@@ -343,8 +343,7 @@ void post_mobility_fixup(void)
 
rc = pseries_devicetree_update(MIGRATION_SCOPE);
if (rc)
-   printk(KERN_ERR "Post-mobility device tree update "
-   "failed: %d\n", rc);
+   pr_err("device tree update failed: %d\n", rc);
 
cacheinfo_rebuild();
 
-- 
2.25.4

[PATCH 11/29] powerpc/pseries/mobility: use rtas_activate_firmware() on resume

2020-10-29 Thread Nathan Lynch

It's incorrect to abort post-suspend processing if
ibm,activate-firmware isn't available. Use rtas_activate_firmware(),
which logs this condition appropriately and allows us to proceed.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index ef8f5641e700..dc6abf164db7 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -312,21 +312,8 @@ int pseries_devicetree_update(s32 scope)
 void post_mobility_fixup(void)
 {
int rc;
-   int activate_fw_token;
 
-   activate_fw_token = rtas_token("ibm,activate-firmware");
-   if (activate_fw_token == RTAS_UNKNOWN_SERVICE) {
-   printk(KERN_ERR "Could not make post-mobility "
-  "activate-fw call.\n");
-   return;
-   }
-
-   do {
-   rc = rtas_call(activate_fw_token, 0, 1, NULL);
-   } while (rtas_busy_delay(rc));
-
-   if (rc)
-   printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc);
+   rtas_activate_firmware();
 
/*
 * We don't want CPUs to go online/offline while the device
-- 
2.25.4

[PATCH 12/29] powerpc/pseries/mobility: extract VASI session polling logic

2020-10-29 Thread Nathan Lynch

The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to
the caller until the specified VASI suspend session state makes the
transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest
of separating concerns to prepare for a new implementation of the
join/suspend sequence, extract VASI session polling logic into a
couple of local functions. Waiting for the session state to reach
H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures
that we will never get an EAGAIN result necessitating a retry. No
user-visible change in behavior is intended.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 76 +--
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index dc6abf164db7..1b8ae221b98a 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -345,6 +345,73 @@ void post_mobility_fixup(void)
return;
 }
 
+static int poll_vasi_state(u64 handle, unsigned long *res)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+   long hvrc;
+   int ret;
+
+   hvrc = plpar_hcall(H_VASI_STATE, retbuf, handle);
+   switch (hvrc) {
+   case H_SUCCESS:
+   ret = 0;
+   *res = retbuf[0];
+   break;
+   case H_PARAMETER:
+   ret = -EINVAL;
+   break;
+   case H_FUNCTION:
+   ret = -EOPNOTSUPP;
+   break;
+   case H_HARDWARE:
+   default:
+   pr_err("unexpected H_VASI_STATE result %ld\n", hvrc);
+   ret = -EIO;
+   break;
+   }
+   return ret;
+}
+
+static int wait_for_vasi_session_suspending(u64 handle)
+{
+   unsigned long state;
+   bool keep_polling;
+   int ret;
+
+   /*
+* Wait for transition from H_VASI_ENABLED to
+* H_VASI_SUSPENDING. Treat anything else as an error.
+*/
+   do {
+   keep_polling = false;
+   ret = poll_vasi_state(handle, );
+   if (ret != 0)
+   break;
+
+   switch (state) {
+   case H_VASI_SUSPENDING:
+   break;
+   case H_VASI_ENABLED:
+   keep_polling = true;
+   ssleep(1);
+   break;
+   default:
+   pr_err("unexpected H_VASI_STATE result %lu\n", state);
+   ret = -EIO;
+   break;
+   }
+   } while (keep_polling);
+
+   /*
+* Proceed even if H_VASI_STATE is unavailable. If H_JOIN or
+* ibm,suspend-me are also unimplemented, we'll recover then.
+*/
+   if (ret == -EOPNOTSUPP)
+   ret = 0;
+
+   return ret;
+}
+
 static ssize_t migration_store(struct class *class,
   struct class_attribute *attr, const char *buf,
   size_t count)
@@ -356,12 +423,11 @@ static ssize_t migration_store(struct class *class,
if (rc)
return rc;
 
-   do {
-   rc = rtas_ibm_suspend_me_unsafe(streamid);
-   if (rc == -EAGAIN)
-   ssleep(1);
-   } while (rc == -EAGAIN);
+   rc = wait_for_vasi_session_suspending(streamid);
+   if (rc)
+   return rc;
 
+   rc = rtas_ibm_suspend_me_unsafe(streamid);
if (rc)
return rc;
 
-- 
2.25.4

[PATCH 14/29] powerpc/pseries/mobility: signal suspend cancellation to platform

2020-10-29 Thread Nathan Lynch

If we're returning an error to user space, use H_VASI_SIGNAL to send a
cancellation request to the platform. This isn't strictly required but
it communicates that Linux will not attempt to complete the suspend,
which allows the various entities involved to promptly end the
operation in progress.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/mobility.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 44ca7d4e143d..0f592246f345 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -520,6 +520,35 @@ static int do_join(void *arg)
return ret;
 }
 
+/*
+ * Abort reason code byte 0. We use only the 'Migrating partition' value.
+ */
+enum vasi_aborting_entity {
+   ORCHESTRATOR= 1,
+   VSP_SOURCE  = 2,
+   PARTITION_FIRMWARE  = 3,
+   PLATFORM_FIRMWARE   = 4,
+   VSP_TARGET  = 5,
+   MIGRATING_PARTITION = 6,
+};
+
+static void pseries_cancel_migration(u64 handle, int err)
+{
+   u32 reason_code;
+   u32 detail;
+   u8 entity;
+   long hvrc;
+
+   entity = MIGRATING_PARTITION;
+   detail = abs(err) & 0xff;
+   reason_code = (entity << 24) | detail;
+
+   hvrc = plpar_hcall_norets(H_VASI_SIGNAL, handle,
+ H_VASI_SIGNAL_CANCEL, reason_code);
+   if (hvrc)
+   pr_err("H_VASI_SIGNAL error: %ld\n", hvrc);
+}
+
 static int pseries_migrate_partition(u64 handle)
 {
atomic_t counter = ATOMIC_INIT(0);
@@ -532,6 +561,8 @@ static int pseries_migrate_partition(u64 handle)
ret = stop_machine(do_join, , cpu_online_mask);
if (ret == 0)
post_mobility_fixup();
+   else
+   pseries_cancel_migration(handle, ret);
 out:
return ret;
 }
-- 
2.25.4

[PATCH 16/29] powerpc/rtas: dispatch partition migration requests to pseries

2020-10-29 Thread Nathan Lynch

sys_rtas() cannot call ibm,suspend-me directly in the same way it
handles other inputs. Instead it must dispatch the request to code
that can first perform the H_JOIN sequence before any call to
ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair
amount of platform-specific code to implement this.

Since a different, more robust implementation of the suspend sequence
is now in the pseries platform code, we want to dispatch the request
there while minimizing additional dependence on pseries.

Use a weak function that only pseries overrides. Note that invoking
ibm,suspend-me via the RTAS syscall is all but deprecated; this change
preserves ABI compatibility for old programs while providing to them
the benefit of the new partition suspend implementation. This is a
behavior change in that the kernel performs the device tree update and
firmware activation before returning, but experimentation indicates
this is tolerated fine by legacy user space.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas.h   | 1 +
 arch/powerpc/kernel/rtas.c| 8 +++-
 arch/powerpc/platforms/pseries/mobility.c | 5 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index fdefe6a974eb..be0fc2536673 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -260,6 +260,7 @@ extern int rtas_suspend_cpu(struct rtas_suspend_me_data 
*data);
 extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data);
 int rtas_ibm_suspend_me_unsafe(u64 handle);
 int rtas_ibm_suspend_me(int *fw_status);
+int rtas_syscall_dispatch_ibm_suspend_me(u64 handle);
 
 struct rtc_time;
 extern time64_t rtas_get_boot_time(void);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 58bbd69a233f..52fb394f15d6 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1221,6 +1221,12 @@ static bool block_rtas_call(int token, int nargs,
 
 #endif /* CONFIG_PPC_RTAS_FILTER */
 
+/* Only pseries should need to override this. */
+int __weak rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
+{
+   return -EINVAL;
+}
+
 /* We assume to be passed big endian arguments */
 SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 {
@@ -1271,7 +1277,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
int rc = 0;
u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32)
  | be32_to_cpu(args.args[1]);
-   rc = rtas_ibm_suspend_me_unsafe(handle);
+   rc = rtas_syscall_dispatch_ibm_suspend_me(handle);
if (rc == -EAGAIN)
args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE);
else if (rc == -EIO)
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index e459cdb8286f..d6417c9db201 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -622,6 +622,11 @@ static int pseries_migrate_partition(u64 handle)
return ret;
 }
 
+int rtas_syscall_dispatch_ibm_suspend_me(u64 handle)
+{
+   return pseries_migrate_partition(handle);
+}
+
 static ssize_t migration_store(struct class *class,
   struct class_attribute *attr, const char *buf,
   size_t count)
-- 
2.25.4

< 1 2 3 4 5 6 7 8 9 >

301 - 400 of 891 matches

Mail list logo