Re: [PATCH v4 1/2] powerpc/eeh: fix pseries_eeh_configure_bridge()
Sam Bobroff writes: > If a device is hot unplgged during EEH recovery, it's possible for the > RTAS call to ibm,configure-pe in pseries_eeh_configure() to return > parameter error (-3), however negative return values are not checked > for and this leads to an infinite loop. > > Fix this by correctly bailing out on negative values. Reviewed-by: Nathan Lynch Thanks.
Re: [PATCH v2 3/4] powerpc/memhotplug: Make lmb size 64bit
"Aneesh Kumar K.V" writes: > @@ -322,12 +322,16 @@ static int pseries_remove_mem_node(struct device_node > *np) > /* >* Find the base address and size of the memblock >*/ > - regs = of_get_property(np, "reg", NULL); > - if (!regs) > + prop = of_get_property(np, "reg", NULL); > + if (!prop) > return ret; > > - base = be64_to_cpu(*(unsigned long *)regs); > - lmb_size = be32_to_cpu(regs[3]); > + /* > + * "reg" property represents (addr,size) tuple. > + */ > + base = of_read_number(prop, mem_addr_cells); > + prop += mem_addr_cells; > + lmb_size = of_read_number(prop, mem_size_cells); Would of_n_size_cells() and of_n_addr_cells() work here?
Re: [PATCH v2 1/4] powerpc/drmem: Make lmb_size 64 bit
"Aneesh Kumar K.V" writes: > Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block > panic") > make sure different variables tracking lmb_size are updated to be 64 bit. > > This was found by code audit. > > Cc: sta...@vger.kernel.org > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/drmem.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/drmem.h > b/arch/powerpc/include/asm/drmem.h > index 17ccc6474ab6..d719cbac34b2 100644 > --- a/arch/powerpc/include/asm/drmem.h > +++ b/arch/powerpc/include/asm/drmem.h > @@ -21,7 +21,7 @@ struct drmem_lmb { > struct drmem_lmb_info { > struct drmem_lmb*lmbs; > int n_lmbs; > - u32 lmb_size; > + u64 lmb_size; > }; > > extern struct drmem_lmb_info *drmem_info; > @@ -67,7 +67,7 @@ struct of_drconf_cell_v2 { > #define DRCONF_MEM_RESERVED 0x0080 > #define DRCONF_MEM_HOTREMOVABLE 0x0100 > > -static inline u32 drmem_lmb_size(void) > +static inline u64 drmem_lmb_size(void) > { > return drmem_info->lmb_size; > } Looks fine. Acked-by: Nathan Lynch
Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
Scott Cheloha writes: > At memory hot-remove time we can retrieve an LMB's nid from its > corresponding memory_block. There is no need to store the nid > in multiple locations. > > Note that lmb_to_memblock() uses find_memory_block() to get the > corresponding memory_block. As find_memory_block() runs in sub-linear > time this approach is negligibly slower than what we do at present. > > In exchange for this lookup at hot-remove time we no longer need to > call memory_add_physaddr_to_nid() during drmem_init() for each LMB. > On powerpc, memory_add_physaddr_to_nid() is a linear search, so this > spares us an O(n^2) initialization during boot. > > On systems with many LMBs that initialization overhead is palpable and > disruptive. For example, on a box with 249854 LMBs we're seeing > drmem_init() take upwards of 30 seconds to complete: > > [ 53.721639] drmem: initializing drmem v2 > [ 80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! > [swapper/0:1] > [ 80.604377] Modules linked in: > [ 80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4 > [ 80.604397] NIP: c00a4980 LR: c00a4940 CTR: > > [ 80.604407] REGS: c0002dbff8493830 TRAP: 0901 Not tainted (5.6.0-rc2+) > [ 80.604412] MSR: 82009033 CR: > 44000248 XER: 000d > [ 80.604431] CFAR: c00a4a38 IRQMASK: 0 > [ 80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 > c0003cfede30 > [ 80.604431] GPR04: c0f4095a 002f > 1000 > [ 80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 > c00c0002fdfb2001 > [ 80.604431] GPR12: c0001e8ec200 > [ 80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0 > [ 80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0 > [ 80.604492] Call Trace: > [ 80.604498] [c0002dbff8493ac0] [c00a4940] > hot_add_scn_to_nid+0x60/0x3e0 (unreliable) > [ 80.604509] [c0002dbff8493b20] [c0087c10] > memory_add_physaddr_to_nid+0x20/0x60 > [ 80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0 > [ 80.604530] [c0002dbff8493c10] [c0010154] > do_one_initcall+0x64/0x2c0 > [ 80.604540] [c0002dbff8493ce0] [c10c4aa0] > kernel_init_freeable+0x2d8/0x3a0 > [ 80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148 > [ 80.604560] [c0002dbff8493e20] [c000b648] > ret_from_kernel_thread+0x5c/0x74 > [ 80.604567] Instruction dump: > [ 80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 > 7d094214 > [ 80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac > e949 7fbe5040 > [ 89.047390] drmem: 249854 LMB(s) > > With a patched kernel on the same machine we're no longer seeing the > soft lockup. drmem_init() now completes in negligible time, even when > the LMB count is large. This has been a real annoyance, and it's great to see it fixed in a way that both simplifies the code and reduces data structure size. > > Signed-off-by: Scott Cheloha I think this also should have: Fixes: b2d3b5ee66f2 ("powerpc/pseries: Track LMB nid instead of using device tree") since you're essentially reverting it. The problem that commit was trying to address arose from the fact that the removal path was determining the Linux node for a block from the firmware description of the block: memory_add_physaddr_to_nid hot_add_scn_to_nid hot_add_drconf_scn_to_nid of_drconf_to_nid_single [parses the DT for the nid] However, this can become out of sync with Linux's NUMA assignment for the memory due to VM migration or other events. I'm satisfied that this change does not reintroduce that problem. The removal path still derives the node without going to the device tree, but now performs a more efficient lookup. Reviewed-by: Nathan Lynch
Re: [PATCH v3] pseries/drmem: don't cache node id in drmem_lmb struct
Nathan Lynch writes: > I'm satisfied that this change does not reintroduce that problem. The > removal path still derives the node without going to the device tree, > but now performs a more efficient lookup. Er, actually it's slightly less efficient in the remove path, as you note in your commit message. This doesn't alter my opinion of the change. The higher-level algorithmic inefficiencies (e.g. all the for_each_drmem_lmb loops) in this code are more of a concern.
[PATCH 0/2] pseries extended cede cpu offline mode removal
I propose removing the "extended cede" offline mode for CPUs as well as the partition suspend code which accommodates it by temporarily onlining all CPUs prior to suspending the LPAR. Detailed justifications are within the individual commit messages. I'm hoping to move the pseries partition suspend implementation to the Linux suspend framework, and I expect that having only one offline mode for CPUs will make that task quite a bit less complex. Nathan Lynch (2): powerpc/pseries: remove cede offline state for CPUs powerpc/rtas: don't online CPUs for partition suspend Documentation/core-api/cpu_hotplug.rst| 7 - arch/powerpc/include/asm/rtas.h | 2 - arch/powerpc/kernel/rtas.c| 122 + arch/powerpc/platforms/pseries/hotplug-cpu.c | 170 ++ .../platforms/pseries/offline_states.h| 38 arch/powerpc/platforms/pseries/pmem.c | 1 - arch/powerpc/platforms/pseries/smp.c | 28 +-- arch/powerpc/platforms/pseries/suspend.c | 22 +-- 8 files changed, 18 insertions(+), 372 deletions(-) delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h -- 2.25.4
[PATCH 2/2] powerpc/rtas: don't online CPUs for partition suspend
Partition suspension, used for hibernation and migration, requires that the OS place all but one of the LPAR's processor threads into one of two states prior to calling the ibm,suspend-me RTAS function: * the architected offline state (via RTAS stop-self); or * the H_JOIN hcall, which does not return until the partition resumes execution Using H_CEDE as the offline mode, introduced by commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), means that any threads which are offline from Linux's point of view must be moved to one of those two states before a partition suspension can proceed. This was eventually addressed in commit 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation"), which added code to temporarily bring up any offline processor threads so they can call H_JOIN. Conceptually this is fine, but the implementation has had multiple races with cpu hotplug operations initiated from user space[1][2][3], the error handling is fragile, and it generates user-visible cpu hotplug events which is a lot of noise for a platform feature that's supposed to minimize disruption to workloads. With commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") reverted, this code becomes unnecessary, so remove it. Since any offline CPUs now are truly offline from the platform's point of view, it is no longer necessary to bring up CPUs only to have them call H_JOIN and then go offline again upon resuming. Only active threads are required to call H_JOIN; stopped threads can be left alone. [1] commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM") [2] commit 9fb603050ffd ("powerpc/rtas: retry when cpu offline races with suspend/migration") [3] commit dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & Migration") Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 2 - arch/powerpc/kernel/rtas.c | 122 +-- arch/powerpc/platforms/pseries/suspend.c | 22 +--- 3 files changed, 3 insertions(+), 143 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 3c1887351c71..bd227e0eab07 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -368,8 +368,6 @@ extern int rtas_set_indicator_fast(int indicator, int index, int new_value); extern void rtas_progress(char *s, unsigned short hex); extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); -extern int rtas_online_cpus_mask(cpumask_var_t cpus); -extern int rtas_offline_cpus_mask(cpumask_var_t cpus); extern int rtas_ibm_suspend_me(u64 handle); struct rtc_time; diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index c5fa251b8950..01210593d60c 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -842,96 +842,6 @@ static void rtas_percpu_suspend_me(void *info) __rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1); } -enum rtas_cpu_state { - DOWN, - UP, -}; - -#ifndef CONFIG_SMP -static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, - cpumask_var_t cpus) -{ - if (!cpumask_empty(cpus)) { - cpumask_clear(cpus); - return -EINVAL; - } else - return 0; -} -#else -/* On return cpumask will be altered to indicate CPUs changed. - * CPUs with states changed will be set in the mask, - * CPUs with status unchanged will be unset in the mask. */ -static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, - cpumask_var_t cpus) -{ - int cpu; - int cpuret = 0; - int ret = 0; - - if (cpumask_empty(cpus)) - return 0; - - for_each_cpu(cpu, cpus) { - struct device *dev = get_cpu_device(cpu); - - switch (state) { - case DOWN: - cpuret = device_offline(dev); - break; - case UP: - cpuret = device_online(dev); - break; - } - if (cpuret < 0) { - pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", - __func__, - ((state == UP) ? "up" : "down"), - cpu, cpuret); - if (!ret) - ret = cpuret; - if (state == UP) { - /* clear bits for unchanged cpus, return */ -
[PATCH 1/2] powerpc/pseries: remove cede offline state for CPUs
This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), which added an offline mode for CPUs which uses the H_CEDE hcall instead of the architected stop-self RTAS function in order to facilitate "folding" of dedicated mode processors on PowerVM platforms to achieve energy savings. This has been the default offline mode since its introduction. There's nothing about stop-self that would prevent the hypervisor from achieving the energy savings available via H_CEDE, so the original premise of this change appears to be flawed. I also have encountered the claim that the transition to and from ceded state is much faster than stop-self/start-cpu. Certainly we would not want to use stop-self as an *idle* mode. That is what H_CEDE is for. However, this difference is insignificant in the context of Linux CPU hotplug, where the latency of an offline or online operation on current systems is on the order of 100ms, mainly attributable to all the various subsystems' cpuhp callbacks. The cede offline mode also prevents accurate accounting, as discussed before: https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-...@linux.vnet.ibm.com/ Unconditionally use stop-self to offline processor threads. This is the architected method for offlining CPUs on PAPR systems. The "cede_offline" boot parameter is rendered obsolete. Removing this code enables the removal of the partition suspend code which temporarily onlines all present CPUs. Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Nathan Lynch --- Documentation/core-api/cpu_hotplug.rst| 7 - arch/powerpc/platforms/pseries/hotplug-cpu.c | 170 ++ .../platforms/pseries/offline_states.h| 38 arch/powerpc/platforms/pseries/pmem.c | 1 - arch/powerpc/platforms/pseries/smp.c | 28 +-- 5 files changed, 15 insertions(+), 229 deletions(-) delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 4a50ab7817f7..b1ae1ac159cf 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -50,13 +50,6 @@ Command Line Switches This option is limited to the X86 and S390 architecture. -``cede_offline={"off","on"}`` - Use this option to disable/enable putting offlined processors to an extended - ``H_CEDE`` state on supported pseries platforms. If nothing is specified, - ``cede_offline`` is set to "on". - - This option is limited to the PowerPC architecture. - ``cpu0_hotplug`` Allow to shutdown CPU0. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 3e8cbfe7a80f..d4b346355bb9 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -35,54 +35,10 @@ #include #include "pseries.h" -#include "offline_states.h" /* This version can't take the spinlock, because it never returns */ static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE; -static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = - CPU_STATE_OFFLINE; -static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE; - -static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; - -static bool cede_offline_enabled __read_mostly = true; - -/* - * Enable/disable cede_offline when available. - */ -static int __init setup_cede_offline(char *str) -{ - return (kstrtobool(str, _offline_enabled) == 0); -} - -__setup("cede_offline=", setup_cede_offline); - -enum cpu_state_vals get_cpu_current_state(int cpu) -{ - return per_cpu(current_state, cpu); -} - -void set_cpu_current_state(int cpu, enum cpu_state_vals state) -{ - per_cpu(current_state, cpu) = state; -} - -enum cpu_state_vals get_preferred_offline_state(int cpu) -{ - return per_cpu(preferred_offline_state, cpu); -} - -void set_preferred_offline_state(int cpu, enum cpu_state_vals state) -{ - per_cpu(preferred_offline_state, cpu) = state; -} - -void set_default_offline_state(int cpu) -{ - per_cpu(preferred_offline_state, cpu) = default_offline_state; -} - static void rtas_stop_self(void) { static struct rtas_args args; @@ -101,9 +57,7 @@ static void rtas_stop_self(void) static void pseries_mach_cpu_die(void) { - unsigned int cpu = smp_processor_id(); unsigned int hwcpu = hard_smp_processor_id(); - u8 cede_latency_hint = 0; local_irq_disable(); idle_task_exit(); @@ -112,49 +66,6 @@ static void pseries_mach_cpu_die(void) else xics_teardown_cpu(); - if (get_preferred_offli
Re: Build regressions/improvements in v5.10-rc1
Geert Uytterhoeven writes: > On Mon, Oct 26, 2020 at 10:46 AM Geert Uytterhoeven > wrote: >> Below is the list of build error/warning regressions/improvements in >> v5.10-rc1[1] compared to v5.9[2]. >> >> Summarized: >> - build errors: +3/-7 >> - build warnings: +26/-28 >> >> Happy fixing! ;-) >> >> Thanks to the linux-next team for providing the build service. >> >> [1] >> http://kisskb.ellerman.id.au/kisskb/branch/linus/head/3650b228f83adda7e5ee532e2b90429c03f7b9ec/ >> (all 192 configs) >> [2] >> http://kisskb.ellerman.id.au/kisskb/branch/linus/head/bbf5c979011a099af5dc76498918ed7df445635b/ >> (all 192 configs) >> >> >> *** ERRORS *** >> >> 3 error regressions: >> + /kisskb/src/arch/um/kernel/skas/clone.c: error: expected declaration >> specifiers or '...' before string constant: => 24:16 > > um-all{mod,yes}config > >> + error: hotplug-memory.c: undefined reference to >> `of_drconf_to_nid_single': => .text+0x5e0) > > powerpc-gcc5/pseries_le_defconfig+NO_NUMA Probably introduced by: commit 72cdd117c449896c707fc6cfe5b90978160697d0 Author: Scott Cheloha Date: Wed Sep 16 09:51:22 2020 -0500 pseries/hotplug-memory: hot-add: skip redundant LMB lookup Scott?
Re: [PATCH 2/2] powerpc/numa: Remove a redundant variable
Srikar Dronamraju writes: > In of_drconf_to_nid_single() default_nid always refers to NUMA_NO_NODE. > Hence replace it with NUMA_NO_NODE. > > No functional changes. > ... > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index a2c5fe0d0cad..b066d89c2975 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -432,16 +432,15 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa) > static int of_drconf_to_nid_single(struct drmem_lmb *lmb) > { > struct assoc_arrays aa = { .arrays = NULL }; > - int default_nid = NUMA_NO_NODE; > - int nid = default_nid; > + int nid = NUMA_NO_NODE; > int rc, index; > > if ((min_common_depth < 0) || !numa_enabled) > - return default_nid; > + return NUMA_NO_NODE; > > rc = of_get_assoc_arrays(); > if (rc) > - return default_nid; > + return NUMA_NO_NODE; > > if (min_common_depth <= aa.array_sz && > !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < > aa.n_arrays) { > @@ -449,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb) > nid = of_read_number([index], 1); > > if (nid == 0x || nid >= num_possible_nodes()) > - nid = default_nid; > + nid = NUMA_NO_NODE; > > if (nid > 0) { > index = lmb->aa_index * aa.array_sz; Doesn't seem like an improvement to me, sorry. Admittedly it's a small point, but a local variable named "default_nid" communicates that it's a default or fallback value. That information is lost with this change.
Re: [PATCH 1/2] powerpc/numa: Limit possible nodes to within num_possible_nodes
Srikar Dronamraju writes: > MAX_NUMNODES is a theoretical maximum number of nodes thats is supported > by the kernel. Device tree properties exposes the number of possible > nodes on the current platform. The kernel would detected this and would > use it for most of its resource allocations. If the platform now > increases the nodes to over what was already exposed, then it may lead > to inconsistencies. Hence limit it to the already exposed nodes. > > Suggested-by: Nathan Lynch > Cc: linuxppc-dev > Cc: Michael Ellerman > Cc: Nathan Lynch > Cc: Tyrel Datwyler > Signed-off-by: Srikar Dronamraju > --- > arch/powerpc/mm/numa.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 8ec7ff05ae47..a2c5fe0d0cad 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -221,7 +221,7 @@ static void initialize_distance_lookup_table(int nid, > } > } > > -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa > +/* Returns nid in the range [0..num_possible_nodes()-1], or -1 if no useful > numa > * info is found. > */ > static int associativity_to_nid(const __be32 *associativity) > @@ -235,7 +235,7 @@ static int associativity_to_nid(const __be32 > *associativity) > nid = of_read_number([min_common_depth], 1); > > /* POWER4 LPAR uses 0x as invalid node */ > - if (nid == 0x || nid >= MAX_NUMNODES) > + if (nid == 0x || nid >= num_possible_nodes()) > nid = NUMA_NO_NODE; > > if (nid > 0 && > @@ -448,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb) > index = lmb->aa_index * aa.array_sz + min_common_depth - 1; > nid = of_read_number([index], 1); > > - if (nid == 0x || nid >= MAX_NUMNODES) > + if (nid == 0xffff || nid >= num_possible_nodes()) > nid = default_nid; > > if (nid > 0) { Yes, looks fine to me. Reviewed-by: Nathan Lynch
Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
"Aneesh Kumar K.V" writes: > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index e437a9ac4956..6c659aada55b 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid, > } > } > > +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = > NUMA_NO_NODE}; It's odd to me to use MAX_NUMNODES for this array when it's going to be indexed not by Linux's logical node IDs but by the platform-provided domain number, which has no relation to MAX_NUMNODES. > + > +int firmware_group_id_to_nid(int firmware_gid) > +{ > + static int last_nid = 0; > + > + /* > + * For PowerNV we don't change the node id. This helps to avoid > + * confusion w.r.t the expected node ids. On pseries, node numbers > + * are virtualized. Hence do logical node id for pseries. > + */ > + if (!firmware_has_feature(FW_FEATURE_LPAR)) > + return firmware_gid; > + > + if (firmware_gid == -1) > + return NUMA_NO_NODE; > + > + if (nid_map[firmware_gid] == NUMA_NO_NODE) > + nid_map[firmware_gid] = last_nid++; This should at least be bounds-checked in case of domain numbering in excess of MAX_NUMNODES. Or a different data structure should be used? Not sure. I'd prefer Linux's logical node type not be easily interchangeable with the firmware node/group id type. The firmware type could be something like: struct affinity_domain { u32 val; }; typedef struct affinity_domain affinity_domain_t; with appropriate accessors/APIs. This can prevent a whole class of errors that is currently possible with CPUs, since the logical and "hardware" ID types are both simple integers.
Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Hi everyone, Michael Ellerman writes: > Greg Kurz writes: >> On Tue, 04 Aug 2020 23:35:10 +1000 >> Michael Ellerman wrote: >>> Spinning forever seems like a bad idea, but as has been demonstrated at >>> least twice now, continuing when we don't know the state of the other >>> CPU can lead to straight up crashes. >>> >>> So I think I'm persuaded that it's preferable to have the kernel stuck >>> spinning rather than oopsing. >>> >> >> +1 >> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My >>> first instinct is no, if we're stuck here for 20s a stack trace would be >>> good. But then we will probably hit that on some big and/or heavily >>> loaded machine. >>> >>> So possibly we should call cond_resched() but have some custom logic in >>> the loop to print a warning if we are stuck for more than some >>> sufficiently long amount of time. >> >> How long should that be ? > > Yeah good question. > > I guess step one would be seeing how long it can take on the 384 vcpu > machine. And we can probably test on some other big machines. > > Hopefully Nathan can give us some idea of how long he's seen it take on > large systems? I know he was concerned about the 20s timeout of the > softlockup detector. Maybe I'm not quite clear what this is referring to, but I don't think stop-self/query-stopped-state latency increases with processor count, at least not on PowerVM. And IIRC I was concerned with the earlier patch's potential for causing the softlockup watchdog to rightly complain by polling the stopped state without ever scheduling away. The fact that smp_query_cpu_stopped() kind of collapses the two distinct results from the query-cpu-stopped-state RTAS call into one return value may make it harder than necessary to reason about the questions around cond_resched() and whether to warn. Sorry to pull this stunt but I have had some code sitting in a neglected branch that I think gets the logic around this right. What we should have is a simple C wrapper for the RTAS call that reflects the architected inputs and outputs: (-- rtas.c --) /** * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state. * @hwcpu: Identifies the processor thread to be queried. * @status: Pointer to status, valid only on success. * * Determine whether the given processor thread is in the stopped * state. If successful and @status is non-NULL, the thread's status * is stored to @status. * * Return: * * 0 - Success * * -1 - Hardware error * * -2 - Busy, try again later */ int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status) { unsigned int cpu_status; int token; int fwrc; token = rtas_token("query-cpu-stopped-state"); fwrc = rtas_call(token, 1, 2, _status, hwcpu); if (fwrc != 0) goto out; if (status != NULL) *status = cpu_status; out: return fwrc; } And then a utility function that waits for the remote thread to enter stopped state, with higher-level logic for rescheduling and warning. The fact that smp_query_cpu_stopped() currently does not handle a -2/busy status is a bug, fixed below by using rtas_busy_delay(). Note the justification for the explicit cond_resched() in the outer loop: (-- rtas.h --) /* query-cpu-stopped-state CPU_status */ #define RTAS_QCSS_STATUS_STOPPED 0 #define RTAS_QCSS_STATUS_IN_PROGRESS 1 #define RTAS_QCSS_STATUS_NOT_STOPPED 2 (-- pseries/hotplug-cpu.c --) /** * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state. */ static void wait_for_cpu_stopped(unsigned int cpu) { unsigned int status; unsigned int hwcpu; hwcpu = get_hard_smp_processor_id(cpu); do { int fwrc; /* * rtas_busy_delay() will yield only if RTAS returns a * busy status. Since query-cpu-stopped-state can * yield RTAS_QCSS_STATUS_IN_PROGRESS or * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded * period before the target thread stops, we must take * care to explicitly reschedule while polling. */ cond_resched(); do { fwrc = rtas_query_cpu_stopped_state(hwcpu, ); } while (rtas_busy_delay(fwrc)); if (fwrc == 0) continue; pr_err_ratelimited("query-cpu-stopped-state for " "thread 0x%x returned %d\n", hwcpu, fwrc); goto out; } while (status == RTAS_QCSS_STATUS_NOT_STOPPED || status == RTAS_QCSS_STATUS_IN_PROGRESS); if (status != RTAS_QCSS_STATUS_STOPPED) {
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Nathan Lynch writes: > Michael Ellerman writes: >> One thought, which I possibly should not put in writing, is that we >> could use the alignment of the pointer as a poor man's substitute for a >> counter, eg: >> >> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) >> +{ >> +if (lmb % PAGE_SIZE == 0) >> +cond_resched(); >> + >> +return ++lmb; >> +} >> >> I think the lmbs are allocated in a block, so I think that will work. >> Maybe PAGE_SIZE is not the right size to use, but you get the idea. >> >> Gross I know, but might be OK as short term solution? > > OK, looking into this. To follow up: I wasn't able to measure more than ~1% difference in DLPAR memory performance with my original version of this, but that was on a relatively small configuration - hundreds of elements in the array as opposed to thousands. I took an educated guess at an appropriate interval and posted v2: https://lore.kernel.org/linuxppc-dev/20200812012005.1919255-1-nath...@linux.ibm.com/
[PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
The drmem lmb list can have hundreds of thousands of entries, and unfortunately lookups take the form of linear searches. As long as this is the case, traversals have the potential to monopolize the CPU and provoke lockup reports, workqueue stalls, and the like unless they explicitly yield. Rather than placing cond_resched() calls within various for_each_drmem_lmb() loop blocks in the code, put it in the iteration expression of the loop macro itself so users can't omit it. Call cond_resched() on every 20th element. Each iteration of the loop in DLPAR code paths can involve around ten RTAS calls which can each take up to 250us, so this ensures the check is performed at worst every few milliseconds. Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format") Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/drmem.h | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) Changes since v1: * Add bounds assertions in drmem_lmb_next(). * Call cond_resched() in the iterator on only every 20th element instead of on every iteration, to reduce overhead in tight loops. diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 17ccc6474ab6..583277e30dd2 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -8,6 +8,9 @@ #ifndef _ASM_POWERPC_LMB_H #define _ASM_POWERPC_LMB_H +#include +#include + struct drmem_lmb { u64 base_addr; u32 drc_index; @@ -26,8 +29,21 @@ struct drmem_lmb_info { extern struct drmem_lmb_info *drmem_info; +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) +{ + const unsigned int resched_interval = 20; + + BUG_ON(lmb < drmem_info->lmbs); + BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs); + + if ((lmb - drmem_info->lmbs) % resched_interval == 0) + cond_resched(); + + return ++lmb; +} + #define for_each_drmem_lmb_in_range(lmb, start, end) \ - for ((lmb) = (start); (lmb) < (end); (lmb)++) + for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb)) #define for_each_drmem_lmb(lmb)\ for_each_drmem_lmb_in_range((lmb), \ -- 2.25.4
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Michael Ellerman writes: > One thought, which I possibly should not put in writing, is that we > could use the alignment of the pointer as a poor man's substitute for a > counter, eg: > > +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) > +{ > + if (lmb % PAGE_SIZE == 0) > + cond_resched(); > + > + return ++lmb; > +} > > I think the lmbs are allocated in a block, so I think that will work. > Maybe PAGE_SIZE is not the right size to use, but you get the idea. > > Gross I know, but might be OK as short term solution? OK, looking into this.
Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Hi Christophe, Christophe Leroy writes: >> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) >> +{ >> +const unsigned int resched_interval = 20; >> + >> +BUG_ON(lmb < drmem_info->lmbs); >> +BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs); > > BUG_ON() shall be avoided unless absolutely necessary. > Wouldn't WARN_ON() together with an early return be enough ? Not sure what a sensible early return behavior would be. If the iterator doesn't advance the pointer the behavior will be a hang. BUG_ON a bounds-check failure is appropriate here; many users of this API will corrupt memory otherwise. >> + >> +if ((lmb - drmem_info->lmbs) % resched_interval == 0) >> +cond_resched(); > > Do you need something that precise ? Can't you use 16 or 32 and use a > logical AND instead of a MODULO ? Eh if you're executing in this code you've already lost with respect to performance considerations at this level, see the discussion on v1. I'll use 16 since I'm going to reroll the patch though. > And what garanties that lmb is always an element of a table based at > drmem_info->lmbs ? Well, that's its only intended use right now. There should not be any other arrays of drmem_lmb objects, and I hope we don't gain any. > What about: > > static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb, > struct drmem_lmb *start) > { > const unsigned int resched_interval = 16; > > if ((++lmb - start) & resched_interval == 0) ^^^ Did you mean '%' here? The bitwise AND doesn't do what I want. Otherwise, making drmem_lmb_next() more general by adding a 'start' argument could ease refactoring to come, so I'll do that.
Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
"Aneesh Kumar K.V" writes: > On 8/7/20 9:54 AM, Nathan Lynch wrote: >> "Aneesh Kumar K.V" writes: >>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>> index e437a9ac4956..6c659aada55b 100644 >>> --- a/arch/powerpc/mm/numa.c >>> +++ b/arch/powerpc/mm/numa.c >>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid, >>> } >>> } >>> >>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = >>> NUMA_NO_NODE}; >> >> It's odd to me to use MAX_NUMNODES for this array when it's going to be >> indexed not by Linux's logical node IDs but by the platform-provided >> domain number, which has no relation to MAX_NUMNODES. > > > I didn't want to dynamically allocate this. We could fetch > "ibm,max-associativity-domains" to find the size for that. The current > code do assume firmware group id to not exceed MAX_NUMNODES. Hence kept > the array size to be MAX_NUMNODEs. I do agree that it is confusing. May > be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES? Well, consider: - ibm,max-associativity-domains can change at runtime with LPM. This doesn't happen in practice yet, but we should probably start thinking about how to support that. - The domain numbering isn't clearly specified to have any particular properties such as beginning at zero or a contiguous range. While the current code likely contains assumptions contrary to these points, a change such as this is an opportunity to think about whether those assumptions can be reduced or removed. In particular I think it would be good to gracefully degrade when the number of NUMA affinity domains can exceed MAX_NUMNODES. Using the platform-supplied domain numbers to directly index Linux data structures will make that impossible. So, maybe genradix or even xarray wouldn't actually be overengineering here.
Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
Michael Ellerman writes: > Tyrel Datwyler writes: >> On 7/27/20 11:46 AM, Scott Cheloha wrote: >>> The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall, >>> Affinity_Domain_Info_By_Partition, which returns, among other things, >>> a "partition affinity score" for a given LPAR. This score, a value on >>> [0-100], represents the processor-memory affinity for the LPAR in >>> question. A score of 0 indicates the worst possible affinity while a >>> score of 100 indicates perfect affinity. The score can be used to >>> reason about performance. >>> >>> This patch adds the score for the local LPAR to the lparcfg procfile >>> under a new 'partition_affinity_score' key. >>> >>> Signed-off-by: Scott Cheloha >> >> I was hoping Michael would chime in the first time around on this patch >> series >> about adding another key/value pair to lparcfg. > > That guy is so unreliable. > > I don't love adding new stuff in lparcfg, but given the file already > exists and there's no prospect of removing it, it's probably not worth > the effort to put the new field anywhere else. > > My other query with this was how on earth anyone is meant to interpret > the metric. ie. if my metric is 50, what does that mean? If it's 90 > should I worry? Here's some more background. This interface is just passing up what the platform provides, and it's identical to the partition affinity score described in the documentation for the management console's lsmemopt command: https://www.ibm.com/support/knowledgecenter/POWER9/p9edm/lsmemopt.html The score is 0-100, higher values are better. To illustrate: I believe a partition's score will be 100 (or very close to it) if all of its CPUs and memory reside within one node. It will be lower than that when a partition has some memory without local CPUs, and lower still when there is no CPU-memory affinity within the partition. Beyond that I don't have more specific information and the algorithm and scale are set by the platform. The intent is for this to be a metric to gather during problem determination e.g. via sosreport or similar, but as far as Linux is concerned this should be treated as an opaque value.
Re: [PATCH v2 2/2] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
Scott Cheloha writes: > The H_GetPerformanceCounterInfo (GPCI) PHYP hypercall has a subcall, > Affinity_Domain_Info_By_Partition, which returns, among other things, > a "partition affinity score" for a given LPAR. This score, a value on > [0-100], represents the processor-memory affinity for the LPAR in > question. A score of 0 indicates the worst possible affinity while a > score of 100 indicates perfect affinity. The score can be used to > reason about performance. > > This patch adds the score for the local LPAR to the lparcfg procfile > under a new 'partition_affinity_score' key. > > Signed-off-by: Scott Cheloha > --- > arch/powerpc/platforms/pseries/lparcfg.c | 35 > 1 file changed, 35 insertions(+) > > diff --git a/arch/powerpc/platforms/pseries/lparcfg.c > b/arch/powerpc/platforms/pseries/lparcfg.c > index b8d28ab88178..e278390ab28d 100644 > --- a/arch/powerpc/platforms/pseries/lparcfg.c > +++ b/arch/powerpc/platforms/pseries/lparcfg.c > @@ -136,6 +136,39 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data > *ppp_data) > return rc; > } > > +static void show_gpci_data(struct seq_file *m) > +{ > + struct hv_gpci_request_buffer *buf; > + unsigned int affinity_score; > + long ret; > + > + buf = kmalloc(sizeof(*buf), GFP_KERNEL); > + if (buf == NULL) > + return; > + > + /* > + * Show the local LPAR's affinity score. > + * > + * 0xB1 selects the Affinity_Domain_Info_By_Partition subcall. > + * The score is at byte 0xB in the output buffer. > + */ > + memset(>params, 0, sizeof(buf->params)); > + buf->params.counter_request = cpu_to_be32(0xB1); > + buf->params.starting_index = cpu_to_be32(-1); /* local LPAR */ > + buf->params.counter_info_version_in = 0x5; /* v5+ for score */ > + ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf), > + sizeof(*buf)); > + if (ret != H_SUCCESS) { > + pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n", > + ret, be32_to_cpu(buf->params.detail_rc)); > + goto out; > + } > + affinity_score = buf->bytes[0xB]; > + seq_printf(m, "partition_affinity_score=%u\n", affinity_score); > +out: > + kfree(buf); > +} > + > static unsigned h_pic(unsigned long *pool_idle_time, > unsigned long *num_procs) > { > @@ -487,6 +520,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void > *v) > partition_active_processors * 100); > } > > + show_gpci_data(m); > + > seq_printf(m, "partition_active_processors=%d\n", > partition_active_processors); Acked-by: Nathan Lynch
Re: [PATCH v2 1/2] powerpc/perf: consolidate GPCI hcall structs into asm/hvcall.h
Scott Cheloha writes: > The H_GetPerformanceCounterInfo (GPCI) hypercall input/output structs are > useful to modules outside of perf/, so move them into asm/hvcall.h to live > alongside the other powerpc hypercall structs. > > Leave the perf-specific GPCI stuff in perf/hv-gpci.h. > > Signed-off-by: Scott Cheloha Acked-by: Nathan Lynch
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Michael Ellerman writes: > Nathan Lynch writes: >> Michael Ellerman writes: >>> Nathan Lynch writes: >>>> Laurent Dufour writes: >>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit : >>>>>> The drmem lmb list can have hundreds of thousands of entries, and >>>>>> unfortunately lookups take the form of linear searches. As long as >>>>>> this is the case, traversals have the potential to monopolize the CPU >>>>>> and provoke lockup reports, workqueue stalls, and the like unless >>>>>> they explicitly yield. >>>>>> >>>>>> Rather than placing cond_resched() calls within various >>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration >>>>>> expression of the loop macro itself so users can't omit it. >>>>> >>>>> Is that not too much to call cond_resched() on every LMB? >>>>> >>>>> Could that be less frequent, every 10, or 100, I don't really know ? >>>> >>>> Everything done within for_each_drmem_lmb is relatively heavyweight >>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens >>>> of milliseconds. I don't think cond_resched() is an expensive check in >>>> this context. >>> >>> Hmm, mostly. >>> >>> But there are quite a few cases like drmem_update_dt_v1(): >>> >>> for_each_drmem_lmb(lmb) { >>> dr_cell->base_addr = cpu_to_be64(lmb->base_addr); >>> dr_cell->drc_index = cpu_to_be32(lmb->drc_index); >>> dr_cell->aa_index = cpu_to_be32(lmb->aa_index); >>> dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb)); >>> >>> dr_cell++; >>> } >>> >>> Which will compile to a pretty tight loop at the moment. >>> >>> Or drmem_update_dt_v2() which has two loops over all lmbs. >>> >>> And although the actual TIF check is cheap the function call to do it is >>> not free. >>> >>> So I worry this is going to make some of those long loops take even >>> longer. >> >> That's fair, and I was wrong - some of the loop bodies are relatively >> simple, not doing allocations or taking locks, etc. >> >> One way to deal is to keep for_each_drmem_lmb() as-is and add a new >> iterator that can reschedule, e.g. for_each_drmem_lmb_slow(). > > If we did that, how many call-sites would need converting? > Is it ~2 or ~20 or ~200? At a glance I would convert 15-20 out of the 24 users in the tree I'm looking at. Let me know if I should do a v2 with that approach.
Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Michael Ellerman writes: > Tyrel Datwyler writes: >> On 8/11/20 6:20 PM, Nathan Lynch wrote: >>> >>> +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) >>> +{ >>> + const unsigned int resched_interval = 20; >>> + >>> + BUG_ON(lmb < drmem_info->lmbs); >>> + BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs); >> >> I think BUG_ON is a pretty big no-no these days unless there is no other >> option. > > It's complicated, but yes we would like to avoid adding them if we can. > > In a case like this there is no other option, *if* the check has to be > in drmem_lmb_next(). > > But I don't think we really need to check that there. > > If for some reason this was called with an *lmb pointing outside of the > lmbs array it would confuse the cond_resched() logic, but it's not worth > crashing the box for that IMHO. The BUG_ONs are pretty much orthogonal to the cond_resched(). It's not apparent from the context of the change, but some users of the for_each_drmem_lmb* macros modify elements of the drmem_info->lmbs array. If the lmb passed to drmem_lmb_next() violates the bounds check (say, if the callsite inappropriately modifies it within the loop), such users are guaranteed to corrupt other objects in memory. This was my thinking in adding the BUG_ONs, and I don't see another place to do it.
[PATCH v3] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
The drmem lmb list can have hundreds of thousands of entries, and unfortunately lookups take the form of linear searches. As long as this is the case, traversals have the potential to monopolize the CPU and provoke lockup reports, workqueue stalls, and the like unless they explicitly yield. Rather than placing cond_resched() calls within various for_each_drmem_lmb() loop blocks in the code, put it in the iteration expression of the loop macro itself so users can't omit it. Introduce a drmem_lmb_next() iteration helper function which calls cond_resched() at a regular interval during array traversal. Each iteration of the loop in DLPAR code paths can involve around ten RTAS calls which can each take up to 250us, so this ensures the check is performed at worst every few milliseconds. Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format") Signed-off-by: Nathan Lynch --- Notes: Changes since v2: * Make drmem_lmb_next() more general. * Adjust reschedule interval for better code generation. * Add commentary to drmem_lmb_next() to explain the cond_resched() call. * Remove bounds assertions. Changes since v1: * Add bounds assertions in drmem_lmb_next(). * Call cond_resched() in the iterator on only every 20th element instead of on every iteration, to reduce overhead in tight loops. arch/powerpc/include/asm/drmem.h | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 17ccc6474ab6..6fb928605ed1 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -8,6 +8,8 @@ #ifndef _ASM_POWERPC_LMB_H #define _ASM_POWERPC_LMB_H +#include + struct drmem_lmb { u64 base_addr; u32 drc_index; @@ -26,8 +28,22 @@ struct drmem_lmb_info { extern struct drmem_lmb_info *drmem_info; +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb, + const struct drmem_lmb *start) +{ + /* +* DLPAR code paths can take several milliseconds per element +* when interacting with firmware. Ensure that we don't +* unfairly monopolize the CPU. +*/ + if (((++lmb - start) % 16) == 0) + cond_resched(); + + return lmb; +} + #define for_each_drmem_lmb_in_range(lmb, start, end) \ - for ((lmb) = (start); (lmb) < (end); (lmb)++) + for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb, start)) #define for_each_drmem_lmb(lmb)\ for_each_drmem_lmb_in_range((lmb), \ -- 2.25.4
Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Hi Aneesh, "Aneesh Kumar K.V" writes: > "Aneesh Kumar K.V" writes: >> On 8/8/20 2:15 AM, Nathan Lynch wrote: >>> "Aneesh Kumar K.V" writes: >>>> On 8/7/20 9:54 AM, Nathan Lynch wrote: >>>>> "Aneesh Kumar K.V" writes: >>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>>>>> index e437a9ac4956..6c659aada55b 100644 >>>>>> --- a/arch/powerpc/mm/numa.c >>>>>> +++ b/arch/powerpc/mm/numa.c >>>>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int >>>>>> nid, >>>>>> } >>>>>>} >>>>>> >>>>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = >>>>>> NUMA_NO_NODE}; >>>>> >>>>> It's odd to me to use MAX_NUMNODES for this array when it's going to be >>>>> indexed not by Linux's logical node IDs but by the platform-provided >>>>> domain number, which has no relation to MAX_NUMNODES. >>>> >>>> >>>> I didn't want to dynamically allocate this. We could fetch >>>> "ibm,max-associativity-domains" to find the size for that. The current >>>> code do assume firmware group id to not exceed MAX_NUMNODES. Hence kept >>>> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May >>>> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES? >>> >>> Well, consider: >>> >>> - ibm,max-associativity-domains can change at runtime with LPM. This >>>doesn't happen in practice yet, but we should probably start thinking >>>about how to support that. >>> - The domain numbering isn't clearly specified to have any particular >>>properties such as beginning at zero or a contiguous range. >>> >>> While the current code likely contains assumptions contrary to these >>> points, a change such as this is an opportunity to think about whether >>> those assumptions can be reduced or removed. In particular I think it >>> would be good to gracefully degrade when the number of NUMA affinity >>> domains can exceed MAX_NUMNODES. Using the platform-supplied domain >>> numbers to directly index Linux data structures will make that >>> impossible. >>> >>> So, maybe genradix or even xarray wouldn't actually be overengineering >>> here. >>> >> >> One of the challenges with such a data structure is that we initialize >> the nid_map before the slab is available. This means a memblock based >> allocation and we would end up implementing such a sparse data structure >> ourselves here. Yes, good point. >> As you mentioned above, since we know that hypervisor as of now limits >> the max affinity domain id below ibm,max-associativity-domains we are >> good with an array-like nid_map we have here. This keeps the code simpler. >> >> This will also allow us to switch to a more sparse data structure as you >> requested here in the future because the main change that is pushed in >> this series is the usage of firmare_group_id_to_nid(). The details of >> the data structure we use to keep track of that mapping are pretty much >> internal to that function. > > How about this? This makes it not a direct index. But it do limit the > search to max numa node on the system. > > static int domain_id_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = -1 }; > > static int __affinity_domain_to_nid(int domain_id, int max_nid) > { > int i; > > for (i = 0; i < max_nid; i++) { > if (domain_id_map[i] == domain_id) > return i; > } > return NUMA_NO_NODE; > } OK, this indexes the array by Linux's node id, good. I was wondering if I could persuade you do flip it around like this :-) Walking through the code below: > int affinity_domain_to_nid(struct affinity_domain *domain) > { > int nid, domain_id; > static int last_nid = 0; > static DEFINE_SPINLOCK(node_id_lock); > > domain_id = domain->id; > /* >* For PowerNV we don't change the node id. This helps to avoid >* confusion w.r.t the expected node ids. On pseries, node numbers >* are virtualized. Hence do logical node id for pseries. >*/ > if (!firmware_has_feature(FW_FEATURE_LPAR)) > return domain_id; > > if (domain_id == -1 || last_nid == MAX_NUMNODES) > return NUMA_NO_
Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Michael Ellerman writes: > Michael Roth writes: >> Quoting Nathan Lynch (2020-08-07 02:05:09) > ... >>> wait_for_cpu_stopped() should be able to accommodate a time-based >>> warning if necessary, but speaking as a likely recipient of any bug >>> reports that would arise here, I'm not convinced of the need and I >>> don't know what a good value would be. It's relatively easy to sample >>> the stack of a task that's apparently failing to make progress, plus I >>> probably would use 'perf probe' or similar to report the inputs and >>> outputs for the RTAS call. >> >> I think if we make the timeout sufficiently high like 2 minutes or so >> it wouldn't hurt and if we did seem them it would probably point to an >> actual bug. But I don't have a strong feeling either way. > > I think we should print a warning after 2 minutes. > > It's true that there are fairly easy mechanisms to work out where the > thread is stuck, but customers are unlikely to use them. They're just > going to report that it's stuck with no further info, and probably > reboot the machine before we get a chance to get any further info. > > Whereas if the kernel prints a warning with a stack trace we at least > have that to go on in an initial bug report. > >>> I'm happy to make this a proper submission after I can clean it up and >>> retest it, or Michael R. is welcome to appropriate it, assuming it's >>> acceptable. >>> >> >> I've given it a shot with this patch and it seems to be holding up in >> testing. If we don't think the ~2 minutes warning message is needed I >> can clean it up to post: >> >> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e >> >> I'd likely break the refactoring patches out to a separate patch under >> Nathan's name since it fixes a separate bug potentially. > > While I like Nathan's refactoring, we probably want to do the minimal > fix first to ease backporting. > > Then do the refactoring on top of that. Fair enough, thanks.
Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform
Srikar Dronamraju writes: > * Michael Ellerman [2020-07-07 15:02:17]: > >> Srikar Dronamraju writes: >> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains >> > /proc/device-tree/rtas/ibm,current-associativity-domains >> > 0005 0001 0002 0002 0002 0010 >> > /proc/device-tree/rtas/ibm,max-associativity-domains >> > 0005 0001 0008 0020 0020 0100 >> > >> > $ cat /sys/devices/system/node/possible ##Before patch >> > 0-31 >> > >> > $ cat /sys/devices/system/node/possible ##After patch >> > 0-1 >> > >> > Note the maximum nodes this platform can support is only 2 but the >> > possible nodes is set to 32. >> >> But what about LPM to a system with more nodes? >> > > I have very less info on LPM, so I checked with Nathan Lynch before posting > and as per Nathan in the current design of LPM, Linux wouldn't use the new > node numbers. (I see a v2 has been posted already but I needed a little time to check with our hypervisor people on this point.) It's less of a design and more of a least-bad option in the absence of a more flexible NUMA architecture in Linux. For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA information from the device tree that it was booted with, and it must disregard ibm,associativity and similar information after LPM or certain other platform events. Historically there has been code that tried to honor changes in NUMA information but it caused much worse problems than degraded performance. That code has been disabled by default since last year and is now subject to removal: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897 Most NUMA-aware code happens to follow that rule because the device tree associativity information tends to get cached on first access in Linux's own data structures. It all feels a little fragile to me though, especially since we can DLPAR add processors and memory after LPM with "new" associativity properties which don't relate to the logical topology Linux has already built. However, on currently available hw, as long as we're using ibm,max-associativity-domains to limit the set of possible nodes, I believe such resources will always receive valid (but possibly suboptimal) NUMA assignments. That's because as of this writing ibm,max-associativity-domains has the same contents on all currently available PowerVM systems. Now if we change to using ibm,current-associativity-domains, which we *can* expect to differ between differently configured systems, post-LPM DLPAR additions can yield resources with node assignments that fall outside of the possible range, especially when we've migrated from a smaller system to a larger one. Is the current code robust against that possibility? I don't think so: it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and possibly more code need to guard against this in order to prevent NODE_DATA() null dereferences and the like. Probably those functions should be made to clamp the nid assignment at num_possible_nodes() instead of MAX_NUMNODES, which strikes me as more correct regardless of your patch.
Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform
Srikar Dronamraju writes: > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 9fcf2d195830..3d55cef1a2dc 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) > return; > > if (of_property_read_u32_index(rtas, > - "ibm,max-associativity-domains", > + "ibm,current-associativity-domains", > min_common_depth, )) Looks good if ibm,current-associativity-domains[min_common_depth] actually denotes the range of possible values, i.e. a value of 2 implies node numbers 0 and 1. PAPR+ says it's the "number of unique values", which isn't how I would specify the property if it's supposed to express a range. But it's probably OK...
Re: [PATCH] powerpc/pseries: new lparcfg key/value pair: partition_affinity_score
Hi Tyrel, Tyrel Datwyler writes: > On 6/19/20 8:34 AM, Scott Cheloha wrote: >> The H_GetPerformanceCounterInfo PHYP hypercall has a subcall, >> Affinity_Domain_Info_By_Partition, which returns, among other things, >> a "partition affinity score" for a given LPAR. This score, a value on >> [0-100], represents the processor-memory affinity for the LPAR in >> question. A score of 0 indicates the worst possible affinity while a >> score of 100 indicates perfect affinity. The score can be used to >> reason about performance. >> >> This patch adds the score for the local LPAR to the lparcfg procfile >> under a new 'partition_affinity_score' key. > > I expect that you will probably get a NACK from Michael on this. The overall > desire is to move away from these dated /proc interfaces. While its true that > I > did add a new value recently it was strictly to facilitate and correct the > calculation of a derived value that was already dependent on a couple other > existing values in lparcfg. > > With that said I would expect that you would likely be advised to expose this > as > a sysfs attribute. The question is where? We probably should put some thought > in > to this as I would like to port each lparcfg value over to sysfs so that we > can > move to deprecating lparcfg. Putting everything under something like > /sys/kernel/lparcfg/* maybe. Michael may have a better suggestion. I think this score fits pretty naturally in lparcfg: it's a simple metric that is specific to the pseries/papr platform, like everything else in there. A few dozen key=value pairs contained in a single file is simple and efficient, unlike sysfs with its rather inconsistently applied one-value-per-file convention. Surely it's OK if lparcfg gains a line every few years? >> The H_GetPerformanceCounterInfo hypercall is already used elsewhere in >> the kernel, in powerpc/perf/hv-gpci.c. Refactoring that code and this >> code into a more general API might be worthwhile if additional modules >> require the hypercall in the future. > > If you are duplicating code its likely you should already be doing this. See > the > rest of my comments about below. > >> >> Signed-off-by: Scott Cheloha >> --- >> arch/powerpc/platforms/pseries/lparcfg.c | 53 >> 1 file changed, 53 insertions(+) >> >> diff --git a/arch/powerpc/platforms/pseries/lparcfg.c >> b/arch/powerpc/platforms/pseries/lparcfg.c >> index b8d28ab88178..b75151eee0f0 100644 >> --- a/arch/powerpc/platforms/pseries/lparcfg.c >> +++ b/arch/powerpc/platforms/pseries/lparcfg.c >> @@ -136,6 +136,57 @@ static unsigned int h_get_ppp(struct hvcall_ppp_data >> *ppp_data) >> return rc; >> } >> >> +/* >> + * Based on H_GetPerformanceCounterInfo v1.10. >> + */ >> +static void show_gpci_data(struct seq_file *m) >> +{ >> +struct perf_counter_info_params { >> +__be32 counter_request; >> +__be32 starting_index; >> +__be16 secondary_index; >> +__be16 returned_values; >> +__be32 detail_rc; >> +__be16 counter_value_element_size; >> +u8 counter_info_version_in; >> +u8 counter_info_version_out; >> +u8 reserved[0xC]; >> +} __packed; > > This looks to duplicate the hv_get_perf_counter_info_params struct from > arch/powerpc/perf/hv-gpci.h. Maybe this include file needs to move to > arch/powerpc/asm/inlcude so you don't have to redefine this struct. > >> +struct hv_gpci_request_buffer { >> +struct perf_counter_info_params params; >> +u8 output[4096 - sizeof(struct perf_counter_info_params)]; >> +} __packed; > > This struct is code duplication of the one defined in > arch/powerpc/perf/hv-gpci.c and could be moved into hv-gpci.h along with > HGPCI_MAX_DATA_BYTES so that you can use those versions here. I tend to agree with these comments. >> +struct hv_gpci_request_buffer *buf; >> +long ret; >> +unsigned int affinity_score; >> + >> +buf = kmalloc(sizeof(*buf), GFP_KERNEL); >> +if (buf == NULL) >> +return; >> + >> +/* >> + * Show the local LPAR's affinity score. >> + * >> + * 0xB1 selects the Affinity_Domain_Info_By_Partition subcall. >> + * The score is at byte 0xB in the output buffer. >> + */ >> +memset(>params, 0, sizeof(buf->params)); >> +buf->params.counter_request = cpu_to_be32(0xB1); >> +buf->params.starting_index = cpu_to_be32(-1); /* local LPAR */ >> +buf->params.counter_info_version_in = 0x5; /* v5+ for score */ >> +ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO, virt_to_phys(buf), >> + sizeof(*buf)); >> +if (ret != H_SUCCESS) { >> +pr_debug("hcall failed: H_GET_PERF_COUNTER_INFO: %ld, %x\n", >> + ret, be32_to_cpu(buf->params.detail_rc)); >> +goto out; >> +} >> +affinity_score = buf->output[0xB]; >> +seq_printf(m,
[PATCH 03/18] powerpc/numa: remove ability to enable topology updates
Remove the /proc/powerpc/topology_updates interface and the topology_updates=on/off command line argument. The internal topology_updates_enabled flag remains for now, but always false. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 71 +- 1 file changed, 1 insertion(+), 70 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 9fcf2d195830..34d95de77bdd 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -984,27 +984,7 @@ static int __init early_numa(char *p) } early_param("numa", early_numa); -/* - * The platform can inform us through one of several mechanisms - * (post-migration device tree updates, PRRN or VPHN) that the NUMA - * assignment of a resource has changed. This controls whether we act - * on that. Disabled by default. - */ -static bool topology_updates_enabled; - -static int __init early_topology_updates(char *p) -{ - if (!p) - return 0; - - if (!strcmp(p, "on")) { - pr_warn("Caution: enabling topology updates\n"); - topology_updates_enabled = true; - } - - return 0; -} -early_param("topology_updates", early_topology_updates); +static const bool topology_updates_enabled; #ifdef CONFIG_MEMORY_HOTPLUG /* @@ -1632,52 +1612,6 @@ int prrn_is_enabled(void) return prrn_enabled; } -static int topology_read(struct seq_file *file, void *v) -{ - if (vphn_enabled || prrn_enabled) - seq_puts(file, "on\n"); - else - seq_puts(file, "off\n"); - - return 0; -} - -static int topology_open(struct inode *inode, struct file *file) -{ - return single_open(file, topology_read, NULL); -} - -static ssize_t topology_write(struct file *file, const char __user *buf, - size_t count, loff_t *off) -{ - char kbuf[4]; /* "on" or "off" plus null. */ - int read_len; - - read_len = count < 3 ? count : 3; - if (copy_from_user(kbuf, buf, read_len)) - return -EINVAL; - - kbuf[read_len] = '\0'; - - if (!strncmp(kbuf, "on", 2)) { - topology_updates_enabled = true; - start_topology_update(); - } else if (!strncmp(kbuf, "off", 3)) { - stop_topology_update(); - topology_updates_enabled = false; - } else - return -EINVAL; - - return count; -} - -static const struct proc_ops topology_proc_ops = { - .proc_read = seq_read, - .proc_write = topology_write, - .proc_open = topology_open, - .proc_release = single_release, -}; - static int topology_update_init(void) { start_topology_update(); @@ -1685,9 +1619,6 @@ static int topology_update_init(void) if (vphn_enabled) topology_schedule_update(); - if (!proc_create("powerpc/topology_updates", 0644, NULL, _proc_ops)) - return -ENOMEM; - topology_inited = 1; return 0; } -- 2.25.4
[PATCH 12/18] powerpc/rtasd: simplify handle_rtas_event(), emit message on events
prrn_is_enabled() always returns false/0, so handle_rtas_event() can be simplified and some dead code can be removed. Use machine_is() instead of #ifdef to run this code only on pseries, and add an informational ratelimited message that we are ignoring the events. PRRN events are relatively rare in normal operation and usually arise from operator-initiated actions such as a DPO (Dynamic Platform Optimizer) run. Eventually we do want to consume these events and update the device tree, but that needs more care to be safe vs LPM and DLPAR. Signed-off-by: Nathan Lynch --- arch/powerpc/kernel/rtasd.c | 28 +++- 1 file changed, 3 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c index 89b798f8f656..8561dfb33f24 100644 --- a/arch/powerpc/kernel/rtasd.c +++ b/arch/powerpc/kernel/rtasd.c @@ -273,37 +273,15 @@ void pSeries_log_error(char *buf, unsigned int err_type, int fatal) } } -#ifdef CONFIG_PPC_PSERIES -static void handle_prrn_event(s32 scope) -{ - /* -* For PRRN, we must pass the negative of the scope value in -* the RTAS event. -*/ - pseries_devicetree_update(-scope); - numa_update_cpu_topology(false); -} - static void handle_rtas_event(const struct rtas_error_log *log) { - if (rtas_error_type(log) != RTAS_TYPE_PRRN || !prrn_is_enabled()) + if (!machine_is(pseries)) return; - /* For PRRN Events the extended log length is used to denote -* the scope for calling rtas update-nodes. -*/ - handle_prrn_event(rtas_error_extended_log_length(log)); + if (rtas_error_type(log) == RTAS_TYPE_PRRN) + pr_info_ratelimited("Platform resource reassignment ignored.\n"); } -#else - -static void handle_rtas_event(const struct rtas_error_log *log) -{ - return; -} - -#endif - static int rtas_log_open(struct inode * inode, struct file * file) { return 0; -- 2.25.4
[PATCH 05/18] powerpc/numa: make vphn_enabled, prrn_enabled flags const
Previous changes have made it so these flags are never changed; enforce this by making them const. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 9e20f12e6caf..1b89bacb8975 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1132,8 +1132,8 @@ struct topology_update_data { #define TOPOLOGY_DEF_TIMER_SECS60 static cpumask_t cpu_associativity_changes_mask; -static int vphn_enabled; -static int prrn_enabled; +static const int vphn_enabled; +static const int prrn_enabled; static void reset_topology_timer(void); static int topology_timer_secs = 1; static int topology_inited; -- 2.25.4
[PATCH 06/18] powerpc/numa: remove unreachable topology timer code
Since vphn_enabled is always 0, we can stub out timed_topology_update() and remove the code which becomes unreachable. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 21 - 1 file changed, 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 1b89bacb8975..6207297490a8 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1129,13 +1129,9 @@ struct topology_update_data { int new_nid; }; -#define TOPOLOGY_DEF_TIMER_SECS60 - static cpumask_t cpu_associativity_changes_mask; static const int vphn_enabled; static const int prrn_enabled; -static void reset_topology_timer(void); -static int topology_timer_secs = 1; static int topology_inited; /* @@ -1143,15 +1139,6 @@ static int topology_inited; */ int timed_topology_update(int nsecs) { - if (vphn_enabled) { - if (nsecs > 0) - topology_timer_secs = nsecs; - else - topology_timer_secs = TOPOLOGY_DEF_TIMER_SECS; - - reset_topology_timer(); - } - return 0; } @@ -1438,14 +1425,6 @@ static void topology_schedule_update(void) schedule_work(_work); } -static struct timer_list topology_timer; - -static void reset_topology_timer(void) -{ - if (vphn_enabled) - mod_timer(_timer, jiffies + topology_timer_secs * HZ); -} - /* * Start polling for associativity changes. */ -- 2.25.4
[PATCH 16/18] powerpc/pseries: remove memory "re-add" implementation
dlpar_memory() no longer has any callers which pass PSERIES_HP_ELOG_ACTION_READD. Remove this case and the corresponding unreachable code. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 - .../platforms/pseries/hotplug-memory.c| 42 --- 2 files changed, 43 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 0107d724e9da..55f9a154c95d 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -215,7 +215,6 @@ inline uint16_t pseries_errorlog_length(struct pseries_errorlog *sect) #define PSERIES_HP_ELOG_ACTION_ADD 1 #define PSERIES_HP_ELOG_ACTION_REMOVE 2 -#define PSERIES_HP_ELOG_ACTION_READD 3 #define PSERIES_HP_ELOG_ID_DRC_NAME1 #define PSERIES_HP_ELOG_ID_DRC_INDEX 2 diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5ace2f9a277e..67ece3ac9ac2 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -487,40 +487,6 @@ static int dlpar_memory_remove_by_index(u32 drc_index) return rc; } -static int dlpar_memory_readd_by_index(u32 drc_index) -{ - struct drmem_lmb *lmb; - int lmb_found; - int rc; - - pr_info("Attempting to update LMB, drc index %x\n", drc_index); - - lmb_found = 0; - for_each_drmem_lmb(lmb) { - if (lmb->drc_index == drc_index) { - lmb_found = 1; - rc = dlpar_remove_lmb(lmb); - if (!rc) { - rc = dlpar_add_lmb(lmb); - if (rc) - dlpar_release_drc(lmb->drc_index); - } - break; - } - } - - if (!lmb_found) - rc = -EINVAL; - - if (rc) - pr_info("Failed to update memory at %llx\n", - lmb->base_addr); - else - pr_info("Memory at %llx was updated\n", lmb->base_addr); - - return rc; -} - static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) { struct drmem_lmb *lmb, *start_lmb, *end_lmb; @@ -617,10 +583,6 @@ static int dlpar_memory_remove_by_index(u32 drc_index) { return -EOPNOTSUPP; } -static int dlpar_memory_readd_by_index(u32 drc_index) -{ - return -EOPNOTSUPP; -} static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index) { @@ -902,10 +864,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) break; } - break; - case PSERIES_HP_ELOG_ACTION_READD: - drc_index = hp_elog->_drc_u.drc_index; - rc = dlpar_memory_readd_by_index(drc_index); break; default: pr_err("Invalid action (%d) specified\n", hp_elog->action); -- 2.25.4
[PATCH 01/18] powerpc/pseries: remove cede offline state for CPUs
This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), which added an offline mode for CPUs which uses the H_CEDE hcall instead of the architected stop-self RTAS function in order to facilitate "folding" of dedicated mode processors on PowerVM platforms to achieve energy savings. This has been the default offline mode since its introduction. There's nothing about stop-self that would prevent the hypervisor from achieving the energy savings available via H_CEDE, so the original premise of this change appears to be flawed. I also have encountered the claim that the transition to and from ceded state is much faster than stop-self/start-cpu. Certainly we would not want to use stop-self as an *idle* mode. That is what H_CEDE is for. However, this difference is insignificant in the context of Linux CPU hotplug, where the latency of an offline or online operation on current systems is on the order of 100ms, mainly attributable to all the various subsystems' cpuhp callbacks. The cede offline mode also prevents accurate accounting, as discussed before: https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-...@linux.vnet.ibm.com/ Unconditionally use stop-self to offline processor threads. This is the architected method for offlining CPUs on PAPR systems. The "cede_offline" boot parameter is rendered obsolete. Removing this code enables the removal of the partition suspend code which temporarily onlines all present CPUs. Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Nathan Lynch Reviewed-by: Gautham R. Shenoy --- Documentation/core-api/cpu_hotplug.rst| 7 - arch/powerpc/platforms/pseries/hotplug-cpu.c | 170 ++ .../platforms/pseries/offline_states.h| 38 arch/powerpc/platforms/pseries/pmem.c | 1 - arch/powerpc/platforms/pseries/smp.c | 28 +-- 5 files changed, 15 insertions(+), 229 deletions(-) delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 4a50ab7817f7..b1ae1ac159cf 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -50,13 +50,6 @@ Command Line Switches This option is limited to the X86 and S390 architecture. -``cede_offline={"off","on"}`` - Use this option to disable/enable putting offlined processors to an extended - ``H_CEDE`` state on supported pseries platforms. If nothing is specified, - ``cede_offline`` is set to "on". - - This option is limited to the PowerPC architecture. - ``cpu0_hotplug`` Allow to shutdown CPU0. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 3e8cbfe7a80f..d4b346355bb9 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -35,54 +35,10 @@ #include #include "pseries.h" -#include "offline_states.h" /* This version can't take the spinlock, because it never returns */ static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE; -static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = - CPU_STATE_OFFLINE; -static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE; - -static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; - -static bool cede_offline_enabled __read_mostly = true; - -/* - * Enable/disable cede_offline when available. - */ -static int __init setup_cede_offline(char *str) -{ - return (kstrtobool(str, _offline_enabled) == 0); -} - -__setup("cede_offline=", setup_cede_offline); - -enum cpu_state_vals get_cpu_current_state(int cpu) -{ - return per_cpu(current_state, cpu); -} - -void set_cpu_current_state(int cpu, enum cpu_state_vals state) -{ - per_cpu(current_state, cpu) = state; -} - -enum cpu_state_vals get_preferred_offline_state(int cpu) -{ - return per_cpu(preferred_offline_state, cpu); -} - -void set_preferred_offline_state(int cpu, enum cpu_state_vals state) -{ - per_cpu(preferred_offline_state, cpu) = state; -} - -void set_default_offline_state(int cpu) -{ - per_cpu(preferred_offline_state, cpu) = default_offline_state; -} - static void rtas_stop_self(void) { static struct rtas_args args; @@ -101,9 +57,7 @@ static void rtas_stop_self(void) static void pseries_mach_cpu_die(void) { - unsigned int cpu = smp_processor_id(); unsigned int hwcpu = hard_smp_processor_id(); - u8 cede_latency_hint = 0; local_irq_disable(); idle_task_exit(); @@ -112,49 +66,6 @@ static void pseries_mach_cpu_die(void) else xics_teardown_cpu(); - if
[PATCH 00/18] remove extended cede offline mode and bogus topology update code
Two major parts to this series: 1. Removal of the extended cede offline mode for CPUs as well as the partition suspend code which accommodates it by temporarily onlining all CPUs prior to suspending the LPAR. This solves some accounting problems, simplifies the pseries CPU hotplug code, and greatly uncomplicates the existing partition suspend code, easing a much-needed transition to the Linux suspend framework. The two patches which make up this part have been posted before: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=180718 and they are simply incorporated unchanged into the larger series here, with Gautham's Reviewed-by added to patch #1. 2. Removal of the long-disabled "topology update" code, most of which resides in mm/numa.c, but there are pieces in pseries and rtasd to excise as well. This code was an attempt to honor changes in a partition's NUMA properties arising from resource reassignments which occur as part of a migration, VPHN change, or a Dynamic Platform Optimizer operation. Its main technique is to remove and re-add affected processors and LMBs and hope in vain that the changes in cpu-node and physaddr-node relationships aren't disruptive. We want to provide user space with some indication that Linux's logical NUMA representation has become out of sync with the platform's assignments, but we need to get this unusable stuff out of the way before this code can sustain new features. Nathan Lynch (18): powerpc/pseries: remove cede offline state for CPUs powerpc/rtas: don't online CPUs for partition suspend powerpc/numa: remove ability to enable topology updates powerpc/numa: remove unreachable topology update code powerpc/numa: make vphn_enabled, prrn_enabled flags const powerpc/numa: remove unreachable topology timer code powerpc/numa: remove unreachable topology workqueue code powerpc/numa: remove vphn_enabled and prrn_enabled internal flags powerpc/numa: stub out numa_update_cpu_topology() powerpc/numa: remove timed_topology_update() powerpc/numa: remove start/stop_topology_update() powerpc/rtasd: simplify handle_rtas_event(), emit message on events powerpc/numa: remove prrn_is_enabled() powerpc/numa: remove arch_update_cpu_topology powerpc/pseries: remove prrn special case from DT update path powerpc/pseries: remove memory "re-add" implementation powerpc/pseries: remove dlpar_cpu_readd() powerpc/pseries: remove obsolete memory hotplug DT notifier code Documentation/core-api/cpu_hotplug.rst| 7 - arch/powerpc/include/asm/rtas.h | 3 - arch/powerpc/include/asm/topology.h | 27 - arch/powerpc/kernel/rtas.c| 122 + arch/powerpc/kernel/rtasd.c | 28 +- arch/powerpc/mm/numa.c| 486 -- arch/powerpc/platforms/pseries/hotplug-cpu.c | 189 +-- .../platforms/pseries/hotplug-memory.c| 107 +--- arch/powerpc/platforms/pseries/mobility.c | 31 -- .../platforms/pseries/offline_states.h| 38 -- arch/powerpc/platforms/pseries/pmem.c | 1 - arch/powerpc/platforms/pseries/smp.c | 28 +- arch/powerpc/platforms/pseries/suspend.c | 27 +- 13 files changed, 22 insertions(+), 1072 deletions(-) delete mode 100644 arch/powerpc/platforms/pseries/offline_states.h -- 2.25.4
[PATCH 11/18] powerpc/numa: remove start/stop_topology_update()
These APIs have become no-ops, so remove them and all call sites. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/topology.h | 10 -- arch/powerpc/mm/numa.c| 20 arch/powerpc/platforms/pseries/mobility.c | 4 arch/powerpc/platforms/pseries/suspend.c | 5 + 4 files changed, 1 insertion(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 379e2cc3789f..537c638582eb 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -93,19 +93,9 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) #endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) -extern int start_topology_update(void); -extern int stop_topology_update(void); extern int prrn_is_enabled(void); extern int find_and_online_cpu_nid(int cpu); #else -static inline int start_topology_update(void) -{ - return 0; -} -static inline int stop_topology_update(void) -{ - return 0; -} static inline int prrn_is_enabled(void) { return 0; diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 6c579ac3e679..dec7ce3b5e67 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1157,8 +1157,6 @@ static long vphn_get_associativity(unsigned long cpu, , rc); break; } - - stop_topology_update(); out: return rc; } @@ -1212,22 +1210,6 @@ int arch_update_cpu_topology(void) return numa_update_cpu_topology(true); } -/* - * Start polling for associativity changes. - */ -int start_topology_update(void) -{ - return 0; -} - -/* - * Disable polling for VPHN associativity changes. - */ -int stop_topology_update(void) -{ - return 0; -} - int prrn_is_enabled(void) { return 0; @@ -1235,8 +1217,6 @@ int prrn_is_enabled(void) static int topology_update_init(void) { - start_topology_update(); - topology_inited = 1; return 0; } diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 10d982997736..c0b09f6f0ae3 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -388,8 +388,6 @@ static ssize_t migration_store(struct class *class, if (rc) return rc; - stop_topology_update(); - do { rc = rtas_ibm_suspend_me(streamid); if (rc == -EAGAIN) @@ -401,8 +399,6 @@ static ssize_t migration_store(struct class *class, post_mobility_fixup(); - start_topology_update(); - return count; } diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index f789693f61f4..81e0ac58d620 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -145,11 +145,8 @@ static ssize_t store_hibernate(struct device *dev, ssleep(1); } while (rc == -EAGAIN); - if (!rc) { - stop_topology_update(); + if (!rc) rc = pm_suspend(PM_SUSPEND_MEM); - start_topology_update(); - } stream_id = 0; -- 2.25.4
[PATCH 07/18] powerpc/numa: remove unreachable topology workqueue code
Since vphn_enabled is always 0, we can remove the call to topology_schedule_update() and remove the code which becomes unreachable as a result. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 6207297490a8..8415481a7f13 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1414,17 +1414,6 @@ int arch_update_cpu_topology(void) return numa_update_cpu_topology(true); } -static void topology_work_fn(struct work_struct *work) -{ - rebuild_sched_domains(); -} -static DECLARE_WORK(topology_work, topology_work_fn); - -static void topology_schedule_update(void) -{ - schedule_work(_work); -} - /* * Start polling for associativity changes. */ @@ -1450,9 +1439,6 @@ static int topology_update_init(void) { start_topology_update(); - if (vphn_enabled) - topology_schedule_update(); - topology_inited = 1; return 0; } -- 2.25.4
[PATCH 18/18] powerpc/pseries: remove obsolete memory hotplug DT notifier code
pseries_update_drconf_memory() runs from a DT notifier in response to an update to the ibm,dynamic-memory property of the /ibm,dynamic-reconfiguration-memory node. This property is an older less compact format than the ibm,dynamic-memory-v2 property used in most currently supported firmwares. There has never been an equivalent function for the v2 property. pseries_update_drconf_memory() compares the 'assigned' flag for each LMB in the old vs new properties and adds or removes the block accordingly. However it appears to be of no actual utility: * Partition suspension and PRRNs are specified only to change LMBs' NUMA affinity information. This notifier should be a no-op for those scenarios since the assigned flags should not change. * The memory hotplug/DLPAR path has a hack which short-circuits execution of the notifier: dlpar_memory() ... rtas_hp_event = true; drmem_update_dt() of_update_property() pseries_memory_notifier() pseries_update_drconf_memory() if (rtas_hp_event) return; So this code only makes sense as a relic of the time when more of the DLPAR workflow took place in user space. I don't see a purpose for it now. Signed-off-by: Nathan Lynch --- .../platforms/pseries/hotplug-memory.c| 65 +-- 1 file changed, 1 insertion(+), 64 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 67ece3ac9ac2..73a5dcd977e1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -22,8 +22,6 @@ #include #include "pseries.h" -static bool rtas_hp_event; - unsigned long pseries_memory_block_size(void) { struct device_node *np; @@ -871,11 +869,8 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) break; } - if (!rc) { - rtas_hp_event = true; + if (!rc) rc = drmem_update_dt(); - rtas_hp_event = false; - } unlock_device_hotplug(); return rc; @@ -911,60 +906,6 @@ static int pseries_add_mem_node(struct device_node *np) return (ret < 0) ? -EINVAL : 0; } -static int pseries_update_drconf_memory(struct of_reconfig_data *pr) -{ - struct of_drconf_cell_v1 *new_drmem, *old_drmem; - unsigned long memblock_size; - u32 entries; - __be32 *p; - int i, rc = -EINVAL; - - if (rtas_hp_event) - return 0; - - memblock_size = pseries_memory_block_size(); - if (!memblock_size) - return -EINVAL; - - if (!pr->old_prop) - return 0; - - p = (__be32 *) pr->old_prop->value; - if (!p) - return -EINVAL; - - /* The first int of the property is the number of lmb's described -* by the property. This is followed by an array of of_drconf_cell -* entries. Get the number of entries and skip to the array of -* of_drconf_cell's. -*/ - entries = be32_to_cpu(*p++); - old_drmem = (struct of_drconf_cell_v1 *)p; - - p = (__be32 *)pr->prop->value; - p++; - new_drmem = (struct of_drconf_cell_v1 *)p; - - for (i = 0; i < entries; i++) { - if ((be32_to_cpu(old_drmem[i].flags) & DRCONF_MEM_ASSIGNED) && - (!(be32_to_cpu(new_drmem[i].flags) & DRCONF_MEM_ASSIGNED))) { - rc = pseries_remove_memblock( - be64_to_cpu(old_drmem[i].base_addr), -memblock_size); - break; - } else if ((!(be32_to_cpu(old_drmem[i].flags) & - DRCONF_MEM_ASSIGNED)) && - (be32_to_cpu(new_drmem[i].flags) & - DRCONF_MEM_ASSIGNED)) { - rc = memblock_add(be64_to_cpu(old_drmem[i].base_addr), - memblock_size); - rc = (rc < 0) ? -EINVAL : 0; - break; - } - } - return rc; -} - static int pseries_memory_notifier(struct notifier_block *nb, unsigned long action, void *data) { @@ -978,10 +919,6 @@ static int pseries_memory_notifier(struct notifier_block *nb, case OF_RECONFIG_DETACH_NODE: err = pseries_remove_mem_node(rd->dn); break; - case OF_RECONFIG_UPDATE_PROPERTY: - if (!strcmp(rd->prop->name, "ibm,dynamic-memory")) - err = pseries_update_drconf_memory(rd); - break; } return notifier_from_errno(err); } -- 2.25.4
[PATCH 04/18] powerpc/numa: remove unreachable topology update code
Since the topology_updates_enabled flag is now always false, remove it and the code which has become unreachable. This is the minimum change that prevents 'defined but unused' warnings emitted by the compiler after stubbing out the start/stop_topology_updates() functions. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 149 + 1 file changed, 2 insertions(+), 147 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 34d95de77bdd..9e20f12e6caf 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -984,8 +984,6 @@ static int __init early_numa(char *p) } early_param("numa", early_numa); -static const bool topology_updates_enabled; - #ifdef CONFIG_MEMORY_HOTPLUG /* * Find the node associated with a hot added memory section for @@ -1133,7 +1131,6 @@ struct topology_update_data { #define TOPOLOGY_DEF_TIMER_SECS60 -static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS]; static cpumask_t cpu_associativity_changes_mask; static int vphn_enabled; static int prrn_enabled; @@ -1158,63 +1155,6 @@ int timed_topology_update(int nsecs) return 0; } -/* - * Store the current values of the associativity change counters in the - * hypervisor. - */ -static void setup_cpu_associativity_change_counters(void) -{ - int cpu; - - /* The VPHN feature supports a maximum of 8 reference points */ - BUILD_BUG_ON(MAX_DISTANCE_REF_POINTS > 8); - - for_each_possible_cpu(cpu) { - int i; - u8 *counts = vphn_cpu_change_counts[cpu]; - volatile u8 *hypervisor_counts = lppaca_of(cpu).vphn_assoc_counts; - - for (i = 0; i < distance_ref_points_depth; i++) - counts[i] = hypervisor_counts[i]; - } -} - -/* - * The hypervisor maintains a set of 8 associativity change counters in - * the VPA of each cpu that correspond to the associativity levels in the - * ibm,associativity-reference-points property. When an associativity - * level changes, the corresponding counter is incremented. - * - * Set a bit in cpu_associativity_changes_mask for each cpu whose home - * node associativity levels have changed. - * - * Returns the number of cpus with unhandled associativity changes. - */ -static int update_cpu_associativity_changes_mask(void) -{ - int cpu; - cpumask_t *changes = _associativity_changes_mask; - - for_each_possible_cpu(cpu) { - int i, changed = 0; - u8 *counts = vphn_cpu_change_counts[cpu]; - volatile u8 *hypervisor_counts = lppaca_of(cpu).vphn_assoc_counts; - - for (i = 0; i < distance_ref_points_depth; i++) { - if (hypervisor_counts[i] != counts[i]) { - counts[i] = hypervisor_counts[i]; - changed = 1; - } - } - if (changed) { - cpumask_or(changes, changes, cpu_sibling_mask(cpu)); - cpu = cpu_last_thread_sibling(cpu); - } - } - - return cpumask_weight(changes); -} - /* * Retrieve the new associativity information for a virtual processor's * home node. @@ -1498,16 +1438,6 @@ static void topology_schedule_update(void) schedule_work(_work); } -static void topology_timer_fn(struct timer_list *unused) -{ - if (prrn_enabled && cpumask_weight(_associativity_changes_mask)) - topology_schedule_update(); - else if (vphn_enabled) { - if (update_cpu_associativity_changes_mask() > 0) - topology_schedule_update(); - reset_topology_timer(); - } -} static struct timer_list topology_timer; static void reset_topology_timer(void) @@ -1516,69 +1446,12 @@ static void reset_topology_timer(void) mod_timer(_timer, jiffies + topology_timer_secs * HZ); } -#ifdef CONFIG_SMP - -static int dt_update_callback(struct notifier_block *nb, - unsigned long action, void *data) -{ - struct of_reconfig_data *update = data; - int rc = NOTIFY_DONE; - - switch (action) { - case OF_RECONFIG_UPDATE_PROPERTY: - if (of_node_is_type(update->dn, "cpu") && - !of_prop_cmp(update->prop->name, "ibm,associativity")) { - u32 core_id; - of_property_read_u32(update->dn, "reg", _id); - rc = dlpar_cpu_readd(core_id); - rc = NOTIFY_OK; - } - break; - } - - return rc; -} - -static struct notifier_block dt_update_nb = { - .notifier_call = dt_update_callback, -}; - -#endif - /* * Start polling for associativity changes. */ int start_topology_update(void) { -
[PATCH 09/18] powerpc/numa: stub out numa_update_cpu_topology()
Previous changes have removed the code which sets bits in cpu_associativity_changes_mask and thus it is never modifed at runtime. From this we can reason that numa_update_cpu_topology() always returns 0 without doing anything. Remove the body of numa_update_cpu_topology() and remove all code which becomes unreachable as a result. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 193 + 1 file changed, 1 insertion(+), 192 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8749d7f2b1a6..b220e5b60140 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1122,14 +1122,6 @@ u64 memory_hotplug_max(void) /* Virtual Processor Home Node (VPHN) support */ #ifdef CONFIG_PPC_SPLPAR -struct topology_update_data { - struct topology_update_data *next; - unsigned int cpu; - int old_nid; - int new_nid; -}; - -static cpumask_t cpu_associativity_changes_mask; static int topology_inited; /* @@ -1219,192 +1211,9 @@ int find_and_online_cpu_nid(int cpu) return new_nid; } -/* - * Update the CPU maps and sysfs entries for a single CPU when its NUMA - * characteristics change. This function doesn't perform any locking and is - * only safe to call from stop_machine(). - */ -static int update_cpu_topology(void *data) -{ - struct topology_update_data *update; - unsigned long cpu; - - if (!data) - return -EINVAL; - - cpu = smp_processor_id(); - - for (update = data; update; update = update->next) { - int new_nid = update->new_nid; - if (cpu != update->cpu) - continue; - - unmap_cpu_from_node(cpu); - map_cpu_to_node(cpu, new_nid); - set_cpu_numa_node(cpu, new_nid); - set_cpu_numa_mem(cpu, local_memory_node(new_nid)); - vdso_getcpu_init(); - } - - return 0; -} - -static int update_lookup_table(void *data) -{ - struct topology_update_data *update; - - if (!data) - return -EINVAL; - - /* -* Upon topology update, the numa-cpu lookup table needs to be updated -* for all threads in the core, including offline CPUs, to ensure that -* future hotplug operations respect the cpu-to-node associativity -* properly. -*/ - for (update = data; update; update = update->next) { - int nid, base, j; - - nid = update->new_nid; - base = cpu_first_thread_sibling(update->cpu); - - for (j = 0; j < threads_per_core; j++) { - update_numa_cpu_lookup_table(base + j, nid); - } - } - - return 0; -} - -/* - * Update the node maps and sysfs entries for each cpu whose home node - * has changed. Returns 1 when the topology has changed, and 0 otherwise. - * - * cpus_locked says whether we already hold cpu_hotplug_lock. - */ int numa_update_cpu_topology(bool cpus_locked) { - unsigned int cpu, sibling, changed = 0; - struct topology_update_data *updates, *ud; - cpumask_t updated_cpus; - struct device *dev; - int weight, new_nid, i = 0; - - if (topology_inited) - return 0; - - weight = cpumask_weight(_associativity_changes_mask); - if (!weight) - return 0; - - updates = kcalloc(weight, sizeof(*updates), GFP_KERNEL); - if (!updates) - return 0; - - cpumask_clear(_cpus); - - for_each_cpu(cpu, _associativity_changes_mask) { - /* -* If siblings aren't flagged for changes, updates list -* will be too short. Skip on this update and set for next -* update. -*/ - if (!cpumask_subset(cpu_sibling_mask(cpu), - _associativity_changes_mask)) { - pr_info("Sibling bits not set for associativity " - "change, cpu%d\n", cpu); - cpumask_or(_associativity_changes_mask, - _associativity_changes_mask, - cpu_sibling_mask(cpu)); - cpu = cpu_last_thread_sibling(cpu); - continue; - } - - new_nid = find_and_online_cpu_nid(cpu); - - if (new_nid == numa_cpu_lookup_table[cpu]) { - cpumask_andnot(_associativity_changes_mask, - _associativity_changes_mask, - cpu_sibling_mask(cpu)); - dbg("Assoc chg gives same node %d for cpu%d\n", - new_nid, cpu); - cpu = cpu_last_th
[PATCH 08/18] powerpc/numa: remove vphn_enabled and prrn_enabled internal flags
These flags are always zero now; remove them and suitably adjust the remaining references to them. Signed-off-by: Nathan Lynch --- arch/powerpc/mm/numa.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8415481a7f13..8749d7f2b1a6 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1130,8 +1130,6 @@ struct topology_update_data { }; static cpumask_t cpu_associativity_changes_mask; -static const int vphn_enabled; -static const int prrn_enabled; static int topology_inited; /* @@ -1292,7 +1290,7 @@ int numa_update_cpu_topology(bool cpus_locked) struct device *dev; int weight, new_nid, i = 0; - if (!prrn_enabled && !vphn_enabled && topology_inited) + if (topology_inited) return 0; weight = cpumask_weight(_associativity_changes_mask); @@ -1432,7 +1430,7 @@ int stop_topology_update(void) int prrn_is_enabled(void) { - return prrn_enabled; + return 0; } static int topology_update_init(void) -- 2.25.4
[PATCH 15/18] powerpc/pseries: remove prrn special case from DT update path
pseries_devicetree_update() is no longer called with PRRN_SCOPE. The purpose of prrn_update_node() was to remove and then add back a LMB whose NUMA assignment had changed. This has never been reliable, and this codepath has been default-disabled for several releases. Remove prrn_update_node(). Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 27 --- 1 file changed, 27 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index c0b09f6f0ae3..78cd772a579b 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -244,29 +244,6 @@ static int add_dt_node(__be32 parent_phandle, __be32 drc_index) return rc; } -static void prrn_update_node(__be32 phandle) -{ - struct pseries_hp_errorlog hp_elog; - struct device_node *dn; - - /* -* If a node is found from a the given phandle, the phandle does not -* represent the drc index of an LMB and we can ignore. -*/ - dn = of_find_node_by_phandle(be32_to_cpu(phandle)); - if (dn) { - of_node_put(dn); - return; - } - - hp_elog.resource = PSERIES_HP_ELOG_RESOURCE_MEM; - hp_elog.action = PSERIES_HP_ELOG_ACTION_READD; - hp_elog.id_type = PSERIES_HP_ELOG_ID_DRC_INDEX; - hp_elog._drc_u.drc_index = phandle; - - handle_dlpar_errorlog(_elog); -} - int pseries_devicetree_update(s32 scope) { char *rtas_buf; @@ -305,10 +282,6 @@ int pseries_devicetree_update(s32 scope) break; case UPDATE_DT_NODE: update_dt_node(phandle, scope); - - if (scope == PRRN_SCOPE) - prrn_update_node(phandle); - break; case ADD_DT_NODE: drc_index = *data++; -- 2.25.4
[PATCH 14/18] powerpc/numa: remove arch_update_cpu_topology
Since arch_update_cpu_topology() doesn't do anything on powerpc now, remove it and associated dead code. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/topology.h | 6 -- arch/powerpc/mm/numa.c | 10 -- 2 files changed, 16 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 658aad65912b..b2c346c5e16f 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -43,7 +43,6 @@ extern void __init dump_numa_cpu_topology(void); extern int sysfs_add_device_to_node(struct device *dev, int nid); extern void sysfs_remove_device_from_node(struct device *dev, int nid); -extern int numa_update_cpu_topology(bool cpus_locked); static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { @@ -78,11 +77,6 @@ static inline void sysfs_remove_device_from_node(struct device *dev, { } -static inline int numa_update_cpu_topology(bool cpus_locked) -{ - return 0; -} - static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {} static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 26fcc947dd2d..e437a9ac4956 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1200,16 +1200,6 @@ int find_and_online_cpu_nid(int cpu) return new_nid; } -int numa_update_cpu_topology(bool cpus_locked) -{ - return 0; -} - -int arch_update_cpu_topology(void) -{ - return numa_update_cpu_topology(true); -} - static int topology_update_init(void) { topology_inited = 1; -- 2.25.4
[PATCH 10/18] powerpc/numa: remove timed_topology_update()
timed_topology_update is a no-op now, so remove it and all call sites. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/topology.h | 5 - arch/powerpc/mm/numa.c | 9 - arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 -- 3 files changed, 16 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 2db7ba789720..379e2cc3789f 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -97,7 +97,6 @@ extern int start_topology_update(void); extern int stop_topology_update(void); extern int prrn_is_enabled(void); extern int find_and_online_cpu_nid(int cpu); -extern int timed_topology_update(int nsecs); #else static inline int start_topology_update(void) { @@ -115,10 +114,6 @@ static inline int find_and_online_cpu_nid(int cpu) { return 0; } -static inline int timed_topology_update(int nsecs) -{ - return 0; -} #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b220e5b60140..6c579ac3e679 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1124,14 +1124,6 @@ u64 memory_hotplug_max(void) #ifdef CONFIG_PPC_SPLPAR static int topology_inited; -/* - * Change polling interval for associativity changes. - */ -int timed_topology_update(int nsecs) -{ - return 0; -} - /* * Retrieve the new associativity information for a virtual processor's * home node. @@ -1147,7 +1139,6 @@ static long vphn_get_associativity(unsigned long cpu, switch (rc) { case H_SUCCESS: dbg("VPHN hcall succeeded. Reset polling...\n"); - timed_topology_update(0); goto out; case H_FUNCTION: diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index d4b346355bb9..dbfabb185eb5 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -263,7 +263,6 @@ static int dlpar_offline_cpu(struct device_node *dn) break; cpu_maps_update_done(); - timed_topology_update(1); rc = device_offline(get_cpu_device(cpu)); if (rc) goto out; @@ -302,7 +301,6 @@ static int dlpar_online_cpu(struct device_node *dn) if (get_hard_smp_processor_id(cpu) != thread) continue; cpu_maps_update_done(); - timed_topology_update(1); find_and_online_cpu_nid(cpu); rc = device_online(get_cpu_device(cpu)); if (rc) { -- 2.25.4
[PATCH 13/18] powerpc/numa: remove prrn_is_enabled()
All users of this prrn_is_enabled() are gone; remove it. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/topology.h | 5 - arch/powerpc/mm/numa.c | 5 - 2 files changed, 10 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 537c638582eb..658aad65912b 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -93,13 +93,8 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) #endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) -extern int prrn_is_enabled(void); extern int find_and_online_cpu_nid(int cpu); #else -static inline int prrn_is_enabled(void) -{ - return 0; -} static inline int find_and_online_cpu_nid(int cpu) { return 0; diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dec7ce3b5e67..26fcc947dd2d 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1210,11 +1210,6 @@ int arch_update_cpu_topology(void) return numa_update_cpu_topology(true); } -int prrn_is_enabled(void) -{ - return 0; -} - static int topology_update_init(void) { topology_inited = 1; -- 2.25.4
[PATCH 02/18] powerpc/rtas: don't online CPUs for partition suspend
Partition suspension, used for hibernation and migration, requires that the OS place all but one of the LPAR's processor threads into one of two states prior to calling the ibm,suspend-me RTAS function: * the architected offline state (via RTAS stop-self); or * the H_JOIN hcall, which does not return until the partition resumes execution Using H_CEDE as the offline mode, introduced by commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), means that any threads which are offline from Linux's point of view must be moved to one of those two states before a partition suspension can proceed. This was eventually addressed in commit 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation"), which added code to temporarily bring up any offline processor threads so they can call H_JOIN. Conceptually this is fine, but the implementation has had multiple races with cpu hotplug operations initiated from user space[1][2][3], the error handling is fragile, and it generates user-visible cpu hotplug events which is a lot of noise for a platform feature that's supposed to minimize disruption to workloads. With commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") reverted, this code becomes unnecessary, so remove it. Since any offline CPUs now are truly offline from the platform's point of view, it is no longer necessary to bring up CPUs only to have them call H_JOIN and then go offline again upon resuming. Only active threads are required to call H_JOIN; stopped threads can be left alone. [1] commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM") [2] commit 9fb603050ffd ("powerpc/rtas: retry when cpu offline races with suspend/migration") [3] commit dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & Migration") Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 2 - arch/powerpc/kernel/rtas.c | 122 +-- arch/powerpc/platforms/pseries/suspend.c | 22 +--- 3 files changed, 3 insertions(+), 143 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 014968f25f7e..0107d724e9da 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -253,8 +253,6 @@ extern int rtas_set_indicator_fast(int indicator, int index, int new_value); extern void rtas_progress(char *s, unsigned short hex); extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); -extern int rtas_online_cpus_mask(cpumask_var_t cpus); -extern int rtas_offline_cpus_mask(cpumask_var_t cpus); extern int rtas_ibm_suspend_me(u64 handle); struct rtc_time; diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index a09eba03f180..806d554ce357 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -843,96 +843,6 @@ static void rtas_percpu_suspend_me(void *info) __rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1); } -enum rtas_cpu_state { - DOWN, - UP, -}; - -#ifndef CONFIG_SMP -static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, - cpumask_var_t cpus) -{ - if (!cpumask_empty(cpus)) { - cpumask_clear(cpus); - return -EINVAL; - } else - return 0; -} -#else -/* On return cpumask will be altered to indicate CPUs changed. - * CPUs with states changed will be set in the mask, - * CPUs with status unchanged will be unset in the mask. */ -static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, - cpumask_var_t cpus) -{ - int cpu; - int cpuret = 0; - int ret = 0; - - if (cpumask_empty(cpus)) - return 0; - - for_each_cpu(cpu, cpus) { - struct device *dev = get_cpu_device(cpu); - - switch (state) { - case DOWN: - cpuret = device_offline(dev); - break; - case UP: - cpuret = device_online(dev); - break; - } - if (cpuret < 0) { - pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", - __func__, - ((state == UP) ? "up" : "down"), - cpu, cpuret); - if (!ret) - ret = cpuret; - if (state == UP) { - /* clear bits for unchanged cpus, return */ -
[PATCH 17/18] powerpc/pseries: remove dlpar_cpu_readd()
dlpar_cpu_readd() is unused now. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/topology.h | 1 - arch/powerpc/platforms/pseries/hotplug-cpu.c | 19 --- 2 files changed, 20 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index b2c346c5e16f..f0b6300e7dd3 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -115,7 +115,6 @@ int get_physical_package_id(int cpu); #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_core_id(cpu) (cpu_to_core_id(cpu)) -int dlpar_cpu_readd(int cpu); #endif #endif diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index dbfabb185eb5..4bad7a83addc 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -779,25 +779,6 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add) return rc; } -int dlpar_cpu_readd(int cpu) -{ - struct device_node *dn; - struct device *dev; - u32 drc_index; - int rc; - - dev = get_cpu_device(cpu); - dn = dev->of_node; - - rc = of_property_read_u32(dn, "ibm,my-drc-index", _index); - - rc = dlpar_cpu_remove_by_index(drc_index); - if (!rc) - rc = dlpar_cpu_add(drc_index); - - return rc; -} - int dlpar_cpu(struct pseries_hp_errorlog *hp_elog) { u32 count, drc_index; -- 2.25.4
Re: [PATCH 1/2] powerpc/pseries: remove cede offline state for CPUs
Hi Gautham, Gautham R Shenoy writes: > On Mon, Jun 01, 2020 at 11:31:39PM -0500, Nathan Lynch wrote: >> This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add >> hooks to put the CPU into an appropriate offline state"), which added >> an offline mode for CPUs which uses the H_CEDE hcall instead of the >> architected stop-self RTAS function in order to facilitate "folding" >> of dedicated mode processors on PowerVM platforms to achieve energy >> savings. This has been the default offline mode since its >> introduction. >> >> There's nothing about stop-self that would prevent the hypervisor from >> achieving the energy savings available via H_CEDE, so the original >> premise of this change appears to be flawed. > > > IIRC, back in 2009, when the Extended-CEDE was introduced, it couldn't > be exposed via the cpuidle subsystem since this state needs an > explicit H_PROD as opposed to the H_IPI which wakes up the regular > CEDE call. So, the alterative was to use the CPU-Hotplug way by having > a userspace daemon fold the cores which weren't needed currently and > bring them back online when they were needed. Back then, Long Term > CEDE was definitely faster compared to stop-self call (It is a pity > that I didn't post the numbers when I wrote the patch) and the > time-taken to unfold a core was definitely one of the concerns. > (https://lkml.org/lkml/2009/9/23/522). Thanks for the background. Ideally, we would understand what has changed since then... certainly the offline path would have been slower when pseries_cpu_die() slept for 200ms between issuing query-cpu-stopped-state calls, but that was changed in 2008 in b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") so perhaps it wasn't a factor in your measurements. Even less sure about what could have made the online path more expensive. (For the benefit of anybody else referring to the 2009 discussion that you linked: that exchange is a bit strange to me because there seems to be a conception that stop-self releases the processor to the platform in a way that could prevent the OS from restarting it on demand. I don't believe this has ever been the case. Processor deallocation is a related but separate workflow involving different RTAS functions.) > The patch looks good to me. > Reviewed-by: Gautham R. Shenoy I appreciate the review, thanks!
Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
Pingfan Liu writes: > On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch wrote: >> Pingfan Liu writes: >> > This will introduce extra dt updating payload for each involved lmb when >> > hotplug. >> > But it should be fine since drmem_update_dt() is memory based operation and >> > hotplug is not a hot path. >> >> This is great analysis but the performance implications of the change >> are grave. The add/remove paths here are already O(n) where n is the >> quantity of memory assigned to the LP, this change would make it O(n^2): >> >> dlpar_memory_add_by_count >> for_each_drmem_lmb <-- >> dlpar_add_lmb >> drmem_update_dt(_v1|_v2) >> for_each_drmem_lmb <-- >> >> Memory add/remove isn't a hot path but quadratic runtime complexity >> isn't acceptable. Its current performance is bad enough that I have > Yes, the quadratic runtime complexity sounds terrible. > And I am curious about the bug. Does the system have thousands of lmb? Yes. >> Not to mention we leak memory every time drmem_update_dt is called >> because we can't safely free device tree properties :-( > Do you know what block us to free it? It's a longstanding problem. References to device tree properties aren't counted or tracked so there's no way to safely free them unless the node itself is released. But the ibm,dynamic-reconfiguration-memory node does not ever go away and its properties are only subject to updates. Maybe there's a way to address the specific case of ibm,dynamic-reconfiguration-memory and the ibm,dynamic-memory(-v2) properties, instead of tackling the general problem. Regardless of all that, the drmem code needs better data structures and lookup functions.
Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
Pingfan Liu writes: > A bug is observed on pseries by taking the following steps on rhel: > -1. drmgr -c mem -r -q 5 > -2. echo c > /proc/sysrq-trigger > > And then, the failure looks like: > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ > kdump: saving vmcore-dmesg.txt > kdump: saving vmcore-dmesg.txt complete > kdump: saving vmcore > Checking for memory holes : [ 0.0 %] / > Checking for memory holes : [100.0 %] | > Excluding unnecessary pages : [100.0 %] \ > Copying data : [ 0.3 %] - > eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! > EA=0x7fffba40 access=0x8004 current=makedumpfile > [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > psize 2 pte=0xc0005504 > [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 > access=0x8004 current=makedumpfile > [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > psize 2 pte=0xc0005504 > [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip > 7fffbbc4d7fc lr 00011356ca3c code 2 > [ 44.338548] Core dump to |/bin/false pipe failed > /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error > $CORE_COLLECTOR /proc/vmcore > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete > kdump: saving vmcore failed > > * Root cause * > After analyzing, it turns out that in the current implementation, > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as > the code __remove_memory() comes before drmem_update_dt(). > So in kdump kernel, when read_from_oldmem() resorts to > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it > can be observed "Bus error" > > From a viewpoint of listener and publisher, the publisher notifies the > listener before data is ready. This introduces a problem where udev > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before > updating. And in capture kernel, makedumpfile will access the memory based > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. > > * Fix * > In order to fix this issue, update dt before __remove_memory(), and > accordingly the same rule in hot-add path. > > This will introduce extra dt updating payload for each involved lmb when > hotplug. > But it should be fine since drmem_update_dt() is memory based operation and > hotplug is not a hot path. This is great analysis but the performance implications of the change are grave. The add/remove paths here are already O(n) where n is the quantity of memory assigned to the LP, this change would make it O(n^2): dlpar_memory_add_by_count for_each_drmem_lmb <-- dlpar_add_lmb drmem_update_dt(_v1|_v2) for_each_drmem_lmb <-- Memory add/remove isn't a hot path but quadratic runtime complexity isn't acceptable. Its current performance is bad enough that I have internal bugs open on it. Not to mention we leak memory every time drmem_update_dt is called because we can't safely free device tree properties :-( Also note that this sort of reverts (fixes?) 063b8b1251fd ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR request").
Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
"Aneesh Kumar K.V" writes: > This is the next version of the fixes for memory unplug on radix. > The issues and the fix are described in the actual patches. I guess this isn't actually causing problems at runtime right now, but I notice calls to resize_hpt_for_hotplug() from arch_add_memory() and arch_remove_memory(), which ought to be mmu-agnostic: int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; int rc; resize_hpt_for_hotplug(memblock_phys_mem_size()); start = (unsigned long)__va(start); rc = create_section_mapping(start, start + size, nid, params->pgprot); ...
[PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
The drmem lmb list can have hundreds of thousands of entries, and unfortunately lookups take the form of linear searches. As long as this is the case, traversals have the potential to monopolize the CPU and provoke lockup reports, workqueue stalls, and the like unless they explicitly yield. Rather than placing cond_resched() calls within various for_each_drmem_lmb() loop blocks in the code, put it in the iteration expression of the loop macro itself so users can't omit it. Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT format") Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/drmem.h | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209f45bb..36d0ed04bda8 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -8,6 +8,8 @@ #ifndef _ASM_POWERPC_LMB_H #define _ASM_POWERPC_LMB_H +#include + struct drmem_lmb { u64 base_addr; u32 drc_index; @@ -26,8 +28,14 @@ struct drmem_lmb_info { extern struct drmem_lmb_info *drmem_info; +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb) +{ + cond_resched(); + return ++lmb; +} + #define for_each_drmem_lmb_in_range(lmb, start, end) \ - for ((lmb) = (start); (lmb) < (end); (lmb)++) + for ((lmb) = (start); (lmb) < (end); lmb = drmem_lmb_next(lmb)) #define for_each_drmem_lmb(lmb)\ for_each_drmem_lmb_in_range((lmb), \ -- 2.25.4
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Hi Laurent, Laurent Dufour writes: > Le 28/07/2020 à 19:37, Nathan Lynch a écrit : >> The drmem lmb list can have hundreds of thousands of entries, and >> unfortunately lookups take the form of linear searches. As long as >> this is the case, traversals have the potential to monopolize the CPU >> and provoke lockup reports, workqueue stalls, and the like unless >> they explicitly yield. >> >> Rather than placing cond_resched() calls within various >> for_each_drmem_lmb() loop blocks in the code, put it in the iteration >> expression of the loop macro itself so users can't omit it. > > Is that not too much to call cond_resched() on every LMB? > > Could that be less frequent, every 10, or 100, I don't really know ? Everything done within for_each_drmem_lmb is relatively heavyweight already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens of milliseconds. I don't think cond_resched() is an expensive check in this context.
Re: [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's
Pingfan Liu writes: > This patch prepares for the incoming patch which swaps the order of KOBJ_ > uevent and dt's updating. > > It has no functional effect, just groups lmb operation and memblock's in > order to insert dt updating operation easily, and makes it easier to > review. ... > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > b/arch/powerpc/platforms/pseries/hotplug-memory.c > index 5d545b7..1a3ac3b 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); > static int dlpar_remove_lmb(struct drmem_lmb *lmb) > { > unsigned long block_sz; > - int rc; > + phys_addr_t base_addr; > + int rc, nid; > > if (!lmb_is_removable(lmb)) > return -EINVAL; > @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) > if (rc) > return rc; > > + base_addr = lmb->base_addr; > + nid = lmb->nid; > block_sz = pseries_memory_block_size(); > > - __remove_memory(lmb->nid, lmb->base_addr, block_sz); > - > - /* Update memory regions for memory remove */ > - memblock_remove(lmb->base_addr, block_sz); > - > invalidate_lmb_associativity_index(lmb); > lmb_clear_nid(lmb); > lmb->flags &= ~DRCONF_MEM_ASSIGNED; > > + __remove_memory(nid, base_addr, block_sz); > + > + /* Update memory regions for memory remove */ > + memblock_remove(base_addr, block_sz); > + > return 0; > } I don't understand; the commit message should not claim this has no functional effect when it changes the order of operations like this. Maybe this is an improvement over the current behavior, but it's not explained why it would be.
Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
Michael Ellerman writes: > Nathan Lynch writes: >> Laurent Dufour writes: >>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit : >>>> The drmem lmb list can have hundreds of thousands of entries, and >>>> unfortunately lookups take the form of linear searches. As long as >>>> this is the case, traversals have the potential to monopolize the CPU >>>> and provoke lockup reports, workqueue stalls, and the like unless >>>> they explicitly yield. >>>> >>>> Rather than placing cond_resched() calls within various >>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration >>>> expression of the loop macro itself so users can't omit it. >>> >>> Is that not too much to call cond_resched() on every LMB? >>> >>> Could that be less frequent, every 10, or 100, I don't really know ? >> >> Everything done within for_each_drmem_lmb is relatively heavyweight >> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens >> of milliseconds. I don't think cond_resched() is an expensive check in >> this context. > > Hmm, mostly. > > But there are quite a few cases like drmem_update_dt_v1(): > > for_each_drmem_lmb(lmb) { > dr_cell->base_addr = cpu_to_be64(lmb->base_addr); > dr_cell->drc_index = cpu_to_be32(lmb->drc_index); > dr_cell->aa_index = cpu_to_be32(lmb->aa_index); > dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb)); > > dr_cell++; > } > > Which will compile to a pretty tight loop at the moment. > > Or drmem_update_dt_v2() which has two loops over all lmbs. > > And although the actual TIF check is cheap the function call to do it is > not free. > > So I worry this is going to make some of those long loops take even > longer. That's fair, and I was wrong - some of the loop bodies are relatively simple, not doing allocations or taking locks, etc. One way to deal is to keep for_each_drmem_lmb() as-is and add a new iterator that can reschedule, e.g. for_each_drmem_lmb_slow(). On the other hand... it's probably not too strong to say that the drmem/hotplug code is in crisis with respect to correctness and algorithmic complexity, so those are my overriding concerns right now. Yes, this change will pessimize loops that are reinitializing the entire drmem_lmb array on every DLPAR operation, but: 1. it doesn't make any user of for_each_drmem_lmb() less correct; 2. why is this code doing that in the first place, other than to accommodate a poor data structure choice? The duration of the system calls where this code runs are measured in minutes or hours on large configurations because of all the behaviors that are at best O(n) with the amount of memory assigned to the partition. For simplicity's sake I'd rather defer lower-level performance considerations like this until the drmem data structures' awful lookup properties are fixed -- hopefully in the 5.10 timeframe. Thoughts?
Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform
Tyrel Datwyler writes: >> --- a/arch/powerpc/mm/numa.c >> +++ b/arch/powerpc/mm/numa.c >> @@ -897,7 +897,7 @@ static void __init find_possible_nodes(void) >> return; >> >> if (of_property_read_u32_index(rtas, >> -"ibm,max-associativity-domains", >> +"ibm,current-associativity-domains", > > I'm not sure ibm,current-associativity-domains is guaranteed to exist on older > firmware. You may need check that it exists and fall back to > ibm,max-associativity-domains in the event it doesn't Yes. Looks like it's a PowerVM-specific property.
[PATCH] powerpc/pseries/dlpar: handle ibm, configure-connector delay status
dlpar_configure_connector() has two problems in its handling of ibm,configure-connector's return status: 1. When the status is -2 (busy, call again), we call ibm,configure-connector again immediately without checking whether to schedule, which can result in monopolizing the CPU. 2. Extended delay status (9900..9905) goes completely unhandled, causing the configuration to unnecessarily terminate. Fix both of these issues by using rtas_busy_delay(). Fixes: ab519a011caa ("powerpc/pseries: Kernel DLPAR Infrastructure") Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/dlpar.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 16e86ba8aa20..f6b7749d6ada 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -127,7 +127,6 @@ void dlpar_free_cc_nodes(struct device_node *dn) #define NEXT_PROPERTY 3 #define PREV_PARENT 4 #define MORE_MEMORY 5 -#define CALL_AGAIN -2 #define ERR_CFG_USE -9003 struct device_node *dlpar_configure_connector(__be32 drc_index, @@ -168,6 +167,9 @@ struct device_node *dlpar_configure_connector(__be32 drc_index, spin_unlock(_data_buf_lock); + if (rtas_busy_delay(rc)) + continue; + switch (rc) { case COMPLETE: break; @@ -216,9 +218,6 @@ struct device_node *dlpar_configure_connector(__be32 drc_index, last_dn = last_dn->parent; break; - case CALL_AGAIN: - break; - case MORE_MEMORY: case ERR_CFG_USE: default: -- 2.29.2
Re: [PATCH 01/29] powerpc/rtas: move rtas_call_reentrant() out of pseries guards
Nathan Lynch writes: > rtas_call_reentrant() isn't platform-dependent; move it out of > CONFIG_PPC_PSERIES-guarded code. As reported by the 0-day CI, this is mistaken, and it breaks non-pseries builds: >> arch/powerpc/kernel/rtas.c:938:21: error: no member named >> 'rtas_args_reentrant' in 'struct paca_struct' args = local_paca->rtas_args_reentrant; ~~ ^ 1 error generated. https://lore.kernel.org/linuxppc-dev/202012050432.sfbbjwmw-...@intel.com/ I'll drop this and reroll the series.
[PATCH v2 01/28] powerpc/rtas: prevent suspend-related sys_rtas use on LE
While drmgr has had work in some areas to make its RTAS syscall interactions endian-neutral, its code for performing partition migration via the syscall has never worked on LE. While it is able to complete ibm,suspend-me successfully, it crashes when attempting the subsequent ibm,update-nodes call. drmgr is the only known (or plausible) user of ibm,suspend-me, ibm,update-nodes, and ibm,update-properties, so allow them only in big-endian configurations. Signed-off-by: Nathan Lynch --- arch/powerpc/kernel/rtas.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 954f41676f69..4ed64aba37d6 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -1050,9 +1050,11 @@ static struct rtas_filter rtas_filters[] __ro_after_init = { { "set-time-for-power-on", -1, -1, -1, -1, -1 }, { "ibm,set-system-parameter", -1, 1, -1, -1, -1 }, { "set-time-of-day", -1, -1, -1, -1, -1 }, +#ifdef CONFIG_CPU_BIG_ENDIAN { "ibm,suspend-me", -1, -1, -1, -1, -1 }, { "ibm,update-nodes", -1, 0, -1, -1, -1, 4096 }, { "ibm,update-properties", -1, 0, -1, -1, -1, 4096 }, +#endif { "ibm,physical-attestation", -1, 0, 1, -1, -1 }, }; -- 2.28.0
[PATCH v2 05/28] powerpc/rtas: add rtas_activate_firmware()
Provide a documented wrapper function for the ibm,activate-firmware service, which must be called after a partition migration or hibernation. If the function is absent or the call fails, the OS will continue to run normally with the current firmware, so there is no need to perform any recovery. Just log it and continue. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 + arch/powerpc/kernel/rtas.c | 30 ++ 2 files changed, 31 insertions(+) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index b43165fc6c2a..fdefe6a974eb 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -247,6 +247,7 @@ extern void __noreturn rtas_restart(char *cmd); extern void rtas_power_off(void); extern void __noreturn rtas_halt(void); extern void rtas_os_term(char *str); +void rtas_activate_firmware(void); extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_sensor_fast(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 8a618a3c4beb..3a740ae933f8 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -798,6 +798,36 @@ void rtas_os_term(char *str) printk(KERN_EMERG "ibm,os-term call failed %d\n", status); } +/** + * rtas_activate_firmware() - Activate a new version of firmware. + * + * Activate a new version of partition firmware. The OS must call this + * after resuming from a partition hibernation or migration in order + * to maintain the ability to perform live firmware updates. It's not + * catastrophic for this method to be absent or to fail; just log the + * condition in that case. + * + * Context: This function may sleep. + */ +void rtas_activate_firmware(void) +{ + int token; + int fwrc; + + token = rtas_token("ibm,activate-firmware"); + if (token == RTAS_UNKNOWN_SERVICE) { + pr_notice("ibm,activate-firmware method unavailable\n"); + return; + } + + do { + fwrc = rtas_call(token, 0, 1, NULL); + } while (rtas_busy_delay(fwrc)); + + if (fwrc) + pr_err("ibm,activate-firmware failed (%i)\n", fwrc); +} + static int ibm_suspend_me_token = RTAS_UNKNOWN_SERVICE; #ifdef CONFIG_PPC_PSERIES static int __rtas_suspend_last_cpu(struct rtas_suspend_me_data *data, int wake_when_done) -- 2.28.0
[PATCH v2 09/28] powerpc/pseries/mobility: error message improvements
- Convert printk(KERN_ERR) to pr_err(). - Include errno in property update failure message. - Remove reference to "Post-mobility" from device tree update message: with pr_err() it will have a "mobility:" prefix. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 527a64e2d89f..31d81b7da961 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -208,8 +208,8 @@ static int update_dt_node(__be32 phandle, s32 scope) rc = update_dt_property(dn, , prop_name, vd, prop_data); if (rc) { - printk(KERN_ERR "Could not update %s" - " property\n", prop_name); + pr_err("updating %s property failed: %d\n", + prop_name, rc); } prop_data += vd; @@ -343,8 +343,7 @@ void post_mobility_fixup(void) rc = pseries_devicetree_update(MIGRATION_SCOPE); if (rc) - printk(KERN_ERR "Post-mobility device tree update " - "failed: %d\n", rc); + pr_err("device tree update failed: %d\n", rc); cacheinfo_rebuild(); -- 2.28.0
[PATCH v2 15/28] powerpc/rtas: dispatch partition migration requests to pseries
sys_rtas() cannot call ibm,suspend-me directly in the same way it handles other inputs. Instead it must dispatch the request to code that can first perform the H_JOIN sequence before any call to ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair amount of platform-specific code to implement this. Since a different, more robust implementation of the suspend sequence is now in the pseries platform code, we want to dispatch the request there. Note that invoking ibm,suspend-me via the RTAS syscall is all but deprecated; this change preserves ABI compatibility for old programs while providing to them the benefit of the new partition suspend implementation. This is a behavior change in that the kernel performs the device tree update and firmware activation before returning, but experimentation indicates this is tolerated fine by legacy user space. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 5 + arch/powerpc/kernel/rtas.c| 2 +- arch/powerpc/platforms/pseries/mobility.c | 5 + 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index fdefe6a974eb..3b52d8574fcc 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -279,8 +279,13 @@ extern time64_t last_rtas_event; extern int clobbering_unread_rtas_event(void); extern int pseries_devicetree_update(s32 scope); extern void post_mobility_fixup(void); +int rtas_syscall_dispatch_ibm_suspend_me(u64 handle); #else static inline int clobbering_unread_rtas_event(void) { return 0; } +static inline int rtas_syscall_dispatch_ibm_suspend_me(u64 handle) +{ + return -EINVAL; +} #endif #ifdef CONFIG_PPC_RTAS_DAEMON diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 3a740ae933f8..d4b048571728 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -1272,7 +1272,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) int rc = 0; u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32) | be32_to_cpu(args.args[1]); - rc = rtas_ibm_suspend_me_unsafe(handle); + rc = rtas_syscall_dispatch_ibm_suspend_me(handle); if (rc == -EAGAIN) args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE); else if (rc == -EIO) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index fe7e35cdc9d5..e670180f311d 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -615,6 +615,11 @@ static int pseries_migrate_partition(u64 handle) return ret; } +int rtas_syscall_dispatch_ibm_suspend_me(u64 handle) +{ + return pseries_migrate_partition(handle); +} + static ssize_t migration_store(struct class *class, struct class_attribute *attr, const char *buf, size_t count) -- 2.28.0
[PATCH v2 20/28] powerpc/machdep: remove suspend_disable_cpu()
There are no users left of the suspend_disable_cpu() callback, remove it. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/machdep.h | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 475687f24f4a..cf6ebbc16cb4 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -207,7 +207,6 @@ struct machdep_calls { void (*suspend_disable_irqs)(void); void (*suspend_enable_irqs)(void); #endif - int (*suspend_disable_cpu)(void); #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE ssize_t (*cpu_probe)(const char *, size_t); -- 2.28.0
[PATCH v2 23/28] powerpc/rtas: remove unused rtas_suspend_last_cpu()
rtas_suspend_last_cpu() is now unused, remove it and __rtas_suspend_last_cpu() which also becomes unused. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 - arch/powerpc/kernel/rtas.c | 43 - 2 files changed, 44 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 97ccb40fb09f..332e1000ca0f 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -256,7 +256,6 @@ extern bool rtas_indicator_present(int token, int *maxindex); extern int rtas_set_indicator(int indicator, int index, int new_value); extern int rtas_set_indicator_fast(int indicator, int index, int new_value); extern void rtas_progress(char *s, unsigned short hex); -extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); int rtas_ibm_suspend_me(int *fw_status); struct rtc_time; diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index aedd46967b99..9a7d1bba3ef7 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -830,49 +830,6 @@ void rtas_activate_firmware(void) static int ibm_suspend_me_token = RTAS_UNKNOWN_SERVICE; #ifdef CONFIG_PPC_PSERIES -static int __rtas_suspend_last_cpu(struct rtas_suspend_me_data *data, int wake_when_done) -{ - u16 slb_size = mmu_slb_size; - int rc = H_MULTI_THREADS_ACTIVE; - int cpu; - - slb_set_size(SLB_MIN_SIZE); - printk(KERN_DEBUG "calling ibm,suspend-me on cpu %i\n", smp_processor_id()); - - while (rc == H_MULTI_THREADS_ACTIVE && !atomic_read(>done) && - !atomic_read(>error)) - rc = rtas_call(data->token, 0, 1, NULL); - - if (rc || atomic_read(>error)) { - printk(KERN_DEBUG "ibm,suspend-me returned %d\n", rc); - slb_set_size(slb_size); - } - - if (atomic_read(>error)) - rc = atomic_read(>error); - - atomic_set(>error, rc); - pSeries_coalesce_init(); - - if (wake_when_done) { - atomic_set(>done, 1); - - for_each_online_cpu(cpu) - plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); - } - - if (atomic_dec_return(>working) == 0) - complete(data->complete); - - return rc; -} - -int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data) -{ - atomic_inc(>working); - return __rtas_suspend_last_cpu(data, 0); -} - /** * rtas_call_reentrant() - Used for reentrant rtas calls * @token: Token for desired reentrant RTAS call -- 2.28.0
[PATCH v2 11/28] powerpc/pseries/mobility: extract VASI session polling logic
The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to the caller until the specified VASI suspend session state makes the transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest of separating concerns to prepare for a new implementation of the join/suspend sequence, extract VASI session polling logic into a couple of local functions. Waiting for the session state to reach H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures that we will never get an EAGAIN result necessitating a retry. No user-visible change in behavior is intended. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 69 +-- 1 file changed, 64 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 01ac7c03558e..573ed48b43d8 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -345,6 +345,66 @@ void post_mobility_fixup(void) return; } +static int poll_vasi_state(u64 handle, unsigned long *res) +{ + unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; + long hvrc; + int ret; + + hvrc = plpar_hcall(H_VASI_STATE, retbuf, handle); + switch (hvrc) { + case H_SUCCESS: + ret = 0; + *res = retbuf[0]; + break; + case H_PARAMETER: + ret = -EINVAL; + break; + case H_FUNCTION: + ret = -EOPNOTSUPP; + break; + case H_HARDWARE: + default: + pr_err("unexpected H_VASI_STATE result %ld\n", hvrc); + ret = -EIO; + break; + } + return ret; +} + +static int wait_for_vasi_session_suspending(u64 handle) +{ + unsigned long state; + int ret; + + /* +* Wait for transition from H_VASI_ENABLED to +* H_VASI_SUSPENDING. Treat anything else as an error. +*/ + while (true) { + ret = poll_vasi_state(handle, ); + + if (ret != 0 || state == H_VASI_SUSPENDING) { + break; + } else if (state == H_VASI_ENABLED) { + ssleep(1); + } else { + pr_err("unexpected H_VASI_STATE result %lu\n", state); + ret = -EIO; + break; + } + } + + /* +* Proceed even if H_VASI_STATE is unavailable. If H_JOIN or +* ibm,suspend-me are also unimplemented, we'll recover then. +*/ + if (ret == -EOPNOTSUPP) + ret = 0; + + return ret; +} + static ssize_t migration_store(struct class *class, struct class_attribute *attr, const char *buf, size_t count) @@ -356,12 +416,11 @@ static ssize_t migration_store(struct class *class, if (rc) return rc; - do { - rc = rtas_ibm_suspend_me_unsafe(streamid); - if (rc == -EAGAIN) - ssleep(1); - } while (rc == -EAGAIN); + rc = wait_for_vasi_session_suspending(streamid); + if (rc) + return rc; + rc = rtas_ibm_suspend_me_unsafe(streamid); if (rc) return rc; -- 2.28.0
[PATCH v2 04/28] powerpc/rtas: add rtas_ibm_suspend_me()
Now that the name is available, provide a simple wrapper for ibm,suspend-me which returns both a Linux errno and optionally the actual RTAS status to the caller. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 + arch/powerpc/kernel/rtas.c | 57 + 2 files changed, 58 insertions(+) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 8436ed01567b..b43165fc6c2a 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -258,6 +258,7 @@ extern void rtas_progress(char *s, unsigned short hex); extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); int rtas_ibm_suspend_me_unsafe(u64 handle); +int rtas_ibm_suspend_me(int *fw_status); struct rtc_time; extern time64_t rtas_get_boot_time(void); diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 0a8e5dc2c108..8a618a3c4beb 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -684,6 +684,63 @@ int rtas_set_indicator_fast(int indicator, int index, int new_value) return rc; } +/** + * rtas_ibm_suspend_me() - Call ibm,suspend-me to suspend the LPAR. + * + * @fw_status: RTAS call status will be placed here if not NULL. + * + * rtas_ibm_suspend_me() should be called only on a CPU which has + * received H_CONTINUE from the H_JOIN hcall. All other active CPUs + * should be waiting to return from H_JOIN. + * + * rtas_ibm_suspend_me() may suspend execution of the OS + * indefinitely. Callers should take appropriate measures upon return, such as + * resetting watchdog facilities. + * + * Callers may choose to retry this call if @fw_status is + * %RTAS_THREADS_ACTIVE. + * + * Return: + * 0 - The partition has resumed from suspend, possibly after + * migration to a different host. + * -ECANCELED - The operation was aborted. + * -EAGAIN- There were other CPUs not in H_JOIN at the time of the call. + * -EBUSY - Some other condition prevented the suspend from succeeding. + * -EIO - Hardware/platform error. + */ +int rtas_ibm_suspend_me(int *fw_status) +{ + int fwrc; + int ret; + + fwrc = rtas_call(rtas_token("ibm,suspend-me"), 0, 1, NULL); + + switch (fwrc) { + case 0: + ret = 0; + break; + case RTAS_SUSPEND_ABORTED: + ret = -ECANCELED; + break; + case RTAS_THREADS_ACTIVE: + ret = -EAGAIN; + break; + case RTAS_NOT_SUSPENDABLE: + case RTAS_OUTSTANDING_COPROC: + ret = -EBUSY; + break; + case -1: + default: + ret = -EIO; + break; + } + + if (fw_status) + *fw_status = fwrc; + + return ret; +} + void __noreturn rtas_restart(char *cmd) { if (rtas_flash_term_hook) -- 2.28.0
[PATCH v2 12/28] powerpc/pseries/mobility: use stop_machine for join/suspend
The partition suspend sequence as specified in the platform architecture requires that all active processor threads call H_JOIN, which: - suspends the calling thread until it is the target of an H_PROD; or - immediately returns H_CONTINUE, if the calling thread is the last to call H_JOIN. This thread is expected to call ibm,suspend-me to completely suspend the partition. Upon returning from ibm,suspend-me the calling thread must wake all others using H_PROD. rtas_ibm_suspend_me_unsafe() uses on_each_cpu() to implement this protocol, but because of its synchronizing nature this is susceptible to deadlock versus users of stop_machine() or other callers of on_each_cpu(). Not only is stop_machine() intended for use cases like this, it handles error propagation and allows us to keep the data shared between CPUs minimal: a single atomic counter which ensures exactly one CPU will wake the others from their joined states. Switch the migration code to use stop_machine() and a less complex local implementation of the H_JOIN/ibm,suspend-me logic, which carries additional benefits: - more informative error reporting, appropriately ratelimited - resets the lockup detector / watchdog on resume to prevent lockup warnings when the OS has been suspended for a time exceeding the threshold. Fixes: 91dc182ca6e2 ("[PATCH] powerpc: special-case ibm,suspend-me RTAS call") Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 132 -- 1 file changed, 125 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 573ed48b43d8..5a3951626a96 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -12,9 +12,11 @@ #include #include #include +#include #include #include #include +#include #include #include #include @@ -405,6 +407,128 @@ static int wait_for_vasi_session_suspending(u64 handle) return ret; } +static void prod_single(unsigned int target_cpu) +{ + long hvrc; + int hwid; + + hwid = get_hard_smp_processor_id(target_cpu); + hvrc = plpar_hcall_norets(H_PROD, hwid); + if (hvrc == H_SUCCESS) + return; + pr_err_ratelimited("H_PROD of CPU %u (hwid %d) error: %ld\n", + target_cpu, hwid, hvrc); +} + +static void prod_others(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) { + if (cpu != smp_processor_id()) + prod_single(cpu); + } +} + +static u16 clamp_slb_size(void) +{ + u16 prev = mmu_slb_size; + + slb_set_size(SLB_MIN_SIZE); + + return prev; +} + +static int do_suspend(void) +{ + u16 saved_slb_size; + int status; + int ret; + + pr_info("calling ibm,suspend-me on CPU %i\n", smp_processor_id()); + + /* +* The destination processor model may have fewer SLB entries +* than the source. We reduce mmu_slb_size to a safe minimum +* before suspending in order to minimize the possibility of +* programming non-existent entries on the destination. If +* suspend fails, we restore it before returning. On success +* the OF reconfig path will update it from the new device +* tree after resuming on the destination. +*/ + saved_slb_size = clamp_slb_size(); + + ret = rtas_ibm_suspend_me(); + if (ret != 0) { + pr_err("ibm,suspend-me error: %d\n", status); + slb_set_size(saved_slb_size); + } + + return ret; +} + +static int do_join(void *arg) +{ + atomic_t *counter = arg; + long hvrc; + int ret; + + /* Must ensure MSR.EE off for H_JOIN. */ + hard_irq_disable(); + hvrc = plpar_hcall_norets(H_JOIN); + + switch (hvrc) { + case H_CONTINUE: + /* +* All other CPUs are offline or in H_JOIN. This CPU +* attempts the suspend. +*/ + ret = do_suspend(); + break; + case H_SUCCESS: + /* +* The suspend is complete and this cpu has received a +* prod. +*/ + ret = 0; + break; + case H_BAD_MODE: + case H_HARDWARE: + default: + ret = -EIO; + pr_err_ratelimited("H_JOIN error %ld on CPU %i\n", + hvrc, smp_processor_id()); + break; + } + + if (atomic_inc_return(counter) == 1) { + pr_info("CPU %u waking all threads\n", smp_processor_id()); + prod_others(); + } + /* +* Execution may have been suspended for several seconds, so +* reset the watchdog. +*/ + touch_nmi_watchdog(); + retur
[PATCH v2 24/28] powerpc/pseries/hibernation: remove redundant cacheinfo update
Partitions with cache nodes in the device tree can encounter the following warning on resume: CPU 0 already accounted in PowerPC,POWER9@0(Data) WARNING: CPU: 0 PID: 3177 at arch/powerpc/kernel/cacheinfo.c:197 cacheinfo_cpu_online+0x640/0x820 These calls to cacheinfo_cpu_offline/online have been redundant since commit e610a466d16a ("powerpc/pseries/mobility: rebuild cacheinfo hierarchy post-migration"). Fixes: e610a466d16a ("powerpc/pseries/mobility: rebuild cacheinfo hierarchy post-migration") Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 703728cb95ec..6a94cc0deb88 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -13,7 +13,6 @@ #include #include #include -#include "../../kernel/cacheinfo.h" static struct device suspend_dev; static DECLARE_COMPLETION(suspend_work); @@ -63,9 +62,7 @@ static void pseries_suspend_enable_irqs(void) * Update configuration which can be modified based on device tree * changes during resume. */ - cacheinfo_cpu_offline(smp_processor_id()); post_mobility_fixup(); - cacheinfo_cpu_online(smp_processor_id()); } /** -- 2.28.0
[PATCH v2 06/28] powerpc/hvcall: add token and codes for H_VASI_SIGNAL
H_VASI_SIGNAL can be used by a partition to request cancellation of its migration. To be used in future changes. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/hvcall.h | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index c1fbccb04390..c98f5141e3fc 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -155,6 +155,14 @@ #define H_VASI_RESUMED 5 #define H_VASI_COMPLETED6 +/* VASI signal codes. Only the Cancel code is valid for H_VASI_SIGNAL. */ +#define H_VASI_SIGNAL_CANCEL1 +#define H_VASI_SIGNAL_ABORT 2 +#define H_VASI_SIGNAL_SUSPEND 3 +#define H_VASI_SIGNAL_COMPLETE 4 +#define H_VASI_SIGNAL_ENABLE5 +#define H_VASI_SIGNAL_FAILOVER 6 + /* Each control block has to be on a 4K boundary */ #define H_CB_ALIGNMENT 4096 @@ -261,6 +269,7 @@ #define H_ADD_CONN 0x284 #define H_DEL_CONN 0x288 #define H_JOIN 0x298 +#define H_VASI_SIGNAL 0x2A0 #define H_VASI_STATE0x2A4 #define H_VIOCTL 0x2A8 #define H_ENABLE_CRQ 0x2B0 -- 2.28.0
[PATCH v2 17/28] powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops
There are three ways pseries_suspend_begin() can be reached: 1. When "mem" is written to /sys/power/state: kobj_attr_store() -> state_store() -> pm_suspend() -> suspend_devices_and_enter() -> pseries_suspend_begin() This never works because there is no way to supply a valid stream id using this interface, and H_VASI_STATE is called with a stream id of zero. So this call path is useless at best. 2. When a stream id is written to /sys/devices/system/power/hibernate. pseries_suspend_begin() is polled directly from store_hibernate() until the stream is in the "Suspending" state (i.e. the platform is ready for the OS to suspend execution): dev_attr_store() -> store_hibernate() -> pseries_suspend_begin() 3. When a stream id is written to /sys/devices/system/power/hibernate (continued). After #2, pseries_suspend_begin() is called once again from the pm core: dev_attr_store() -> store_hibernate() -> pm_suspend() -> suspend_devices_and_enter() -> pseries_suspend_begin() This is redundant because the VASI suspend state is already known to be Suspending. The begin() callback of platform_suspend_ops is optional, so we can simply remove that assignment with no loss of function. Fixes: 32d8ad4e621d ("powerpc/pseries: Partition hibernation support") Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 81e0ac58d620..3eaa9d59dc7a 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -187,7 +187,6 @@ static struct bus_type suspend_subsys = { static const struct platform_suspend_ops pseries_suspend_ops = { .valid = suspend_valid_only_mem, - .begin = pseries_suspend_begin, .prepare_late = pseries_prepare_late, .enter = pseries_suspend_enter, }; -- 2.28.0
[PATCH v2 27/28] powerpc/rtas: remove unused rtas_suspend_me_data
All code which used this type has been removed. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas-types.h | 8 1 file changed, 8 deletions(-) diff --git a/arch/powerpc/include/asm/rtas-types.h b/arch/powerpc/include/asm/rtas-types.h index aa420561bc10..8df6235d64d1 100644 --- a/arch/powerpc/include/asm/rtas-types.h +++ b/arch/powerpc/include/asm/rtas-types.h @@ -23,14 +23,6 @@ struct rtas_t { struct device_node *dev;/* virtual address pointer */ }; -struct rtas_suspend_me_data { - atomic_t working; /* number of cpus accessing this struct */ - atomic_t done; - int token; /* ibm,suspend-me */ - atomic_t error; - struct completion *complete; /* wait on this until working == 0 */ -}; - struct rtas_error_log { /* Byte 0 */ u8 byte0; /* Architectural version */ -- 2.28.0
[PATCH v2 18/28] powerpc/pseries/hibernation: pass stream id via function arguments
There is no need for the stream id to be a file-global variable; pass it from hibernate_store() to pseries_suspend_begin() for the H_VASI_STATE call. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 3eaa9d59dc7a..232621f33510 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -15,7 +15,6 @@ #include #include "../../kernel/cacheinfo.h" -static u64 stream_id; static struct device suspend_dev; static DECLARE_COMPLETION(suspend_work); static struct rtas_suspend_me_data suspend_data; @@ -29,7 +28,7 @@ static atomic_t suspending; * Return value: * 0 on success / other on failure **/ -static int pseries_suspend_begin(suspend_state_t state) +static int pseries_suspend_begin(u64 stream_id) { long vasi_state, rc; unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; @@ -132,6 +131,7 @@ static ssize_t store_hibernate(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { + u64 stream_id; int rc; if (!capable(CAP_SYS_ADMIN)) @@ -140,7 +140,7 @@ static ssize_t store_hibernate(struct device *dev, stream_id = simple_strtoul(buf, NULL, 16); do { - rc = pseries_suspend_begin(PM_SUSPEND_MEM); + rc = pseries_suspend_begin(stream_id); if (rc == -EAGAIN) ssleep(1); } while (rc == -EAGAIN); @@ -148,8 +148,6 @@ static ssize_t store_hibernate(struct device *dev, if (!rc) rc = pm_suspend(PM_SUSPEND_MEM); - stream_id = 0; - if (!rc) rc = count; -- 2.28.0
[PATCH v2 26/28] powerpc/pseries/hibernation: remove prepare_late() callback
The pseries hibernate code no longer calls into the original join/suspend code in kernel/rtas.c, so pseries_prepare_late() and related code don't accomplish anything now. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 25 1 file changed, 25 deletions(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 589a91730db8..1b902cbf85c5 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -15,9 +15,6 @@ #include static struct device suspend_dev; -static DECLARE_COMPLETION(suspend_work); -static struct rtas_suspend_me_data suspend_data; -static atomic_t suspending; /** * pseries_suspend_begin - First phase of hibernation @@ -61,23 +58,6 @@ static int pseries_suspend_enter(suspend_state_t state) return rtas_ibm_suspend_me(NULL); } -/** - * pseries_prepare_late - Prepare to suspend all other CPUs - * - * Return value: - * 0 on success / other on failure - **/ -static int pseries_prepare_late(void) -{ - atomic_set(, 1); - atomic_set(_data.working, 0); - atomic_set(_data.done, 0); - atomic_set(_data.error, 0); - suspend_data.complete = _work; - reinit_completion(_work); - return 0; -} - /** * store_hibernate - Initiate partition hibernation * @dev: subsys root device @@ -152,7 +132,6 @@ static struct bus_type suspend_subsys = { static const struct platform_suspend_ops pseries_suspend_ops = { .valid = suspend_valid_only_mem, - .prepare_late = pseries_prepare_late, .enter = pseries_suspend_enter, }; @@ -195,10 +174,6 @@ static int __init pseries_suspend_init(void) if (!firmware_has_feature(FW_FEATURE_LPAR)) return 0; - suspend_data.token = rtas_token("ibm,suspend-me"); - if (suspend_data.token == RTAS_UNKNOWN_SERVICE) - return 0; - if ((rc = pseries_suspend_sysfs_register(_dev))) return rc; -- 2.28.0
[PATCH v2 14/28] powerpc/pseries/mobility: retry partition suspend after error
This is a mitigation for the relatively rare occurrence where a virtual IOA can be in a transient state that prevents the suspend/migration from succeeding, resulting in an error from ibm,suspend-me. If the join/suspend sequence returns an error, it is acceptable to retry as long as the VASI suspend session state is still "Suspending" (i.e. the platform is still waiting for the OS to suspend). Retry a few times on suspend failure while this condition holds, progressively increasing the delay between attempts. We don't want to retry indefinitey because firmware emits an error log event on each unsuccessful attempt. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 59 ++- 1 file changed, 57 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index f234a7ed87aa..fe7e35cdc9d5 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -542,16 +542,71 @@ static void pseries_cancel_migration(u64 handle, int err) pr_err("H_VASI_SIGNAL error: %ld\n", hvrc); } +static int pseries_suspend(u64 handle) +{ + const unsigned int max_attempts = 5; + unsigned int retry_interval_ms = 1; + unsigned int attempt = 1; + int ret; + + while (true) { + atomic_t counter = ATOMIC_INIT(0); + unsigned long vasi_state; + int vasi_err; + + ret = stop_machine(do_join, , cpu_online_mask); + if (ret == 0) + break; + /* +* Encountered an error. If the VASI stream is still +* in Suspending state, it's likely a transient +* condition related to some device in the partition +* and we can retry in the hope that the cause has +* cleared after some delay. +* +* A better design would allow drivers etc to prepare +* for the suspend and avoid conditions which prevent +* the suspend from succeeding. For now, we have this +* mitigation. +*/ + pr_notice("Partition suspend attempt %u of %u error: %d\n", + attempt, max_attempts, ret); + + if (attempt == max_attempts) + break; + + vasi_err = poll_vasi_state(handle, _state); + if (vasi_err == 0) { + if (vasi_state != H_VASI_SUSPENDING) { + pr_notice("VASI state %lu after failed suspend\n", + vasi_state); + break; + } + } else if (vasi_err != -EOPNOTSUPP) { + pr_err("VASI state poll error: %d", vasi_err); + break; + } + + pr_notice("Will retry partition suspend after %u ms\n", + retry_interval_ms); + + msleep(retry_interval_ms); + retry_interval_ms *= 10; + attempt++; + } + + return ret; +} + static int pseries_migrate_partition(u64 handle) { - atomic_t counter = ATOMIC_INIT(0); int ret; ret = wait_for_vasi_session_suspending(handle); if (ret) return ret; - ret = stop_machine(do_join, , cpu_online_mask); + ret = pseries_suspend(handle); if (ret == 0) post_mobility_fixup(); else -- 2.28.0
[PATCH v2 28/28] powerpc/pseries/mobility: refactor node lookup during DT update
In pseries_devicetree_update(), with each call to ibm,update-nodes the partition firmware communicates the node to be deleted or updated by placing its phandle in the work buffer. Each of delete_dt_node(), update_dt_node(), and add_dt_node() have duplicate lookups using the phandle value and corresponding refcount management. Move the lookup and of_node_put() into pseries_devicetree_update(), and emit a warning on any failed lookups. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 49 --- 1 file changed, 17 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index e670180f311d..ea4d6a660e0d 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -61,18 +61,10 @@ static int mobility_rtas_call(int token, char *buf, s32 scope) return rc; } -static int delete_dt_node(__be32 phandle) +static int delete_dt_node(struct device_node *dn) { - struct device_node *dn; - - dn = of_find_node_by_phandle(be32_to_cpu(phandle)); - if (!dn) - return -ENOENT; - pr_debug("removing node %pOFfp\n", dn); - dlpar_detach_node(dn); - of_node_put(dn); return 0; } @@ -137,10 +129,9 @@ static int update_dt_property(struct device_node *dn, struct property **prop, return 0; } -static int update_dt_node(__be32 phandle, s32 scope) +static int update_dt_node(struct device_node *dn, s32 scope) { struct update_props_workarea *upwa; - struct device_node *dn; struct property *prop = NULL; int i, rc, rtas_rc; char *prop_data; @@ -157,14 +148,8 @@ static int update_dt_node(__be32 phandle, s32 scope) if (!rtas_buf) return -ENOMEM; - dn = of_find_node_by_phandle(be32_to_cpu(phandle)); - if (!dn) { - kfree(rtas_buf); - return -ENOENT; - } - upwa = (struct update_props_workarea *)_buf[0]; - upwa->phandle = phandle; + upwa->phandle = cpu_to_be32(dn->phandle); do { rtas_rc = mobility_rtas_call(update_properties_token, rtas_buf, @@ -224,26 +209,18 @@ static int update_dt_node(__be32 phandle, s32 scope) cond_resched(); } while (rtas_rc == 1); - of_node_put(dn); kfree(rtas_buf); return 0; } -static int add_dt_node(__be32 parent_phandle, __be32 drc_index) +static int add_dt_node(struct device_node *parent_dn, __be32 drc_index) { struct device_node *dn; - struct device_node *parent_dn; int rc; - parent_dn = of_find_node_by_phandle(be32_to_cpu(parent_phandle)); - if (!parent_dn) - return -ENOENT; - dn = dlpar_configure_connector(drc_index, parent_dn); - if (!dn) { - of_node_put(parent_dn); + if (!dn) return -ENOENT; - } rc = dlpar_attach_node(dn, parent_dn); if (rc) @@ -251,7 +228,6 @@ static int add_dt_node(__be32 parent_phandle, __be32 drc_index) pr_debug("added node %pOFfp\n", dn); - of_node_put(parent_dn); return rc; } @@ -284,22 +260,31 @@ int pseries_devicetree_update(s32 scope) data++; for (i = 0; i < node_count; i++) { + struct device_node *np; __be32 phandle = *data++; __be32 drc_index; + np = of_find_node_by_phandle(be32_to_cpu(phandle)); + if (!np) { + pr_warn("Failed lookup: phandle 0x%x for action 0x%x\n", + be32_to_cpu(phandle), action); + continue; + } + switch (action) { case DELETE_DT_NODE: - delete_dt_node(phandle); + delete_dt_node(np); break; case UPDATE_DT_NODE: - update_dt_node(phandle, scope); + update_dt_node(np, scope); break; case ADD_DT_NODE: drc_index = *data++; - add_dt_node(phandle, drc_index); + add_dt_node(np, drc_index); break; } + of_node_put(np); cond_resched(); } } -- 2.28.0
[PATCH v2 00/28] partition suspend updates
This series aims to improve the pseries-specific partition migration and hibernation implementation, part of which has been living in kernel/rtas.c. Most of that code is eliminated or moved to platforms/pseries, and the following major functional changes are made: - Use stop_machine() instead of on_each_cpu() to avoid deadlock in the join/suspend sequence. - Retry the join/suspend sequence on errors that are likely to be transient. This is a mitigation for the fact that drivers currently have no way to prepare for an impending partition suspension, sometimes resulting in a virtual adapter being in a state which causes the platform to fail the suspend call. - Request cancellation of the migration via H_VASI_SIGNAL if Linux is going to error out of the suspend attempt. This allows the management console and other entities to promptly clean up their operations instead of relying on long timeouts to fail the migration. - Little-endian users of ibm,suspend-me, ibm,update-nodes and ibm,update-properties via sys_rtas are blocked when CONFIG_PPC_RTAS_FILTERS is enabled. - Legacy user space code (drmgr) historically has driven the migration process by using sys_rtas to separately call ibm,suspend-me, ibm,activate-firmware, and ibm,update-nodes/properties, in that order. With these changes, when sys_rtas() dispatches ibm,suspend-me, the kernel performs the device tree update and firmware activation before returning. This is more reliable, and drmgr does not seem bothered by it. - If the H_VASI_STATE hcall is absent, the implementation proceeds with the suspend instead of erroring out. This allows us to exercise these code paths in QEMU. Changes since v1: - Drop "powerpc/rtas: move rtas_call_reentrant() out of pseries guards". rtas_call_reentrant() actually is pseries-specific and this broke builds without CONFIG_PPC_PSERIES set. - Simplify polling logic in wait_for_vasi_session_suspending(). ("powerpc/pseries/mobility: extract VASI session polling logic") - Use direct return instead of goto in pseries_migrate_partition(). ("powerpc/pseries/mobility: use stop_machine for join/suspend") - Change dispatch of ibm,suspend-me in rtas syscall path to use conventional config symbol guards instead of a weak function. ("powerpc/rtas: dispatch partition migration requests to pseries") - Fix refcount imbalance in add_dt_node() error path. ("powerpc/pseries/mobility: refactor node lookup during DT update") Nathan Lynch (28): powerpc/rtas: prevent suspend-related sys_rtas use on LE powerpc/rtas: complete ibm,suspend-me status codes powerpc/rtas: rtas_ibm_suspend_me -> rtas_ibm_suspend_me_unsafe powerpc/rtas: add rtas_ibm_suspend_me() powerpc/rtas: add rtas_activate_firmware() powerpc/hvcall: add token and codes for H_VASI_SIGNAL powerpc/pseries/mobility: don't error on absence of ibm,update-nodes powerpc/pseries/mobility: add missing break to default case powerpc/pseries/mobility: error message improvements powerpc/pseries/mobility: use rtas_activate_firmware() on resume powerpc/pseries/mobility: extract VASI session polling logic powerpc/pseries/mobility: use stop_machine for join/suspend powerpc/pseries/mobility: signal suspend cancellation to platform powerpc/pseries/mobility: retry partition suspend after error powerpc/rtas: dispatch partition migration requests to pseries powerpc/rtas: remove rtas_ibm_suspend_me_unsafe() powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops powerpc/pseries/hibernation: pass stream id via function arguments powerpc/pseries/hibernation: remove pseries_suspend_cpu() powerpc/machdep: remove suspend_disable_cpu() powerpc/rtas: remove rtas_suspend_cpu() powerpc/pseries/hibernation: switch to rtas_ibm_suspend_me() powerpc/rtas: remove unused rtas_suspend_last_cpu() powerpc/pseries/hibernation: remove redundant cacheinfo update powerpc/pseries/hibernation: perform post-suspend fixups later powerpc/pseries/hibernation: remove prepare_late() callback powerpc/rtas: remove unused rtas_suspend_me_data powerpc/pseries/mobility: refactor node lookup during DT update arch/powerpc/include/asm/hvcall.h | 9 + arch/powerpc/include/asm/machdep.h| 1 - arch/powerpc/include/asm/rtas-types.h | 8 - arch/powerpc/include/asm/rtas.h | 17 +- arch/powerpc/kernel/rtas.c| 243 ++- arch/powerpc/platforms/pseries/mobility.c | 358 ++ arch/powerpc/platforms/pseries/suspend.c | 79 + 7 files changed, 415 insertions(+), 300 deletions(-) -- 2.28.0
[PATCH v2 07/28] powerpc/pseries/mobility: don't error on absence of ibm, update-nodes
Treat the absence of the ibm,update-nodes function as benign instead of reporting an error. If the platform does not provide that facility, it's not a problem for Linux. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 6ff642e84c6a..e66359b00297 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -261,7 +261,7 @@ int pseries_devicetree_update(s32 scope) update_nodes_token = rtas_token("ibm,update-nodes"); if (update_nodes_token == RTAS_UNKNOWN_SERVICE) - return -EINVAL; + return 0; rtas_buf = kzalloc(RTAS_DATA_BUF_SIZE, GFP_KERNEL); if (!rtas_buf) -- 2.28.0
[PATCH v2 10/28] powerpc/pseries/mobility: use rtas_activate_firmware() on resume
It's incorrect to abort post-suspend processing if ibm,activate-firmware isn't available. Use rtas_activate_firmware(), which logs this condition appropriately and allows us to proceed. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 31d81b7da961..01ac7c03558e 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -312,21 +312,8 @@ int pseries_devicetree_update(s32 scope) void post_mobility_fixup(void) { int rc; - int activate_fw_token; - activate_fw_token = rtas_token("ibm,activate-firmware"); - if (activate_fw_token == RTAS_UNKNOWN_SERVICE) { - printk(KERN_ERR "Could not make post-mobility " - "activate-fw call.\n"); - return; - } - - do { - rc = rtas_call(activate_fw_token, 0, 1, NULL); - } while (rtas_busy_delay(rc)); - - if (rc) - printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc); + rtas_activate_firmware(); /* * We don't want CPUs to go online/offline while the device -- 2.28.0
[PATCH v2 22/28] powerpc/pseries/hibernation: switch to rtas_ibm_suspend_me()
rtas_suspend_last_cpu() and related code perform a lot of work that isn't relevant to the hibernation workflow. All other CPUs are offline when called so there is no need to place them in H_JOIN or prod them on resume, nor is there need for retries or operations on shared state. Call the rtas_ibm_suspend_me() wrapper function directly from pseries_suspend_enter() instead of using rtas_suspend_last_cpu(). Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 3315d698d5ab..703728cb95ec 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -76,11 +76,7 @@ static void pseries_suspend_enable_irqs(void) **/ static int pseries_suspend_enter(suspend_state_t state) { - int rc = rtas_suspend_last_cpu(_data); - - atomic_set(, 0); - atomic_set(_data.done, 1); - return rc; + return rtas_ibm_suspend_me(NULL); } /** -- 2.28.0
[PATCH v2 08/28] powerpc/pseries/mobility: add missing break to default case
update_dt_node() has a switch statement where the default case lacks a break statement. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index e66359b00297..527a64e2d89f 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -213,6 +213,7 @@ static int update_dt_node(__be32 phandle, s32 scope) } prop_data += vd; + break; } cond_resched(); -- 2.28.0
[PATCH v2 13/28] powerpc/pseries/mobility: signal suspend cancellation to platform
If we're returning an error to user space, use H_VASI_SIGNAL to send a cancellation request to the platform. This isn't strictly required but it communicates that Linux will not attempt to complete the suspend, which allows the various entities involved to promptly end the operation in progress. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 5a3951626a96..f234a7ed87aa 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -513,6 +513,35 @@ static int do_join(void *arg) return ret; } +/* + * Abort reason code byte 0. We use only the 'Migrating partition' value. + */ +enum vasi_aborting_entity { + ORCHESTRATOR= 1, + VSP_SOURCE = 2, + PARTITION_FIRMWARE = 3, + PLATFORM_FIRMWARE = 4, + VSP_TARGET = 5, + MIGRATING_PARTITION = 6, +}; + +static void pseries_cancel_migration(u64 handle, int err) +{ + u32 reason_code; + u32 detail; + u8 entity; + long hvrc; + + entity = MIGRATING_PARTITION; + detail = abs(err) & 0xff; + reason_code = (entity << 24) | detail; + + hvrc = plpar_hcall_norets(H_VASI_SIGNAL, handle, + H_VASI_SIGNAL_CANCEL, reason_code); + if (hvrc) + pr_err("H_VASI_SIGNAL error: %ld\n", hvrc); +} + static int pseries_migrate_partition(u64 handle) { atomic_t counter = ATOMIC_INIT(0); @@ -525,6 +554,8 @@ static int pseries_migrate_partition(u64 handle) ret = stop_machine(do_join, , cpu_online_mask); if (ret == 0) post_mobility_fixup(); + else + pseries_cancel_migration(handle, ret); return ret; } -- 2.28.0
[PATCH v2 21/28] powerpc/rtas: remove rtas_suspend_cpu()
rtas_suspend_cpu() no longer has users; remove it and __rtas_suspend_cpu() which now becomes unused as well. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 - arch/powerpc/kernel/rtas.c | 52 - 2 files changed, 53 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 9a6107ffe378..97ccb40fb09f 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -256,7 +256,6 @@ extern bool rtas_indicator_present(int token, int *maxindex); extern int rtas_set_indicator(int indicator, int index, int new_value); extern int rtas_set_indicator_fast(int indicator, int index, int new_value); extern void rtas_progress(char *s, unsigned short hex); -extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); int rtas_ibm_suspend_me(int *fw_status); diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 7e6024f570da..aedd46967b99 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -873,58 +873,6 @@ int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data) return __rtas_suspend_last_cpu(data, 0); } -static int __rtas_suspend_cpu(struct rtas_suspend_me_data *data, int wake_when_done) -{ - long rc = H_SUCCESS; - unsigned long msr_save; - int cpu; - - atomic_inc(>working); - - /* really need to ensure MSR.EE is off for H_JOIN */ - msr_save = mfmsr(); - mtmsr(msr_save & ~(MSR_EE)); - - while (rc == H_SUCCESS && !atomic_read(>done) && !atomic_read(>error)) - rc = plpar_hcall_norets(H_JOIN); - - mtmsr(msr_save); - - if (rc == H_SUCCESS) { - /* This cpu was prodded and the suspend is complete. */ - goto out; - } else if (rc == H_CONTINUE) { - /* All other cpus are in H_JOIN, this cpu does -* the suspend. -*/ - return __rtas_suspend_last_cpu(data, wake_when_done); - } else { - printk(KERN_ERR "H_JOIN on cpu %i failed with rc = %ld\n", - smp_processor_id(), rc); - atomic_set(>error, rc); - } - - if (wake_when_done) { - atomic_set(>done, 1); - - /* This cpu did the suspend or got an error; in either case, -* we need to prod all other other cpus out of join state. -* Extra prods are harmless. -*/ - for_each_online_cpu(cpu) - plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); - } -out: - if (atomic_dec_return(>working) == 0) - complete(data->complete); - return rc; -} - -int rtas_suspend_cpu(struct rtas_suspend_me_data *data) -{ - return __rtas_suspend_cpu(data, 0); -} - /** * rtas_call_reentrant() - Used for reentrant rtas calls * @token: Token for desired reentrant RTAS call -- 2.28.0
[PATCH v2 25/28] powerpc/pseries/hibernation: perform post-suspend fixups later
The pseries hibernate code calls post_mobility_fixup() which is sort of a dumping ground of fixups that need to run after resuming from suspend regardless of whether suspend was a hibernation or a migration. Calling post_mobility_fixup() from pseries_suspend_enable_irqs() runs this code early in resume with devices suspended and only one CPU up, while the much more commonly used migration case runs these fixups in a more typical process context. Call post_mobility_fixup() after the suspend core returns a success status to the hibernate sysfs store method and remove pseries_suspend_enable_irqs(). Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 21 - 1 file changed, 4 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 6a94cc0deb88..589a91730db8 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -50,21 +50,6 @@ static int pseries_suspend_begin(u64 stream_id) return 0; } -/** - * pseries_suspend_enable_irqs - * - * Post suspend configuration updates - * - **/ -static void pseries_suspend_enable_irqs(void) -{ - /* -* Update configuration which can be modified based on device tree -* changes during resume. -*/ - post_mobility_fixup(); -} - /** * pseries_suspend_enter - Final phase of hibernation * @@ -127,8 +112,11 @@ static ssize_t store_hibernate(struct device *dev, if (!rc) rc = pm_suspend(PM_SUSPEND_MEM); - if (!rc) + if (!rc) { rc = count; + post_mobility_fixup(); + } + return rc; } @@ -214,7 +202,6 @@ static int __init pseries_suspend_init(void) if ((rc = pseries_suspend_sysfs_register(_dev))) return rc; - ppc_md.suspend_enable_irqs = pseries_suspend_enable_irqs; suspend_set_ops(_suspend_ops); return 0; } -- 2.28.0
[PATCH v2 03/28] powerpc/rtas: rtas_ibm_suspend_me -> rtas_ibm_suspend_me_unsafe
The pseries partition suspend sequence requires that all active CPUs call H_JOIN, which suspends all but one of them with interrupts disabled. The "chosen" CPU is then to call ibm,suspend-me to complete the suspend. Upon returning from ibm,suspend-me, the chosen CPU is to use H_PROD to wake the joined CPUs. Using on_each_cpu() for this, as rtas_ibm_suspend_me() does to implement partition migration, is susceptible to deadlock with other users of on_each_cpu() and with users of stop_machine APIs. The callback passed to on_each_cpu() is not allowed to synchronize with other CPUs in the way it is used here. Complicating the fix is the fact that rtas_ibm_suspend_me() also occupies the function name that should be used to provide a more conventional wrapper for ibm,suspend-me. Rename rtas_ibm_suspend_me() to rtas_ibm_suspend_me_unsafe() to free up the name and indicate that it should not gain users. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 2 +- arch/powerpc/kernel/rtas.c| 6 +++--- arch/powerpc/platforms/pseries/mobility.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index f060181a0d32..8436ed01567b 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -257,7 +257,7 @@ extern int rtas_set_indicator_fast(int indicator, int index, int new_value); extern void rtas_progress(char *s, unsigned short hex); extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); -extern int rtas_ibm_suspend_me(u64 handle); +int rtas_ibm_suspend_me_unsafe(u64 handle); struct rtc_time; extern time64_t rtas_get_boot_time(void); diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 4ed64aba37d6..0a8e5dc2c108 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -843,7 +843,7 @@ static void rtas_percpu_suspend_me(void *info) __rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1); } -int rtas_ibm_suspend_me(u64 handle) +int rtas_ibm_suspend_me_unsafe(u64 handle) { long state; long rc; @@ -949,7 +949,7 @@ int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...) } #else /* CONFIG_PPC_PSERIES */ -int rtas_ibm_suspend_me(u64 handle) +int rtas_ibm_suspend_me_unsafe(u64 handle) { return -ENOSYS; } @@ -1185,7 +1185,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) int rc = 0; u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32) | be32_to_cpu(args.args[1]); - rc = rtas_ibm_suspend_me(handle); + rc = rtas_ibm_suspend_me_unsafe(handle); if (rc == -EAGAIN) args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE); else if (rc == -EIO) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 2f73cb5bf12d..6ff642e84c6a 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -370,7 +370,7 @@ static ssize_t migration_store(struct class *class, return rc; do { - rc = rtas_ibm_suspend_me(streamid); + rc = rtas_ibm_suspend_me_unsafe(streamid); if (rc == -EAGAIN) ssleep(1); } while (rc == -EAGAIN); -- 2.28.0
[PATCH v2 02/28] powerpc/rtas: complete ibm,suspend-me status codes
We don't completely account for the possible return codes for ibm,suspend-me. Add definitions for these. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 55f9a154c95d..f060181a0d32 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -23,11 +23,16 @@ #define RTAS_RMOBUF_MAX (64 * 1024) /* RTAS return status codes */ -#define RTAS_NOT_SUSPENDABLE -9004 #define RTAS_BUSY -2/* RTAS Busy */ #define RTAS_EXTENDED_DELAY_MIN9900 #define RTAS_EXTENDED_DELAY_MAX9905 +/* statuses specific to ibm,suspend-me */ +#define RTAS_SUSPEND_ABORTED 9000 /* Suspension aborted */ +#define RTAS_NOT_SUSPENDABLE-9004 /* Partition not suspendable */ +#define RTAS_THREADS_ACTIVE -9005 /* Multiple processor threads active */ +#define RTAS_OUTSTANDING_COPROC -9006 /* Outstanding coprocessor operations */ + /* * In general to call RTAS use rtas_token("string") to lookup * an RTAS token for the given string (e.g. "event-scan"). -- 2.28.0
[PATCH v2 16/28] powerpc/rtas: remove rtas_ibm_suspend_me_unsafe()
rtas_ibm_suspend_me_unsafe() is now unused; remove it and rtas_percpu_suspend_me() which becomes unused as a result. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 - arch/powerpc/kernel/rtas.c | 67 + 2 files changed, 1 insertion(+), 67 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 3b52d8574fcc..9a6107ffe378 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -258,7 +258,6 @@ extern int rtas_set_indicator_fast(int indicator, int index, int new_value); extern void rtas_progress(char *s, unsigned short hex); extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); -int rtas_ibm_suspend_me_unsafe(u64 handle); int rtas_ibm_suspend_me(int *fw_status); struct rtc_time; diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index d4b048571728..7e6024f570da 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -925,66 +925,6 @@ int rtas_suspend_cpu(struct rtas_suspend_me_data *data) return __rtas_suspend_cpu(data, 0); } -static void rtas_percpu_suspend_me(void *info) -{ - __rtas_suspend_cpu((struct rtas_suspend_me_data *)info, 1); -} - -int rtas_ibm_suspend_me_unsafe(u64 handle) -{ - long state; - long rc; - unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; - struct rtas_suspend_me_data data; - DECLARE_COMPLETION_ONSTACK(done); - - if (!rtas_service_present("ibm,suspend-me")) - return -ENOSYS; - - /* Make sure the state is valid */ - rc = plpar_hcall(H_VASI_STATE, retbuf, handle); - - state = retbuf[0]; - - if (rc) { - printk(KERN_ERR "rtas_ibm_suspend_me: vasi_state returned %ld\n",rc); - return rc; - } else if (state == H_VASI_ENABLED) { - return -EAGAIN; - } else if (state != H_VASI_SUSPENDING) { - printk(KERN_ERR "rtas_ibm_suspend_me: vasi_state returned state %ld\n", - state); - return -EIO; - } - - atomic_set(, 0); - atomic_set(, 0); - atomic_set(, 0); - data.token = rtas_token("ibm,suspend-me"); - data.complete = - - lock_device_hotplug(); - - cpu_hotplug_disable(); - - /* Call function on all CPUs. One of us will make the -* rtas call -*/ - on_each_cpu(rtas_percpu_suspend_me, , 0); - - wait_for_completion(); - - if (atomic_read() != 0) - printk(KERN_ERR "Error doing global join\n"); - - - cpu_hotplug_enable(); - - unlock_device_hotplug(); - - return atomic_read(); -} - /** * rtas_call_reentrant() - Used for reentrant rtas calls * @token: Token for desired reentrant RTAS call @@ -1035,12 +975,7 @@ int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...) return ret; } -#else /* CONFIG_PPC_PSERIES */ -int rtas_ibm_suspend_me_unsafe(u64 handle) -{ - return -ENOSYS; -} -#endif +#endif /* CONFIG_PPC_PSERIES */ /** * Find a specific pseries error log in an RTAS extended event log. -- 2.28.0
[PATCH v2 19/28] powerpc/pseries/hibernation: remove pseries_suspend_cpu()
Since commit 48f6e7f6d948 ("powerpc/pseries: remove cede offline state for CPUs"), ppc_md.suspend_disable_cpu() is no longer used and all CPUs (save one) are placed into true offline state as opposed to H_JOIN. So pseries_suspend_cpu() is effectively unused; remove it. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/suspend.c | 15 --- 1 file changed, 15 deletions(-) diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c index 232621f33510..3315d698d5ab 100644 --- a/arch/powerpc/platforms/pseries/suspend.c +++ b/arch/powerpc/platforms/pseries/suspend.c @@ -48,20 +48,6 @@ static int pseries_suspend_begin(u64 stream_id) vasi_state); return -EIO; } - - return 0; -} - -/** - * pseries_suspend_cpu - Suspend a single CPU - * - * Makes the H_JOIN call to suspend the CPU - * - **/ -static int pseries_suspend_cpu(void) -{ - if (atomic_read()) - return rtas_suspend_cpu(_data); return 0; } @@ -235,7 +221,6 @@ static int __init pseries_suspend_init(void) if ((rc = pseries_suspend_sysfs_register(_dev))) return rc; - ppc_md.suspend_disable_cpu = pseries_suspend_cpu; ppc_md.suspend_enable_irqs = pseries_suspend_enable_irqs; suspend_set_ops(_suspend_ops); return 0; -- 2.28.0
Re: [PATCH 12/29] powerpc/pseries/mobility: extract VASI session polling logic
Michael Ellerman writes: > Nathan Lynch writes: >> The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to >> the caller until the specified VASI suspend session state makes the >> transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest >> of separating concerns to prepare for a new implementation of the >> join/suspend sequence, extract VASI session polling logic into a >> couple of local functions. Waiting for the session state to reach >> H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures >> that we will never get an EAGAIN result necessitating a retry. No >> user-visible change in behavior is intended. >> >> Signed-off-by: Nathan Lynch >> --- >> arch/powerpc/platforms/pseries/mobility.c | 76 +-- >> 1 file changed, 71 insertions(+), 5 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/mobility.c >> b/arch/powerpc/platforms/pseries/mobility.c >> index dc6abf164db7..1b8ae221b98a 100644 >> --- a/arch/powerpc/platforms/pseries/mobility.c >> +++ b/arch/powerpc/platforms/pseries/mobility.c >> @@ -345,6 +345,73 @@ void post_mobility_fixup(void) > ... > >> + >> +static int wait_for_vasi_session_suspending(u64 handle) >> +{ >> +unsigned long state; >> +bool keep_polling; >> +int ret; >> + >> +/* >> + * Wait for transition from H_VASI_ENABLED to >> + * H_VASI_SUSPENDING. Treat anything else as an error. >> + */ >> +do { >> +keep_polling = false; >> +ret = poll_vasi_state(handle, ); >> +if (ret != 0) >> +break; >> + >> +switch (state) { >> +case H_VASI_SUSPENDING: >> +break; >> +case H_VASI_ENABLED: >> +keep_polling = true; >> +ssleep(1); >> +break; >> +default: >> +pr_err("unexpected H_VASI_STATE result %lu\n", state); >> +ret = -EIO; >> +break; >> +} >> +} while (keep_polling); > > This seems like it could be simpler? > > eg: > > while (true) { > ret = poll_vasi_state(handle, ); > > if (ret != 0 || state == H_VASI_SUSPENDING) > break; > else if (state == H_VASI_ENABLED) > ssleep(1); > else { > pr_err("unexpected H_VASI_STATE result %lu\n", state); > ret = -EIO; > break; > } > } > > > Or did I miss something? No I think you're right, thanks.
Re: [PATCH 03/29] powerpc/rtas: complete ibm,suspend-me status codes
Michael Ellerman writes: > Nathan Lynch writes: >> We don't completely account for the possible return codes for >> ibm,suspend-me. Add definitions for these. >> >> Signed-off-by: Nathan Lynch >> --- >> arch/powerpc/include/asm/rtas.h | 7 ++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/include/asm/rtas.h >> b/arch/powerpc/include/asm/rtas.h >> index 55f9a154c95d..f060181a0d32 100644 >> --- a/arch/powerpc/include/asm/rtas.h >> +++ b/arch/powerpc/include/asm/rtas.h >> @@ -23,11 +23,16 @@ >> #define RTAS_RMOBUF_MAX (64 * 1024) >> >> /* RTAS return status codes */ >> -#define RTAS_NOT_SUSPENDABLE-9004 >> #define RTAS_BUSY -2/* RTAS Busy */ >> #define RTAS_EXTENDED_DELAY_MIN 9900 >> #define RTAS_EXTENDED_DELAY_MAX 9905 >> >> +/* statuses specific to ibm,suspend-me */ >> +#define RTAS_SUSPEND_ABORTED 9000 /* Suspension aborted */ > > This made me ... pause. > > But it really is positive 9000, I checked PAPR. Yes, 9000 falls within the "vendor-specific success codes" range in the RTAS status value table. I guess aborting a suspend is operator-initiated and it's not considered an error condition from that point of view.
Re: [PATCH 13/29] powerpc/pseries/mobility: use stop_machine for join/suspend
Hi Michael, Michael Ellerman writes: > Nathan Lynch writes: >> The partition suspend sequence as specified in the platform >> architecture requires that all active processor threads call >> H_JOIN, which: > ... >> diff --git a/arch/powerpc/platforms/pseries/mobility.c >> b/arch/powerpc/platforms/pseries/mobility.c >> index 1b8ae221b98a..44ca7d4e143d 100644 >> --- a/arch/powerpc/platforms/pseries/mobility.c >> +++ b/arch/powerpc/platforms/pseries/mobility.c >> @@ -412,6 +414,128 @@ static int wait_for_vasi_session_suspending(u64 handle) > ... > >> + >> +static int do_join(void *arg) >> +{ >> +atomic_t *counter = arg; >> +long hvrc; >> +int ret; >> + >> +/* Must ensure MSR.EE off for H_JOIN. */ >> +hard_irq_disable(); > > Didn't stop_machine() already do that for us? > > In the state machine in multi_cpu_stop(). Yes, but I didn't want to rely on something that seems like an implementation detail of stop_machine(). I assumed it's benign and in keeping with hard_irq_disable()'s intended semantics to make multiple calls to it within a critical section. I don't feel strongly about this though. If I remove this hard_irq_disable() I'll modify the comment to indicate that this function relies on stop_machine() to satisfy this condition for H_JOIN. >> +hvrc = plpar_hcall_norets(H_JOIN); >> + >> +switch (hvrc) { >> +case H_CONTINUE: >> +/* >> + * All other CPUs are offline or in H_JOIN. This CPU >> + * attempts the suspend. >> + */ >> +ret = do_suspend(); >> +break; >> +case H_SUCCESS: >> +/* >> + * The suspend is complete and this cpu has received a >> + * prod. >> + */ >> +ret = 0; >> +break; >> +case H_BAD_MODE: >> +case H_HARDWARE: >> +default: >> +ret = -EIO; >> +pr_err_ratelimited("H_JOIN error %ld on CPU %i\n", >> + hvrc, smp_processor_id()); >> +break; >> +} >> + >> +if (atomic_inc_return(counter) == 1) { >> +pr_info("CPU %u waking all threads\n", smp_processor_id()); >> +prod_others(); >> +} > > Do we even need the counter? IIUC only one CPU receives H_CONTINUE. So > couldn't we just have that CPU do the prodding of others? CPUs may exit H_JOIN due to system reset interrupt at any time, and H_JOIN may return H_HARDWARE to a caller after other CPUs have entered the join state successfully. In these cases the counter ensures exactly one thread performs the prod sequence. > >> +/* >> + * Execution may have been suspended for several seconds, so >> + * reset the watchdog. >> + */ >> +touch_nmi_watchdog(); >> +return ret; >> +} >> + >> +static int pseries_migrate_partition(u64 handle) >> +{ >> +atomic_t counter = ATOMIC_INIT(0); >> +int ret; >> + >> +ret = wait_for_vasi_session_suspending(handle); >> +if (ret) >> +goto out; > > Direct return would be clearer IMHO. OK, I can change this.
Re: [PATCH 16/29] powerpc/rtas: dispatch partition migration requests to pseries
Michael Ellerman writes: > Nathan Lynch writes: >> sys_rtas() cannot call ibm,suspend-me directly in the same way it >> handles other inputs. Instead it must dispatch the request to code >> that can first perform the H_JOIN sequence before any call to >> ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair >> amount of platform-specific code to implement this. >> >> Since a different, more robust implementation of the suspend sequence >> is now in the pseries platform code, we want to dispatch the request >> there while minimizing additional dependence on pseries. >> >> Use a weak function that only pseries overrides. > > Over the years weak functions have caused their fair share of problems. > > There are cases where they are the cleanest option, but for intra-arch > code like this I think and ifdef is much simpler. Fair enough, I wasn't all that confident about this anyway. >> diff --git a/arch/powerpc/include/asm/rtas.h >> b/arch/powerpc/include/asm/rtas.h >> index fdefe6a974eb..be0fc2536673 100644 >> --- a/arch/powerpc/include/asm/rtas.h >> +++ b/arch/powerpc/include/asm/rtas.h >> @@ -260,6 +260,7 @@ extern int rtas_suspend_cpu(struct rtas_suspend_me_data >> *data); >> extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); >> int rtas_ibm_suspend_me_unsafe(u64 handle); >> int rtas_ibm_suspend_me(int *fw_status); >> +int rtas_syscall_dispatch_ibm_suspend_me(u64 handle); > > ie. we'd just do: > > #ifdef CONFIG_PPC_PSERIES > int rtas_syscall_dispatch_ibm_suspend_me(u64 handle); > #else > int rtas_syscall_dispatch_ibm_suspend_me(u64 handle) > { > return -EINVAL; > } > #endif Yep will do.
Re: [PATCH] powerpc/pseries/hotplug-cpu: Fix memleak when cpus node not exist
Zhang Xiaoxu writes: > From: zhangxiaoxu > > If the cpus nodes not exist, we lost to free the 'cpu_drcs', which > will leak memory. > > Fixes: a0ff72f9f5a7 ("powerpc/pseries/hotplug-cpu: Remove double free in > error path") > Reported-by: Hulk Robot > Signed-off-by: zhangxiaoxu > --- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c > b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index f2837e33bf5d..4bb1c9f2bb11 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -743,6 +743,7 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add) > parent = of_find_node_by_path("/cpus"); > if (!parent) { > pr_warn("Could not find CPU root node in device tree\n"); > + kfree(cpu_drcs); > return -1; > } Thanks for finding this. a0ff72f9f5a7 ("powerpc/pseries/hotplug-cpu: Remove double free in error path") was posted in Sept 2019 but was not applied until July 2020: https://lore.kernel.org/linuxppc-dev/20190919231633.1344-1-nath...@linux.ibm.com/ Here is that change as posted; note the function context is find_dlpar_cpus_to_add(), not dlpar_cpu_add_by_count(): --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -726,7 +726,6 @@ static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 cpus_to_add) parent = of_find_node_by_path("/cpus"); if (!parent) { pr_warn("Could not find CPU root node in device tree\n"); - kfree(cpu_drcs); return -1; } Meanwhile b015f6bc9547dbc056edde7177c7868ca8629c4c ("powerpc/pseries: Add cpu DLPAR support for drc-info property") was posted in Nov 2019 and committed a few days later: https://lore.kernel.org/linux-pci/1573449697-5448-4-git-send-email-tyr...@linux.ibm.com/ This change reorganized the same code, removing find_dlpar_cpus_to_add(), and it had the effect of fixing the same issue. However git apparently allowed the older change to still apply on top of this (changing a function different from the one in the original patch!), leading to a real bug. Your patch is correct but it should be framed as a revert of a0ff72f9f5a7 with this context in the commit message.
Re: [PATCH 29/29] powerpc/pseries/mobility: refactor node lookup during DT update
Nathan Lynch writes: > In pseries_devicetree_update(), with each call to ibm,update-nodes the > partition firmware communicates the node to be deleted or updated by > placing its phandle in the work buffer. Each of delete_dt_node(), > update_dt_node(), and add_dt_node() have duplicate lookups using the > phandle value and corresponding refcount management. ... I noticed that this introduces a reference count imbalance in an error path: > -static int add_dt_node(__be32 parent_phandle, __be32 drc_index) > +static int add_dt_node(struct device_node *parent_dn, __be32 drc_index) > { > struct device_node *dn; > - struct device_node *parent_dn; > int rc; > > - parent_dn = of_find_node_by_phandle(be32_to_cpu(parent_phandle)); > - if (!parent_dn) > - return -ENOENT; > - > dn = dlpar_configure_connector(drc_index, parent_dn); > if (!dn) { > of_node_put(parent_dn); here: ^^^ > @@ -251,7 +230,6 @@ static int add_dt_node(__be32 parent_phandle, __be32 > drc_index) > > pr_debug("added node %pOFfp\n", dn); > > - of_node_put(parent_dn); > return rc; > } The change correctly removes the of_node_put() from the happy path but the put in the error branch should have been removed too.
Re: [PATCH 02/29] powerpc/rtas: prevent suspend-related sys_rtas use on LE
Andrew Donnellan writes: > On 30/10/20 12:17 pm, Nathan Lynch wrote: >> While drmgr has had work in some areas to make its RTAS syscall >> interactions endian-neutral, its code for performing partition >> migration via the syscall has never worked on LE. While it is able to >> complete ibm,suspend-me successfully, it crashes when attempting the >> subsequent ibm,update-nodes call. >> >> drmgr is the only known (or plausible) user of these ibm,suspend-me, >> ibm,update-nodes, and ibm,update-properties, so allow them only in >> big-endian configurations. > > And there's a zero chance that drmgr will ever be fixed on LE? It's always used the sysfs interface on LE, and the only way to provoke it to attempt the syscalls is by doing something like this before running the migration: # echo 0 > /tmp/fake_api_version # mount -o bind,ro /tmp/fake_api_version /sys/kernel/mobility/api_version So I'm not aware of any circumstance that would actually motivate someone to make it work on LE. I'd say it's more likely that drmgr will drop its support for using the syscall altogether.
[PATCH 10/29] powerpc/pseries/mobility: error message improvements
- Convert printk(KERN_ERR) to pr_err(). - Include errno in property update failure message. - Remove reference to "Post-mobility" from device tree update message: with pr_err() it will have a "mobility:" prefix. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index d85799b5464a..ef8f5641e700 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -208,8 +208,8 @@ static int update_dt_node(__be32 phandle, s32 scope) rc = update_dt_property(dn, , prop_name, vd, prop_data); if (rc) { - printk(KERN_ERR "Could not update %s" - " property\n", prop_name); + pr_err("updating %s property failed: %d\n", + prop_name, rc); } prop_data += vd; @@ -343,8 +343,7 @@ void post_mobility_fixup(void) rc = pseries_devicetree_update(MIGRATION_SCOPE); if (rc) - printk(KERN_ERR "Post-mobility device tree update " - "failed: %d\n", rc); + pr_err("device tree update failed: %d\n", rc); cacheinfo_rebuild(); -- 2.25.4
[PATCH 11/29] powerpc/pseries/mobility: use rtas_activate_firmware() on resume
It's incorrect to abort post-suspend processing if ibm,activate-firmware isn't available. Use rtas_activate_firmware(), which logs this condition appropriately and allows us to proceed. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index ef8f5641e700..dc6abf164db7 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -312,21 +312,8 @@ int pseries_devicetree_update(s32 scope) void post_mobility_fixup(void) { int rc; - int activate_fw_token; - activate_fw_token = rtas_token("ibm,activate-firmware"); - if (activate_fw_token == RTAS_UNKNOWN_SERVICE) { - printk(KERN_ERR "Could not make post-mobility " - "activate-fw call.\n"); - return; - } - - do { - rc = rtas_call(activate_fw_token, 0, 1, NULL); - } while (rtas_busy_delay(rc)); - - if (rc) - printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc); + rtas_activate_firmware(); /* * We don't want CPUs to go online/offline while the device -- 2.25.4
[PATCH 12/29] powerpc/pseries/mobility: extract VASI session polling logic
The behavior of rtas_ibm_suspend_me_unsafe() is to return -EAGAIN to the caller until the specified VASI suspend session state makes the transition from H_VASI_ENABLED to H_VASI_SUSPENDING. In the interest of separating concerns to prepare for a new implementation of the join/suspend sequence, extract VASI session polling logic into a couple of local functions. Waiting for the session state to reach H_VASI_SUSPENDING before calling rtas_ibm_suspend_me_unsafe() ensures that we will never get an EAGAIN result necessitating a retry. No user-visible change in behavior is intended. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 76 +-- 1 file changed, 71 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index dc6abf164db7..1b8ae221b98a 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -345,6 +345,73 @@ void post_mobility_fixup(void) return; } +static int poll_vasi_state(u64 handle, unsigned long *res) +{ + unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; + long hvrc; + int ret; + + hvrc = plpar_hcall(H_VASI_STATE, retbuf, handle); + switch (hvrc) { + case H_SUCCESS: + ret = 0; + *res = retbuf[0]; + break; + case H_PARAMETER: + ret = -EINVAL; + break; + case H_FUNCTION: + ret = -EOPNOTSUPP; + break; + case H_HARDWARE: + default: + pr_err("unexpected H_VASI_STATE result %ld\n", hvrc); + ret = -EIO; + break; + } + return ret; +} + +static int wait_for_vasi_session_suspending(u64 handle) +{ + unsigned long state; + bool keep_polling; + int ret; + + /* +* Wait for transition from H_VASI_ENABLED to +* H_VASI_SUSPENDING. Treat anything else as an error. +*/ + do { + keep_polling = false; + ret = poll_vasi_state(handle, ); + if (ret != 0) + break; + + switch (state) { + case H_VASI_SUSPENDING: + break; + case H_VASI_ENABLED: + keep_polling = true; + ssleep(1); + break; + default: + pr_err("unexpected H_VASI_STATE result %lu\n", state); + ret = -EIO; + break; + } + } while (keep_polling); + + /* +* Proceed even if H_VASI_STATE is unavailable. If H_JOIN or +* ibm,suspend-me are also unimplemented, we'll recover then. +*/ + if (ret == -EOPNOTSUPP) + ret = 0; + + return ret; +} + static ssize_t migration_store(struct class *class, struct class_attribute *attr, const char *buf, size_t count) @@ -356,12 +423,11 @@ static ssize_t migration_store(struct class *class, if (rc) return rc; - do { - rc = rtas_ibm_suspend_me_unsafe(streamid); - if (rc == -EAGAIN) - ssleep(1); - } while (rc == -EAGAIN); + rc = wait_for_vasi_session_suspending(streamid); + if (rc) + return rc; + rc = rtas_ibm_suspend_me_unsafe(streamid); if (rc) return rc; -- 2.25.4
[PATCH 14/29] powerpc/pseries/mobility: signal suspend cancellation to platform
If we're returning an error to user space, use H_VASI_SIGNAL to send a cancellation request to the platform. This isn't strictly required but it communicates that Linux will not attempt to complete the suspend, which allows the various entities involved to promptly end the operation in progress. Signed-off-by: Nathan Lynch --- arch/powerpc/platforms/pseries/mobility.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 44ca7d4e143d..0f592246f345 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -520,6 +520,35 @@ static int do_join(void *arg) return ret; } +/* + * Abort reason code byte 0. We use only the 'Migrating partition' value. + */ +enum vasi_aborting_entity { + ORCHESTRATOR= 1, + VSP_SOURCE = 2, + PARTITION_FIRMWARE = 3, + PLATFORM_FIRMWARE = 4, + VSP_TARGET = 5, + MIGRATING_PARTITION = 6, +}; + +static void pseries_cancel_migration(u64 handle, int err) +{ + u32 reason_code; + u32 detail; + u8 entity; + long hvrc; + + entity = MIGRATING_PARTITION; + detail = abs(err) & 0xff; + reason_code = (entity << 24) | detail; + + hvrc = plpar_hcall_norets(H_VASI_SIGNAL, handle, + H_VASI_SIGNAL_CANCEL, reason_code); + if (hvrc) + pr_err("H_VASI_SIGNAL error: %ld\n", hvrc); +} + static int pseries_migrate_partition(u64 handle) { atomic_t counter = ATOMIC_INIT(0); @@ -532,6 +561,8 @@ static int pseries_migrate_partition(u64 handle) ret = stop_machine(do_join, , cpu_online_mask); if (ret == 0) post_mobility_fixup(); + else + pseries_cancel_migration(handle, ret); out: return ret; } -- 2.25.4
[PATCH 16/29] powerpc/rtas: dispatch partition migration requests to pseries
sys_rtas() cannot call ibm,suspend-me directly in the same way it handles other inputs. Instead it must dispatch the request to code that can first perform the H_JOIN sequence before any call to ibm,suspend-me can succeed. Over time kernel/rtas.c has accreted a fair amount of platform-specific code to implement this. Since a different, more robust implementation of the suspend sequence is now in the pseries platform code, we want to dispatch the request there while minimizing additional dependence on pseries. Use a weak function that only pseries overrides. Note that invoking ibm,suspend-me via the RTAS syscall is all but deprecated; this change preserves ABI compatibility for old programs while providing to them the benefit of the new partition suspend implementation. This is a behavior change in that the kernel performs the device tree update and firmware activation before returning, but experimentation indicates this is tolerated fine by legacy user space. Signed-off-by: Nathan Lynch --- arch/powerpc/include/asm/rtas.h | 1 + arch/powerpc/kernel/rtas.c| 8 +++- arch/powerpc/platforms/pseries/mobility.c | 5 + 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index fdefe6a974eb..be0fc2536673 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -260,6 +260,7 @@ extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); int rtas_ibm_suspend_me_unsafe(u64 handle); int rtas_ibm_suspend_me(int *fw_status); +int rtas_syscall_dispatch_ibm_suspend_me(u64 handle); struct rtc_time; extern time64_t rtas_get_boot_time(void); diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 58bbd69a233f..52fb394f15d6 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -1221,6 +1221,12 @@ static bool block_rtas_call(int token, int nargs, #endif /* CONFIG_PPC_RTAS_FILTER */ +/* Only pseries should need to override this. */ +int __weak rtas_syscall_dispatch_ibm_suspend_me(u64 handle) +{ + return -EINVAL; +} + /* We assume to be passed big endian arguments */ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) { @@ -1271,7 +1277,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) int rc = 0; u64 handle = ((u64)be32_to_cpu(args.args[0]) << 32) | be32_to_cpu(args.args[1]); - rc = rtas_ibm_suspend_me_unsafe(handle); + rc = rtas_syscall_dispatch_ibm_suspend_me(handle); if (rc == -EAGAIN) args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE); else if (rc == -EIO) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index e459cdb8286f..d6417c9db201 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -622,6 +622,11 @@ static int pseries_migrate_partition(u64 handle) return ret; } +int rtas_syscall_dispatch_ibm_suspend_me(u64 handle) +{ + return pseries_migrate_partition(handle); +} + static ssize_t migration_store(struct class *class, struct class_attribute *attr, const char *buf, size_t count) -- 2.25.4