Re: [RFC PATCH 1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core
On Thu, Feb 15, 2024 at 9:09 PM Michael Ellerman wrote: > > On Fri, 29 Dec 2023 23:01:03 +1100, Michael Ellerman wrote: > > If nr_cpu_ids is too low to include at least all the threads of a single > > core adjust nr_cpu_ids upwards. This avoids triggering odd bugs in code > > that assumes all threads of a core are available. > > > > > > Applied to powerpc/next. > Great! After all these years, finally we are close to the conclusion of this feature. Thanks, Pingfan > [1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core > > https://git.kernel.org/powerpc/c/5580e96dad5a439d561d9648ffcbccb739c2a120 > [2/5] powerpc/smp: Increase nr_cpu_ids to include the boot CPU > > https://git.kernel.org/powerpc/c/777f81f0a9c780a6443bcf2c7785f0cc2e87c1ef > [3/5] powerpc/smp: Lookup avail once per device tree node > > https://git.kernel.org/powerpc/c/dca79603fbc592ec7ea8bd7ba274052d3984e882 > [4/5] powerpc/smp: Factor out assign_threads() > > https://git.kernel.org/powerpc/c/9832de654499f0bf797a3719c4d4c5bd401f18f5 > [5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids > > https://git.kernel.org/powerpc/c/0875f1ceba974042069f04946aa8f1d4d1e688da > > cheers >
Re: [PATCH v6 (proposal)] powerpc/cpu: enable nr_cpus for crash kernel
Hi Christophe, The latest series is https://lore.kernel.org/linuxppc-dev/20231017022806.4523-1-pi...@redhat.com/ And Michael has his implement on: https://lore.kernel.org/all/20231229120107.2281153-3-...@ellerman.id.au/T/#m46128446bce1095631162a1927415733a3bf0633 Thanks, Pingfan On Fri, Jan 26, 2024 at 3:40 AM Christophe Leroy wrote: > > Hi, > > Le 22/05/2018 à 10:23, Pingfan Liu a écrit : > > For kexec -p, the boot cpu can be not the cpu0, this causes the problem > > to alloc paca[]. In theory, there is no requirement to assign cpu's logical > > id as its present seq by device tree. But we have something like > > cpu_first_thread_sibling(), which makes assumption on the mapping inside > > a core. Hence partially changing the mapping, i.e. unbind the mapping of > > core while keep the mapping inside a core. After this patch, the core with > > boot-cpu will always be mapped into core 0. > > > > And at present, the code to discovery cpu spreads over two functions: > > early_init_dt_scan_cpus() and smp_setup_cpu_maps(). > > This patch tries to fold smp_setup_cpu_maps() into the "previous" one > > This patch is pretty old and doesn't apply anymore. If still relevant > can you please rebase and resubmit. > > Thanks > Christophe > > > > > Signed-off-by: Pingfan Liu > > --- > > v5 -> v6: > >simplify the loop logic (Hope it can answer Benjamin's concern) > >concentrate the cpu recovery code to early stage (Hope it can answer > > Michael's concern) > > Todo: (if this method is accepted) > >fold the whole smp_setup_cpu_maps() > > > > arch/powerpc/include/asm/smp.h | 1 + > > arch/powerpc/kernel/prom.c | 123 > > - > > arch/powerpc/kernel/setup-common.c | 58 ++--- > > drivers/of/fdt.c | 2 +- > > include/linux/of_fdt.h | 2 + > > 5 files changed, 103 insertions(+), 83 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h > > index fac963e..80c7693 100644 > > --- a/arch/powerpc/include/asm/smp.h > > +++ b/arch/powerpc/include/asm/smp.h > > @@ -30,6 +30,7 @@ > > #include > > > > extern int boot_cpuid; > > +extern int threads_in_core; > > extern int spinning_secondaries; > > > > extern void cpu_die(void); > > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > > index 4922162..2ae0b4a 100644 > > --- a/arch/powerpc/kernel/prom.c > > +++ b/arch/powerpc/kernel/prom.c > > @@ -77,7 +77,6 @@ unsigned long tce_alloc_start, tce_alloc_end; > > u64 ppc64_rma_size; > > #endif > > static phys_addr_t first_memblock_size; > > -static int __initdata boot_cpu_count; > > > > static int __init early_parse_mem(char *p) > > { > > @@ -305,6 +304,14 @@ static void __init > > check_cpu_feature_properties(unsigned long node) > > } > > } > > > > +struct bootinfo { > > + int boot_thread_id; > > + unsigned int cpu_cnt; > > + int cpu_hwids[NR_CPUS]; > > + bool avail[NR_CPUS]; > > +}; > > +static struct bootinfo *bt_info; > > + > > static int __init early_init_dt_scan_cpus(unsigned long node, > > const char *uname, int depth, > > void *data) > > @@ -312,10 +319,12 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > > const char *type = of_get_flat_dt_prop(node, "device_type", NULL); > > const __be32 *prop; > > const __be32 *intserv; > > - int i, nthreads; > > + int i, nthreads, maxidx; > > int len; > > - int found = -1; > > - int found_thread = 0; > > + int found_thread = -1; > > + struct bootinfo *info = data; > > + bool avail; > > + int rotate_cnt, id; > > > > /* We are scanning "cpu" nodes only */ > > if (type == NULL || strcmp(type, "cpu") != 0) > > @@ -325,8 +334,15 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > > intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", > > ); > > if (!intserv) > > intserv = of_get_flat_dt_prop(node, "reg", ); > > + avail = of_fdt_device_is_available(initial_boot_params, node); > > +#if 0 > > + //todo > > + if (!avail) > > + avail = !of_fdt_property
Re: [RFC PATCH 5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids
On Fri, Dec 29, 2023 at 8:07 PM Michael Ellerman wrote: > > Michael Ellerman writes: > > If nr_cpu_ids is too low to include the boot CPU, remap the boot CPU > > onto logical core 0. > > Hi guys, > > I finally got time to look at this issue. I think this series should fix Thanks a lot for sparing time on it and hope we can close this prolonged issue soon. And loop in Wen Xiong and Ming Lei, who care for this issue too. Best Regards, Pingfan > the problems that have been seen. I've tested this fairly thoroughly > with a qemu script, and also a few boots on a real machine. > > If you can test it with your setups that would be great. Hopefully there > isn't some obscure case I've missed. > > cheers >
[PATCHv10 3/3] powerpc/smp: Allow hole in paca_ptrs to accommodate boot_cpu
From: Pingfan Liu This patch always forces the first core onlined due to some subsystem needs cpu0. After core0, a hole may follow, then comes the crashed core. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/paca.c | 7 +-- arch/powerpc/kernel/prom.c | 6 ++ arch/powerpc/kernel/setup-common.c | 24 4 files changed, 32 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 576d0e15..f01c7891b0d7 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -27,6 +27,7 @@ extern int boot_cpuid; extern int boot_cpu_hwid; /* PPC64 only */ +extern int threads_in_core; extern int spinning_secondaries; extern u32 *cpu_to_phys_id; extern bool coregroup_enabled; diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 840c74dd17d6..1fe0fd2a6021 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -242,9 +242,12 @@ static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_last_cpu_num = nr_cpu_ids; + unsigned int cnt; - paca_ptrs_size = sizeof(struct paca_struct *) * paca_last_cpu_num; + /* paca_ptrs should be big enough to hold boot cpu */ + cnt = max((unsigned int)ALIGN(boot_cpuid + 1, threads_in_core), nr_cpu_ids); + paca_last_cpu_num = cnt; + paca_ptrs_size = sizeof(struct paca_struct *) * cnt; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..e1a671156941 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -371,9 +371,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, DBG("boot cpu: logical %d physical %d\n", found, be32_to_cpu(intserv[found_thread])); boot_cpuid = found; + /* This forces all threads in a core to be onlined */ + set_nr_cpu_ids(ALIGN(nr_cpu_ids, nthreads)); + /* Core 0 is always onlined and assure enough room for boot core */ + if (nthreads -1 < boot_cpuid && nr_cpu_ids < 2 * nthreads) + set_nr_cpu_ids(2 * nthreads); if (IS_ENABLED(CONFIG_PPC64)) boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + threads_in_core = nthreads; /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index f9f5f313abf0..b70474e1b5fe 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -86,6 +86,7 @@ EXPORT_SYMBOL(machine_id); int boot_cpuid = -1; EXPORT_SYMBOL_GPL(boot_cpuid); +int __initdata threads_in_core = 1; #ifdef CONFIG_PPC64 int boot_cpu_hwid = -1; @@ -448,8 +449,9 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; + int cpu_onlined = 0, cpu = 0; int nthreads = 1; + bool bootcpu_covered = false; DBG("smp_setup_cpu_maps()\n"); @@ -484,7 +486,19 @@ void __init smp_setup_cpu_maps(void) nthreads = len / sizeof(int); - for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { + if (!bootcpu_covered) { + if (cpu == ALIGN_DOWN(boot_cpuid, nthreads)) { + bootcpu_covered = true; + goto scan; + + /* Reserve the last online slot for boot core */ + } else if (cpu >= nr_cpu_ids - nthreads && !bootcpu_covered) { + cpu += nthreads; + continue; + } + } +scan: + for (j = 0; j < nthreads && cpu_onlined < nr_cpu_ids; j++) { bool avail; DBG("thread %d -> cpu %d (hard id %d)\n", @@ -499,9 +513,10 @@ void __init smp_setup_cpu_maps(void) set_cpu_possible(cpu, true); cpu_to_phys_id[cpu] = be32_to_cpu(intserv[j]); cpu++; + cpu_onlined++; } - if (cpu >= nr_cpu_ids) { + if (cpu_onlined >= nr_cpu_ids) { of_node_put(dn); break; } @@ -547,7 +562,8 @@ vo
[PATCHv10 2/3] powerpc/kernel: Extend arrays' size to make room for a hole in cpu_possible_mask
From: Pingfan Liu This patch aims to mark all the arrays which size is decided by nr_cpu_ids or num_possible_cpus(). Later if a hole is allowed in cpu_possible_mask, the corresponding array should extend to hold the last bit number in cpu_possible_mask. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/paca.h| 2 ++ arch/powerpc/kernel/paca.c | 8 arch/powerpc/kernel/setup-common.c | 2 +- arch/powerpc/kernel/smp.c | 3 ++- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index e667d455ecb4..a577d98dd0d8 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -299,5 +299,7 @@ static inline void free_unused_pacas(void) { } #endif /* CONFIG_PPC64 */ +extern int paca_last_cpu_num; + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_PACA_H */ diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 760f371cf096..840c74dd17d6 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -236,15 +236,15 @@ void setup_paca(struct paca_struct *new_paca) } -static int __initdata paca_nr_cpu_ids; +int __initdata paca_last_cpu_num; static int __initdata paca_ptrs_size; static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_nr_cpu_ids = nr_cpu_ids; + paca_last_cpu_num = nr_cpu_ids; - paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + paca_ptrs_size = sizeof(struct paca_struct *) * paca_last_cpu_num; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", @@ -258,7 +258,7 @@ void __init allocate_paca(int cpu) u64 limit; struct paca_struct *paca; - BUG_ON(cpu >= paca_nr_cpu_ids); + BUG_ON(cpu >= paca_last_cpu_num); #ifdef CONFIG_PPC_BOOK3S_64 /* diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 2f1026fba00d..f9f5f313abf0 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -453,7 +453,7 @@ void __init smp_setup_cpu_maps(void) DBG("smp_setup_cpu_maps()\n"); - cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32), + cpu_to_phys_id = memblock_alloc(paca_last_cpu_num * sizeof(u32), __alignof__(u32)); if (!cpu_to_phys_id) panic("%s: Failed to allocate %zu bytes align=0x%zx\n", diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5826f5108a12..6fefe22fd118 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1140,7 +1140,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus) } if (cpu_to_chip_id(boot_cpuid) != -1) { - int idx = DIV_ROUND_UP(num_possible_cpus(), threads_per_core); + int idx = DIV_ROUND_UP(cpumask_last(cpu_possible_mask), + threads_per_core); /* * All threads of a core will all belong to the same core, -- 2.31.1
[PATCHv10 1/3] powerpc/kernel: Remove check on paca_ptrs_size
From: Pingfan Liu Between early_setup()->allocate_paca_ptrs() and smp_setup_cpu_maps()->free_unused_pacas(), there is no call to set_nr_cpu_ids(), which means nr_cpu_ids is unchanged. Hence removing the check. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/paca.c | 13 - 1 file changed, 13 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index cda4e00b67c1..760f371cf096 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -286,16 +286,6 @@ void __init allocate_paca(int cpu) void __init free_unused_pacas(void) { - int new_ptrs_size; - - new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; - if (new_ptrs_size < paca_ptrs_size) - memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size, - paca_ptrs_size - new_ptrs_size); - - paca_nr_cpu_ids = nr_cpu_ids; - paca_ptrs_size = new_ptrs_size; - #ifdef CONFIG_PPC_64S_HASH_MMU if (early_radix_enabled()) { /* Ugly fixup, see new_slb_shadow() */ @@ -304,9 +294,6 @@ void __init free_unused_pacas(void) paca_ptrs[boot_cpuid]->slb_shadow_ptr = NULL; } #endif - - printk(KERN_DEBUG "Allocated %u bytes for %u pacas\n", - paca_ptrs_size + paca_struct_size, nr_cpu_ids); } #ifdef CONFIG_PPC_64S_HASH_MMU -- 2.31.1
[PATCHv10 0/3] enable nr_cpus for powerpc without re-ordering cpu number
From: Pingfan Liu This series addresses the nr_cpus issue for PowerPC without re-ordering cpu number. To save the memory used by percpu area, it also limits the possible cpu numbers by allowing hole in cpu_possible_mask. Because the last cpu number will bigger than nr_cpu_ids in this way, some pointer arrays indexed by cpu should be extended to hold the pointer, e.g. paca_ptrs. Please notice that this series still has some issue (some cpu can not be brought up), but before I resolve it. Please share your thoughts about it. Thanks Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org Pingfan Liu (3): powerpc/kernel: Remove check on paca_ptrs_size powerpc/kernel: Extend arrays' size to make room for a hole in cpu_possible_mask powerpc/smp: Allow hole in paca_ptrs to accommodate boot_cpu arch/powerpc/include/asm/paca.h| 2 ++ arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/paca.c | 24 +++- arch/powerpc/kernel/prom.c | 6 ++ arch/powerpc/kernel/setup-common.c | 26 +- arch/powerpc/kernel/smp.c | 3 ++- 6 files changed, 39 insertions(+), 23 deletions(-) -- 2.31.1
Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
Hi Hari, On Mon, Nov 27, 2023 at 12:30 PM Hari Bathini wrote: > > Hi Pingfan, Michael, > > On 17/10/23 4:03 pm, Hari Bathini wrote: > > > > > > On 17/10/23 7:58 am, Pingfan Liu wrote: > >> *** Idea *** > >> For kexec -p, the boot cpu can be not the cpu0, this causes the problem > >> of allocating memory for paca_ptrs[]. However, in theory, there is no > >> requirement to assign cpu's logical id as its present sequence in the > >> device tree. But there is something like cpu_first_thread_sibling(), > >> which makes assumption on the mapping inside a core. Hence partially > >> loosening the mapping, i.e. unbind the mapping of core while keep the > >> mapping inside a core. > >> > >> *** Implement *** > >> At this early stage, there are plenty of memory to utilize. Hence, this > >> patch allocates interim memory to link the cpu info on a list, then > >> reorder cpus by changing the list head. As a result, there is a rotate > >> shift between the sequence number in dt and the cpu logical number. > >> > >> *** Result *** > >> After this patch, a boot-cpu's logical id will always be mapped into the > >> range [0,threads_per_core). > >> > >> Besides this, at this phase, all threads in the boot core are forced to > >> be onlined. This restriction will be lifted in a later patch with > >> extra effort. > >> > >> Signed-off-by: Pingfan Liu > >> Cc: Michael Ellerman > >> Cc: Nicholas Piggin > >> Cc: Christophe Leroy > >> Cc: Mahesh Salgaonkar > >> Cc: Wen Xiong > >> Cc: Baoquan He > >> Cc: Ming Lei > >> Cc: Sourabh Jain > >> Cc: Hari Bathini > >> Cc: ke...@lists.infradead.org > >> To: linuxppc-dev@lists.ozlabs.org > > > > Thanks for working on this, Pingfan. > > Looks good to me. > > > > Acked-by: Hari Bathini > > > > On second thoughts, probably better off with no impact for > bootcpu < nr_cpu_ids case and changing only two cores logical > numbering otherwise. Something like the below (Please share > your thoughts): > I am afraid that it may not be as ideal as it looks, considering the following factors: -1. For the case of 'bootcpu < nr_cpu_ids', crash can happen evenly across any cpu in the system, which seriously undermines the protection intended here (Under the most optimistic scenario, there is a 50% chance of success) -2. For the re-ordering of logical numbering, IMHO, if there is concern that re-ordering will break something, the partial re-ordering can not avoid that. We ought to spot probable hazards so as to ease worries. Thanks, Pingfan > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > index ec82f5bda908..78a8312aa8c4 100644 > --- a/arch/powerpc/kernel/prom.c > +++ b/arch/powerpc/kernel/prom.c > @@ -76,7 +76,9 @@ u64 ppc64_rma_size; > unsigned int boot_cpu_node_count __ro_after_init; > #endif > static phys_addr_t first_memblock_size; > +#ifdef CONFIG_SMP > static int __initdata boot_cpu_count; > +#endif > > static int __init early_parse_mem(char *p) > { > @@ -357,6 +359,25 @@ static int __init early_init_dt_scan_cpus(unsigned > long node, > fdt_boot_cpuid_phys(initial_boot_params)) { > found = boot_cpu_count; > found_thread = i; > + /* > +* Map boot-cpu logical id into the range > +* of [0, thread_per_core) if it can't be > +* accommodated within nr_cpu_ids. > +*/ > + if (i != boot_cpu_count && boot_cpu_count >= > nr_cpu_ids) { > + boot_cpuid = i; > + DBG("Logical CPU number for boot CPU changed > from %d to %d\n", > + boot_cpu_count, i); > + } else { > + boot_cpuid = boot_cpu_count; > + } > + > + /* Ensure boot thread is acconted for in nr_cpu_ids */ > + if (boot_cpuid >= nr_cpu_ids) { > + set_nr_cpu_ids(boot_cpuid + 1); > + DBG("Adjusted nr_cpu_ids to %u, to include > boot CPU.\n", > + nr_cpu_ids); > + } > } > #ifdef CONFIG_SMP > /* logical cpu id is always 0 on UP kernels */ > @@ -368,9 +389,8 @@ static int __ini
Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
On Tue, Oct 17, 2023 at 6:39 PM Hari Bathini wrote: > > > > On 17/10/23 7:58 am, Pingfan Liu wrote: > > *** Idea *** > > For kexec -p, the boot cpu can be not the cpu0, this causes the problem > > of allocating memory for paca_ptrs[]. However, in theory, there is no > > requirement to assign cpu's logical id as its present sequence in the > > device tree. But there is something like cpu_first_thread_sibling(), > > which makes assumption on the mapping inside a core. Hence partially > > loosening the mapping, i.e. unbind the mapping of core while keep the > > mapping inside a core. > > > > *** Implement *** > > At this early stage, there are plenty of memory to utilize. Hence, this > > patch allocates interim memory to link the cpu info on a list, then > > reorder cpus by changing the list head. As a result, there is a rotate > > shift between the sequence number in dt and the cpu logical number. > > > > *** Result *** > > After this patch, a boot-cpu's logical id will always be mapped into the > > range [0,threads_per_core). > > > > Besides this, at this phase, all threads in the boot core are forced to > > be onlined. This restriction will be lifted in a later patch with > > extra effort. > > > > Signed-off-by: Pingfan Liu > > Cc: Michael Ellerman > > Cc: Nicholas Piggin > > Cc: Christophe Leroy > > Cc: Mahesh Salgaonkar > > Cc: Wen Xiong > > Cc: Baoquan He > > Cc: Ming Lei > > Cc: Sourabh Jain > > Cc: Hari Bathini > > Cc: ke...@lists.infradead.org > > To: linuxppc-dev@lists.ozlabs.org > > Thanks for working on this, Pingfan. > Looks good to me. > > Acked-by: Hari Bathini > Thank you for kindly reviewing. I hope that after all these years, we have accomplished the objective. Best Regards, Pingfan
[PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
*** Idea *** For kexec -p, the boot cpu can be not the cpu0, this causes the problem of allocating memory for paca_ptrs[]. However, in theory, there is no requirement to assign cpu's logical id as its present sequence in the device tree. But there is something like cpu_first_thread_sibling(), which makes assumption on the mapping inside a core. Hence partially loosening the mapping, i.e. unbind the mapping of core while keep the mapping inside a core. *** Implement *** At this early stage, there are plenty of memory to utilize. Hence, this patch allocates interim memory to link the cpu info on a list, then reorder cpus by changing the list head. As a result, there is a rotate shift between the sequence number in dt and the cpu logical number. *** Result *** After this patch, a boot-cpu's logical id will always be mapped into the range [0,threads_per_core). Besides this, at this phase, all threads in the boot core are forced to be onlined. This restriction will be lifted in a later patch with extra effort. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/prom.c | 25 + arch/powerpc/kernel/setup-common.c | 84 +++--- 2 files changed, 82 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index ec82f5bda908..7ed9034912ca 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -76,7 +76,9 @@ u64 ppc64_rma_size; unsigned int boot_cpu_node_count __ro_after_init; #endif static phys_addr_t first_memblock_size; +#ifdef CONFIG_SMP static int __initdata boot_cpu_count; +#endif static int __init early_parse_mem(char *p) { @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, const __be32 *intserv; int i, nthreads; int len; - int found = -1; - int found_thread = 0; + bool found = false; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, for (i = 0; i < nthreads; i++) { if (be32_to_cpu(intserv[i]) == fdt_boot_cpuid_phys(initial_boot_params)) { - found = boot_cpu_count; - found_thread = i; + /* +* always map the boot-cpu logical id into the +* range of [0, thread_per_core) +*/ + boot_cpuid = i; + found = true; + /* This forces all threads in a core to be online */ + if (nr_cpu_ids % nthreads != 0) + set_nr_cpu_ids(ALIGN(nr_cpu_ids, nthreads)); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ @@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned long node, } /* Not the boot CPU */ - if (found < 0) + if (!found) return 0; - DBG("boot cpu: logical %d physical %d\n", found, - be32_to_cpu(intserv[found_thread])); - boot_cpuid = found; + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, + be32_to_cpu(intserv[boot_cpuid])); - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 707f0490639d..9802c7e5ee2f 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -425,6 +426,13 @@ static void __init cpu_init_thread_core_maps(int tpc) u32 *cpu_to_phys_id = NULL; +struct interrupt_server_node { + struct list_head node; + boolavail; + int len; + __be32 intserv[]; +}; + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_mask @@ -446,11 +454,16 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; - int nthreads = 1; + int shift = 0, cpu = 0; + int j, nthreads = 1; + int len; + struct interrupt_server_node *intserv_node, *n; + struct list_head *bt_node, head; + bool avail, found_boot_cpu = false; DBG("smp_setup_cpu_maps()\n"); + INIT_LIST_HEAD(); cpu_to_phys_id = memblock_alloc(nr_cpu_ids
[PATCHv9 1/2] powerpc/setup : Enable boot_cpu_hwid for PPC32
In order to identify the boot cpu, its intserv[] should be recorded and checked in smp_setup_cpu_maps(). smp_setup_cpu_maps() is shared between PPC64 and PPC32. Since PPC64 has already used boot_cpu_hwid to carry that information, enabling this variable on PPC32 so later it can also be used to carry that information for PPC32 in the coming patch. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/smp.h | 2 +- arch/powerpc/kernel/prom.c | 3 +-- arch/powerpc/kernel/setup-common.c | 2 -- 3 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 576d0e15..5db9178cc800 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -26,7 +26,7 @@ #include extern int boot_cpuid; -extern int boot_cpu_hwid; /* PPC64 only */ +extern int boot_cpu_hwid; extern int spinning_secondaries; extern u32 *cpu_to_phys_id; extern bool coregroup_enabled; diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..ec82f5bda908 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -372,8 +372,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, be32_to_cpu(intserv[found_thread])); boot_cpuid = found; - if (IS_ENABLED(CONFIG_PPC64)) - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 2f1026fba00d..707f0490639d 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -87,9 +87,7 @@ EXPORT_SYMBOL(machine_id); int boot_cpuid = -1; EXPORT_SYMBOL_GPL(boot_cpuid); -#ifdef CONFIG_PPC64 int boot_cpu_hwid = -1; -#endif /* * These are used in binfmt_elf.c to put aux entries on the stack -- 2.31.1
[PATCHv9 0/2] enable nr_cpus for powerpc
From: Pingfan Liu Since my last v4 [1], the code has undergone great changes. The paca[] array has been reorganized and indexed by paca_ptrs[], which dramatically decreases the memory consumption even if there are many unpresent cpus in the middle. However, reordering the logical cpu numbers can further decrease the size of paca_ptrs[] in the kdump case. These two patches rotate-shifts the cpu's sequence number in the device tree to obtain the logical cpu id. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ --- v8 -> v9 put aside [3-5/5] in v8 for the time being, which complicates the code. optimize out some unnecessary initialization according to Hari's suggestion Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: Sourabh Jain Cc: Hari Bathini Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org Pingfan Liu (2): powerpc/setup : Enable boot_cpu_hwid for PPC32 powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt arch/powerpc/include/asm/smp.h | 2 +- arch/powerpc/kernel/prom.c | 26 + arch/powerpc/kernel/setup-common.c | 86 +++--- 3 files changed, 83 insertions(+), 31 deletions(-) -- 2.31.1
Re: [PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32
On Mon, Oct 16, 2023 at 12:13:53PM +0530, Sourabh Jain wrote: > Hello Pingfan, > > > > > > > With this patch series applied, the kdump kernel fails to boot on > > > > > > powerpc with nr_cpus=1. > > > > > > > > > > > > Console logs: > > > > > > --- > > > > > > [root]# echo c > /proc/sysrq-trigger > > > > > > [ 74.783235] sysrq: Trigger a crash > > > > > > [ 74.783244] Kernel panic - not syncing: sysrq triggered crash > > > > > > [ 74.783252] CPU: 58 PID: 3838 Comm: bash Kdump: loaded Not > > > > > > tainted > > > > > > 6.6.0-rc5pf-nr-cpus+ #3 > > > > > > [ 74.783259] Hardware name: POWER10 (raw) phyp pSeries > > > > > > [ 74.783275] Call Trace: > > > > > > [ 74.783280] [c0020f4ebac0] [c0ed9f38] > > > > > > dump_stack_lvl+0x6c/0x9c (unreliable) > > > > > > [ 74.783291] [c0020f4ebaf0] [c0150300] > > > > > > panic+0x178/0x438 > > > > > > [ 74.783298] [c0020f4ebb90] [c0936d48] > > > > > > sysrq_handle_crash+0x28/0x30 > > > > > > [ 74.783304] [c0020f4ebbf0] [c093773c] > > > > > > __handle_sysrq+0x10c/0x250 > > > > > > [ 74.783309] [c0020f4ebc90] [c0937fa8] > > > > > > write_sysrq_trigger+0xc8/0x168 > > > > > > [ 74.783314] [c0020f4ebcd0] [c0665d8c] > > > > > > proc_reg_write+0x10c/0x1b0 > > > > > > [ 74.783321] [c0020f4ebd00] [c058da54] > > > > > > vfs_write+0x104/0x4b0 > > > > > > [ 74.783326] [c0020f4ebdc0] [c058dfdc] > > > > > > ksys_write+0x7c/0x140 > > > > > > [ 74.783331] [c0020f4ebe10] [c0033a64] > > > > > > system_call_exception+0x144/0x3a0 > > > > > > [ 74.783337] [c0020f4ebe50] [c000c554] > > > > > > system_call_common+0xf4/0x258 > > > > > > [ 74.783343] --- interrupt: c00 at 0x7fffa0721594 > > > > > > [ 74.783352] NIP: 7fffa0721594 LR: 7fffa0697bf4 CTR: > > > > > > > > > > > > [ 74.783364] REGS: c0020f4ebe80 TRAP: 0c00 Not tainted > > > > > > (6.6.0-rc5pf-nr-cpus+) > > > > > > [ 74.783376] MSR: 8280f033 > > > > > > CR: 2802 XER: > > > > > > [ 74.783394] IRQMASK: 0 > > > > > > [ 74.783394] GPR00: 0004 7c4b6800 > > > > > > 7fffa0807300 > > > > > > 0001 > > > > > > [ 74.783394] GPR04: 00013549ea60 0002 > > > > > > 0010 > > > > > > > > > > > > [ 74.783394] GPR08: > > > > > > > > > > > > > > > > > > [ 74.783394] GPR12: 7fffa0abaf70 > > > > > > 4000 > > > > > > 00011a0f9798 > > > > > > [ 74.783394] GPR16: 00011a0f9724 00011a097688 > > > > > > 00011a02ff70 > > > > > > 00011a0fd568 > > > > > > [ 74.783394] GPR20: 000135554bf0 0001 > > > > > > 00011a0aa478 > > > > > > 7c4b6a24 > > > > > > [ 74.783394] GPR24: 7c4b6a20 00011a0faf94 > > > > > > 0002 > > > > > > 00013549ea60 > > > > > > [ 74.783394] GPR28: 0002 7fffa08017a0 > > > > > > 00013549ea60 > > > > > > 0002 > > > > > > [ 74.783440] NIP [7fffa0721594] 0x7fffa0721594 > > > > > > [ 74.783443] LR [7fffa0697bf4] 0x7fffa0697bf4 > > > > > > [ 74.783447] --- interrupt: c00 > > > > > > I'm in purgatory > > > > > > [0.00] radix-mmu: Page sizes from device-tree: > > > > > > [0.00] radix-mmu: Page size shift = 12 AP=0x0 > > > > > > [0.00] radix-mmu: Page size shift = 16 AP=0x5 > > > > > > [0.00] radix-mmu: Page size shift = 21 AP=0x1 > > > > > > [0.00] radix-mmu: Page size shift = 30 AP=0x2 > > > > > > [0.00] Activating Kernel Userspace Access Prevention > > > > > > [0.00] Activating Kernel Userspace Execution Prevention > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x-0x0001 > > > > > > with 64.0 KiB pages (exec) > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x0001-0x0020 > > > > > > with 64.0 KiB pages > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x0020-0x2000 > > > > > > with 2.00 MiB pages > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x2000-0x2260 > > > > > > with 2.00 MiB pages (exec) > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x2260-0x4000 > > > > > > with 2.00 MiB pages > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x4000-0x00018000 > > > > > > with 1.00 GiB pages > > > > > > [0.00] radix-mmu: Mapped > > > > > > 0x00018000-0x0001a000 > > > > > > with 2.00 MiB pages > > > > > > [0.00] lpar: Using radix MMU under hypervisor > > > > > > [0.00] Linux version 6.6.0-rc5pf-nr-cpus+ > > > > > > (r...@ltcever7x0-lp1.aus.stglabs.ibm.com) (gcc (GCC) 8.5.0 20210514 > > > > > > (Red > > > > > > Hat
Re: [PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32
On Wed, Oct 11, 2023 at 6:53 PM Sourabh Jain wrote: > > Hello Pingfan, > >>> With this patch series applied, the kdump kernel fails to boot on > >>> powerpc with nr_cpus=1. > >>> > >>> Console logs: > >>> --- > >>> [root]# echo c > /proc/sysrq-trigger > >>> [ 74.783235] sysrq: Trigger a crash > >>> [ 74.783244] Kernel panic - not syncing: sysrq triggered crash > >>> [ 74.783252] CPU: 58 PID: 3838 Comm: bash Kdump: loaded Not tainted > >>> 6.6.0-rc5pf-nr-cpus+ #3 > >>> [ 74.783259] Hardware name: POWER10 (raw) phyp pSeries > >>> [ 74.783275] Call Trace: > >>> [ 74.783280] [c0020f4ebac0] [c0ed9f38] > >>> dump_stack_lvl+0x6c/0x9c (unreliable) > >>> [ 74.783291] [c0020f4ebaf0] [c0150300] panic+0x178/0x438 > >>> [ 74.783298] [c0020f4ebb90] [c0936d48] > >>> sysrq_handle_crash+0x28/0x30 > >>> [ 74.783304] [c0020f4ebbf0] [c093773c] > >>> __handle_sysrq+0x10c/0x250 > >>> [ 74.783309] [c0020f4ebc90] [c0937fa8] > >>> write_sysrq_trigger+0xc8/0x168 > >>> [ 74.783314] [c0020f4ebcd0] [c0665d8c] > >>> proc_reg_write+0x10c/0x1b0 > >>> [ 74.783321] [c0020f4ebd00] [c058da54] > >>> vfs_write+0x104/0x4b0 > >>> [ 74.783326] [c0020f4ebdc0] [c058dfdc] > >>> ksys_write+0x7c/0x140 > >>> [ 74.783331] [c0020f4ebe10] [c0033a64] > >>> system_call_exception+0x144/0x3a0 > >>> [ 74.783337] [c0020f4ebe50] [c000c554] > >>> system_call_common+0xf4/0x258 > >>> [ 74.783343] --- interrupt: c00 at 0x7fffa0721594 > >>> [ 74.783352] NIP: 7fffa0721594 LR: 7fffa0697bf4 CTR: > >>> > >>> [ 74.783364] REGS: c0020f4ebe80 TRAP: 0c00 Not tainted > >>> (6.6.0-rc5pf-nr-cpus+) > >>> [ 74.783376] MSR: 8280f033 > >>> CR: 2802 XER: > >>> [ 74.783394] IRQMASK: 0 > >>> [ 74.783394] GPR00: 0004 7c4b6800 7fffa0807300 > >>> 0001 > >>> [ 74.783394] GPR04: 00013549ea60 0002 0010 > >>> > >>> [ 74.783394] GPR08: > >>> > >>> [ 74.783394] GPR12: 7fffa0abaf70 4000 > >>> 00011a0f9798 > >>> [ 74.783394] GPR16: 00011a0f9724 00011a097688 00011a02ff70 > >>> 00011a0fd568 > >>> [ 74.783394] GPR20: 000135554bf0 0001 00011a0aa478 > >>> 7c4b6a24 > >>> [ 74.783394] GPR24: 7c4b6a20 00011a0faf94 0002 > >>> 00013549ea60 > >>> [ 74.783394] GPR28: 0002 7fffa08017a0 00013549ea60 > >>> 0002 > >>> [ 74.783440] NIP [7fffa0721594] 0x7fffa0721594 > >>> [ 74.783443] LR [7fffa0697bf4] 0x7fffa0697bf4 > >>> [ 74.783447] --- interrupt: c00 > >>> I'm in purgatory > >>> [0.00] radix-mmu: Page sizes from device-tree: > >>> [0.00] radix-mmu: Page size shift = 12 AP=0x0 > >>> [0.00] radix-mmu: Page size shift = 16 AP=0x5 > >>> [0.00] radix-mmu: Page size shift = 21 AP=0x1 > >>> [0.00] radix-mmu: Page size shift = 30 AP=0x2 > >>> [0.00] Activating Kernel Userspace Access Prevention > >>> [0.00] Activating Kernel Userspace Execution Prevention > >>> [0.00] radix-mmu: Mapped 0x-0x0001 > >>> with 64.0 KiB pages (exec) > >>> [0.00] radix-mmu: Mapped 0x0001-0x0020 > >>> with 64.0 KiB pages > >>> [0.00] radix-mmu: Mapped 0x0020-0x2000 > >>> with 2.00 MiB pages > >>> [0.00] radix-mmu: Mapped 0x2000-0x2260 > >>> with 2.00 MiB pages (exec) > >>> [0.00] radix-mmu: Mapped 0x2260-0x4000 > >>> with 2.00 MiB pages > >>> [0.00] radix-mmu: Mapped 0x4000-0x00018000 > >>> with 1.00 GiB pages > >>> [0.00] radix-mmu: Mapped 0x00018000-0x0001a000 > >>> with 2.00 MiB pages > >>> [0.00] lpar: Using radix MMU under hypervisor > >>> [0.00] Linux version 6.6.0-rc5pf-nr-cpus+ > >>> (r...@ltcever7x0-lp1.aus.stglabs.ibm.com) (gcc (GCC) 8.5.0 20210514 (Red > >>> Hat 8.5.0-20), GNU ld version 2.30-123.el8) #3 SMP Mon Oct 9 11:07: > >>> 41 CDT 2023 > >>> [0.00] Found initrd at 0xc00022e6:0xc000248f08d8 > >>> [0.00] Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 > >>> 0xf06 of:IBM,FW1060.00 (NM1060_016) hv:phyp pSeries > >>> [0.00] printk: bootconsole [udbg0] enabled > >>> [0.00] the round shift between dt seq and the cpu logic number: > >>> 56 > >>> [0.00] BUG: Unable to handle kernel data access on write at > >>> 0xc001a000 > >>> [0.00] Faulting instruction address: 0xc00022009c64 > >>> [0.00] Oops: Kernel access of bad area, sig: 11 [#1] > >>> [0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > >>>
Re: [PATCHv8 2/5] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
On Tue, Oct 10, 2023 at 04:07:00PM +0530, Hari Bathini wrote: > > > On 09/10/23 5:00 pm, Pingfan Liu wrote: > > *** Idea *** > > For kexec -p, the boot cpu can be not the cpu0, this causes the problem > > of allocating memory for paca_ptrs[]. However, in theory, there is no > > requirement to assign cpu's logical id as its present sequence in the > > device tree. But there is something like cpu_first_thread_sibling(), > > which makes assumption on the mapping inside a core. Hence partially > > loosening the mapping, i.e. unbind the mapping of core while keep the > > mapping inside a core. > > > > *** Implement *** > > At this early stage, there are plenty of memory to utilize. Hence, this > > patch allocates interim memory to link the cpu info on a list, then > > reorder cpus by changing the list head. As a result, there is a rotate > > shift between the sequence number in dt and the cpu logical number. > > > > *** Result *** > > After this patch, a boot-cpu's logical id will always be mapped into the > > range [0,threads_per_core). > > > > Besides this, at this phase, all threads in the boot core are forced to > > be onlined. This restriction will be lifted in a later patch with > > extra effort. > > > > Signed-off-by: Pingfan Liu > > Cc: Michael Ellerman > > Cc: Nicholas Piggin > > Cc: Christophe Leroy > > Cc: Mahesh Salgaonkar > > Cc: Wen Xiong > > Cc: Baoquan He > > Cc: Ming Lei > > Cc: ke...@lists.infradead.org > > To: linuxppc-dev@lists.ozlabs.org > > --- > > arch/powerpc/kernel/prom.c | 25 + > > arch/powerpc/kernel/setup-common.c | 87 +++--- > > 2 files changed, 85 insertions(+), 27 deletions(-) > > > > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > > index ec82f5bda908..87272a2d8c10 100644 > > --- a/arch/powerpc/kernel/prom.c > > +++ b/arch/powerpc/kernel/prom.c > > @@ -76,7 +76,9 @@ u64 ppc64_rma_size; > > unsigned int boot_cpu_node_count __ro_after_init; > > #endif > > static phys_addr_t first_memblock_size; > > +#ifdef CONFIG_SMP > > static int __initdata boot_cpu_count; > > +#endif > > static int __init early_parse_mem(char *p) > > { > > @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long > > node, > > const __be32 *intserv; > > int i, nthreads; > > int len; > > - int found = -1; > > - int found_thread = 0; > > + bool found = false; > > /* We are scanning "cpu" nodes only */ > > if (type == NULL || strcmp(type, "cpu") != 0) > > @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > > for (i = 0; i < nthreads; i++) { > > if (be32_to_cpu(intserv[i]) == > > fdt_boot_cpuid_phys(initial_boot_params)) { > > - found = boot_cpu_count; > > - found_thread = i; > > + /* > > +* always map the boot-cpu logical id into the > > +* range of [0, thread_per_core) > > +*/ > > + boot_cpuid = i; > > + found = true; > > + /* This works around the hole in paca_ptrs[]. */ > > + if (nr_cpu_ids < nthreads) > > + set_nr_cpu_ids(nthreads); > > } > > #ifdef CONFIG_SMP > > /* logical cpu id is always 0 on UP kernels */ > > @@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned > > long node, > > } > > /* Not the boot CPU */ > > - if (found < 0) > > + if (!found) > > return 0; > > - DBG("boot cpu: logical %d physical %d\n", found, > > - be32_to_cpu(intserv[found_thread])); > > - boot_cpuid = found; > > + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, > > + be32_to_cpu(intserv[boot_cpuid])); > > - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); > > + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); > > /* > > * PAPR defines "logical" PVR values for cpus that > > diff --git a/arch/powerpc/kernel/setup-common.c > > b/arch/powerpc/kernel/setup-common.c > > index 1b19a9815672..81291e13dec0 100644 > > --- a/arch/powerpc/kernel/setup-common.c > > +++ b/arch/powerpc/kernel/setup-common.c > > @@ -36,6 +36,7 @@ > > #includ
Re: [PATCHv8 3/5] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
On Tue, Oct 10, 2023 at 01:56:13PM +0530, Hari Bathini wrote: > > > On 09/10/23 5:00 pm, Pingfan Liu wrote: > > If the boot_cpuid is smaller than nr_cpus, it requires extra effort to > > ensure the boot_cpu is in cpu_present_mask. This can be achieved by > > reserving the last quota for the boot cpu. > > > > Note: the restriction on nr_cpus will be lifted with more effort in the > > successive patches > > > > Signed-off-by: Pingfan Liu > > Cc: Michael Ellerman > > Cc: Nicholas Piggin > > Cc: Christophe Leroy > > Cc: Mahesh Salgaonkar > > Cc: Wen Xiong > > Cc: Baoquan He > > Cc: Ming Lei > > Cc: ke...@lists.infradead.org > > To: linuxppc-dev@lists.ozlabs.org > > --- > > arch/powerpc/kernel/setup-common.c | 25 ++--- > > 1 file changed, 22 insertions(+), 3 deletions(-) > > > > diff --git a/arch/powerpc/kernel/setup-common.c > > b/arch/powerpc/kernel/setup-common.c > > index 81291e13dec0..f9ef0a2666b0 100644 > > --- a/arch/powerpc/kernel/setup-common.c > > +++ b/arch/powerpc/kernel/setup-common.c > > @@ -454,8 +454,8 @@ struct interrupt_server_node { > > void __init smp_setup_cpu_maps(void) > > { > > struct device_node *dn; > > - int shift = 0, cpu = 0; > > - int j, nthreads = 1; > > + int terminate, shift = 0, cpu = 0; > > + int j, bt_thread = 0, nthreads = 1; > > int len; > > struct interrupt_server_node *intserv_node, *n; > > struct list_head *bt_node, head; > > @@ -518,6 +518,7 @@ void __init smp_setup_cpu_maps(void) > > for (j = 0 ; j < nthreads; j++) { > > if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) { > > bt_node = _node->node; > > + bt_thread = j; > > found_boot_cpu = true; > > /* > > * Record the round-shift between dt > > @@ -537,11 +538,21 @@ void __init smp_setup_cpu_maps(void) > > /* Select the primary thread, the boot cpu's slibing, as the logic 0 */ > > list_add_tail(, bt_node); > > pr_info("the round shift between dt seq and the cpu logic number: > > %d\n", shift); > > + terminate = nr_cpu_ids; > > list_for_each_entry(intserv_node, , node) { > > + j = 0; > > > + /* Choose a start point to cover the boot cpu */ > > + if (nr_cpu_ids - 1 < bt_thread) { > > + /* > > +* The processor core puts assumption on the thread id, > > +* not to breach the assumption. > > +*/ > > + terminate = nr_cpu_ids - 1; > > nthreads is anyway assumed to be same for all cores. So, enforcing > nr_cpu_ids to a minimum of nthreads (and multiple of nthreads) should > make the code much simpler without the need for above check and the > other complexities addressed in the subsequent patches... > Indeed, this series can be splited into two partsk, [1-2/5] and [3-5/5]. In [1-2/5], if smaller, the nr_cpu_ids is enforced to be equal to nthreads. I will make it align upward on nthreads in the next version. So [1-2/5] can be totally independent from the rest patches in this series. >From an engineer's perspective, [3-5/5] are added to maintain the nr_cpus semantics. (Finally, nr_cpus=1 can be achieved but requiring effort on other subsystem) Testing result on my Power9 machine with SMT=4 -1. taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger' kdump:/# cat /proc/meminfo | grep Percpu Percpu: 896 kB kdump:/# cat /sys/devices/system/cpu/possible 0 -2. taskset -c 5 bash -c 'echo c > /proc/sysrq-trigger' kdump:/# cat /proc/meminfo | grep Percpu Percpu: 1792 kB kdump:/# cat /sys/devices/system/cpu/possible 0-1 -3. taskset -c 6 bash -c 'echo c > /proc/sysrq-trigger' kdump:/# cat /proc/meminfo | grep Percpu Percpu: 1792 kB kdump:/# cat /sys/devices/system/cpu/possible 0,2 -4. taskset -c 7 bash -c 'echo c > /proc/sysrq-trigger' kdump:/# cat /proc/meminfo | grep Percpu Percpu: 1792 kB kdump:/# cat /sys/devices/system/cpu/possible 0,3 Thanks, Pingfan
Re: [PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32
On Tue, Oct 10, 2023 at 02:38:40PM +0530, Sourabh Jain wrote: > Hello Pingfan, > > > > > With this patch series applied, the kdump kernel fails to boot on > > powerpc with nr_cpus=1. > > > > Console logs: > > --- > > [root]# echo c > /proc/sysrq-trigger > > [ 74.783235] sysrq: Trigger a crash > > [ 74.783244] Kernel panic - not syncing: sysrq triggered crash > > [ 74.783252] CPU: 58 PID: 3838 Comm: bash Kdump: loaded Not tainted > > 6.6.0-rc5pf-nr-cpus+ #3 > > [ 74.783259] Hardware name: POWER10 (raw) phyp pSeries > > [ 74.783275] Call Trace: > > [ 74.783280] [c0020f4ebac0] [c0ed9f38] > > dump_stack_lvl+0x6c/0x9c (unreliable) > > [ 74.783291] [c0020f4ebaf0] [c0150300] panic+0x178/0x438 > > [ 74.783298] [c0020f4ebb90] [c0936d48] > > sysrq_handle_crash+0x28/0x30 > > [ 74.783304] [c0020f4ebbf0] [c093773c] > > __handle_sysrq+0x10c/0x250 > > [ 74.783309] [c0020f4ebc90] [c0937fa8] > > write_sysrq_trigger+0xc8/0x168 > > [ 74.783314] [c0020f4ebcd0] [c0665d8c] > > proc_reg_write+0x10c/0x1b0 > > [ 74.783321] [c0020f4ebd00] [c058da54] > > vfs_write+0x104/0x4b0 > > [ 74.783326] [c0020f4ebdc0] [c058dfdc] > > ksys_write+0x7c/0x140 > > [ 74.783331] [c0020f4ebe10] [c0033a64] > > system_call_exception+0x144/0x3a0 > > [ 74.783337] [c0020f4ebe50] [c000c554] > > system_call_common+0xf4/0x258 > > [ 74.783343] --- interrupt: c00 at 0x7fffa0721594 > > [ 74.783352] NIP: 7fffa0721594 LR: 7fffa0697bf4 CTR: > > > > [ 74.783364] REGS: c0020f4ebe80 TRAP: 0c00 Not tainted > > (6.6.0-rc5pf-nr-cpus+) > > [ 74.783376] MSR: 8280f033 > > CR: 2802 XER: > > [ 74.783394] IRQMASK: 0 > > [ 74.783394] GPR00: 0004 7c4b6800 7fffa0807300 > > 0001 > > [ 74.783394] GPR04: 00013549ea60 0002 0010 > > > > [ 74.783394] GPR08: > > > > [ 74.783394] GPR12: 7fffa0abaf70 4000 > > 00011a0f9798 > > [ 74.783394] GPR16: 00011a0f9724 00011a097688 00011a02ff70 > > 00011a0fd568 > > [ 74.783394] GPR20: 000135554bf0 0001 00011a0aa478 > > 7c4b6a24 > > [ 74.783394] GPR24: 7c4b6a20 00011a0faf94 0002 > > 00013549ea60 > > [ 74.783394] GPR28: 0002 7fffa08017a0 00013549ea60 > > 0002 > > [ 74.783440] NIP [7fffa0721594] 0x7fffa0721594 > > [ 74.783443] LR [7fffa0697bf4] 0x7fffa0697bf4 > > [ 74.783447] --- interrupt: c00 > > I'm in purgatory > > [ 0.00] radix-mmu: Page sizes from device-tree: > > [ 0.00] radix-mmu: Page size shift = 12 AP=0x0 > > [ 0.00] radix-mmu: Page size shift = 16 AP=0x5 > > [ 0.00] radix-mmu: Page size shift = 21 AP=0x1 > > [ 0.00] radix-mmu: Page size shift = 30 AP=0x2 > > [ 0.00] Activating Kernel Userspace Access Prevention > > [ 0.00] Activating Kernel Userspace Execution Prevention > > [ 0.00] radix-mmu: Mapped 0x-0x0001 > > with 64.0 KiB pages (exec) > > [ 0.00] radix-mmu: Mapped 0x0001-0x0020 > > with 64.0 KiB pages > > [ 0.00] radix-mmu: Mapped 0x0020-0x2000 > > with 2.00 MiB pages > > [ 0.00] radix-mmu: Mapped 0x2000-0x2260 > > with 2.00 MiB pages (exec) > > [ 0.00] radix-mmu: Mapped 0x2260-0x4000 > > with 2.00 MiB pages > > [ 0.00] radix-mmu: Mapped 0x4000-0x00018000 > > with 1.00 GiB pages > > [ 0.00] radix-mmu: Mapped 0x00018000-0x0001a000 > > with 2.00 MiB pages > > [ 0.00] lpar: Using radix MMU under hypervisor > > [ 0.00] Linux version 6.6.0-rc5pf-nr-cpus+ > > (r...@ltcever7x0-lp1.aus.stglabs.ibm.com) (gcc (GCC) 8.5.0 20210514 (Red > > Hat 8.5.0-20), GNU ld version 2.30-123.el8) #3 SMP Mon Oct 9 11:07: > > 41 CDT 2023 > > [ 0.00] Found initrd at 0xc00022e6:0xc000248f08d8 > > [ 0.00] Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 > > 0xf06 of:IBM,FW1060.00 (NM1060_016) hv:phyp pSeries > > [ 0.00] printk: bootconsole [udbg0] enabled > > [ 0.00] the round shift between dt seq and the cpu logic number: > > 56 > > [ 0.00] BUG: Unable to handle kernel data access on write at > > 0xc001a000 > > [ 0.00] Faulting instruction address: 0xc00022009c64 > > [ 0.00] Oops: Kernel access of bad area, sig: 11 [#1] > > [ 0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > > [ 0.00] Modules linked in: > > [ 0.00] CPU: 2 PID: 0 Comm: swapper Not tainted > > 6.6.0-rc5pf-nr-cpus+ #3 > > [ 0.00] Hardware name: POWER10 (raw)
[PATCHv8 5/5] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid
paca_ptrs should be large enough to hold the boot_cpuid, hence, its lower boundary is set to the bigger one between boot_cpuid+1 and nr_cpus. On the other hand, some kernel component: -1. the timer assumes cpu0 online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if not initialized to a proper present cpu. -2. power9_idle_stop() assumes the primary thread's paca is allocated. Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the boot cpu is not cpu0. Result: When nr_cpus=1, taskset -c 14 bash -c 'echo c > /proc/sysrq-trigger' the kdump kernel brings up two cpus. While when taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger', the kdump kernel brings up one cpu. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/paca.c | 10 ++ arch/powerpc/kernel/prom.c | 9 ++--- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index cda4e00b67c1..91e2401de1bd 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -242,9 +242,10 @@ static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_nr_cpu_ids = nr_cpu_ids; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + paca_nr_cpu_ids = n; + paca_ptrs_size = sizeof(struct paca_struct *) * n; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", @@ -287,13 +288,14 @@ void __init allocate_paca(int cpu) void __init free_unused_pacas(void) { int new_ptrs_size; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + new_ptrs_size = sizeof(struct paca_struct *) * n; if (new_ptrs_size < paca_ptrs_size) memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size, paca_ptrs_size - new_ptrs_size); - paca_nr_cpu_ids = nr_cpu_ids; + paca_nr_cpu_ids = n; paca_ptrs_size = new_ptrs_size; #ifdef CONFIG_PPC_64S_HASH_MMU diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 87272a2d8c10..15c994f54bf9 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long node, */ boot_cpuid = i; found = true; - /* This works around the hole in paca_ptrs[]. */ - if (nr_cpu_ids < nthreads) - set_nr_cpu_ids(nthreads); + /* +* Ideally, nr_cpus=1 can be achieved if each kernel +* component does not assume cpu0 is onlined. +*/ + if (boot_cpuid != 0 && nr_cpu_ids < 2) + set_nr_cpu_ids(2); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ -- 2.31.1
[PATCHv8 4/5] powerpc/cpu: Skip impossible cpu during iteration on a core
The threads in a core have equal status, so the code introduces a for loop pattern to execute the same task on each thread: for (i = first_thread; i < first_thread + threads_per_core; i++) Now that some threads may not be in the cpu_possible_mask, the iteration skips those threads by checking the mask. In this way, the unpopulated pcpu struct can be skipped and left unaccessed. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/cputhreads.h| 6 + arch/powerpc/kernel/smp.c| 2 +- arch/powerpc/kvm/book3s_hv.c | 7 ++ arch/powerpc/platforms/powernv/idle.c| 32 arch/powerpc/platforms/powernv/subcore.c | 5 +++- 5 files changed, 29 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index f26c430f3982..fdb71ff7f6a9 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -65,6 +65,12 @@ static inline int cpu_last_thread_sibling(int cpu) return cpu | (threads_per_core - 1); } +#define for_each_possible_cpu_in_core(start, iter) \ + for (iter = start; iter < start + threads_per_core; iter++) \ + if (unlikely(!cpu_possible(iter))) \ + continue; \ + else + /* * tlb_thread_siblings are siblings which share a TLB. This is not * architected, is not something a hypervisor could emulate and a future diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index fbbb695bae3d..2936f7a2240d 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -933,7 +933,7 @@ static int __init update_mask_from_threadgroup(cpumask_var_t *mask, struct threa zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); - for (i = first_thread; i < first_thread + threads_per_core; i++) { + for_each_possible_cpu_in_core(first_thread, i) { int i_group_start = get_cpu_thread_group_start(i, tg); if (unlikely(i_group_start == -1)) { diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 130bafdb1430..ff4b3f8affba 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -6235,12 +6235,9 @@ static int kvm_init_subcore_bitmap(void) return -ENOMEM; - for (j = 0; j < threads_per_core; j++) { - int cpu = first_cpu + j; - - paca_ptrs[cpu]->sibling_subcore_state = + for_each_possible_cpu_in_core(first_cpu, j) + paca_ptrs[j]->sibling_subcore_state = sibling_subcore_state; - } } return 0; } diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index ad41dffe4d92..79d81ce5cf4c 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -823,36 +823,36 @@ void pnv_power9_force_smt4_catch(void) cpu = smp_processor_id(); cpu0 = cpu & ~(threads_per_core - 1); - for (thr = 0; thr < threads_per_core; ++thr) { - if (cpu != cpu0 + thr) - atomic_inc(_ptrs[cpu0+thr]->dont_stop); + for_each_possible_cpu_in_core(cpu0, thr) { + if (cpu != thr) + atomic_inc(_ptrs[thr]->dont_stop); } /* order setting dont_stop vs testing requested_psscr */ smp_mb(); - for (thr = 0; thr < threads_per_core; ++thr) { - if (!paca_ptrs[cpu0+thr]->requested_psscr) + for_each_possible_cpu_in_core(cpu0, thr) { + if (!paca_ptrs[thr]->requested_psscr) ++awake_threads; else - poke_threads |= (1 << thr); + poke_threads |= (1 << (thr - cpu0)); } /* If at least 3 threads are awake, the core is in SMT4 already */ if (awake_threads < need_awake) { /* We have to wake some threads; we'll use msgsnd */ - for (thr = 0; thr < threads_per_core; ++thr) { - if (poke_threads & (1 << thr)) { + for_each_possible_cpu_in_core(cpu0, thr) { + if (poke_threads & (1 << (thr - cpu0))) { ppc_msgsnd_sync(); ppc_msgsnd(PPC_DBELL_MSGTYPE, 0, - paca_ptrs[cpu0+thr]->hw_cpu_id); + paca_ptrs[thr]->hw_cp
[PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32
In order to identify the boot cpu, its intserv[] should be recorded and checked in smp_setup_cpu_maps(). smp_setup_cpu_maps() is shared between PPC64 and PPC32. Since PPC64 has already used boot_cpu_hwid to carry that information, enabling this variable on PPC32 so later it can also be used to carry that information for PPC32 in the coming patch. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/smp.h | 2 +- arch/powerpc/kernel/prom.c | 3 +-- arch/powerpc/kernel/setup-common.c | 2 -- 3 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 576d0e15..5db9178cc800 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -26,7 +26,7 @@ #include extern int boot_cpuid; -extern int boot_cpu_hwid; /* PPC64 only */ +extern int boot_cpu_hwid; extern int spinning_secondaries; extern u32 *cpu_to_phys_id; extern bool coregroup_enabled; diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..ec82f5bda908 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -372,8 +372,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, be32_to_cpu(intserv[found_thread])); boot_cpuid = found; - if (IS_ENABLED(CONFIG_PPC64)) - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index d2a446216444..1b19a9815672 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -87,9 +87,7 @@ EXPORT_SYMBOL(machine_id); int boot_cpuid = -1; EXPORT_SYMBOL_GPL(boot_cpuid); -#ifdef CONFIG_PPC64 int boot_cpu_hwid = -1; -#endif /* * These are used in binfmt_elf.c to put aux entries on the stack -- 2.31.1
[PATCHv8 3/5] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to ensure the boot_cpu is in cpu_present_mask. This can be achieved by reserving the last quota for the boot cpu. Note: the restriction on nr_cpus will be lifted with more effort in the successive patches Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/setup-common.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 81291e13dec0..f9ef0a2666b0 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -454,8 +454,8 @@ struct interrupt_server_node { void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int shift = 0, cpu = 0; - int j, nthreads = 1; + int terminate, shift = 0, cpu = 0; + int j, bt_thread = 0, nthreads = 1; int len; struct interrupt_server_node *intserv_node, *n; struct list_head *bt_node, head; @@ -518,6 +518,7 @@ void __init smp_setup_cpu_maps(void) for (j = 0 ; j < nthreads; j++) { if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) { bt_node = _node->node; + bt_thread = j; found_boot_cpu = true; /* * Record the round-shift between dt @@ -537,11 +538,21 @@ void __init smp_setup_cpu_maps(void) /* Select the primary thread, the boot cpu's slibing, as the logic 0 */ list_add_tail(, bt_node); pr_info("the round shift between dt seq and the cpu logic number: %d\n", shift); + terminate = nr_cpu_ids; list_for_each_entry(intserv_node, , node) { + j = 0; + /* Choose a start point to cover the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + /* +* The processor core puts assumption on the thread id, +* not to breach the assumption. +*/ + terminate = nr_cpu_ids - 1; + } avail = intserv_node->avail; nthreads = intserv_node->len / sizeof(int); - for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { + for (; j < nthreads && cpu < terminate; j++) { set_cpu_present(cpu, avail); set_cpu_possible(cpu, true); cpu_to_phys_id[cpu] = be32_to_cpu(intserv_node->intserv[j]); @@ -549,6 +560,14 @@ void __init smp_setup_cpu_maps(void) j, cpu, be32_to_cpu(intserv_node->intserv[j])); cpu++; } + /* Online the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + set_cpu_present(bt_thread, avail); + set_cpu_possible(bt_thread, true); + cpu_to_phys_id[bt_thread] = be32_to_cpu(intserv_node->intserv[bt_thread]); + DBG("thread %d -> cpu %d (hard id %d)\n", + bt_thread, bt_thread, be32_to_cpu(intserv_node->intserv[bt_thread])); + } } list_for_each_entry_safe(intserv_node, n, , node) { -- 2.31.1
[PATCHv8 2/5] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
*** Idea *** For kexec -p, the boot cpu can be not the cpu0, this causes the problem of allocating memory for paca_ptrs[]. However, in theory, there is no requirement to assign cpu's logical id as its present sequence in the device tree. But there is something like cpu_first_thread_sibling(), which makes assumption on the mapping inside a core. Hence partially loosening the mapping, i.e. unbind the mapping of core while keep the mapping inside a core. *** Implement *** At this early stage, there are plenty of memory to utilize. Hence, this patch allocates interim memory to link the cpu info on a list, then reorder cpus by changing the list head. As a result, there is a rotate shift between the sequence number in dt and the cpu logical number. *** Result *** After this patch, a boot-cpu's logical id will always be mapped into the range [0,threads_per_core). Besides this, at this phase, all threads in the boot core are forced to be onlined. This restriction will be lifted in a later patch with extra effort. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/prom.c | 25 + arch/powerpc/kernel/setup-common.c | 87 +++--- 2 files changed, 85 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index ec82f5bda908..87272a2d8c10 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -76,7 +76,9 @@ u64 ppc64_rma_size; unsigned int boot_cpu_node_count __ro_after_init; #endif static phys_addr_t first_memblock_size; +#ifdef CONFIG_SMP static int __initdata boot_cpu_count; +#endif static int __init early_parse_mem(char *p) { @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, const __be32 *intserv; int i, nthreads; int len; - int found = -1; - int found_thread = 0; + bool found = false; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, for (i = 0; i < nthreads; i++) { if (be32_to_cpu(intserv[i]) == fdt_boot_cpuid_phys(initial_boot_params)) { - found = boot_cpu_count; - found_thread = i; + /* +* always map the boot-cpu logical id into the +* range of [0, thread_per_core) +*/ + boot_cpuid = i; + found = true; + /* This works around the hole in paca_ptrs[]. */ + if (nr_cpu_ids < nthreads) + set_nr_cpu_ids(nthreads); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ @@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned long node, } /* Not the boot CPU */ - if (found < 0) + if (!found) return 0; - DBG("boot cpu: logical %d physical %d\n", found, - be32_to_cpu(intserv[found_thread])); - boot_cpuid = found; + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, + be32_to_cpu(intserv[boot_cpuid])); - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 1b19a9815672..81291e13dec0 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -425,6 +426,13 @@ static void __init cpu_init_thread_core_maps(int tpc) u32 *cpu_to_phys_id = NULL; +struct interrupt_server_node { + struct list_head node; + boolavail; + int len; + __be32 *intserv; +}; + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_mask @@ -446,11 +454,16 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; - int nthreads = 1; + int shift = 0, cpu = 0; + int j, nthreads = 1; + int len; + struct interrupt_server_node *intserv_node, *n; + struct list_head *bt_node, head; + bool avail, found_boot_cpu = false; DBG("smp_setup_cpu_maps()\n"); + INIT_LIST_HEAD(); cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32), __alignof_
[PATCHv8 0/5] enable nr_cpus for powerpc
Since my last v4 [1], the code has undergone great changes. The paca[] array has been reorganized and indexed by paca_ptrs[], which dramatically decreases the memory consumption even if there are many unpresent cpus in the middle. However, reordering the logical cpu numbers can further decrease the size of paca_ptrs[] in the kdump case. So I keep [1-2/5], which rotate-shifts the cpu's sequence number in the device tree to obtain the logical cpu id. Patch [3-5/5] make further efforts to decrease the nr_cpus to be less than or equal to two. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ --- v7 -> v8 Fix bug when turning on DEBUG macro Introducing [PATCHv7 4/5] powerpc/cpu: Skip impossible cpu during iteration on a core, which avoid access to unpopulated pcpu data. Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org Pingfan Liu (5): powerpc/setup : Enable boot_cpu_hwid for PPC32 powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus powerpc/cpu: Skip impossible cpu during iteration on a core powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid arch/powerpc/include/asm/cputhreads.h| 6 ++ arch/powerpc/include/asm/smp.h | 2 +- arch/powerpc/kernel/paca.c | 10 ++- arch/powerpc/kernel/prom.c | 29 +++--- arch/powerpc/kernel/setup-common.c | 108 ++- arch/powerpc/kernel/smp.c| 2 +- arch/powerpc/kvm/book3s_hv.c | 7 +- arch/powerpc/platforms/powernv/idle.c| 32 +++ arch/powerpc/platforms/powernv/subcore.c | 5 +- 9 files changed, 143 insertions(+), 58 deletions(-) -- 2.31.1
Re: [PATCHv7 4/4] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid
On Wed, Oct 4, 2023 at 2:07 AM Mahesh J Salgaonkar wrote: > > On 2023-09-25 15:53:48 Mon, Pingfan Liu wrote: > > paca_ptrs should be large enough to hold the boot_cpuid, hence, its > > lower boundary is set to the bigger one between boot_cpuid+1 and > > nr_cpus. > > > > On the other hand, some kernel component: -1. the timer assumes cpu0 > > online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if > > not initialized to a proper present cpu. -2. power9_idle_stop() assumes > > the primary thread's paca is allocated. > > > > Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the > > boot cpu is not cpu0. > > > > Result: > > When nr_cpus=1, taskset -c 14 bash -c 'echo c > /proc/sysrq-trigger' > > the kdump kernel brings up two cpus. > > While when taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger', > > the kdump kernel brings up one cpu. > > I tried your changes on power9 and power10 systems. However, on power10 lpar I > see bellow backtrace in kdump kernel bootup with nr_cpus=1. > Thanks for the testing. I have only tried this series on Power9 bare metal. I think the bug is related with the code snippet in update_mask_from_threadgroup() for (i = first_thread; i < first_thread + threads_per_core; i++) { int i_group_start = get_cpu_thread_group_start(i, tg); ^^^ Here it iterates over each thread in the core, but some of them are not online. I will try to bring up a remedy. Thanks, Pingfan > $ taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger' > [...] > [0.00] Hardware name: IBM,9105-22A POWER10 (raw) 0x800200 0xf06 > of:IBM,FW1040.00 (NL1040_005) hv:phyp pSeries > [0.00] printk: bootconsole [udbg0] enabled > [0.00] the round shift between dt seq and the cpu logic number: 8 > [0.00] Partition configured for 16 cpus, operating system maximum is > 2. > [0.00] CPU maps initialized for 8 threads per core > [...] > [0.002249] BUG: Unable to handle kernel data access at 0x88c0 > [0.002260] Faulting instruction address: 0xc0001201226c > [0.002268] Oops: Kernel access of bad area, sig: 11 [#1] > [0.002274] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > [0.002282] Modules linked in: > [0.002288] CPU: 4 PID: 1 Comm: swapper/4 Not tainted 6.6.0-rc4 #1 > [0.002296] Hardware name: IBM,9105-22A POWER10 (raw) 0x800200 0xf06 > of:IBM,FW1040.00 (NL1040_005) hv:phyp pSeries > [0.002305] NIP: c0001201226c LR: c00012012234 CTR: > 0004 > [0.002312] REGS: c000167ff8f0 TRAP: 0380 Not tainted (6.6.0-rc4) > [0.002321] MSR: 82009033 CR: > 24000844 XER: 000a > [0.002346] CFAR: c0001201231c IRQMASK: 0 > [0.002346] GPR00: c00012012234 c000167ffb90 c00011b61900 > 0002 > [0.002346] GPR04: 0001 0001 > c0004ffeff80 > [0.002346] GPR08: 0002 > > [0.002346] GPR12: c00013141000 c00010011058 > > [0.002346] GPR16: > > [0.002346] GPR20: 0028 c00012170968 c000120a3e80 > 0016 > [0.002346] GPR24: c0004ffdcfd0 c00012b82058 > > [0.002346] GPR28: c0004fc80a68 c00012bf0350 c000120a3e2c > > [0.002426] NIP [c0001201226c] update_mask_from_threadgroup+0x98/0x174 > [0.002437] LR [c00012012234] update_mask_from_threadgroup+0x60/0x174 > [0.002444] Call Trace: > [0.002451] [c000167ffb90] [c00012012234] > update_mask_from_threadgroup+0x60/0x174 (unreliable) > [0.002464] [c000167ffbe0] [c000120125f8] > init_thread_group_cache_map+0x2b0/0x328 > [0.002477] [c000167ffc50] [c0001201296c] > smp_prepare_cpus+0x2fc/0x4f0 > [0.002497] [c000167ffd10] [c00012004e40] > kernel_init_freeable+0x198/0x3cc > [0.002509] [c000167ffde0] [c00010011084] kernel_init+0x34/0x1b0 > [0.002531] [c000167ffe50] [c0001000dd3c] > ret_from_kernel_user_thread+0x14/0x1c > [0.002547] --- interrupt: 0 at 0x0 > [0.002553] NIP: LR: CTR: > > [0.002563] REGS: c000167ffe80 TRAP: Not tainted (6.6.0-rc4) > [0.002569] MSR: <> CR: XER: > [0.002576] CFAR: IRQMASK: 0 > [0.002576] GPR00: 00
Re: [PATCHv7 2/4] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
On Fri, Sep 29, 2023 at 4:36 AM Wen Xiong wrote: > > Hi Pingfan, > > + avail = intserv_node->avail; > + nthreads = intserv_node->len / sizeof(int); > + for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { > set_cpu_present(cpu, avail); > set_cpu_possible(cpu, true); > - cpu_to_phys_id[cpu] = be32_to_cpu(intserv[j]); > + cpu_to_phys_id[cpu] = > be32_to_cpu(intserv_node->intserv[j]); > + DBG("thread %d -> cpu %d (hard id %d)\n", > + j, cpu, be32_to_cpu(intserv[j])); > > Intserv is not defined. Should "be32_to_cpu(intserv_node->intserv[j])? Yes, thanks. Sorry that I did not turn on the DBG macro and not catch this bug. Thanks, Pingfan > cpu++; > } > + } > > -Original Message- > From: Pingfan Liu > Sent: Monday, September 25, 2023 2:54 AM > To: linuxppc-dev@lists.ozlabs.org > Cc: Pingfan Liu ; Michael Ellerman ; > Nicholas Piggin ; Christophe Leroy > ; Mahesh Salgaonkar ; Wen > Xiong ; Baoquan He ; Ming Lei > ; ke...@lists.infradead.org > Subject: [EXTERNAL] [PATCHv7 2/4] powerpc/setup: Loosen the mapping between > cpu logical id and its seq in dt > > *** Idea *** > For kexec -p, the boot cpu can be not the cpu0, this causes the problem of > allocating memory for paca_ptrs[]. However, in theory, there is no > requirement to assign cpu's logical id as its present sequence in the device > tree. But there is something like cpu_first_thread_sibling(), which makes > assumption on the mapping inside a core. Hence partially loosening the > mapping, i.e. unbind the mapping of core while keep the mapping inside a core. > > *** Implement *** > At this early stage, there are plenty of memory to utilize. Hence, this patch > allocates interim memory to link the cpu info on a list, then reorder cpus by > changing the list head. As a result, there is a rotate shift between the > sequence number in dt and the cpu logical number. > > *** Result *** > After this patch, a boot-cpu's logical id will always be mapped into the > range [0,threads_per_core). > > Besides this, at this phase, all threads in the boot core are forced to be > onlined. This restriction will be lifted in a later patch with extra effort. > > Signed-off-by: Pingfan Liu > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Christophe Leroy > Cc: Mahesh Salgaonkar > Cc: Wen Xiong > Cc: Baoquan He > Cc: Ming Lei > Cc: ke...@lists.infradead.org > To: linuxppc-dev@lists.ozlabs.org > --- > arch/powerpc/kernel/prom.c | 25 + > arch/powerpc/kernel/setup-common.c | 87 +++--- > 2 files changed, 85 insertions(+), 27 deletions(-) > > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index > ec82f5bda908..87272a2d8c10 100644 > --- a/arch/powerpc/kernel/prom.c > +++ b/arch/powerpc/kernel/prom.c > @@ -76,7 +76,9 @@ u64 ppc64_rma_size; > unsigned int boot_cpu_node_count __ro_after_init; #endif static > phys_addr_t first_memblock_size; > +#ifdef CONFIG_SMP > static int __initdata boot_cpu_count; > +#endif > > static int __init early_parse_mem(char *p) { @@ -331,8 +333,7 @@ static int > __init early_init_dt_scan_cpus(unsigned long node, > const __be32 *intserv; > int i, nthreads; > int len; > - int found = -1; > - int found_thread = 0; > + bool found = false; > > /* We are scanning "cpu" nodes only */ > if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ > static int __init early_init_dt_scan_cpus(unsigned long node, > for (i = 0; i < nthreads; i++) { > if (be32_to_cpu(intserv[i]) == > fdt_boot_cpuid_phys(initial_boot_params)) { > - found = boot_cpu_count; > - found_thread = i; > + /* > +* always map the boot-cpu logical id into the > +* range of [0, thread_per_core) > +*/ > + boot_cpuid = i; > + found = true; > + /* This works around the hole in paca_ptrs[]. */ > + if (nr_cpu_ids < nthreads) > + set_nr_cpu_ids(nthreads); > } > #ifdef CONFIG_SMP > /* logical cpu id is always 0 on UP kernels */ @@ -365,14 > +373,13 @@ static int __init early_init_
[PATCHv7 4/4] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid
paca_ptrs should be large enough to hold the boot_cpuid, hence, its lower boundary is set to the bigger one between boot_cpuid+1 and nr_cpus. On the other hand, some kernel component: -1. the timer assumes cpu0 online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if not initialized to a proper present cpu. -2. power9_idle_stop() assumes the primary thread's paca is allocated. Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the boot cpu is not cpu0. Result: When nr_cpus=1, taskset -c 14 bash -c 'echo c > /proc/sysrq-trigger' the kdump kernel brings up two cpus. While when taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger', the kdump kernel brings up one cpu. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/paca.c | 10 ++ arch/powerpc/kernel/prom.c | 9 ++--- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index cda4e00b67c1..91e2401de1bd 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -242,9 +242,10 @@ static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_nr_cpu_ids = nr_cpu_ids; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + paca_nr_cpu_ids = n; + paca_ptrs_size = sizeof(struct paca_struct *) * n; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", @@ -287,13 +288,14 @@ void __init allocate_paca(int cpu) void __init free_unused_pacas(void) { int new_ptrs_size; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + new_ptrs_size = sizeof(struct paca_struct *) * n; if (new_ptrs_size < paca_ptrs_size) memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size, paca_ptrs_size - new_ptrs_size); - paca_nr_cpu_ids = nr_cpu_ids; + paca_nr_cpu_ids = n; paca_ptrs_size = new_ptrs_size; #ifdef CONFIG_PPC_64S_HASH_MMU diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 87272a2d8c10..15c994f54bf9 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long node, */ boot_cpuid = i; found = true; - /* This works around the hole in paca_ptrs[]. */ - if (nr_cpu_ids < nthreads) - set_nr_cpu_ids(nthreads); + /* +* Ideally, nr_cpus=1 can be achieved if each kernel +* component does not assume cpu0 is onlined. +*/ + if (boot_cpuid != 0 && nr_cpu_ids < 2) + set_nr_cpu_ids(2); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ -- 2.31.1
[PATCHv7 3/4] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to ensure the boot_cpu is in cpu_present_mask. This can be achieved by reserving the last quota for the boot cpu. Note: the restriction on nr_cpus will be lifted with more effort in the next patch Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/setup-common.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index f6d32324b5a5..a72d00a6cff2 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -454,8 +454,8 @@ struct interrupt_server_node { void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int shift = 0, cpu = 0; - int j, nthreads = 1; + int terminate, shift = 0, cpu = 0; + int j, bt_thread = 0, nthreads = 1; int len; struct interrupt_server_node *intserv_node, *n; struct list_head *bt_node, head; @@ -518,6 +518,7 @@ void __init smp_setup_cpu_maps(void) for (j = 0 ; j < nthreads; j++) { if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) { bt_node = _node->node; + bt_thread = j; found_boot_cpu = true; /* * Record the round-shift between dt @@ -537,11 +538,21 @@ void __init smp_setup_cpu_maps(void) /* Select the primary thread, the boot cpu's slibing, as the logic 0 */ list_add_tail(, bt_node); pr_info("the round shift between dt seq and the cpu logic number: %d\n", shift); + terminate = nr_cpu_ids; list_for_each_entry(intserv_node, , node) { + j = 0; + /* Choose a start point to cover the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + /* +* The processor core puts assumption on the thread id, +* not to breach the assumption. +*/ + terminate = nr_cpu_ids - 1; + } avail = intserv_node->avail; nthreads = intserv_node->len / sizeof(int); - for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { + for (; j < nthreads && cpu < terminate; j++) { set_cpu_present(cpu, avail); set_cpu_possible(cpu, true); cpu_to_phys_id[cpu] = be32_to_cpu(intserv_node->intserv[j]); @@ -549,6 +560,14 @@ void __init smp_setup_cpu_maps(void) j, cpu, be32_to_cpu(intserv[j])); cpu++; } + /* Online the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + set_cpu_present(bt_thread, avail); + set_cpu_possible(bt_thread, true); + cpu_to_phys_id[bt_thread] = be32_to_cpu(intserv_node->intserv[bt_thread]); + DBG("thread %d -> cpu %d (hard id %d)\n", + bt_thread, bt_thread, be32_to_cpu(intserv[bt_thread])); + } } list_for_each_entry_safe(intserv_node, n, , node) { -- 2.31.1
[PATCHv7 2/4] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
*** Idea *** For kexec -p, the boot cpu can be not the cpu0, this causes the problem of allocating memory for paca_ptrs[]. However, in theory, there is no requirement to assign cpu's logical id as its present sequence in the device tree. But there is something like cpu_first_thread_sibling(), which makes assumption on the mapping inside a core. Hence partially loosening the mapping, i.e. unbind the mapping of core while keep the mapping inside a core. *** Implement *** At this early stage, there are plenty of memory to utilize. Hence, this patch allocates interim memory to link the cpu info on a list, then reorder cpus by changing the list head. As a result, there is a rotate shift between the sequence number in dt and the cpu logical number. *** Result *** After this patch, a boot-cpu's logical id will always be mapped into the range [0,threads_per_core). Besides this, at this phase, all threads in the boot core are forced to be onlined. This restriction will be lifted in a later patch with extra effort. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/prom.c | 25 + arch/powerpc/kernel/setup-common.c | 87 +++--- 2 files changed, 85 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index ec82f5bda908..87272a2d8c10 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -76,7 +76,9 @@ u64 ppc64_rma_size; unsigned int boot_cpu_node_count __ro_after_init; #endif static phys_addr_t first_memblock_size; +#ifdef CONFIG_SMP static int __initdata boot_cpu_count; +#endif static int __init early_parse_mem(char *p) { @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, const __be32 *intserv; int i, nthreads; int len; - int found = -1; - int found_thread = 0; + bool found = false; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, for (i = 0; i < nthreads; i++) { if (be32_to_cpu(intserv[i]) == fdt_boot_cpuid_phys(initial_boot_params)) { - found = boot_cpu_count; - found_thread = i; + /* +* always map the boot-cpu logical id into the +* range of [0, thread_per_core) +*/ + boot_cpuid = i; + found = true; + /* This works around the hole in paca_ptrs[]. */ + if (nr_cpu_ids < nthreads) + set_nr_cpu_ids(nthreads); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ @@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned long node, } /* Not the boot CPU */ - if (found < 0) + if (!found) return 0; - DBG("boot cpu: logical %d physical %d\n", found, - be32_to_cpu(intserv[found_thread])); - boot_cpuid = found; + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, + be32_to_cpu(intserv[boot_cpuid])); - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 1b19a9815672..f6d32324b5a5 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -425,6 +426,13 @@ static void __init cpu_init_thread_core_maps(int tpc) u32 *cpu_to_phys_id = NULL; +struct interrupt_server_node { + struct list_head node; + boolavail; + int len; + __be32 *intserv; +}; + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_mask @@ -446,11 +454,16 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; - int nthreads = 1; + int shift = 0, cpu = 0; + int j, nthreads = 1; + int len; + struct interrupt_server_node *intserv_node, *n; + struct list_head *bt_node, head; + bool avail, found_boot_cpu = false; DBG("smp_setup_cpu_maps()\n"); + INIT_LIST_HEAD(); cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32), __alignof_
[PATCHv7 1/4] powerpc/setup : Enable boot_cpu_hwid for PPC32
In order to identify the boot cpu, its intserv[] should be recorded and checked in smp_setup_cpu_maps(). smp_setup_cpu_maps() is shared between PPC64 and PPC32. Since PPC64 has already used boot_cpu_hwid to carry that information, enabling this variable on PPC32 so later it can also be used to carry that information for PPC32 in the coming patch. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202309130232.n2rewhbv-...@intel.com/ --- arch/powerpc/include/asm/smp.h | 2 +- arch/powerpc/kernel/prom.c | 3 +-- arch/powerpc/kernel/setup-common.c | 2 -- 3 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 576d0e15..5db9178cc800 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -26,7 +26,7 @@ #include extern int boot_cpuid; -extern int boot_cpu_hwid; /* PPC64 only */ +extern int boot_cpu_hwid; extern int spinning_secondaries; extern u32 *cpu_to_phys_id; extern bool coregroup_enabled; diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..ec82f5bda908 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -372,8 +372,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, be32_to_cpu(intserv[found_thread])); boot_cpuid = found; - if (IS_ENABLED(CONFIG_PPC64)) - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index d2a446216444..1b19a9815672 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -87,9 +87,7 @@ EXPORT_SYMBOL(machine_id); int boot_cpuid = -1; EXPORT_SYMBOL_GPL(boot_cpuid); -#ifdef CONFIG_PPC64 int boot_cpu_hwid = -1; -#endif /* * These are used in binfmt_elf.c to put aux entries on the stack -- 2.31.1
[PATCHv7 0/4] enable nr_cpus for powerpc
Since my last v4 [1], the code has undergone great changes. The paca[] array has been reorganized and indexed by paca_ptrs[], which dramatically decreases the memory consumption even if there are many unpresent cpus in the middle. However, reordering the logical cpu numbers can further decrease the size of paca_ptrs[] in the kdump case. So I keep [2/4], which rotate-shifts the cpu's sequence number in the device tree to obtain the logical cpu id. Patch [3-4/4] make efforts to decrease the nr_cpus to be less than or equal to two. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ --- v6 -> v7 Add [1/4], which fixes compilation error on PPC32 Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org Pingfan Liu (4): powerpc/setup : Enable boot_cpu_hwid for PPC32 powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid arch/powerpc/include/asm/smp.h | 2 +- arch/powerpc/kernel/paca.c | 10 +-- arch/powerpc/kernel/prom.c | 29 +--- arch/powerpc/kernel/setup-common.c | 108 +++-- 4 files changed, 114 insertions(+), 35 deletions(-) -- 2.31.1
Re: [RFC PATCH] powerpc: Make crashing cpu to be discovered first in kdump kernel.
Hi Mahesh, I am not quite sure about fdt, so I skip that part, and leave some comments from the kexec view. On Thu, Sep 7, 2023 at 1:59 AM Mahesh Salgaonkar wrote: > > The kernel boot parameter 'nr_cpus=' allows one to specify number of > possible cpus in the system. In the normal scenario the first cpu (cpu0) > that shows up is the boot cpu and hence it gets covered under nr_cpus > limit. > > But this assumption is broken in kdump scenario where kdump kernel after a > crash can boot up on an non-zero boot cpu. The paca structure allocation > depends on value of nr_cpus and is indexed using logical cpu ids. The cpu > discovery code brings up the cpus as they appear sequentially on device > tree and assigns logical cpu ids starting from 0. This definitely becomes > an issue if boot cpu id > nr_cpus. When this occurs it results into > > In past there were proposals to fix this by making changes to cpu discovery > code to identify non-zero boot cpu and map it to logical cpu 0. However, > the changes were very invasive, making discovery code more complicated and > risky. > > Considering that the non-zero boot cpu scenario is more specific to kdump > kernel, limiting the changes in panic/crash kexec path would probably be a > best approach to have. > > Hence proposed change is, in crash kexec path, move the crashing cpu's > device node to the first position under '/cpus' node, which will make the > crashing cpu to be discovered as part of the first core in kdump kernel. > > In order to accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids, > align up the nr_cpu_ids to SMT threads in early_init_dt_scan_cpus(). This > will allow kdump kernel to work with nr_cpus=X where X will be aligned up > in multiple of SMT threads per core. > > Signed-off-by: Mahesh Salgaonkar > --- > arch/powerpc/include/asm/kexec.h |1 > arch/powerpc/kernel/prom.c| 13 > arch/powerpc/kexec/core_64.c | 128 > + > arch/powerpc/kexec/file_load_64.c |2 - > 4 files changed, 143 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/kexec.h > b/arch/powerpc/include/asm/kexec.h > index a1ddba01e7d13..f5a6f4a1b8eb0 100644 > --- a/arch/powerpc/include/asm/kexec.h > +++ b/arch/powerpc/include/asm/kexec.h > @@ -144,6 +144,7 @@ unsigned int kexec_extra_fdt_size_ppc64(struct kimage > *image); > int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, > unsigned long initrd_load_addr, > unsigned long initrd_len, const char *cmdline); > +int add_node_props(void *fdt, int node_offset, const struct device_node *dn); > #endif /* CONFIG_PPC64 */ > > #endif /* CONFIG_KEXEC_FILE */ > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > index 0b5878c3125b1..c2d4f55042d72 100644 > --- a/arch/powerpc/kernel/prom.c > +++ b/arch/powerpc/kernel/prom.c > @@ -322,6 +322,9 @@ static void __init check_cpu_feature_properties(unsigned > long node) > } > } > > +/* align addr on a size boundary - adjust address up */ > +#define _ALIGN_UP(addr, size) > (((addr)+((size)-1))&(~((typeof(addr))(size)-1))) > + > static int __init early_init_dt_scan_cpus(unsigned long node, > const char *uname, int depth, > void *data) > @@ -348,6 +351,16 @@ static int __init early_init_dt_scan_cpus(unsigned long > node, > > nthreads = len / sizeof(int); > > + /* > +* Align nr_cpu_ids to correct SMT value. This will help us to > allocate > +* pacas correctly to accomodate boot_cpu != 0 scenario e.g. in kdump > +* kernel the boot cpu can be any cpu between 0 through nthreads. > +*/ > + if (nr_cpu_ids % nthreads) { > + nr_cpu_ids = _ALIGN_UP(nr_cpu_ids, nthreads); It is better to use set_nr_cpu_ids(), which can hide the difference of nr_cpus_ids under different kernel configuration. > + pr_info("Aligned nr_cpus to SMT=%d, nr_cpu_ids = %d\n", > nthreads, nr_cpu_ids); > + } > + > /* > * Now see if any of these threads match our boot cpu. > * NOTE: This must match the parsing done in smp_setup_cpu_maps. > diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c > index a79e28c91e2be..168bef43e22c2 100644 > --- a/arch/powerpc/kexec/core_64.c > +++ b/arch/powerpc/kexec/core_64.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -298,6 +299,119 @@ extern void kexec_sequence(void *newstack, unsigned > long start, >void (*clear_all)(void), >bool copy_with_mmu_off) __noreturn; > > +/* > + * Move the crashing cpus FDT node as the first node under '/cpus' node. > + * > + * - Get the FDT segment from the crash image segments. > + * - Locate the crashing CPUs fdt subnode 'A' under '/cpus' node. > + * -
[PATCHv6 3/3] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid
paca_ptrs should be large enough to hold the boot_cpuid, hence, its lower boundary is set to the bigger one between boot_cpuid+1 and nr_cpus. On the other hand, some kernel component: -1. the timer assumes cpu0 online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if not initialized to a proper present cpu. -2. power9_idle_stop() assumes the primary thread's paca is allocated. Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the boot cpu is not cpu0. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/paca.c | 10 ++ arch/powerpc/kernel/prom.c | 9 ++--- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index cda4e00b67c1..91e2401de1bd 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -242,9 +242,10 @@ static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_nr_cpu_ids = nr_cpu_ids; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + paca_nr_cpu_ids = n; + paca_ptrs_size = sizeof(struct paca_struct *) * n; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", @@ -287,13 +288,14 @@ void __init allocate_paca(int cpu) void __init free_unused_pacas(void) { int new_ptrs_size; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + new_ptrs_size = sizeof(struct paca_struct *) * n; if (new_ptrs_size < paca_ptrs_size) memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size, paca_ptrs_size - new_ptrs_size); - paca_nr_cpu_ids = nr_cpu_ids; + paca_nr_cpu_ids = n; paca_ptrs_size = new_ptrs_size; #ifdef CONFIG_PPC_64S_HASH_MMU diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index cb3f3e040455..28441edbc42d 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long node, */ boot_cpuid = i; found = true; - /* This works around the hole in paca_ptrs[]. */ - if (nr_cpu_ids < nthreads) - set_nr_cpu_ids(nthreads); + /* +* Ideally, nr_cpus=1 can be achieved if each kernel +* component does not assume cpu0 is onlined. +*/ + if (boot_cpuid != 0 && nr_cpu_ids < 2) + set_nr_cpu_ids(2); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ -- 2.31.1
[PATCHv6 2/3] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to ensure the boot_cpu is in cpu_present_mask. This can be achieved by reserving the last quota for the boot cpu. Note: the restriction on nr_cpus will be lifted with more effort in the next patch Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/setup-common.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index a07af8de6674..58a988c64dd2 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -456,8 +456,8 @@ struct interrupt_server_node { void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int shift = 0, cpu = 0; - int j, nthreads = 1; + int terminate, shift = 0, cpu = 0; + int j, bt_thread = 0, nthreads = 1; int len; struct interrupt_server_node *intserv_node, *n; struct list_head *bt_node, head; @@ -520,6 +520,7 @@ void __init smp_setup_cpu_maps(void) for (j = 0 ; j < nthreads; j++) { if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) { bt_node = _node->node; + bt_thread = j; found_boot_cpu = true; /* * Record the round-shift between dt @@ -539,11 +540,21 @@ void __init smp_setup_cpu_maps(void) /* Select the primary thread, the boot cpu's slibing, as the logic 0 */ list_add_tail(, bt_node); pr_info("the round shift between dt seq and the cpu logic number: %d\n", shift); + terminate = nr_cpu_ids; list_for_each_entry(intserv_node, , node) { + j = 0; + /* Choose a start point to cover the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + /* +* The processor core puts assumption on the thread id, +* not to breach the assumption. +*/ + terminate = nr_cpu_ids - 1; + } avail = intserv_node->avail; nthreads = intserv_node->len / sizeof(int); - for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { + for (; j < nthreads && cpu < terminate; j++) { set_cpu_present(cpu, avail); set_cpu_possible(cpu, true); cpu_to_phys_id[cpu] = be32_to_cpu(intserv_node->intserv[j]); @@ -551,6 +562,14 @@ void __init smp_setup_cpu_maps(void) j, cpu, be32_to_cpu(intserv[j])); cpu++; } + /* Online the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + set_cpu_present(bt_thread, avail); + set_cpu_possible(bt_thread, true); + cpu_to_phys_id[bt_thread] = be32_to_cpu(intserv_node->intserv[bt_thread]); + DBG("thread %d -> cpu %d (hard id %d)\n", + bt_thread, bt_thread, be32_to_cpu(intserv[bt_thread])); + } } list_for_each_entry_safe(intserv_node, n, , node) { -- 2.31.1
[PATCHv6 1/3] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
*** Idea *** For kexec -p, the boot cpu can be not the cpu0, this may waste plenty of room when of allocating memory for paca_ptrs[]. However, in theory, there is no requirement to assign cpu's logical id as its present sequence in the device tree. But there is something like cpu_first_thread_sibling(), which makes assumption on the mapping inside a core. Hence partially loosening the mapping, i.e. unbind the mapping of core while keep the mapping inside a core. *** Implement *** At this early stage, there are plenty of memory to utilize. Hence, this patch allocates interim memory to link the cpu info on a list, then reorder cpus by changing the list head. As a result, there is a rotate shift between the sequence number in dt and the cpu logical number. *** Result *** After this patch, a boot-cpu's logical id will always be mapped into the range [0,threads_per_core). Besides this, at this phase, all threads in the boot core are forced to be onlined. This restriction will be lifted in a later patch with extra effort. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/prom.c | 25 + arch/powerpc/kernel/setup-common.c | 87 +++--- 2 files changed, 85 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..cb3f3e040455 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -76,7 +76,9 @@ u64 ppc64_rma_size; unsigned int boot_cpu_node_count __ro_after_init; #endif static phys_addr_t first_memblock_size; +#ifdef CONFIG_SMP static int __initdata boot_cpu_count; +#endif static int __init early_parse_mem(char *p) { @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, const __be32 *intserv; int i, nthreads; int len; - int found = -1; - int found_thread = 0; + bool found = false; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, for (i = 0; i < nthreads; i++) { if (be32_to_cpu(intserv[i]) == fdt_boot_cpuid_phys(initial_boot_params)) { - found = boot_cpu_count; - found_thread = i; + /* +* always map the boot-cpu logical id into the +* range of [0, thread_per_core) +*/ + boot_cpuid = i; + found = true; + /* This works around the hole in paca_ptrs[]. */ + if (nr_cpu_ids < nthreads) + set_nr_cpu_ids(nthreads); } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ @@ -365,15 +373,14 @@ static int __init early_init_dt_scan_cpus(unsigned long node, } /* Not the boot CPU */ - if (found < 0) + if (!found) return 0; - DBG("boot cpu: logical %d physical %d\n", found, - be32_to_cpu(intserv[found_thread])); - boot_cpuid = found; + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, + be32_to_cpu(intserv[boot_cpuid])); if (IS_ENABLED(CONFIG_PPC64)) - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index d2a446216444..a07af8de6674 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -427,6 +428,13 @@ static void __init cpu_init_thread_core_maps(int tpc) u32 *cpu_to_phys_id = NULL; +struct interrupt_server_node { + struct list_head node; + boolavail; + int len; + __be32 *intserv; +}; + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_mask @@ -448,11 +456,16 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; - int nthreads = 1; + int shift = 0, cpu = 0; + int j, nthreads = 1; + int len; + struct interrupt_server_node *intserv_node, *n; + struct list_head *bt_node, head; + bool avail, found_boot_cpu = false; DBG("smp_setup_cpu_maps()\n"); + INIT_LIST_HEAD(); cpu_to_phys_id = memblock_alloc(nr_cpu_i
[PATCHv6 0/3] enable nr_cpus for powerpc
Since my last v4 [1], the code has undergone great changes. The paca[] array has been reorganized and indexed by paca_ptrs[], which dramatically decreases the memory consumption even if there are many unpresent cpus in the middle. However, reordering the logical cpu numbers can further decrease the size of paca_ptrs[] in the kdump case. So I keep [1/3], which rotate-shifts the cpu's sequence number in the device tree to obtain the logical cpu id. Patch [2-3/3] make efforts to decrease the nr_cpus to be less than or equal to two. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org v5 -> v6: assign nr_cpu_ids by set_nr_cpu_ids() to tackle with the issue if nr_cpu_ids is configured as a constant Pingfan Liu (3): powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid arch/powerpc/kernel/paca.c | 10 +-- arch/powerpc/kernel/prom.c | 28 +--- arch/powerpc/kernel/setup-common.c | 106 - 3 files changed, 113 insertions(+), 31 deletions(-) -- 2.31.1
[PATCHv5 3/3] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid
paca_ptrs should be large enough to hold the boot_cpuid, hence, its lower boundary is set to the bigger one between boot_cpuid+1 and nr_cpus. On the other hand, some kernel component: -1. the timer assumes cpu0 online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if not initialized to a proper present cpu. -2. power9_idle_stop() assumes the primary thread's paca is allocated. Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the boot cpu is not cpu0. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/paca.c | 10 ++ arch/powerpc/kernel/prom.c | 9 ++--- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index cda4e00b67c1..91e2401de1bd 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -242,9 +242,10 @@ static int __initdata paca_struct_size; void __init allocate_paca_ptrs(void) { - paca_nr_cpu_ids = nr_cpu_ids; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + paca_nr_cpu_ids = n; + paca_ptrs_size = sizeof(struct paca_struct *) * n; paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES); if (!paca_ptrs) panic("Failed to allocate %d bytes for paca pointers\n", @@ -287,13 +288,14 @@ void __init allocate_paca(int cpu) void __init free_unused_pacas(void) { int new_ptrs_size; + int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids; - new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids; + new_ptrs_size = sizeof(struct paca_struct *) * n; if (new_ptrs_size < paca_ptrs_size) memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size, paca_ptrs_size - new_ptrs_size); - paca_nr_cpu_ids = nr_cpu_ids; + paca_nr_cpu_ids = n; paca_ptrs_size = new_ptrs_size; #ifdef CONFIG_PPC_64S_HASH_MMU diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 72be75d4f003..eca6a1568749 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -360,9 +360,12 @@ static int __init early_init_dt_scan_cpus(unsigned long node, */ boot_cpuid = i; found = true; - /* This works around the hole in paca_ptrs[]. */ - if (nr_cpu_ids < nthreads) - nr_cpu_ids = nthreads; + /* +* Ideally, nr_cpus=1 can be achieved if each kernel +* component does not assume cpu0 is onlined. +*/ + if (boot_cpuid != 0 && nr_cpu_ids < 2) + nr_cpu_ids = 2; } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ -- 2.31.1
[PATCHv5 2/3] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to ensure the boot_cpu is in cpu_present_mask. This can be achieved by reserving the last quota for the boot cpu. Note: the restriction on nr_cpus will be lifted with more effort in the next patch Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/setup-common.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index a07af8de6674..58a988c64dd2 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -456,8 +456,8 @@ struct interrupt_server_node { void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int shift = 0, cpu = 0; - int j, nthreads = 1; + int terminate, shift = 0, cpu = 0; + int j, bt_thread = 0, nthreads = 1; int len; struct interrupt_server_node *intserv_node, *n; struct list_head *bt_node, head; @@ -520,6 +520,7 @@ void __init smp_setup_cpu_maps(void) for (j = 0 ; j < nthreads; j++) { if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) { bt_node = _node->node; + bt_thread = j; found_boot_cpu = true; /* * Record the round-shift between dt @@ -539,11 +540,21 @@ void __init smp_setup_cpu_maps(void) /* Select the primary thread, the boot cpu's slibing, as the logic 0 */ list_add_tail(, bt_node); pr_info("the round shift between dt seq and the cpu logic number: %d\n", shift); + terminate = nr_cpu_ids; list_for_each_entry(intserv_node, , node) { + j = 0; + /* Choose a start point to cover the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + /* +* The processor core puts assumption on the thread id, +* not to breach the assumption. +*/ + terminate = nr_cpu_ids - 1; + } avail = intserv_node->avail; nthreads = intserv_node->len / sizeof(int); - for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) { + for (; j < nthreads && cpu < terminate; j++) { set_cpu_present(cpu, avail); set_cpu_possible(cpu, true); cpu_to_phys_id[cpu] = be32_to_cpu(intserv_node->intserv[j]); @@ -551,6 +562,14 @@ void __init smp_setup_cpu_maps(void) j, cpu, be32_to_cpu(intserv[j])); cpu++; } + /* Online the boot cpu */ + if (nr_cpu_ids - 1 < bt_thread) { + set_cpu_present(bt_thread, avail); + set_cpu_possible(bt_thread, true); + cpu_to_phys_id[bt_thread] = be32_to_cpu(intserv_node->intserv[bt_thread]); + DBG("thread %d -> cpu %d (hard id %d)\n", + bt_thread, bt_thread, be32_to_cpu(intserv[bt_thread])); + } } list_for_each_entry_safe(intserv_node, n, , node) { -- 2.31.1
[PATCHv5 0/3] enable nr_cpus for powerpc
It is a long time since my last v4 [1]. The code has undergone great changes. The paca[] array has been reorganized and indexed by paca_ptrs[], which dramatically decreases the memory consumption even if there are many unpresent cpus in the middle. However, reordering the logical cpu numbers can further decrease the size of paca_ptrs[] in the kdump case. So I keep [1/3], which rotate-shifts the cpu's sequence number in the device tree to obtain the logical cpu id. Patch [2-3/3] make efforts to decrease the nr_cpus to be less than or equal to two. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org Pingfan Liu (3): powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid arch/powerpc/kernel/paca.c | 10 +-- arch/powerpc/kernel/prom.c | 26 --- arch/powerpc/kernel/setup-common.c | 106 - 3 files changed, 111 insertions(+), 31 deletions(-) -- 2.31.1
[PATCHv5 1/3] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt
*** Idea *** For kexec -p, the boot cpu can be not the cpu0, this causes the problem of allocating memory for paca_ptrs[]. However, in theory, there is no requirement to assign cpu's logical id as its present sequence in the device tree. But there is something like cpu_first_thread_sibling(), which makes assumption on the mapping inside a core. Hence partially loosening the mapping, i.e. unbind the mapping of core while keep the mapping inside a core. *** Implement *** At this early stage, there are plenty of memory to utilize. Hence, this patch allocates interim memory to link the cpu info on a list, then reorder cpus by changing the list head. As a result, there is a rotate shift between the sequence number in dt and the cpu logical number. *** Result *** After this patch, a boot-cpu's logical id will always be mapped into the range [0,threads_per_core). Besides this, at this phase, all threads in the boot core are forced to be onlined. This restriction will be lifted in a later patch with extra effort. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Mahesh Salgaonkar Cc: Wen Xiong Cc: Baoquan He Cc: Ming Lei Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/prom.c | 23 arch/powerpc/kernel/setup-common.c | 87 +++--- 2 files changed, 83 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 0b5878c3125b..72be75d4f003 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -331,8 +331,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, const __be32 *intserv; int i, nthreads; int len; - int found = -1; - int found_thread = 0; + bool found = false; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +354,15 @@ static int __init early_init_dt_scan_cpus(unsigned long node, for (i = 0; i < nthreads; i++) { if (be32_to_cpu(intserv[i]) == fdt_boot_cpuid_phys(initial_boot_params)) { - found = boot_cpu_count; - found_thread = i; + /* +* always map the boot-cpu logical id into the +* range of [0, thread_per_core) +*/ + boot_cpuid = i; + found = true; + /* This works around the hole in paca_ptrs[]. */ + if (nr_cpu_ids < nthreads) + nr_cpu_ids = nthreads; } #ifdef CONFIG_SMP /* logical cpu id is always 0 on UP kernels */ @@ -365,15 +371,14 @@ static int __init early_init_dt_scan_cpus(unsigned long node, } /* Not the boot CPU */ - if (found < 0) + if (!found) return 0; - DBG("boot cpu: logical %d physical %d\n", found, - be32_to_cpu(intserv[found_thread])); - boot_cpuid = found; + DBG("boot cpu: logical %d physical %d\n", boot_cpuid, + be32_to_cpu(intserv[boot_cpuid])); if (IS_ENABLED(CONFIG_PPC64)) - boot_cpu_hwid = be32_to_cpu(intserv[found_thread]); + boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]); /* * PAPR defines "logical" PVR values for cpus that diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index d2a446216444..a07af8de6674 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -427,6 +428,13 @@ static void __init cpu_init_thread_core_maps(int tpc) u32 *cpu_to_phys_id = NULL; +struct interrupt_server_node { + struct list_head node; + boolavail; + int len; + __be32 *intserv; +}; + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_mask @@ -448,11 +456,16 @@ u32 *cpu_to_phys_id = NULL; void __init smp_setup_cpu_maps(void) { struct device_node *dn; - int cpu = 0; - int nthreads = 1; + int shift = 0, cpu = 0; + int j, nthreads = 1; + int len; + struct interrupt_server_node *intserv_node, *n; + struct list_head *bt_node, head; + bool avail, found_boot_cpu = false; DBG("smp_setup_cpu_maps()\n"); + INIT_LIST_HEAD(); cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32), __alignof__(u32)); if (!cpu_to_phys_id) @@ -462,7 +475,6 @@ void __init smp_setup_cpu_maps(void) for_each_node_by_type(dn, "cpu") { const __be32 *intserv;
Re: [RFC PATCH] powerpc: Make crashing cpu to be discovered first in kdump kernel.
Hi Mahesh, Thanks for sharing your great idea. I was in the middle of V5 and finish it today. My v5 is based on the same idea of my v4 [1] with the improvement of the code. And I will send it out. [1]: https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/ I will have a close look at your patch later. Thanks, Pingfan On Thu, Sep 7, 2023 at 1:59 AM Mahesh Salgaonkar wrote: > > The kernel boot parameter 'nr_cpus=' allows one to specify number of > possible cpus in the system. In the normal scenario the first cpu (cpu0) > that shows up is the boot cpu and hence it gets covered under nr_cpus > limit. > > But this assumption is broken in kdump scenario where kdump kernel after a > crash can boot up on an non-zero boot cpu. The paca structure allocation > depends on value of nr_cpus and is indexed using logical cpu ids. The cpu > discovery code brings up the cpus as they appear sequentially on device > tree and assigns logical cpu ids starting from 0. This definitely becomes > an issue if boot cpu id > nr_cpus. When this occurs it results into > > In past there were proposals to fix this by making changes to cpu discovery > code to identify non-zero boot cpu and map it to logical cpu 0. However, > the changes were very invasive, making discovery code more complicated and > risky. > > Considering that the non-zero boot cpu scenario is more specific to kdump > kernel, limiting the changes in panic/crash kexec path would probably be a > best approach to have. > > Hence proposed change is, in crash kexec path, move the crashing cpu's > device node to the first position under '/cpus' node, which will make the > crashing cpu to be discovered as part of the first core in kdump kernel. > > In order to accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids, > align up the nr_cpu_ids to SMT threads in early_init_dt_scan_cpus(). This > will allow kdump kernel to work with nr_cpus=X where X will be aligned up > in multiple of SMT threads per core. > > Signed-off-by: Mahesh Salgaonkar > --- > arch/powerpc/include/asm/kexec.h |1 > arch/powerpc/kernel/prom.c| 13 > arch/powerpc/kexec/core_64.c | 128 > + > arch/powerpc/kexec/file_load_64.c |2 - > 4 files changed, 143 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/kexec.h > b/arch/powerpc/include/asm/kexec.h > index a1ddba01e7d13..f5a6f4a1b8eb0 100644 > --- a/arch/powerpc/include/asm/kexec.h > +++ b/arch/powerpc/include/asm/kexec.h > @@ -144,6 +144,7 @@ unsigned int kexec_extra_fdt_size_ppc64(struct kimage > *image); > int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, > unsigned long initrd_load_addr, > unsigned long initrd_len, const char *cmdline); > +int add_node_props(void *fdt, int node_offset, const struct device_node *dn); > #endif /* CONFIG_PPC64 */ > > #endif /* CONFIG_KEXEC_FILE */ > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > index 0b5878c3125b1..c2d4f55042d72 100644 > --- a/arch/powerpc/kernel/prom.c > +++ b/arch/powerpc/kernel/prom.c > @@ -322,6 +322,9 @@ static void __init check_cpu_feature_properties(unsigned > long node) > } > } > > +/* align addr on a size boundary - adjust address up */ > +#define _ALIGN_UP(addr, size) > (((addr)+((size)-1))&(~((typeof(addr))(size)-1))) > + > static int __init early_init_dt_scan_cpus(unsigned long node, > const char *uname, int depth, > void *data) > @@ -348,6 +351,16 @@ static int __init early_init_dt_scan_cpus(unsigned long > node, > > nthreads = len / sizeof(int); > > + /* > +* Align nr_cpu_ids to correct SMT value. This will help us to > allocate > +* pacas correctly to accomodate boot_cpu != 0 scenario e.g. in kdump > +* kernel the boot cpu can be any cpu between 0 through nthreads. > +*/ > + if (nr_cpu_ids % nthreads) { > + nr_cpu_ids = _ALIGN_UP(nr_cpu_ids, nthreads); > + pr_info("Aligned nr_cpus to SMT=%d, nr_cpu_ids = %d\n", > nthreads, nr_cpu_ids); > + } > + > /* > * Now see if any of these threads match our boot cpu. > * NOTE: This must match the parsing done in smp_setup_cpu_maps. > diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c > index a79e28c91e2be..168bef43e22c2 100644 > --- a/arch/powerpc/kexec/core_64.c > +++ b/arch/powerpc/kexec/core_64.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -298,6 +299,119 @@ extern void kexec_sequence(void *newstack, unsigned > long start, >void (*clear_all)(void), >bool copy_with_mmu_off) __noreturn; > > +/* > + * Move the crashing cpus FDT node as the first node under '/cpus' node. > + * >
Re: [PATCH 2/2] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues
Hi Ming, Having no [PATCH 1/2] blk-mq: add blk_mq_max_nr_hw_queues() in inbox. So I reply here. At first glance, I think that the cpu hot plug callback hook should be the remedy for the newly introduced blk_mq_max_nr_hw_queues(), although it is more complicated. Consider the scene where nr_cpus=4, which can speed up the dumping process, the blk_mq_max_nr_hw_queues() can not utilize the other three cpus. Thanks, Pingfan On Mon, Jul 10, 2023 at 5:16 PM Ming Lei wrote: > > On Mon, Jul 10, 2023 at 08:41:09AM +0200, Christoph Hellwig wrote: > > On Sat, Jul 08, 2023 at 10:02:59AM +0800, Ming Lei wrote: > > > Take blk-mq's knowledge into account for calculating io queues. > > > > > > Fix wrong queue mapping in case of kdump kernel. > > > > > > On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see > > > `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus() > > > still returns all CPUs. > > > > That's simply broken. Please fix the arch code to make sure > > it does not return a bogus num_possible_cpus value for these > > That is documented in Documentation/admin-guide/kdump/kdump.rst. > > On arm and ppc64, 'maxcpus=1' is passed for kdump kernel, and "maxcpu=1" > simply keep one of CPU cores as online, and others as offline. > > So Cc our arch(arm & ppc64) & kdump guys wrt. passing 'maxcpus=1' for > kdump kernel. > > > setups, otherwise you'll have to paper over it in all kind of > > drivers. > > The issue is only triggered for drivers which use managed irq & > multiple hw queues. > > > Thanks, > Ming > > > ___ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec >
Re: [PATCH 2/2] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues
On Mon, Jul 10, 2023 at 5:16 PM Ming Lei wrote: > > On Mon, Jul 10, 2023 at 08:41:09AM +0200, Christoph Hellwig wrote: > > On Sat, Jul 08, 2023 at 10:02:59AM +0800, Ming Lei wrote: > > > Take blk-mq's knowledge into account for calculating io queues. > > > > > > Fix wrong queue mapping in case of kdump kernel. > > > > > > On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see > > > `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus() > > > still returns all CPUs. > > > > That's simply broken. Please fix the arch code to make sure > > it does not return a bogus num_possible_cpus value for these > In fact, num_possible_cpus is not broken. Quote from admin-guide/kernel-parameters.txt maxcpus=[SMP] Maximum number of processors that an SMP kernel will bring up during bootup. maxcpus=n : n >= 0 limits the kernel to bring up 'n' processors. Surely after bootup you can bring up the other plugged cpu by executing "echo 1 > /sys/devices/system/cpu/cpuX/online". So maxcpus only takes effect during system bootup. While n=0 is a special case, it is equivalent to "nosmp", which also disables the IO APIC. Here, as it explained, maxcpus only affects the bootup, later, extra cpus can be online. > That is documented in Documentation/admin-guide/kdump/kdump.rst. > > On arm and ppc64, 'maxcpus=1' is passed for kdump kernel, and "maxcpu=1" On aarch64 and x86, nr_cpus=1 is used, while on ppc64, due to the implementation, nr_cpus=1 can not be supported. Thanks, Pingfan > simply keep one of CPU cores as online, and others as offline. > > So Cc our arch(arm & ppc64) & kdump guys wrt. passing 'maxcpus=1' for > kdump kernel. > > > setups, otherwise you'll have to paper over it in all kind of > > drivers. > > The issue is only triggered for drivers which use managed irq & > multiple hw queues. > > > Thanks, > Ming > > > ___ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec >
Re: // a kdump hang caused by PPC pci patch series
On Mon, Nov 21, 2022 at 8:57 PM Cédric Le Goater wrote: > > On 11/21/22 12:57, Pingfan Liu wrote: > > Sorry that forget a subject. > > > > On Mon, Nov 21, 2022 at 7:54 PM Pingfan Liu wrote: > >> > >> Hello Powerpc folks, > >> > >> I encounter an kdump bug, which I bisect and pin commit 174db9e7f775 > >> ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug") > >> In that case, using Fedora 36 as host, the mentioned commit as the > >> guest kernel, and virto-block disk, the kdump kernel will hang: > > The host kernel should be using the PowerNV platform and not pseries > or are you running a nested L2 guest on KVM/pseries L1 ? > > And as far as I remember, the patch above only impacts the IBM PowerVM > hypervisor, not KVM, and PHB hotplug, or kdump induces some hot-plugging > I am not aware of. > > Also, if indeed, this is a L2 guest, the XIVE interrupt controller is > emulated in QEMU, "info pic" should return: > >... >irqchip: emulated > > >> > >> [0.00] Kernel command line: elfcorehdr=0x22c0 > >> no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8 > >> irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory > >> numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 > >> kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd > >> hugetlb_cma=0 > >> ... > >> [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues > >> [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical > >> blocks (10.7 GB/10.0 GiB) > >> [ 68.398234] systemd-udevd[187]: virtio2: Worker [190] > >> processing SEQNUM=1193 is taking a long time > >> [ 188.398258] systemd-udevd[187]: virtio2: Worker [190] > >> processing SEQNUM=1193 killed > >> > >> > >> During my test, I found that in very rare cases, the kdump can success > >> (I guess it may be due to the cpu id). And if using either maxcpus=2 > >> or using scsi-disk, then kdump can also success. And before the > >> mentioned commit, kdump can also success. > >> > >> The attachment contains the xml to reproduce that bug. > >> > >> Do you have any ideas? > > Most certainly an interrupt not being delivered. You can check the status > on the host with : > >virsh qemu-monitor-command --hmp "info pic" > Please pick it up from the attachment. Thanks, Pingfan Script started on 2022-11-24 03:22:55-05:00 [TERM="xterm-256color" TTY="/dev/pts/0" COLUMNS="172" LINES="41"] ]0;root@ibm-p9wr-02:~[?2004h[root@ibm-p9wr-02 ~]# virsh qemu-monitor-command --hmp rhel9 "info pic" [?2004l CPU[]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 CPU[]: USER00 00 0000 00 00 00 00 CPU[]: OS00 ff 0000 ff 00 ff ff 8400 CPU[]: POOL00 00 0000 00 00 00 00 CPU[]: PHYS00 00 0000 00 00 00 ff CPU[0001]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 CPU[0001]: USER00 00 0000 00 00 00 00 CPU[0001]: OS00 ff 0000 ff 00 ff ff 8401 CPU[0001]: POOL00 00 0000 00 00 00 00 CPU[0001]: PHYS00 00 0000 00 00 00 ff CPU[0002]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 CPU[0002]: USER00 00 0000 00 00 00 00 CPU[0002]: OS00 ff 0000 ff 00 ff ff 8402 CPU[0002]: POOL00 00 0000 00 00 00 00 CPU[0002]: PHYS00 00 0000 00 00 00 ff CPU[0003]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 CPU[0003]: USER00 00 0000 00 00 00 00 CPU[0003]: OS00 ff 0000 ff 00 ff ff 8403 CPU[0003]: POOL00 00 0000 00 00 00 00 CPU[0003]: PHYS00 00 0000 00 00 00 ff CPU[0004]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 CPU[0004]: USER00 00 0000 00 00 00 00 CPU[0004]: OS00 ff 0000 ff 00 ff ff 8404 CPU[0004]: POOL00 00 0000 00 00 00 00 CPU[0004]: PHYS00 00 0000 00 00 00 ff CPU[0005]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 CPU[0005]: USER00 00 0000 00 00 00 00 CPU[0005]: OS00 ff 0000 ff 00 ff ff 8405 CPU[0005]: POOL00 00 0000 00 00 00 00 CPU[0005]: PHYS00 00 0000 00 00 00 ff CPU[0006]: QW NSR CPPR IPB LSM
Re: // a kdump hang caused by PPC pci patch series
Hi Gedric, Appreciate your insight. Please see the comment inline below. On Mon, Nov 21, 2022 at 8:57 PM Cédric Le Goater wrote: > > On 11/21/22 12:57, Pingfan Liu wrote: > > Sorry that forget a subject. > > > > On Mon, Nov 21, 2022 at 7:54 PM Pingfan Liu wrote: > >> > >> Hello Powerpc folks, > >> > >> I encounter an kdump bug, which I bisect and pin commit 174db9e7f775 > >> ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug") > >> In that case, using Fedora 36 as host, the mentioned commit as the > >> guest kernel, and virto-block disk, the kdump kernel will hang: > > The host kernel should be using the PowerNV platform and not pseries > or are you running a nested L2 guest on KVM/pseries L1 ? > Host kernel ran on P9 bare metal. And here PowerKVM is used. > And as far as I remember, the patch above only impacts the IBM PowerVM > hypervisor, not KVM, and PHB hotplug, or kdump induces some hot-plugging > I am not aware of. > Sorry that my information is not clear. The suspect series is "[PATCH 00/31] powerpc: Modernize the PCI/MSI support", and in the main line, beginning from commit 786e5b102a00 ("powerpc/pseries/pci: Introduce __find_pe_total_msi()"). I tried to bisect, and the commit a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains") even hangs the first kernel. So I went ahead to find the next functional change on pseries, which is commit 174db9e7f775 ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug"). > Also, if indeed, this is a L2 guest, the XIVE interrupt controller is > emulated in QEMU, "info pic" should return: > >... >irqchip: emulated > > >> > >> [0.00] Kernel command line: elfcorehdr=0x22c0 > >> no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8 > >> irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory > >> numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 > >> kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd > >> hugetlb_cma=0 > >> ... > >> [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues > >> [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical > >> blocks (10.7 GB/10.0 GiB) > >> [ 68.398234] systemd-udevd[187]: virtio2: Worker [190] > >> processing SEQNUM=1193 is taking a long time > >> [ 188.398258] systemd-udevd[187]: virtio2: Worker [190] > >> processing SEQNUM=1193 killed > >> > >> > >> During my test, I found that in very rare cases, the kdump can success > >> (I guess it may be due to the cpu id). And if using either maxcpus=2 > >> or using scsi-disk, then kdump can also success. And before the > >> mentioned commit, kdump can also success. > >> > >> The attachment contains the xml to reproduce that bug. > >> > >> Do you have any ideas? > > Most certainly an interrupt not being delivered. You can check the status > on the host with : > >virsh qemu-monitor-command --hmp "info pic" > OK, I will try to occupy a P9 machine and have a shot. I will update the info later. Thanks, Pingfa > > > Thanks, > > C.
Re: // a kdump hang caused by PPC pci patch series
Sorry that forget a subject. On Mon, Nov 21, 2022 at 7:54 PM Pingfan Liu wrote: > > Hello Powerpc folks, > > I encounter an kdump bug, which I bisect and pin commit 174db9e7f775 > ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug") > > In that case, using Fedora 36 as host, the mentioned commit as the > guest kernel, and virto-block disk, the kdump kernel will hang: > > [0.00] Kernel command line: elfcorehdr=0x22c0 > no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8 > irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory > numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 > kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd > hugetlb_cma=0 > ... > [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues > [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical > blocks (10.7 GB/10.0 GiB) > [ 68.398234] systemd-udevd[187]: virtio2: Worker [190] > processing SEQNUM=1193 is taking a long time > [ 188.398258] systemd-udevd[187]: virtio2: Worker [190] > processing SEQNUM=1193 killed > > > During my test, I found that in very rare cases, the kdump can success > (I guess it may be due to the cpu id). And if using either maxcpus=2 > or using scsi-disk, then kdump can also success. And before the > mentioned commit, kdump can also success. > > The attachment contains the xml to reproduce that bug. > > Do you have any ideas? > > Thanks
[no subject]
Hello Powerpc folks, I encounter an kdump bug, which I bisect and pin commit 174db9e7f775 ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug") In that case, using Fedora 36 as host, the mentioned commit as the guest kernel, and virto-block disk, the kdump kernel will hang: [0.00] Kernel command line: elfcorehdr=0x22c0 no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8 irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory numa=off udev.children-max=2 ehea.use_mcs=0 panic=10 kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd hugetlb_cma=0 ... [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical blocks (10.7 GB/10.0 GiB) [ 68.398234] systemd-udevd[187]: virtio2: Worker [190] processing SEQNUM=1193 is taking a long time [ 188.398258] systemd-udevd[187]: virtio2: Worker [190] processing SEQNUM=1193 killed During my test, I found that in very rare cases, the kdump can success (I guess it may be due to the cpu id). And if using either maxcpus=2 or using scsi-disk, then kdump can also success. And before the mentioned commit, kdump can also success. The attachment contains the xml to reproduce that bug. Do you have any ideas? Thanks rhel9 6266c1c1-1e74-4046-b959-33d94877b387 http://libosinfo.org/xmlns/libvirt/domain/1.0;> http://redhat.com/rhel/8-unknown"/> 16777216 16777216 16 hvm POWER9 destroy restart destroy /usr/libexec/qemu-kvm /dev/urandom
[RFC 08/10] cpuhp: Replace cpumask_any_but(cpu_online_mask, cpu)
In a kexec quick reboot path, the dying cpus are still on cpu_online_mask. During the teardown of cpu, a subsystem needs to migrate its broker to a real online cpu. This patch replaces cpumask_any_but(cpu_online_mask, cpu) in a teardown procedure with cpumask_not_dying_but(cpu_online_mask, cpu). Signed-off-by: Pingfan Liu Cc: Russell King Cc: Shawn Guo Cc: Sascha Hauer Cc: Pengutronix Kernel Team Cc: Fabio Estevam Cc: NXP Linux Team Cc: Fenghua Yu Cc: Dave Jiang Cc: Vinod Koul Cc: Wu Hao Cc: Tom Rix Cc: Moritz Fischer Cc: Xu Yilun Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: David Airlie Cc: Daniel Vetter Cc: Will Deacon Cc: Mark Rutland Cc: Frank Li Cc: Shaokun Zhang Cc: Qi Liu Cc: Andy Gross Cc: Bjorn Andersson Cc: Konrad Dybcio Cc: Khuong Dinh Cc: Li Yang Cc: Yury Norov To: linux-arm-ker...@lists.infradead.org To: dmaeng...@vger.kernel.org To: linux-f...@vger.kernel.org To: intel-...@lists.freedesktop.org To: dri-de...@lists.freedesktop.org To: linux-arm-...@vger.kernel.org To: linuxppc-dev@lists.ozlabs.org To: linux-ker...@vger.kernel.org --- arch/arm/mach-imx/mmdc.c | 2 +- arch/arm/mm/cache-l2x0-pmu.c | 2 +- drivers/dma/idxd/perfmon.c | 2 +- drivers/fpga/dfl-fme-perf.c | 2 +- drivers/gpu/drm/i915/i915_pmu.c | 2 +- drivers/perf/arm-cci.c | 2 +- drivers/perf/arm-ccn.c | 2 +- drivers/perf/arm-cmn.c | 4 ++-- drivers/perf/arm_dmc620_pmu.c| 2 +- drivers/perf/arm_dsu_pmu.c | 2 +- drivers/perf/arm_smmuv3_pmu.c| 2 +- drivers/perf/fsl_imx8_ddr_perf.c | 2 +- drivers/perf/hisilicon/hisi_uncore_pmu.c | 2 +- drivers/perf/marvell_cn10k_tad_pmu.c | 2 +- drivers/perf/qcom_l2_pmu.c | 2 +- drivers/perf/qcom_l3_pmu.c | 2 +- drivers/perf/xgene_pmu.c | 2 +- drivers/soc/fsl/qbman/bman_portal.c | 2 +- drivers/soc/fsl/qbman/qman_portal.c | 2 +- 19 files changed, 20 insertions(+), 20 deletions(-) diff --git a/arch/arm/mach-imx/mmdc.c b/arch/arm/mach-imx/mmdc.c index af12668d0bf5..a109a7ea8613 100644 --- a/arch/arm/mach-imx/mmdc.c +++ b/arch/arm/mach-imx/mmdc.c @@ -220,7 +220,7 @@ static int mmdc_pmu_offline_cpu(unsigned int cpu, struct hlist_node *node) if (!cpumask_test_and_clear_cpu(cpu, _mmdc->cpu)) return 0; - target = cpumask_any_but(cpu_online_mask, cpu); + target = cpumask_not_dying_but(cpu_online_mask, cpu); if (target >= nr_cpu_ids) return 0; diff --git a/arch/arm/mm/cache-l2x0-pmu.c b/arch/arm/mm/cache-l2x0-pmu.c index 993fefdc167a..1b0037ef7fa5 100644 --- a/arch/arm/mm/cache-l2x0-pmu.c +++ b/arch/arm/mm/cache-l2x0-pmu.c @@ -428,7 +428,7 @@ static int l2x0_pmu_offline_cpu(unsigned int cpu) if (!cpumask_test_and_clear_cpu(cpu, _cpu)) return 0; - target = cpumask_any_but(cpu_online_mask, cpu); + target = cpumask_not_dying_but(cpu_online_mask, cpu); if (target >= nr_cpu_ids) return 0; diff --git a/drivers/dma/idxd/perfmon.c b/drivers/dma/idxd/perfmon.c index d73004f47cf4..f3f1ccb55f73 100644 --- a/drivers/dma/idxd/perfmon.c +++ b/drivers/dma/idxd/perfmon.c @@ -528,7 +528,7 @@ static int perf_event_cpu_offline(unsigned int cpu, struct hlist_node *node) if (!cpumask_test_and_clear_cpu(cpu, _dsa_cpu_mask)) return 0; - target = cpumask_any_but(cpu_online_mask, cpu); + target = cpumask_not_dying_but(cpu_online_mask, cpu); /* migrate events if there is a valid target */ if (target < nr_cpu_ids) diff --git a/drivers/fpga/dfl-fme-perf.c b/drivers/fpga/dfl-fme-perf.c index 587c82be12f7..57804f28357e 100644 --- a/drivers/fpga/dfl-fme-perf.c +++ b/drivers/fpga/dfl-fme-perf.c @@ -948,7 +948,7 @@ static int fme_perf_offline_cpu(unsigned int cpu, struct hlist_node *node) if (cpu != priv->cpu) return 0; - target = cpumask_any_but(cpu_online_mask, cpu); + target = cpumask_not_dying_but(cpu_online_mask, cpu); if (target >= nr_cpu_ids) return 0; diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c index 958b37123bf1..f866f9223492 100644 --- a/drivers/gpu/drm/i915/i915_pmu.c +++ b/drivers/gpu/drm/i915/i915_pmu.c @@ -1068,7 +1068,7 @@ static int i915_pmu_cpu_offline(unsigned int cpu, struct hlist_node *node) return 0; if (cpumask_test_and_clear_cpu(cpu, _pmu_cpumask)) { - target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu); + target = cpumask_not_dying_but(topology_sibling_cpumask(cpu), cpu); /* Migrate events if there is a valid target */ if (target < nr_cpu_ids) { diff --git a/drivers/perf/arm-cci.c b/drivers/perf/arm-cci.c index 03b1309875ae
[PATCHv4 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable
smp_shutdown_nonboot_cpus() repeats the same code chunk as migrate_to_reboot_cpu() to ensure that the rebooting happens on a valid cpu. if (!cpu_online(primary_cpu)) primary_cpu = cpumask_first(cpu_online_mask); This is due to an unexpected cpu-down event like the following: kernel_kexec() migrate_to_reboot_cpu(); cpu_hotplug_enable(); ---> comes a cpu_down(this_cpu) on other cpu machine_shutdown(); smp_shutdown_nonboot_cpus();which needs to re-check "if (!cpu_online(primary_cpu))" Although the kexec-reboot task can get through a cpu_down() on its cpu, this code looks a little confusing. Tracing down the git history, the cpu_hotplug_enable() called by kernel_kexec() is introduced by commit 011e4b02f1da ("powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode"), which wakes up all offline cpu by cpu_up(cpu). Later, it is required by the architectures(arm/arm64/ia64/riscv) which resort to cpu hot-removing to achieve kexec-reboot by smp_shutdown_nonboot_cpus()->cpu_down_maps_locked(). Hence, the cpu_hotplug_enable() in kernel_kexec() is an architecture requirement. By deferring the cpu hotplug enable to a more proper point, where smp_shutdown_nonboot_cpus() holds cpu_add_remove_lock, the unexpected cpu-down event is squashed out and the rebooting cpu can keep unchanged. (For powerpc, no gains from this change.) As a result, the repeated code chunk can be removed and in [2/2], the callsites of smp_shutdown_nonboot_cpus() can be consistent. Signed-off-by: Pingfan Liu Cc: Eric Biederman Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Donnefort Cc: Ingo Molnar Cc: Michael Ellerman Cc: Mark Rutland Cc: YueHaibing Cc: Baokun Li Cc: Randy Dunlap Cc: Valentin Schneider Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org To: linux-ker...@vger.kernel.org --- arch/powerpc/kexec/core_64.c | 1 + kernel/cpu.c | 10 +- kernel/kexec_core.c | 11 +-- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c index 6cc7793b8420..8ccf22197f08 100644 --- a/arch/powerpc/kexec/core_64.c +++ b/arch/powerpc/kexec/core_64.c @@ -224,6 +224,7 @@ static void wake_offline_cpus(void) static void kexec_prepare_cpus(void) { + cpu_hotplug_enable(); wake_offline_cpus(); smp_call_function(kexec_smp_down, NULL, /* wait */0); local_irq_disable(); diff --git a/kernel/cpu.c b/kernel/cpu.c index d0a9aa0b42e8..4415370f0e91 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1236,12 +1236,12 @@ void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) cpu_maps_update_begin(); /* -* Make certain the cpu I'm about to reboot on is online. -* -* This is inline to what migrate_to_reboot_cpu() already do. +* At this point, the cpu hotplug is still disabled by +* migrate_to_reboot_cpu() to guarantee that the rebooting happens on +* the selected CPU. But cpu_down_maps_locked() returns -EBUSY, if +* cpu_hotplug_disabled. So re-enable CPU hotplug here. */ - if (!cpu_online(primary_cpu)) - primary_cpu = cpumask_first(cpu_online_mask); + __cpu_hotplug_enable(); for_each_online_cpu(cpu) { if (cpu == primary_cpu) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 68480f731192..1bd5a8c95a20 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1168,14 +1168,13 @@ int kernel_kexec(void) kexec_in_progress = true; kernel_restart_prepare("kexec reboot"); migrate_to_reboot_cpu(); - /* -* migrate_to_reboot_cpu() disables CPU hotplug assuming that -* no further code needs to use CPU hotplug (which is true in -* the reboot case). However, the kexec path depends on using -* CPU hotplug again; so re-enable it here. +* migrate_to_reboot_cpu() disables CPU hotplug and pin the +* rebooting thread on the selected CPU. If an architecture +* requires CPU hotplug to achieve kexec reboot, it should +* enable the hotplug in the architecture specific code */ - cpu_hotplug_enable(); + pr_notice("Starting new kernel\n"); machine_shutdown(); } -- 2.31.1
Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo
Correct mail address of Kazuhito On Tue, Jun 8, 2021 at 6:34 PM Pingfan Liu wrote: > > As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo: > Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the > formula: > #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) > > Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate > PAGES_PER_SECTION in makedumpfile just like kernel. > > Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g. > recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce > SECTION_SIZE_BITS"). But user space wants a stable interface to get this > info. Such info is impossible to be deduced from a crashdump vmcore. > Hence append SECTION_SIZE_BITS to vmcoreinfo. > > Signed-off-by: Pingfan Liu > Cc: Bhupesh Sharma > Cc: Kazuhito Hagio > Cc: Dave Young > Cc: Baoquan He > Cc: Boris Petkov > Cc: Ingo Molnar > Cc: Thomas Gleixner > Cc: James Morse > Cc: Mark Rutland > Cc: Will Deacon > Cc: Catalin Marinas > Cc: Michael Ellerman > Cc: Paul Mackerras > Cc: Benjamin Herrenschmidt > Cc: Dave Anderson > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-ker...@vger.kernel.org > Cc: ke...@lists.infradead.org > Cc: x...@kernel.org > Cc: linux-arm-ker...@lists.infradead.org > --- > kernel/crash_core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 825284baaf46..684a6061a13a 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void) > VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS); > VMCOREINFO_STRUCT_SIZE(mem_section); > VMCOREINFO_OFFSET(mem_section, section_mem_map); > + VMCOREINFO_NUMBER(SECTION_SIZE_BITS); > VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS); > #endif > VMCOREINFO_STRUCT_SIZE(page); > -- > 2.29.2 >
[PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the formula: #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate PAGES_PER_SECTION in makedumpfile just like kernel. Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g. recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS"). But user space wants a stable interface to get this info. Such info is impossible to be deduced from a crashdump vmcore. Hence append SECTION_SIZE_BITS to vmcoreinfo. Signed-off-by: Pingfan Liu Cc: Bhupesh Sharma Cc: Kazuhito Hagio Cc: Dave Young Cc: Baoquan He Cc: Boris Petkov Cc: Ingo Molnar Cc: Thomas Gleixner Cc: James Morse Cc: Mark Rutland Cc: Will Deacon Cc: Catalin Marinas Cc: Michael Ellerman Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Dave Anderson Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org Cc: ke...@lists.infradead.org Cc: x...@kernel.org Cc: linux-arm-ker...@lists.infradead.org --- kernel/crash_core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 825284baaf46..684a6061a13a 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS); VMCOREINFO_STRUCT_SIZE(mem_section); VMCOREINFO_OFFSET(mem_section, section_mem_map); + VMCOREINFO_NUMBER(SECTION_SIZE_BITS); VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS); #endif VMCOREINFO_STRUCT_SIZE(page); -- 2.29.2
Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
On Sat, Apr 10, 2021 at 12:33 AM Michal Suchánek wrote: > > Hello, > > On Fri, Aug 28, 2020 at 04:10:09PM +0800, Pingfan Liu wrote: > > On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour > > wrote: > > > > > > Le 10/08/2020 à 10:52, Pingfan Liu a écrit : > > > > A bug is observed on pseries by taking the following steps on rhel: > > > > -1. drmgr -c mem -r -q 5 > > > > -2. echo c > /proc/sysrq-trigger > > > > > > > > And then, the failure looks like: > > > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ > > > > kdump: saving vmcore-dmesg.txt > > > > kdump: saving vmcore-dmesg.txt complete > > > > kdump: saving vmcore > > > > Checking for memory holes : [ 0.0 %] / > > > > Checking for memory holes : [100.0 > > > > %] | Excluding unnecessary pages > > > > : [100.0 %] \ Copying data > > > >: [ 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: > > > > Hashing failure ! EA=0x7fffba40 access=0x8004 > > > > current=makedumpfile > > > > [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base > > > > psize=2 psize 2 pte=0xc0005504 > > > > [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 > > > > access=0x8004 current=makedumpfile > > > > [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base > > > > psize=2 psize 2 pte=0xc0005504 > > > > [ 44.337708] makedumpfile[469]: unhandled signal 7 at > > > > 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2 > > > > [ 44.338548] Core dump to |/bin/false pipe failed > > > > /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error > > > > $CORE_COLLECTOR /proc/vmcore > > > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete > > > > kdump: saving vmcore failed > > > > > > > > * Root cause * > > > >After analyzing, it turns out that in the current implementation, > > > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt > > > > updating as > > > > the code __remove_memory() comes before drmem_update_dt(). > > > > So in kdump kernel, when read_from_oldmem() resorts to > > > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to > > > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, > > > > as it > > > > can be observed "Bus error" > > > > > > > > From a viewpoint of listener and publisher, the publisher notifies the > > > > listener before data is ready. This introduces a problem where udev > > > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before > > > > updating. And in capture kernel, makedumpfile will access the memory > > > > based > > > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. > > > > > > > > * Fix * > > > > This bug is introduced by commit 063b8b1251fd > > > > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR > > > > request"), which tried to combine all the dt updating into one. > > > > > > > > To fix this issue, meanwhile not to introduce a quadratic runtime > > > > complexity by the model: > > > >dlpar_memory_add_by_count > > > > for_each_drmem_lmb <-- > > > >dlpar_add_lmb > > > > drmem_update_dt(_v1|_v2) > > > >for_each_drmem_lmb <-- > > > > The dt should still be only updated once, and just before the last > > > > memory > > > > online/offline event is ejected to user space. Achieve this by tracing > > > > the > > > > num of lmb added or removed. > > > > > > > > Signed-off-by: Pingfan Liu > > > > Cc: Michael Ellerman > > > > Cc: Hari Bathini > > > > Cc: Nathan Lynch > > > > Cc: Nathan Fontenot > > > > Cc: Laurent Dufour > > > > To: linuxppc-dev@lists.ozlabs.org > > > > Cc: ke...@lists.infradead.org > > > > --- > > > > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() pro
Re: [PATCH 0/3] warn and suppress irqflood
On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner wrote: > > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote: > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 > > platform. > > When the bug happens, the kernel is totally occupies by irq. Currently, > > there > > may be nothing or just soft lockup warning showed in console. It is better > > to warn users with irq flood info. > > > > In the kdump case, the kernel can move on by suppressing the irq flood. > > You're curing the symptom not the cause and the cure is just magic and > can't work reliably. Yeah, it is magic. But at least, it is better to printk something and alarm users about what happens. With current code, it may show nothing when system hangs. > > Where is that irq flood originated from and why is none of the > mechanisms we have in place to shut it up working? The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus driver (i2c-opal.c). After i2c_opal_send_request(), the bug is triggered. But things are complicated by introducing a firmware layer: Skiboot. This software layer hides the detail of manipulating the hardware from Linux. I guess the software logic can not enter a sane state when kernel crashes. Cc Skiboot and ppc64 community to see whether anyone has idea about it. Thanks, Pingfan
Re: [PATCH] powerpc/time: enable sched clock for irqtime
I encounter a irq flood on Power9 machine, and tries a way to work around it by https://www.spinics.net/lists/kernel/msg3705028.html As irq time accounting is the foundation for the method, it needs to make irq accounting take effect on powerpc platform. On Thu, Oct 22, 2020 at 2:51 PM Pingfan Liu wrote: > > When CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN, powerpc > does not enable "sched_clock_irqtime" and can not utilize irq time > accounting. > > Like x86, powerpc does not use the sched_clock_register() interface. So it > needs an dedicated call to enable_sched_clock_irqtime() to enable irq time > accounting. > > Signed-off-by: Pingfan Liu > Cc: Michael Ellerman > Cc: Christophe Leroy > Cc: Nicholas Piggin > To: linuxppc-dev@lists.ozlabs.org > --- > arch/powerpc/kernel/time.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index f85539e..4083b59e 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -1134,6 +1135,7 @@ void __init time_init(void) > tick_setup_hrtimer_broadcast(); > > of_clk_init(NULL); > + enable_sched_clock_irqtime(); > } > > /* > -- > 2.7.5 >
[PATCH] powerpc/time: enable sched clock for irqtime
When CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN, powerpc does not enable "sched_clock_irqtime" and can not utilize irq time accounting. Like x86, powerpc does not use the sched_clock_register() interface. So it needs an dedicated call to enable_sched_clock_irqtime() to enable irq time accounting. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Christophe Leroy Cc: Nicholas Piggin To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/time.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index f85539e..4083b59e 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -53,6 +53,7 @@ #include #include #include +#include #include #include @@ -1134,6 +1135,7 @@ void __init time_init(void) tick_setup_hrtimer_broadcast(); of_clk_init(NULL); + enable_sched_clock_irqtime(); } /* -- 2.7.5
Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour wrote: > > Le 10/08/2020 à 10:52, Pingfan Liu a écrit : > > A bug is observed on pseries by taking the following steps on rhel: > > -1. drmgr -c mem -r -q 5 > > -2. echo c > /proc/sysrq-trigger > > > > And then, the failure looks like: > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ > > kdump: saving vmcore-dmesg.txt > > kdump: saving vmcore-dmesg.txt complete > > kdump: saving vmcore > > Checking for memory holes : [ 0.0 %] / > > Checking for memory holes : [100.0 %] | > > Excluding unnecessary pages : [100.0 %] > > \ Copying data : [ > > 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! > > EA=0x7fffba40 access=0x8004 current=makedumpfile > > [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > > psize 2 pte=0xc0005504 > > [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 > > access=0x8004 current=makedumpfile > > [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > > psize 2 pte=0xc0005504 > > [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 > > nip 7fffbbc4d7fc lr 00011356ca3c code 2 > > [ 44.338548] Core dump to |/bin/false pipe failed > > /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error > > $CORE_COLLECTOR /proc/vmcore > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete > > kdump: saving vmcore failed > > > > * Root cause * > >After analyzing, it turns out that in the current implementation, > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating > > as > > the code __remove_memory() comes before drmem_update_dt(). > > So in kdump kernel, when read_from_oldmem() resorts to > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it > > can be observed "Bus error" > > > > From a viewpoint of listener and publisher, the publisher notifies the > > listener before data is ready. This introduces a problem where udev > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before > > updating. And in capture kernel, makedumpfile will access the memory based > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. > > > > * Fix * > > This bug is introduced by commit 063b8b1251fd > > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR > > request"), which tried to combine all the dt updating into one. > > > > To fix this issue, meanwhile not to introduce a quadratic runtime > > complexity by the model: > >dlpar_memory_add_by_count > > for_each_drmem_lmb <-- > >dlpar_add_lmb > > drmem_update_dt(_v1|_v2) > >for_each_drmem_lmb <-- > > The dt should still be only updated once, and just before the last memory > > online/offline event is ejected to user space. Achieve this by tracing the > > num of lmb added or removed. > > > > Signed-off-by: Pingfan Liu > > Cc: Michael Ellerman > > Cc: Hari Bathini > > Cc: Nathan Lynch > > Cc: Nathan Fontenot > > Cc: Laurent Dufour > > To: linuxppc-dev@lists.ozlabs.org > > Cc: ke...@lists.infradead.org > > --- > > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report > >whether dt is updated successfully. > >Fix a condition boundary check bug > > v3 -> v4: resolve a quadratic runtime complexity issue. > >This series is applied on next-test branch > > arch/powerpc/platforms/pseries/hotplug-memory.c | 102 > > +++- > > 1 file changed, 80 insertions(+), 22 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > > b/arch/powerpc/platforms/pseries/hotplug-memory.c > > index 46cbcd1..1567d9f 100644 > > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > > @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) > > return true; > > } > > > > -static int dlpar_add_lmb(struct drmem_lmb *); > > +enum dt_update_status { > > + DT_NOUPDATE, > > + DT_TOUP
Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
Hello guys. Do you have further comments on this version? Thanks, Pingfan On Mon, Aug 10, 2020 at 4:53 PM Pingfan Liu wrote: > > A bug is observed on pseries by taking the following steps on rhel: > -1. drmgr -c mem -r -q 5 > -2. echo c > /proc/sysrq-trigger > > And then, the failure looks like: > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ > kdump: saving vmcore-dmesg.txt > kdump: saving vmcore-dmesg.txt complete > kdump: saving vmcore > Checking for memory holes : [ 0.0 %] / > Checking for memory holes : [100.0 %] | > Excluding unnecessary pages : [100.0 %] \ > Copying data : [ 0.3 %] - > eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! > EA=0x7fffba40 access=0x8004 current=makedumpfile > [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > psize 2 pte=0xc0005504 > [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 > access=0x8004 current=makedumpfile > [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > psize 2 pte=0xc0005504 > [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip > 7fffbbc4d7fc lr 00011356ca3c code 2 > [ 44.338548] Core dump to |/bin/false pipe failed > /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error > $CORE_COLLECTOR /proc/vmcore > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete > kdump: saving vmcore failed > > * Root cause * > After analyzing, it turns out that in the current implementation, > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as > the code __remove_memory() comes before drmem_update_dt(). > So in kdump kernel, when read_from_oldmem() resorts to > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it > can be observed "Bus error" > > From a viewpoint of listener and publisher, the publisher notifies the > listener before data is ready. This introduces a problem where udev > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before > updating. And in capture kernel, makedumpfile will access the memory based > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. > > * Fix * > This bug is introduced by commit 063b8b1251fd > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR > request"), which tried to combine all the dt updating into one. > > To fix this issue, meanwhile not to introduce a quadratic runtime > complexity by the model: > dlpar_memory_add_by_count > for_each_drmem_lmb <-- > dlpar_add_lmb > drmem_update_dt(_v1|_v2) > for_each_drmem_lmb <-- > The dt should still be only updated once, and just before the last memory > online/offline event is ejected to user space. Achieve this by tracing the > num of lmb added or removed. > > Signed-off-by: Pingfan Liu > Cc: Michael Ellerman > Cc: Hari Bathini > Cc: Nathan Lynch > Cc: Nathan Fontenot > Cc: Laurent Dufour > To: linuxppc-dev@lists.ozlabs.org > Cc: ke...@lists.infradead.org > --- > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report > whether dt is updated successfully. > Fix a condition boundary check bug > v3 -> v4: resolve a quadratic runtime complexity issue. > This series is applied on next-test branch > arch/powerpc/platforms/pseries/hotplug-memory.c | 102 > +++- > 1 file changed, 80 insertions(+), 22 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > b/arch/powerpc/platforms/pseries/hotplug-memory.c > index 46cbcd1..1567d9f 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) > return true; > } > > -static int dlpar_add_lmb(struct drmem_lmb *); > +enum dt_update_status { > + DT_NOUPDATE, > + DT_TOUPDATE, > + DT_UPDATED, > +}; > + > +/* "*dt_update" returns DT_UPDATED if updated */ > +static int dlpar_add_lmb(struct drmem_lmb *lmb, > + enum dt_update_status *dt_update); > > -static int dlpar_remove_lmb(struct drmem_lmb *lmb) > +static int dlpar_remove_lmb(struct drmem_lmb *lmb, > + enum dt_update_status *dt_update) > { > unsigned long block_sz;
[PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
A bug is observed on pseries by taking the following steps on rhel: -1. drmgr -c mem -r -q 5 -2. echo c > /proc/sysrq-trigger And then, the failure looks like: kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ kdump: saving vmcore-dmesg.txt kdump: saving vmcore-dmesg.txt complete kdump: saving vmcore Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Copying data : [ 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2 [ 44.338548] Core dump to |/bin/false pipe failed /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete kdump: saving vmcore failed * Root cause * After analyzing, it turns out that in the current implementation, when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as the code __remove_memory() comes before drmem_update_dt(). So in kdump kernel, when read_from_oldmem() resorts to pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it can be observed "Bus error" >From a viewpoint of listener and publisher, the publisher notifies the listener before data is ready. This introduces a problem where udev launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before updating. And in capture kernel, makedumpfile will access the memory based on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. * Fix * This bug is introduced by commit 063b8b1251fd ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR request"), which tried to combine all the dt updating into one. To fix this issue, meanwhile not to introduce a quadratic runtime complexity by the model: dlpar_memory_add_by_count for_each_drmem_lmb <-- dlpar_add_lmb drmem_update_dt(_v1|_v2) for_each_drmem_lmb <-- The dt should still be only updated once, and just before the last memory online/offline event is ejected to user space. Achieve this by tracing the num of lmb added or removed. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Hari Bathini Cc: Nathan Lynch Cc: Nathan Fontenot Cc: Laurent Dufour To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report whether dt is updated successfully. Fix a condition boundary check bug v3 -> v4: resolve a quadratic runtime complexity issue. This series is applied on next-test branch arch/powerpc/platforms/pseries/hotplug-memory.c | 102 +++- 1 file changed, 80 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 46cbcd1..1567d9f 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) return true; } -static int dlpar_add_lmb(struct drmem_lmb *); +enum dt_update_status { + DT_NOUPDATE, + DT_TOUPDATE, + DT_UPDATED, +}; + +/* "*dt_update" returns DT_UPDATED if updated */ +static int dlpar_add_lmb(struct drmem_lmb *lmb, + enum dt_update_status *dt_update); -static int dlpar_remove_lmb(struct drmem_lmb *lmb) +static int dlpar_remove_lmb(struct drmem_lmb *lmb, + enum dt_update_status *dt_update) { unsigned long block_sz; phys_addr_t base_addr; - int rc, nid; + int rc, ret, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -372,6 +381,13 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + if (*dt_update) { + ret = drmem_update_dt(); + if (ret) + pr_warn("%s fail to update dt, but continue\n", __func__); + else + *dt_update = DT_UPDATED; + } __remove_m
[PATCHv5 1/2] powerpc/pseries: group lmb operation and memblock's
This patch prepares for the incoming patch which swaps the order of KOBJ_ADD/REMOVE uevent and dt's updating. The dt updating should come after lmb operations, and before __remove_memory()/__add_memory(). Accordingly, grouping all lmb operations before the memblock's. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Hari Bathini Cc: Nathan Lynch Cc: Nathan Fontenot Cc: Laurent Dufour To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- v4 -> v5: fix the miss of clearing DRCONF_MEM_ASSIGNED in a failure path arch/powerpc/platforms/pseries/hotplug-memory.c | 28 + 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5d545b7..46cbcd1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { unsigned long block_sz; - int rc; + phys_addr_t base_addr; + int rc, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) if (rc) return rc; + base_addr = lmb->base_addr; + nid = lmb->nid; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); - - /* Update memory regions for memory remove */ - memblock_remove(lmb->base_addr, block_sz); - invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + __remove_memory(nid, base_addr, block_sz); + + /* Update memory regions for memory remove */ + memblock_remove(base_addr, block_sz); + return 0; } @@ -603,22 +606,29 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) } lmb_set_nid(lmb); + lmb->flags |= DRCONF_MEM_ASSIGNED; + block_sz = memory_block_size_bytes(); /* Add the memory */ rc = __add_memory(lmb->nid, lmb->base_addr, block_sz); if (rc) { invalidate_lmb_associativity_index(lmb); + lmb_clear_nid(lmb); + lmb->flags &= ~DRCONF_MEM_ASSIGNED; return rc; } rc = dlpar_online_lmb(lmb); if (rc) { - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + int nid = lmb->nid; + phys_addr_t base_addr = lmb->base_addr; + invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); - } else { - lmb->flags |= DRCONF_MEM_ASSIGNED; + lmb->flags &= ~DRCONF_MEM_ASSIGNED; + + __remove_memory(nid, base_addr, block_sz); } return rc; -- 2.7.5
Re: [PATCHv4 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
On Tue, Aug 4, 2020 at 12:29 AM Laurent Dufour wrote: > [...] > > lmb_set_nid(lmb); > > lmb->flags |= DRCONF_MEM_ASSIGNED; > > + if (dt_update) { > > + ret = drmem_update_dt(); > > + if (ret) > > + pr_warn("%s fail to update dt, but continue\n", > > __func__); > > + } > > > > block_sz = memory_block_size_bytes(); > > In the case the call to __add_memory is failing, the flag DRCONF_MEM_ASSIGNED > should be cleared as I mentioned in your previous patch. In addition to this, Yes. > the DT should be updated, or the caller should manage that but that will be > hard > since it doesn't know where the error happened in this function. Yeah, it is hard to manage it by caller, so just updating dt is a easier method. > > > > > @@ -625,7 +653,11 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) > > invalidate_lmb_associativity_index(lmb); > > lmb_clear_nid(lmb); > > lmb->flags &= ~DRCONF_MEM_ASSIGNED; > > - > > + if (dt_update) { > > + ret = drmem_update_dt(); > > + if (ret) > > + pr_warn("%s fail to update dt during > > rollback, but continue\n", __func__); > > + } > > __remove_memory(nid, base_addr, block_sz); > > } > > > > @@ -638,6 +670,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) > > int lmbs_available = 0; > > int lmbs_added = 0; > > int rc; > > + bool dt_update = false; > > > > pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add); > > > > @@ -664,7 +697,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) > > if (rc) > > continue; > > > > - rc = dlpar_add_lmb(lmb); > > + rc = dlpar_add_lmb(lmb, dt_update); > > if (rc) { > > dlpar_release_drc(lmb->drc_index); > > continue; > > @@ -678,16 +711,23 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add) > > lmbs_added++; > > if (lmbs_added == lmbs_to_add) > > break; > > + else if (lmbs_added == lmbs_to_add - 1) > > + dt_update = true; > > In the case the number of LMB to add is 1, dt_update is never set to true, and > the device tree is never updated. You need to set dt_update to true earlier in > the loop block. Oh, I will fix it in V5 > > > } > > > > if (lmbs_added != lmbs_to_add) { > > + bool rollback_dt_update = false; > > + > > pr_err("Memory hot-add failed, removing any added LMBs\n"); > > > > for_each_drmem_lmb(lmb) { > > if (!drmem_lmb_reserved(lmb)) > > continue; > > > > - rc = dlpar_remove_lmb(lmb); > > + if (--lmbs_added == 0 && dt_update) > > + rollback_dt_update = true; > > That test may have to be review to deal with error during the last LMB > addition, > the DT may have been updated before __add_memory() is failing in > dlpar_add_lmb(). In that case, lmbs_added == 0 and that branch is not covered. > That's not an issue if dlpar_add_lmb() is handling that case (see my comment > above). I will fix it in next version. Thanks for your review. Regards, Pingfan
Re: [PATCHv4 1/2] powerpc/pseries: group lmb operation and memblock's
On Mon, Aug 3, 2020 at 9:52 PM Laurent Dufour wrote: > > > @@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) > > } > > > > lmb_set_nid(lmb); > > + lmb->flags |= DRCONF_MEM_ASSIGNED; > > + > > block_sz = memory_block_size_bytes(); > > > > /* Add the memory */ > > Since the lmb->flags is now set earlier, you should unset it in the case the > call to __add_memory() fails, something like: > > @@ -614,6 +614,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) > rc = __add_memory(lmb->nid, lmb->base_addr, block_sz); > if (rc) { > invalidate_lmb_associativity_index(lmb); > + lmb->flags &= ~DRCONF_MEM_ASSIGNED; You are right. I will fix it in V5. Thanks, Pingfan
[PATCHv4 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
A bug is observed on pseries by taking the following steps on rhel: -1. drmgr -c mem -r -q 5 -2. echo c > /proc/sysrq-trigger And then, the failure looks like: kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ kdump: saving vmcore-dmesg.txt kdump: saving vmcore-dmesg.txt complete kdump: saving vmcore Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Copying data : [ 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2 [ 44.338548] Core dump to |/bin/false pipe failed /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete kdump: saving vmcore failed * Root cause * After analyzing, it turns out that in the current implementation, when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as the code __remove_memory() comes before drmem_update_dt(). So in kdump kernel, when read_from_oldmem() resorts to pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it can be observed "Bus error" >From a viewpoint of listener and publisher, the publisher notifies the listener before data is ready. This introduces a problem where udev launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before updating. And in capture kernel, makedumpfile will access the memory based on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. * Fix * This bug is introduced by commit 063b8b1251fd ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR request"), which tried to combine all the dt updating into one. To fix this issue, meanwhile not to introduce a quadratic runtime complexity by the model: dlpar_memory_add_by_count for_each_drmem_lmb <-- dlpar_add_lmb drmem_update_dt(_v1|_v2) for_each_drmem_lmb <-- The dt should still be only updated once, and just before the last memory online/offline event is ejected to user space. Achieve this by tracing the num of lmb added or removed. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Hari Bathini Cc: Nathan Lynch Cc: Nathan Fontenot Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- v3 -> v4: resolve a quadratic runtime complexity issue. This series is applied on next-test branch arch/powerpc/platforms/pseries/hotplug-memory.c | 88 ++--- 1 file changed, 66 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 1a3ac3b..e07d5b1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -350,13 +350,13 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) return true; } -static int dlpar_add_lmb(struct drmem_lmb *); +static int dlpar_add_lmb(struct drmem_lmb *lmb, bool dt_update); -static int dlpar_remove_lmb(struct drmem_lmb *lmb) +static int dlpar_remove_lmb(struct drmem_lmb *lmb, bool dt_update) { unsigned long block_sz; phys_addr_t base_addr; - int rc, nid; + int rc, ret, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -372,6 +372,11 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + if (dt_update) { + ret = drmem_update_dt(); + if (ret) + pr_warn("%s fail to update dt, but continue\n", __func__); + } __remove_memory(nid, base_addr, block_sz); @@ -387,6 +392,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) int lmbs_removed = 0; int lmbs_available = 0; int rc; + bool dt_update = false; pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove); @@ -409,7 +415,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove) } for_each_drmem_lmb(lmb) { - rc = dlpar_remove_
[PATCHv4 1/2] powerpc/pseries: group lmb operation and memblock's
This patch prepares for the incoming patch which swaps the order of KOBJ_ADD/REMOVE uevent and dt's updating. The dt updating should come after lmb operations, and before __remove_memory()/__add_memory(). Accordingly, grouping all lmb operations before the memblock's. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Hari Bathini Cc: Nathan Lynch Cc: Nathan Fontenot Cc: ke...@lists.infradead.org To: linuxppc-dev@lists.ozlabs.org --- v3 -> v4: improve commit log arch/powerpc/platforms/pseries/hotplug-memory.c | 26 - 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5d545b7..1a3ac3b 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { unsigned long block_sz; - int rc; + phys_addr_t base_addr; + int rc, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) if (rc) return rc; + base_addr = lmb->base_addr; + nid = lmb->nid; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); - - /* Update memory regions for memory remove */ - memblock_remove(lmb->base_addr, block_sz); - invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + __remove_memory(nid, base_addr, block_sz); + + /* Update memory regions for memory remove */ + memblock_remove(base_addr, block_sz); + return 0; } @@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) } lmb_set_nid(lmb); + lmb->flags |= DRCONF_MEM_ASSIGNED; + block_sz = memory_block_size_bytes(); /* Add the memory */ @@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) rc = dlpar_online_lmb(lmb); if (rc) { - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + int nid = lmb->nid; + phys_addr_t base_addr = lmb->base_addr; + invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); - } else { - lmb->flags |= DRCONF_MEM_ASSIGNED; + lmb->flags &= ~DRCONF_MEM_ASSIGNED; + + __remove_memory(nid, base_addr, block_sz); } return rc; -- 2.7.5
Re: [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's
On Thu, Jul 23, 2020 at 10:41 PM Nathan Lynch wrote: > > Pingfan Liu writes: > > This patch prepares for the incoming patch which swaps the order of KOBJ_ > > uevent and dt's updating. > > > > It has no functional effect, just groups lmb operation and memblock's in > > order to insert dt updating operation easily, and makes it easier to > > review. > > ... > > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > > b/arch/powerpc/platforms/pseries/hotplug-memory.c > > index 5d545b7..1a3ac3b 100644 > > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > > @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); > > static int dlpar_remove_lmb(struct drmem_lmb *lmb) > > { > > unsigned long block_sz; > > - int rc; > > + phys_addr_t base_addr; > > + int rc, nid; > > > > if (!lmb_is_removable(lmb)) > > return -EINVAL; > > @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) > > if (rc) > > return rc; > > > > + base_addr = lmb->base_addr; > > + nid = lmb->nid; > > block_sz = pseries_memory_block_size(); > > > > - __remove_memory(lmb->nid, lmb->base_addr, block_sz); > > - > > - /* Update memory regions for memory remove */ > > - memblock_remove(lmb->base_addr, block_sz); > > - > > invalidate_lmb_associativity_index(lmb); > > lmb_clear_nid(lmb); > > lmb->flags &= ~DRCONF_MEM_ASSIGNED; > > > > + __remove_memory(nid, base_addr, block_sz); > > + > > + /* Update memory regions for memory remove */ > > + memblock_remove(base_addr, block_sz); > > + > > return 0; > > } > > I don't understand; the commit message should not claim this has no > functional effect when it changes the order of operations like > this. Maybe this is an improvement over the current behavior, but it's > not explained why it would be. One group of functions, which name contains lmb, are powerpc specific, and used to form dt. The other group __remove_memory() and memblock_remove() are integrated with linux mm. And [2/2] arrange dt-updating just before __remove_memory() Thanks, Pingfan
Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch wrote: > > Pingfan Liu writes: > > A bug is observed on pseries by taking the following steps on rhel: > > -1. drmgr -c mem -r -q 5 > > -2. echo c > /proc/sysrq-trigger > > > > And then, the failure looks like: > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ > > kdump: saving vmcore-dmesg.txt > > kdump: saving vmcore-dmesg.txt complete > > kdump: saving vmcore > > Checking for memory holes : [ 0.0 %] / > >Checking for memory holes : [100.0 %] | > > Excluding unnecessary pages : [100.0 %] > > \ Copying data : [ > > 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! > > EA=0x7fffba40 access=0x8004 current=makedumpfile > > [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > > psize 2 pte=0xc0005504 > > [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 > > access=0x8004 current=makedumpfile > > [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > > psize 2 pte=0xc0005504 > > [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 > > nip 7fffbbc4d7fc lr 00011356ca3c code 2 > > [ 44.338548] Core dump to |/bin/false pipe failed > > /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error > > $CORE_COLLECTOR /proc/vmcore > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete > > kdump: saving vmcore failed > > > > * Root cause * > > After analyzing, it turns out that in the current implementation, > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating > > as > > the code __remove_memory() comes before drmem_update_dt(). > > So in kdump kernel, when read_from_oldmem() resorts to > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it > > can be observed "Bus error" > > > > From a viewpoint of listener and publisher, the publisher notifies the > > listener before data is ready. This introduces a problem where udev > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before > > updating. And in capture kernel, makedumpfile will access the memory based > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. > > > > * Fix * > > In order to fix this issue, update dt before __remove_memory(), and > > accordingly the same rule in hot-add path. > > > > This will introduce extra dt updating payload for each involved lmb when > > hotplug. > > But it should be fine since drmem_update_dt() is memory based operation and > > hotplug is not a hot path. > > This is great analysis but the performance implications of the change > are grave. The add/remove paths here are already O(n) where n is the > quantity of memory assigned to the LP, this change would make it O(n^2): > > dlpar_memory_add_by_count > for_each_drmem_lmb <-- > dlpar_add_lmb > drmem_update_dt(_v1|_v2) > for_each_drmem_lmb <-- > > Memory add/remove isn't a hot path but quadratic runtime complexity > isn't acceptable. Its current performance is bad enough that I have Yes, the quadratic runtime complexity sounds terrible. And I am curious about the bug. Does the system have thousands of lmb? > internal bugs open on it. > > Not to mention we leak memory every time drmem_update_dt is called > because we can't safely free device tree properties :-( Do you know what block us to free it? > > Also note that this sort of reverts (fixes?) 063b8b1251fd > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR > request"). Yes. And now, I think I need to bring up another method to fix it. Thanks, Pingfan
Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
On Wed, Jul 22, 2020 at 12:57 PM Michael Ellerman wrote: > > Pingfan Liu writes: > > A bug is observed on pseries by taking the following steps on rhel: > ^ > RHEL > > I assume it happens on mainline too? Yes, it does. > [...] > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > > b/arch/powerpc/platforms/pseries/hotplug-memory.c > > index 1a3ac3b..def8cb3f 100644 > > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > > @@ -372,6 +372,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) > > invalidate_lmb_associativity_index(lmb); > > lmb_clear_nid(lmb); > > lmb->flags &= ~DRCONF_MEM_ASSIGNED; > > + drmem_update_dt(); > > No error checking? Hmm, here should be a more careful design. Please see the comment at the end. > > > __remove_memory(nid, base_addr, block_sz); > > > > @@ -607,6 +608,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) > > > > lmb_set_nid(lmb); > > lmb->flags |= DRCONF_MEM_ASSIGNED; > > + drmem_update_dt(); > > And here .. > > > > block_sz = memory_block_size_bytes(); > > > > @@ -625,6 +627,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) > > invalidate_lmb_associativity_index(lmb); > > lmb_clear_nid(lmb); > > lmb->flags &= ~DRCONF_MEM_ASSIGNED; > > + drmem_update_dt(); > > > And here .. > > > __remove_memory(nid, base_addr, block_sz); > > } > > @@ -877,9 +880,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) > > break; > > } > > > > - if (!rc) > > - rc = drmem_update_dt(); > > - > > unlock_device_hotplug(); > > return rc; > > Whereas previously we did check it. drmem_update_dt() fails iff allocating memory fail. And in the failed case, even the original code does not roll back the effect of __add_memory()/__remove_memory(). And I plan to do the following in V4: if drmem_update_dt() fails in dlpar_add_lmb(), then bails out immediately. Thanks, Pingfan
[PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
A bug is observed on pseries by taking the following steps on rhel: -1. drmgr -c mem -r -q 5 -2. echo c > /proc/sysrq-trigger And then, the failure looks like: kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ kdump: saving vmcore-dmesg.txt kdump: saving vmcore-dmesg.txt complete kdump: saving vmcore Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Copying data : [ 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2 [ 44.338548] Core dump to |/bin/false pipe failed /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete kdump: saving vmcore failed * Root cause * After analyzing, it turns out that in the current implementation, when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as the code __remove_memory() comes before drmem_update_dt(). So in kdump kernel, when read_from_oldmem() resorts to pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it can be observed "Bus error" >From a viewpoint of listener and publisher, the publisher notifies the listener before data is ready. This introduces a problem where udev launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before updating. And in capture kernel, makedumpfile will access the memory based on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. * Fix * In order to fix this issue, update dt before __remove_memory(), and accordingly the same rule in hot-add path. This will introduce extra dt updating payload for each involved lmb when hotplug. But it should be fine since drmem_update_dt() is memory based operation and hotplug is not a hot path. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Hari Bathini Cc: Nathan Lynch To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- v2 -> v3: rebase onto ppc next-test branch --- arch/powerpc/platforms/pseries/hotplug-memory.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 1a3ac3b..def8cb3f 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -372,6 +372,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + drmem_update_dt(); __remove_memory(nid, base_addr, block_sz); @@ -607,6 +608,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) lmb_set_nid(lmb); lmb->flags |= DRCONF_MEM_ASSIGNED; + drmem_update_dt(); block_sz = memory_block_size_bytes(); @@ -625,6 +627,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + drmem_update_dt(); __remove_memory(nid, base_addr, block_sz); } @@ -877,9 +880,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) break; } - if (!rc) - rc = drmem_update_dt(); - unlock_device_hotplug(); return rc; } -- 2.7.5
[PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's
This patch prepares for the incoming patch which swaps the order of KOBJ_ uevent and dt's updating. It has no functional effect, just groups lmb operation and memblock's in order to insert dt updating operation easily, and makes it easier to review. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Hari Bathini Cc: Nathan Lynch To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- v2 -> v3: rebase onto ppc next-test branch --- arch/powerpc/platforms/pseries/hotplug-memory.c | 26 - 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5d545b7..1a3ac3b 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { unsigned long block_sz; - int rc; + phys_addr_t base_addr; + int rc, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) if (rc) return rc; + base_addr = lmb->base_addr; + nid = lmb->nid; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); - - /* Update memory regions for memory remove */ - memblock_remove(lmb->base_addr, block_sz); - invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + __remove_memory(nid, base_addr, block_sz); + + /* Update memory regions for memory remove */ + memblock_remove(base_addr, block_sz); + return 0; } @@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) } lmb_set_nid(lmb); + lmb->flags |= DRCONF_MEM_ASSIGNED; + block_sz = memory_block_size_bytes(); /* Add the memory */ @@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) rc = dlpar_online_lmb(lmb); if (rc) { - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + int nid = lmb->nid; + phys_addr_t base_addr = lmb->base_addr; + invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); - } else { - lmb->flags |= DRCONF_MEM_ASSIGNED; + lmb->flags &= ~DRCONF_MEM_ASSIGNED; + + __remove_memory(nid, base_addr, block_sz); } return rc; -- 2.7.5
[PATCHv2 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
A bug is observed on pseries by taking the following steps on rhel: -1. drmgr -c mem -r -q 5 -2. echo c > /proc/sysrq-trigger And then, the failure looks like: kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ kdump: saving vmcore-dmesg.txt kdump: saving vmcore-dmesg.txt complete kdump: saving vmcore Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Copying data : [ 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2 [ 44.338548] Core dump to |/bin/false pipe failed /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete kdump: saving vmcore failed * Root cause * After analyzing, it turns out that in the current implementation, when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as the code __remove_memory() comes before drmem_update_dt(). So in kdump kernel, when read_from_oldmem() resorts to pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it can be observed "Bus error" >From a viewpoint of listener and publisher, the publisher notifies the listener before data is ready. This introduces a problem where udev launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before updating. And in capture kernel, makedumpfile will access the memory based on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. * Fix * In order to fix this issue, update dt before __remove_memory(), and accordingly the same rule in hot-add path. This will introduce extra dt updating payload for each involved lmb when hotplug. But it should be fine since drmem_update_dt() is memory based operation and hotplug is not a hot path. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Hari Bathini Cc: Leonardo Bras Cc: Libor Pechacek Cc: Nathan Fontenot To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- v1 -> v2: improve commit, and more detail about the SIGBUG failure path arch/powerpc/platforms/pseries/hotplug-memory.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 4bd9004..72cd4a5 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -394,6 +394,9 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + rtas_hp_event = true; + drmem_update_dt(); + rtas_hp_event = false; __remove_memory(nid, base_addr, block_sz); @@ -667,6 +670,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) lmb_set_nid(lmb); lmb->flags |= DRCONF_MEM_ASSIGNED; + rtas_hp_event = true; + drmem_update_dt(); + rtas_hp_event = false; block_sz = memory_block_size_bytes(); @@ -685,6 +691,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + rtas_hp_event = true; + drmem_update_dt(); + rtas_hp_event = false; __remove_memory(nid, base_addr, block_sz); } @@ -941,12 +950,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) break; } - if (!rc) { - rtas_hp_event = true; - rc = drmem_update_dt(); - rtas_hp_event = false; - } - unlock_device_hotplug(); return rc; } -- 2.7.5
[PATCHv2 1/2] powerpc/pseries: group lmb operation and memblock's
This patch prepares for the incoming patch which swaps the order of KOBJ_ uevent and dt's updating. It has no functional effect, just groups lmb operation and memblock's in order to insert dt updating operation easily, and makes it easier to review. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Hari Bathini Cc: Leonardo Bras Cc: Libor Pechacek Cc: Nathan Fontenot To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- arch/powerpc/platforms/pseries/hotplug-memory.c | 26 - 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index b2cde17..4bd9004 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -377,7 +377,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { unsigned long block_sz; - int rc; + phys_addr_t base_addr; + int rc, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -386,17 +387,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) if (rc) return rc; + base_addr = lmb->base_addr; + nid = lmb->nid; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); - - /* Update memory regions for memory remove */ - memblock_remove(lmb->base_addr, block_sz); - invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + __remove_memory(nid, base_addr, block_sz); + + /* Update memory regions for memory remove */ + memblock_remove(base_addr, block_sz); + return 0; } @@ -663,6 +666,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) } lmb_set_nid(lmb); + lmb->flags |= DRCONF_MEM_ASSIGNED; + block_sz = memory_block_size_bytes(); /* Add the memory */ @@ -674,11 +679,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) rc = dlpar_online_lmb(lmb); if (rc) { - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + int nid = lmb->nid; + phys_addr_t base_addr = lmb->base_addr; + invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); - } else { - lmb->flags |= DRCONF_MEM_ASSIGNED; + lmb->flags &= ~DRCONF_MEM_ASSIGNED; + + __remove_memory(nid, base_addr, block_sz); } return rc; -- 2.7.5
[PATCHv4] powerpc/crashkernel: take "mem=" option into account
'mem=" option is an easy way to put high pressure on memory during some test. Hence after applying the memory limit, instead of total mem, the actual usable memory should be considered when reserving mem for crashkernel. Otherwise the boot up may experience OOM issue. E.g. it would reserve 4G prior to the change and 512M afterward, if passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G on a 256G machine. This issue is powerpc specific because it puts higher priority on fadump and kdump reservation than on "mem=". Referring the following code: if (fadump_reserve_mem() == 0) reserve_crashkernel(); ... /* Ensure that total memory size is page-aligned. */ limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE); memblock_enforce_memory_limit(limit); While on other arches, the effect of "mem=" takes a higher priority and pass through memblock_phys_mem_size() before calling reserve_crashkernel(). Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Hari Bathini Cc: Michael Ellerman Cc: ke...@lists.infradead.org --- v3 -> v4: fix total_mem_sz based on adjusted memory_limit arch/powerpc/kexec/core.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c index 078fe3d..56da5eb 100644 --- a/arch/powerpc/kexec/core.c +++ b/arch/powerpc/kexec/core.c @@ -115,11 +115,12 @@ void machine_kexec(struct kimage *image) void __init reserve_crashkernel(void) { - unsigned long long crash_size, crash_base; + unsigned long long crash_size, crash_base, total_mem_sz; int ret; + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); /* use common parsing */ - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), + ret = parse_crashkernel(boot_command_line, total_mem_sz, _size, _base); if (ret == 0 && crash_size > 0) { crashk_res.start = crash_base; @@ -178,6 +179,7 @@ void __init reserve_crashkernel(void) /* Crash kernel trumps memory limit */ if (memory_limit && memory_limit <= crashk_res.end) { memory_limit = crashk_res.end + 1; + total_mem_sz = memory_limit; printk("Adjusted memory limit for crashkernel, now 0x%llx\n", memory_limit); } @@ -186,7 +188,7 @@ void __init reserve_crashkernel(void) "for crashkernel (System RAM: %ldMB)\n", (unsigned long)(crash_size >> 20), (unsigned long)(crashk_res.start >> 20), - (unsigned long)(memblock_phys_mem_size() >> 20)); + (unsigned long)(total_mem_sz >> 20)); if (!memblock_is_region_memory(crashk_res.start, crash_size) || memblock_reserve(crashk_res.start, crash_size)) { -- 2.7.5
Re: [PATCHv3 2/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel
On Mon, Mar 16, 2020 at 10:53 AM Aneesh Kumar K.V wrote: > > On 3/4/20 2:17 PM, Pingfan Liu wrote: > > At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so > > if dumping to fsdax, it will take a very long time. > > > > > that should be fixed by > > faa6d21153fd11e139dd880044521389b34a24f2 > Author: Aneesh Kumar K.V > AuthorDate: Tue Sep 3 18:04:52 2019 +0530 > Commit: Michael Ellerman > CommitDate: Wed Sep 25 08:32:59 2019 +1000 > > powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error > > Right now we force an unbind of SCM memory at drcindex on H_OVERLAP error. > This really slows down operations like kexec where we get the H_OVERLAP > error because we don't go through a full hypervisor re init. > > H_OVERLAP error for a H_SCM_BIND_MEM hcall indicates that SCM memory at > drc index is already bound. Since we don't specify a logical memory > address for bind hcall, we can use the H_SCM_QUERY hcall to query > the already bound logical address. Good to know it. Thanks, Pingfan > > > > > > Take a closer look, during the papr_scm initialization, the only > > configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM, > > ...), which helps to set up the bound address. > > > > On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this > > step can be stepped around to save times. So the pmem bound address can be > > passed to the 2nd kernel through a dynamic added property "bound-addr" in > > dt node 'ibm,pmemory'. > > > > -aneesh >
Re: [PATCHv3 2/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel
Appreciate for your kind review. And I have some comment as below. On Fri, Mar 13, 2020 at 11:18 AM Oliver O'Halloran wrote: > > On Wed, Mar 4, 2020 at 7:50 PM Pingfan Liu wrote: > > > > At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so > > if dumping to fsdax, it will take a very long time. > > > > Take a closer look, during the papr_scm initialization, the only > > configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM, > > ...), which helps to set up the bound address. > > > > On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this > > step can be stepped around to save times. So the pmem bound address can be > > passed to the 2nd kernel through a dynamic added property "bound-addr" in > > dt node 'ibm,pmemory'. > > > > Signed-off-by: Pingfan Liu > > To: linuxppc-dev@lists.ozlabs.org > > Cc: Benjamin Herrenschmidt > > Cc: Paul Mackerras > > Cc: Michael Ellerman > > Cc: Hari Bathini > > Cc: Aneesh Kumar K.V > > Cc: Oliver O'Halloran > > Cc: Dan Williams > > Cc: Andrew Donnellan > > Cc: Christophe Leroy > > Cc: Rob Herring > > Cc: Frank Rowand > > Cc: ke...@lists.infradead.org > > --- > > note: This patch has not been tested since I can not get such a pseries > > with pmem. > > Please kindly to give some suggestion, thanks. > > There was some qemu patches to implement the Hcall interface floating > around a while ago. I'm not sure they ever made it into upstream qemu > though. Unfortunately, it does not appear in latest qemu code. I think probably virt-pmem has achieved the same feature. > > > --- > > arch/powerpc/platforms/pseries/of_helpers.c | 1 + > > arch/powerpc/platforms/pseries/papr_scm.c | 33 > > - > > drivers/of/base.c | 1 + > > 3 files changed, 25 insertions(+), 10 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/of_helpers.c > > b/arch/powerpc/platforms/pseries/of_helpers.c > > index 1022e0f..2c7bab4 100644 > > --- a/arch/powerpc/platforms/pseries/of_helpers.c > > +++ b/arch/powerpc/platforms/pseries/of_helpers.c > > @@ -34,6 +34,7 @@ struct property *new_property(const char *name, const int > > length, > > kfree(new); > > return NULL; > > } > > +EXPORT_SYMBOL(new_property); > > > > /** > > * pseries_of_derive_parent - basically like dirname(1) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > > b/arch/powerpc/platforms/pseries/papr_scm.c > > index 0b4467e..54ae903 100644 > > --- a/arch/powerpc/platforms/pseries/papr_scm.c > > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > > @@ -14,6 +14,7 @@ > > #include > > > > #include > > +#include "of_helpers.h" > > > > #define BIND_ANY_ADDR (~0ul) > > > > @@ -383,7 +384,7 @@ static int papr_scm_probe(struct platform_device *pdev) > > { > > struct device_node *dn = pdev->dev.of_node; > > u32 drc_index, metadata_size; > > - u64 blocks, block_size; > > + u64 blocks, block_size, bound_addr = 0; > > struct papr_scm_priv *p; > > const char *uuid_str; > > u64 uuid[2]; > > @@ -440,17 +441,29 @@ static int papr_scm_probe(struct platform_device > > *pdev) > > p->metadata_size = metadata_size; > > p->pdev = pdev; > > > > - /* request the hypervisor to bind this region to somewhere in > > memory */ > > - rc = drc_pmem_bind(p); > > + of_property_read_u64(dn, "bound-addr", _addr); > > + if (bound_addr) { > > + p->bound_addr = bound_addr; > > + } else { > > + struct property *property; > > + u64 big; > > > > - /* If phyp says drc memory still bound then force unbound and retry > > */ > > - if (rc == H_OVERLAP) > > - rc = drc_pmem_query_n_bind(p); > > + /* request the hypervisor to bind this region to somewhere > > in memory */ > > + rc = drc_pmem_bind(p); > > > > - if (rc != H_SUCCESS) { > > - dev_err(>pdev->dev, "bind err: %d\n", rc); > > - rc = -ENXIO; > > - goto err; > > + /* If phyp says drc memory still bound then force unbound > > and retry */ > > + if (rc == H_OVERLAP) > > +
Re: [PATCHv3 1/2] powerpc/of: split out new_property() for reusing
On Sat, Mar 7, 2020 at 3:59 AM Nathan Lynch wrote: > > Hi, > > Pingfan Liu writes: > > Splitting out new_property() for coming reusing and moving it to > > of_helpers.c. > > [...] > > > +struct property *new_property(const char *name, const int length, > > + const unsigned char *value, struct property *last) > > +{ > > + struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); > > + > > + if (!new) > > + return NULL; > > + > > + new->name = kstrdup(name, GFP_KERNEL); > > + if (!new->name) > > + goto cleanup; > > + new->value = kmalloc(length + 1, GFP_KERNEL); > > + if (!new->value) > > + goto cleanup; > > + > > + memcpy(new->value, value, length); > > + *(((char *)new->value) + length) = 0; > > + new->length = length; > > + new->next = last; > > + return new; > > + > > +cleanup: > > + kfree(new->name); > > + kfree(new->value); > > + kfree(new); > > + return NULL; > > +} > > This function in its current form isn't suitable for more general use: > > * It appears to be tailored to string properties - note the char * value > parameter, the length + 1 allocation and nul termination. > > * Most code shouldn't need the 'last' argument. The code where this > currently resides builds a list of properties and attaches it to a new > node, bypassing of_add_property(). > > Let's look at the call site you add in your next patch: > > + big = cpu_to_be64(p->bound_addr); > + property = new_property("bound-addr", sizeof(u64), (const > unsigned char *), > + NULL); > + of_add_property(dn, property); > > So you have to use a cast, and this is going to allocate (sizeof(u64) + 1) > for the value, is that what you want? > > I think you should leave that legacy pseries reconfig code undisturbed > (frankly that stuff should get deprecated and removed) and if you want a > generic helper it should look more like: > > struct property *of_property_new(const char *name, size_t length, > const void *value, gfp_t allocflags) > > __of_prop_dup() looks like a good model/guide here. Thanks for your good suggestion. I will re-code based on your suggestion, if [2/2] turns out acceptable. Regards, Pingfan
[PATCHv3 2/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel
At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so if dumping to fsdax, it will take a very long time. Take a closer look, during the papr_scm initialization, the only configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM, ...), which helps to set up the bound address. On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this step can be stepped around to save times. So the pmem bound address can be passed to the 2nd kernel through a dynamic added property "bound-addr" in dt node 'ibm,pmemory'. Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: Andrew Donnellan Cc: Christophe Leroy Cc: Rob Herring Cc: Frank Rowand Cc: ke...@lists.infradead.org --- note: This patch has not been tested since I can not get such a pseries with pmem. Please kindly to give some suggestion, thanks. --- arch/powerpc/platforms/pseries/of_helpers.c | 1 + arch/powerpc/platforms/pseries/papr_scm.c | 33 - drivers/of/base.c | 1 + 3 files changed, 25 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/of_helpers.c b/arch/powerpc/platforms/pseries/of_helpers.c index 1022e0f..2c7bab4 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.c +++ b/arch/powerpc/platforms/pseries/of_helpers.c @@ -34,6 +34,7 @@ struct property *new_property(const char *name, const int length, kfree(new); return NULL; } +EXPORT_SYMBOL(new_property); /** * pseries_of_derive_parent - basically like dirname(1) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 0b4467e..54ae903 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -14,6 +14,7 @@ #include #include +#include "of_helpers.h" #define BIND_ANY_ADDR (~0ul) @@ -383,7 +384,7 @@ static int papr_scm_probe(struct platform_device *pdev) { struct device_node *dn = pdev->dev.of_node; u32 drc_index, metadata_size; - u64 blocks, block_size; + u64 blocks, block_size, bound_addr = 0; struct papr_scm_priv *p; const char *uuid_str; u64 uuid[2]; @@ -440,17 +441,29 @@ static int papr_scm_probe(struct platform_device *pdev) p->metadata_size = metadata_size; p->pdev = pdev; - /* request the hypervisor to bind this region to somewhere in memory */ - rc = drc_pmem_bind(p); + of_property_read_u64(dn, "bound-addr", _addr); + if (bound_addr) { + p->bound_addr = bound_addr; + } else { + struct property *property; + u64 big; - /* If phyp says drc memory still bound then force unbound and retry */ - if (rc == H_OVERLAP) - rc = drc_pmem_query_n_bind(p); + /* request the hypervisor to bind this region to somewhere in memory */ + rc = drc_pmem_bind(p); - if (rc != H_SUCCESS) { - dev_err(>pdev->dev, "bind err: %d\n", rc); - rc = -ENXIO; - goto err; + /* If phyp says drc memory still bound then force unbound and retry */ + if (rc == H_OVERLAP) + rc = drc_pmem_query_n_bind(p); + + if (rc != H_SUCCESS) { + dev_err(>pdev->dev, "bind err: %d\n", rc); + rc = -ENXIO; + goto err; + } + big = cpu_to_be64(p->bound_addr); + property = new_property("bound-addr", sizeof(u64), (const unsigned char *), + NULL); + of_add_property(dn, property); } /* setup the resource for the newly bound range */ diff --git a/drivers/of/base.c b/drivers/of/base.c index ae03b12..602d2a9 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -1817,6 +1817,7 @@ int of_add_property(struct device_node *np, struct property *prop) return rc; } +EXPORT_SYMBOL_GPL(of_add_property); int __of_remove_property(struct device_node *np, struct property *prop) { -- 2.7.5
[PATCHv3 1/2] powerpc/of: split out new_property() for reusing
Splitting out new_property() for coming reusing and moving it to of_helpers.c. Also do some coding style cleanup. Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: Andrew Donnellan Cc: Christophe Leroy Cc: Rob Herring Cc: Frank Rowand Cc: ke...@lists.infradead.org --- arch/powerpc/platforms/pseries/of_helpers.c | 28 arch/powerpc/platforms/pseries/of_helpers.h | 3 +++ arch/powerpc/platforms/pseries/reconfig.c | 26 -- 3 files changed, 31 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/platforms/pseries/of_helpers.c b/arch/powerpc/platforms/pseries/of_helpers.c index 66dfd82..1022e0f 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.c +++ b/arch/powerpc/platforms/pseries/of_helpers.c @@ -7,6 +7,34 @@ #include "of_helpers.h" +struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last) +{ + struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); + + if (!new) + return NULL; + + new->name = kstrdup(name, GFP_KERNEL); + if (!new->name) + goto cleanup; + new->value = kmalloc(length + 1, GFP_KERNEL); + if (!new->value) + goto cleanup; + + memcpy(new->value, value, length); + *(((char *)new->value) + length) = 0; + new->length = length; + new->next = last; + return new; + +cleanup: + kfree(new->name); + kfree(new->value); + kfree(new); + return NULL; +} + /** * pseries_of_derive_parent - basically like dirname(1) * @path: the full_name of a node to be added to the tree diff --git a/arch/powerpc/platforms/pseries/of_helpers.h b/arch/powerpc/platforms/pseries/of_helpers.h index decad65..34add82 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.h +++ b/arch/powerpc/platforms/pseries/of_helpers.h @@ -4,6 +4,9 @@ #include +struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last); + struct device_node *pseries_of_derive_parent(const char *path); #endif /* _PSERIES_OF_HELPERS_H */ diff --git a/arch/powerpc/platforms/pseries/reconfig.c b/arch/powerpc/platforms/pseries/reconfig.c index 7f7369f..8e5a2ba 100644 --- a/arch/powerpc/platforms/pseries/reconfig.c +++ b/arch/powerpc/platforms/pseries/reconfig.c @@ -165,32 +165,6 @@ static char * parse_next_property(char *buf, char *end, char **name, int *length return tmp; } -static struct property *new_property(const char *name, const int length, -const unsigned char *value, struct property *last) -{ - struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); - - if (!new) - return NULL; - - if (!(new->name = kstrdup(name, GFP_KERNEL))) - goto cleanup; - if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) - goto cleanup; - - memcpy(new->value, value, length); - *(((char *)new->value) + length) = 0; - new->length = length; - new->next = last; - return new; - -cleanup: - kfree(new->name); - kfree(new->value); - kfree(new); - return NULL; -} - static int do_add_node(char *buf, size_t bufsize) { char *path, *end, *name; -- 2.7.5
[PATCHv3 0/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel
V2 -> V3: in [2/2], EXPORT_SYMBOL(new_property) and EXPORT_SYMBOL_GPL(of_add_property) To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: Andrew Donnellan Cc: Christophe Leroy Cc: Rob Herring Cc: Frank Rowand Cc: ke...@lists.infradead.org Pingfan Liu (2): powerpc/of: split out new_property() for reusing pseries/scm: buffer pmem's bound addr in dt for kexec kernel arch/powerpc/platforms/pseries/of_helpers.c | 29 + arch/powerpc/platforms/pseries/of_helpers.h | 3 +++ arch/powerpc/platforms/pseries/papr_scm.c | 33 - arch/powerpc/platforms/pseries/reconfig.c | 26 --- drivers/of/base.c | 1 + 5 files changed, 56 insertions(+), 36 deletions(-) -- 2.7.5
[PATCHv2 2/2] pSeries/papr_scm: buffer pmem's bound addr in dt for kexec kernel
At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so if dumping to fsdax, it will take a very long time. Take a closer look, during the papr_scm initialization, the only configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM, ...), which helps to set up the bound address. On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this step can be stepped around to save times. So the pmem bound address can be passed to the 2nd kernel through a dynamic added property "bound-addr" in dt node 'ibm,pmemory'. Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: Andrew Donnellan Cc: Christophe Leroy Cc: ke...@lists.infradead.org --- note: This patch has not been tested since I can not get such a pseries with pmem. Please kindly to give some suggestion, thanks. arch/powerpc/platforms/pseries/papr_scm.c | 32 +-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 0b4467e..40cd214 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -14,6 +14,7 @@ #include #include +#include "of_helpers.h" #define BIND_ANY_ADDR (~0ul) @@ -383,7 +384,7 @@ static int papr_scm_probe(struct platform_device *pdev) { struct device_node *dn = pdev->dev.of_node; u32 drc_index, metadata_size; - u64 blocks, block_size; + u64 blocks, block_size, bound_addr = 0; struct papr_scm_priv *p; const char *uuid_str; u64 uuid[2]; @@ -440,17 +441,28 @@ static int papr_scm_probe(struct platform_device *pdev) p->metadata_size = metadata_size; p->pdev = pdev; - /* request the hypervisor to bind this region to somewhere in memory */ - rc = drc_pmem_bind(p); + of_property_read_u64(dn, "bound-addr", _addr); + if (bound_addr) { + p->bound_addr = bound_addr; + } else { + struct property *property; + u64 big; - /* If phyp says drc memory still bound then force unbound and retry */ - if (rc == H_OVERLAP) - rc = drc_pmem_query_n_bind(p); + /* request the hypervisor to bind this region to somewhere in memory */ + rc = drc_pmem_bind(p); - if (rc != H_SUCCESS) { - dev_err(>pdev->dev, "bind err: %d\n", rc); - rc = -ENXIO; - goto err; + /* If phyp says drc memory still bound then force unbound and retry */ + if (rc == H_OVERLAP) + rc = drc_pmem_query_n_bind(p); + + if (rc != H_SUCCESS) { + dev_err(>pdev->dev, "bind err: %d\n", rc); + rc = -ENXIO; + goto err; + } + big = cpu_to_be64(p->bound_addr); + property = new_property("bound-addr", sizeof(u64), , NULL); + of_add_property(dn, property); } /* setup the resource for the newly bound range */ -- 2.7.5
[PATCHv2 1/2] powerpc/of: split out new_property() for reusing
Splitting out new_property() for coming reusing and moving it to of_helpers.c. Also do some coding style cleanup. Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: Andrew Donnellan Cc: Christophe Leroy Cc: ke...@lists.infradead.org --- arch/powerpc/platforms/pseries/of_helpers.c | 28 arch/powerpc/platforms/pseries/of_helpers.h | 3 +++ arch/powerpc/platforms/pseries/reconfig.c | 26 -- 3 files changed, 31 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/platforms/pseries/of_helpers.c b/arch/powerpc/platforms/pseries/of_helpers.c index 66dfd82..1022e0f 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.c +++ b/arch/powerpc/platforms/pseries/of_helpers.c @@ -7,6 +7,34 @@ #include "of_helpers.h" +struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last) +{ + struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); + + if (!new) + return NULL; + + new->name = kstrdup(name, GFP_KERNEL); + if (!new->name) + goto cleanup; + new->value = kmalloc(length + 1, GFP_KERNEL); + if (!new->value) + goto cleanup; + + memcpy(new->value, value, length); + *(((char *)new->value) + length) = 0; + new->length = length; + new->next = last; + return new; + +cleanup: + kfree(new->name); + kfree(new->value); + kfree(new); + return NULL; +} + /** * pseries_of_derive_parent - basically like dirname(1) * @path: the full_name of a node to be added to the tree diff --git a/arch/powerpc/platforms/pseries/of_helpers.h b/arch/powerpc/platforms/pseries/of_helpers.h index decad65..34add82 100644 --- a/arch/powerpc/platforms/pseries/of_helpers.h +++ b/arch/powerpc/platforms/pseries/of_helpers.h @@ -4,6 +4,9 @@ #include +struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last); + struct device_node *pseries_of_derive_parent(const char *path); #endif /* _PSERIES_OF_HELPERS_H */ diff --git a/arch/powerpc/platforms/pseries/reconfig.c b/arch/powerpc/platforms/pseries/reconfig.c index 7f7369f..8e5a2ba 100644 --- a/arch/powerpc/platforms/pseries/reconfig.c +++ b/arch/powerpc/platforms/pseries/reconfig.c @@ -165,32 +165,6 @@ static char * parse_next_property(char *buf, char *end, char **name, int *length return tmp; } -static struct property *new_property(const char *name, const int length, -const unsigned char *value, struct property *last) -{ - struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); - - if (!new) - return NULL; - - if (!(new->name = kstrdup(name, GFP_KERNEL))) - goto cleanup; - if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) - goto cleanup; - - memcpy(new->value, value, length); - *(((char *)new->value) + length) = 0; - new->length = length; - new->next = last; - return new; - -cleanup: - kfree(new->name); - kfree(new->value); - kfree(new); - return NULL; -} - static int do_add_node(char *buf, size_t bufsize) { char *path, *end, *name; -- 2.7.5
Re: [PATCH 3/3] pseries/scm: buffer pmem's bound addr in dt for kexec kernel
On Fri, Feb 28, 2020 at 2:52 PM Christophe Leroy wrote: > > > > Le 28/02/2020 à 06:53, Pingfan Liu a écrit : > > At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so > > if dumping to fsdax, it will take a very long time. > > > > Take a closer look, during the papr_scm initialization, the only > > configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM, > > ...), which helps to set up the bound address. > > > > On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this > > step can be stepped around to save times. So the pmem bound address can be > > passed to the 2nd kernel through a dynamic added property "bound-addr" in > > dt node 'ibm,pmemory'. > > > > Signed-off-by: Pingfan Liu > > To: linuxppc-dev@lists.ozlabs.org > > Cc: Benjamin Herrenschmidt > > Cc: Paul Mackerras > > Cc: Michael Ellerman > > Cc: Hari Bathini > > Cc: Aneesh Kumar K.V > > Cc: Oliver O'Halloran > > Cc: Dan Williams > > Cc: ke...@lists.infradead.org > > --- > > note: I can not find such a pseries machine, and not finish it yet. > > --- > > arch/powerpc/platforms/pseries/papr_scm.c | 32 > > +-- > > 1 file changed, 22 insertions(+), 10 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > > b/arch/powerpc/platforms/pseries/papr_scm.c > > index c2ef320..555e746 100644 > > --- a/arch/powerpc/platforms/pseries/papr_scm.c > > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > > @@ -382,7 +382,7 @@ static int papr_scm_probe(struct platform_device *pdev) > > { > > struct device_node *dn = pdev->dev.of_node; > > u32 drc_index, metadata_size; > > - u64 blocks, block_size; > > + u64 blocks, block_size, bound_addr = 0; > > struct papr_scm_priv *p; > > const char *uuid_str; > > u64 uuid[2]; > > @@ -439,17 +439,29 @@ static int papr_scm_probe(struct platform_device > > *pdev) > > p->metadata_size = metadata_size; > > p->pdev = pdev; > > > > - /* request the hypervisor to bind this region to somewhere in memory > > */ > > - rc = drc_pmem_bind(p); > > + of_property_read_u64(dn, "bound-addr", _addr); > > + if (bound_addr) > > + p->bound_addr = bound_addr; > > + else { > > All legs of an if/else must have { } when one leg need them, see codying > style. OK, > > > + struct property *property; > > + u64 big; > > > > - /* If phyp says drc memory still bound then force unbound and retry */ > > - if (rc == H_OVERLAP) > > - rc = drc_pmem_query_n_bind(p); > > + /* request the hypervisor to bind this region to somewhere in > > memory */ > > + rc = drc_pmem_bind(p); > > > > - if (rc != H_SUCCESS) { > > - dev_err(>pdev->dev, "bind err: %d\n", rc); > > - rc = -ENXIO; > > - goto err; > > + /* If phyp says drc memory still bound then force unbound and > > retry */ > > + if (rc == H_OVERLAP) > > + rc = drc_pmem_query_n_bind(p); > > + > > + if (rc != H_SUCCESS) { > > + dev_err(>pdev->dev, "bind err: %d\n", rc); > > + rc = -ENXIO; > > + goto err; > > + } > > + big = cpu_to_be64(p->bound_addr); > > + property = new_property("bound-addr", sizeof(u64), , > > + NULL); > > Why plitting this line in two parts ? You have lines far longer above. > In powerpc we allow lines 90 chars long. OK, good to know it. Thanks, Pingfan
Re: [PATCH 1/3] powerpc/of: split out new_property() for reusing
On Fri, Feb 28, 2020 at 2:03 PM Andrew Donnellan wrote: > > On 28/2/20 4:53 pm, Pingfan Liu wrote: > > Since new_property() is used in several calling sites, splitting it out for > > reusing. > > > > To ease the review, although the split out part has coding style issue, > > keeping it untouched and fixed in next patch. > > > > Signed-off-by: Pingfan Liu > > To: linuxppc-dev@lists.ozlabs.org > > Cc: Benjamin Herrenschmidt > > Cc: Paul Mackerras > > Cc: Michael Ellerman > > Cc: Hari Bathini > > Cc: Aneesh Kumar K.V > > Cc: Oliver O'Halloran > > Cc: Dan Williams > > Cc: ke...@lists.infradead.org > > Which tree does this apply to? I don't see a new_property() in mm/drmem.c... Sorry, there is mud in my git tree, I check, either linux git or powerpc git tree does not have this function. Nack this series, and I will send out V2 for patch 3/3. Thanks, Pingfan > > -- > Andrew Donnellan OzLabs, ADL Canberra > a...@linux.ibm.com IBM Australia Limited >
[PATCH 3/3] pseries/scm: buffer pmem's bound addr in dt for kexec kernel
At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so if dumping to fsdax, it will take a very long time. Take a closer look, during the papr_scm initialization, the only configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM, ...), which helps to set up the bound address. On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this step can be stepped around to save times. So the pmem bound address can be passed to the 2nd kernel through a dynamic added property "bound-addr" in dt node 'ibm,pmemory'. Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: ke...@lists.infradead.org --- note: I can not find such a pseries machine, and not finish it yet. --- arch/powerpc/platforms/pseries/papr_scm.c | 32 +-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index c2ef320..555e746 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -382,7 +382,7 @@ static int papr_scm_probe(struct platform_device *pdev) { struct device_node *dn = pdev->dev.of_node; u32 drc_index, metadata_size; - u64 blocks, block_size; + u64 blocks, block_size, bound_addr = 0; struct papr_scm_priv *p; const char *uuid_str; u64 uuid[2]; @@ -439,17 +439,29 @@ static int papr_scm_probe(struct platform_device *pdev) p->metadata_size = metadata_size; p->pdev = pdev; - /* request the hypervisor to bind this region to somewhere in memory */ - rc = drc_pmem_bind(p); + of_property_read_u64(dn, "bound-addr", _addr); + if (bound_addr) + p->bound_addr = bound_addr; + else { + struct property *property; + u64 big; - /* If phyp says drc memory still bound then force unbound and retry */ - if (rc == H_OVERLAP) - rc = drc_pmem_query_n_bind(p); + /* request the hypervisor to bind this region to somewhere in memory */ + rc = drc_pmem_bind(p); - if (rc != H_SUCCESS) { - dev_err(>pdev->dev, "bind err: %d\n", rc); - rc = -ENXIO; - goto err; + /* If phyp says drc memory still bound then force unbound and retry */ + if (rc == H_OVERLAP) + rc = drc_pmem_query_n_bind(p); + + if (rc != H_SUCCESS) { + dev_err(>pdev->dev, "bind err: %d\n", rc); + rc = -ENXIO; + goto err; + } + big = cpu_to_be64(p->bound_addr); + property = new_property("bound-addr", sizeof(u64), , + NULL); + of_add_property(dn, property); } /* setup the resource for the newly bound range */ -- 2.7.5
[PATCH 2/3] powerpc/of: coding style cleanup
Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: ke...@lists.infradead.org --- arch/powerpc/kernel/of_property.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/of_property.c b/arch/powerpc/kernel/of_property.c index e56c832..c6abf7e 100644 --- a/arch/powerpc/kernel/of_property.c +++ b/arch/powerpc/kernel/of_property.c @@ -5,16 +5,18 @@ #include struct property *new_property(const char *name, const int length, -const unsigned char *value, struct property *last) + const unsigned char *value, struct property *last) { struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); if (!new) return NULL; - if (!(new->name = kstrdup(name, GFP_KERNEL))) + new->name = kstrdup(name, GFP_KERNEL); + if (!new->name) goto cleanup; - if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) + new->value = kmalloc(length + 1, GFP_KERNEL); + if (!new->value) goto cleanup; memcpy(new->value, value, length); -- 2.7.5
[PATCH 1/3] powerpc/of: split out new_property() for reusing
Since new_property() is used in several calling sites, splitting it out for reusing. To ease the review, although the split out part has coding style issue, keeping it untouched and fixed in next patch. Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Hari Bathini Cc: Aneesh Kumar K.V Cc: Oliver O'Halloran Cc: Dan Williams Cc: ke...@lists.infradead.org --- arch/powerpc/include/asm/prom.h | 2 ++ arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/of_property.c | 32 +++ arch/powerpc/mm/drmem.c | 26 - arch/powerpc/platforms/pseries/reconfig.c | 26 - 5 files changed, 35 insertions(+), 53 deletions(-) create mode 100644 arch/powerpc/kernel/of_property.c diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h index 94e3fd5..02f7b1b 100644 --- a/arch/powerpc/include/asm/prom.h +++ b/arch/powerpc/include/asm/prom.h @@ -90,6 +90,8 @@ struct of_drc_info { extern int of_read_drc_info_cell(struct property **prop, const __be32 **curval, struct of_drc_info *data); +extern struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last); /* * There are two methods for telling firmware what our capabilities are. diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 157b014..23375fd 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -47,7 +47,7 @@ obj-y := cputable.o ptrace.o syscalls.o \ signal.o sysfs.o cacheinfo.o time.o \ prom.o traps.o setup-common.o \ udbg.o misc.o io.o misc_$(BITS).o \ - of_platform.o prom_parse.o + of_platform.o prom_parse.o of_property.o obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \ signal_64.o ptrace32.o \ paca.o nvram_64.o firmware.o note.o diff --git a/arch/powerpc/kernel/of_property.c b/arch/powerpc/kernel/of_property.c new file mode 100644 index 000..e56c832 --- /dev/null +++ b/arch/powerpc/kernel/of_property.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include + +struct property *new_property(const char *name, const int length, +const unsigned char *value, struct property *last) +{ + struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); + + if (!new) + return NULL; + + if (!(new->name = kstrdup(name, GFP_KERNEL))) + goto cleanup; + if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) + goto cleanup; + + memcpy(new->value, value, length); + *(((char *)new->value) + length) = 0; + new->length = length; + new->next = last; + return new; + +cleanup: + kfree(new->name); + kfree(new->value); + kfree(new); + return NULL; +} + diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 85b088a..888227e 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -99,32 +99,6 @@ static void init_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell, extern int test_hotplug; -static struct property *new_property(const char *name, const int length, -const unsigned char *value, struct property *last) -{ - struct property *new = kzalloc(sizeof(*new), GFP_KERNEL); - - if (!new) - return NULL; - - if (!(new->name = kstrdup(name, GFP_KERNEL))) - goto cleanup; - if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) - goto cleanup; - - memcpy(new->value, value, length); - *(((char *)new->value) + length) = 0; - new->length = length; - new->next = last; - return new; - -cleanup: - kfree(new->name); - kfree(new->value); - kfree(new); - return NULL; -} - static int drmem_update_dt_v2(struct device_node *memory, struct property *prop) { diff --git a/arch/powerpc/platforms/pseries/reconfig.c b/arch/powerpc/platforms/pseries/reconfig.c index 7f7369f..8e5a2ba 100644 --- a/arch/powerpc/platforms/pseries/reconfig.c +++ b/arch/powerpc/platforms/pseries/reconfig.c @@ -165,32 +165,6 @@ static char * parse_next_property(char *buf, char *end, char **name, int *length return tmp; } -static struct property *new_property(const char *name, const int length, -const unsigned char *value, struct property *last) -{ - struct
[PATCHv3] powerpc/crashkernel: take "mem=" option into account
'mem=" option is an easy way to put high pressure on memory during some test. Hence after applying the memory limit, instead of total mem, the actual usable memory should be considered when reserving mem for crashkernel. Otherwise the boot up may experience OOM issue. E.g. it would reserve 4G prior to the change and 512M afterward, if passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G on a 256G machine. This issue is powerpc specific because it puts higher priority on fadump and kdump reservation than on "mem=". Referring the following code: if (fadump_reserve_mem() == 0) reserve_crashkernel(); ... /* Ensure that total memory size is page-aligned. */ limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE); memblock_enforce_memory_limit(limit); While on other arches, the effect of "mem=" takes a higher priority and pass through memblock_phys_mem_size() before calling reserve_crashkernel(). Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Hari Bathini Cc: Michael Ellerman Cc: ke...@lists.infradead.org --- v2 -> v3: improve commit log arch/powerpc/kernel/machine_kexec.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/machine_kexec.c b/arch/powerpc/kernel/machine_kexec.c index c4ed328..eec96dc 100644 --- a/arch/powerpc/kernel/machine_kexec.c +++ b/arch/powerpc/kernel/machine_kexec.c @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) void __init reserve_crashkernel(void) { - unsigned long long crash_size, crash_base; + unsigned long long crash_size, crash_base, total_mem_sz; int ret; + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); /* use common parsing */ - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), + ret = parse_crashkernel(boot_command_line, total_mem_sz, _size, _base); if (ret == 0 && crash_size > 0) { crashk_res.start = crash_base; @@ -185,7 +186,7 @@ void __init reserve_crashkernel(void) "for crashkernel (System RAM: %ldMB)\n", (unsigned long)(crash_size >> 20), (unsigned long)(crashk_res.start >> 20), - (unsigned long)(memblock_phys_mem_size() >> 20)); + (unsigned long)(total_mem_sz >> 20)); if (!memblock_is_region_memory(crashk_res.start, crash_size) || memblock_reserve(crashk_res.start, crash_size)) { -- 2.7.5
[PATCH 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
A bug is observed on pseries by taking the following steps on rhel: -1. drmgr -c mem -r -q 5 -2. echo c > /proc/sysrq-trigger And then, the failure looks like: kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ kdump: saving vmcore-dmesg.txt kdump: saving vmcore-dmesg.txt complete kdump: saving vmcore Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Copying data : [ 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 psize 2 pte=0xc0005504 [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2 [ 44.338548] Core dump to |/bin/false pipe failed /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error $CORE_COLLECTOR /proc/vmcore $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete kdump: saving vmcore failed * Root cause * After analyzing, it turns out that in the current implementation, when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as the code __remove_memory() comes before drmem_update_dt(). >From a viewpoint of listener and publisher, the publisher notifies the listener before data is ready. This introduces a problem where udev launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before updating. And in capture kernel, makedumpfile will access the memory based on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. * Fix * In order to fix this issue, update dt before __remove_memory(), and accordingly the same rule in hot-add path. This will introduce extra dt updating payload for each involved lmb when hotplug. But it should be fine since drmem_update_dt() is memory based operation and hotplug is not a hot path. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Hari Bathini To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- arch/powerpc/platforms/pseries/hotplug-memory.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index a3a9353..1f623c3 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -392,6 +392,9 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + rtas_hp_event = true; + drmem_update_dt(); + rtas_hp_event = false; __remove_memory(nid, base_addr, block_sz); @@ -665,6 +668,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) lmb_set_nid(lmb); lmb->flags |= DRCONF_MEM_ASSIGNED; + rtas_hp_event = true; + drmem_update_dt(); + rtas_hp_event = false; block_sz = memory_block_size_bytes(); @@ -683,6 +689,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + rtas_hp_event = true; + drmem_update_dt(); + rtas_hp_event = false; __remove_memory(nid, base_addr, block_sz); } @@ -939,12 +948,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) break; } - if (!rc) { - rtas_hp_event = true; - rc = drmem_update_dt(); - rtas_hp_event = false; - } - unlock_device_hotplug(); return rc; } -- 2.7.5
[PATCH 1/2] powerpc/pseries: group lmb operation and memblock's
This patch prepares for the incoming patch which swaps the order of KOBJ_ uevent and dt's updating. It has no functional effect, just groups lmb operation and memblock's in order to insert dt updating operation easily, and makes it easier to review. Signed-off-by: Pingfan Liu Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Hari Bathini To: linuxppc-dev@lists.ozlabs.org Cc: ke...@lists.infradead.org --- arch/powerpc/platforms/pseries/hotplug-memory.c | 26 - 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c126b94..a3a9353 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -375,7 +375,8 @@ static int dlpar_add_lmb(struct drmem_lmb *); static int dlpar_remove_lmb(struct drmem_lmb *lmb) { unsigned long block_sz; - int rc; + phys_addr_t base_addr; + int rc, nid; if (!lmb_is_removable(lmb)) return -EINVAL; @@ -384,17 +385,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb) if (rc) return rc; + base_addr = lmb->base_addr; + nid = lmb->nid; block_sz = pseries_memory_block_size(); - __remove_memory(lmb->nid, lmb->base_addr, block_sz); - - /* Update memory regions for memory remove */ - memblock_remove(lmb->base_addr, block_sz); - invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); lmb->flags &= ~DRCONF_MEM_ASSIGNED; + __remove_memory(nid, base_addr, block_sz); + + /* Update memory regions for memory remove */ + memblock_remove(base_addr, block_sz); + return 0; } @@ -661,6 +664,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) } lmb_set_nid(lmb); + lmb->flags |= DRCONF_MEM_ASSIGNED; + block_sz = memory_block_size_bytes(); /* Add the memory */ @@ -672,11 +677,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) rc = dlpar_online_lmb(lmb); if (rc) { - __remove_memory(lmb->nid, lmb->base_addr, block_sz); + int nid = lmb->nid; + phys_addr_t base_addr = lmb->base_addr; + invalidate_lmb_associativity_index(lmb); lmb_clear_nid(lmb); - } else { - lmb->flags |= DRCONF_MEM_ASSIGNED; + lmb->flags &= ~DRCONF_MEM_ASSIGNED; + + __remove_memory(nid, base_addr, block_sz); } return rc; -- 2.7.5
[PATCH] powerpc/pseries: in lmb_is_removable(), advance pfn if section is not present
In lmb_is_removable(), if a section is not present, it should continue to test the rest sections in the block. But the current code fails to do so. Signed-off-by: Pingfan Liu Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/platforms/pseries/hotplug-memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c126b94..a4d40a3 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -360,8 +360,10 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) for (i = 0; i < scns_per_block; i++) { pfn = PFN_DOWN(phys_addr); - if (!pfn_present(pfn)) + if (!pfn_present(pfn)) { + phys_addr += MIN_MEMORY_BLOCK_SIZE; continue; + } rc = rc && is_mem_section_removable(pfn, PAGES_PER_SECTION); phys_addr += MIN_MEMORY_BLOCK_SIZE; -- 2.7.5
Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock
On Mon, Oct 14, 2019 at 10:03:03PM +0200, Jan Kara wrote: > On Mon 14-10-19 08:23:39, Eric Sandeen wrote: > > On 10/14/19 4:43 AM, Jan Kara wrote: > > > On Mon 14-10-19 16:33:15, Pingfan Liu wrote: > > > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote: > > > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote: > > > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch > > > > > > between grub xfs driver and kernel xfs driver has been obsevered. > > > > > > Note: > > > > > > fadump boots up in the following sequence: fireware -> grub reads > > > > > > kernel > > > > > > and initramfs -> kernel boots. > > > > > > > > > > > > The process to reproduce this mismatch: > > > > > >- On powerpc, boot kernel with fadump=on and edit > > > > > > /etc/kdump.conf. > > > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, > > > > > > "kdumpctl > > > > > > restart" to rebuild the initramfs. Detail about the rebuilding > > > > > > looks > > > > > > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp; > > > > > >mv /boot/initramfs-`uname -r`.img.tmp > > > > > > /boot/initramfs-`uname -r`.img > > > > > >sync > > > > > >- "echo c >/proc/sysrq-trigger". > > > > > > > > > > > > The result: > > > > > > The dump image will not be saved under /var/crashnew/* as expected, > > > > > > but > > > > > > still saved under /var/crash. > > > > > > > > > > > > The root cause: > > > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by > > > > > > writing > > > > > > back metadata to xlog, but not necessary to fsblock. This raises > > > > > > issue if > > > > > > grub can not replay the xlog before accessing the xfs files. Since > > > > > > the > > > > > > above dir entry of initramfs should be saved as inline data with > > > > > > xfs_inode, > > > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock. > > > > > > > > > > > > umount can be used to write metadata fsblock, but the filesystem > > > > > > can not be > > > > > > umounted if still in use. > > > > > > > > > > > > There are two ways to fix this mismatch, either grub or xfs. It may > > > > > > be > > > > > > easier to do this in xfs side by introducing an interface to flush > > > > > > metadata > > > > > > to fsblock explicitly. > > > > > > > > > > > > With this patch, metadata can be written to fsblock by: > > > > > ># update AIL > > > > > >sync > > > > > ># new introduced interface to flush metadata to fsblock > > > > > >mount -o remount,metasync mountpoint > > > > > > > > > > I think this ought to be an ioctl or some sort of generic call since > > > > > the > > > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER > > > > > is too dumb to recover logs but still wants to write to the fs" > > > > > checkpointing problem. > > > > Yes, a syscall sounds more reasonable. > > > > > > > > > > (Or maybe we should just put all that stuff in a vfat filesystem, I > > > > > don't know...) > > > > I think it is unavoidable to involve in each fs' implementation. What > > > > about introducing an interface sync_to_fsblock(struct super_block *sb) > > > > in > > > > the struct super_operations, then let each fs manage its own case? > > > > > > Well, we already have a way to achieve what you need: fsfreeze. > > > Traditionally, that is guaranteed to put fs into a "clean" state very much > > > equivalent to the fs being unmounted and that seems to be what the > > > bootloader wants so that it can access the filesystem without worrying > > > about some recovery details. So do you see any problem with replacing > > > 'sync
Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock
On Mon, Oct 14, 2019 at 08:23:39AM -0500, Eric Sandeen wrote: > On 10/14/19 4:43 AM, Jan Kara wrote: > > On Mon 14-10-19 16:33:15, Pingfan Liu wrote: > > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote: > > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote: > > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch > > > > > between grub xfs driver and kernel xfs driver has been obsevered. > > > > > Note: > > > > > fadump boots up in the following sequence: fireware -> grub reads > > > > > kernel > > > > > and initramfs -> kernel boots. > > > > > > > > > > The process to reproduce this mismatch: > > > > >- On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf. > > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, > > > > > "kdumpctl > > > > > restart" to rebuild the initramfs. Detail about the rebuilding > > > > > looks > > > > > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp; > > > > >mv /boot/initramfs-`uname -r`.img.tmp > > > > > /boot/initramfs-`uname -r`.img > > > > >sync > > > > >- "echo c >/proc/sysrq-trigger". > > > > > > > > > > The result: > > > > > The dump image will not be saved under /var/crashnew/* as expected, > > > > > but > > > > > still saved under /var/crash. > > > > > > > > > > The root cause: > > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by > > > > > writing > > > > > back metadata to xlog, but not necessary to fsblock. This raises > > > > > issue if > > > > > grub can not replay the xlog before accessing the xfs files. Since the > > > > > above dir entry of initramfs should be saved as inline data with > > > > > xfs_inode, > > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock. > > > > > > > > > > umount can be used to write metadata fsblock, but the filesystem can > > > > > not be > > > > > umounted if still in use. > > > > > > > > > > There are two ways to fix this mismatch, either grub or xfs. It may be > > > > > easier to do this in xfs side by introducing an interface to flush > > > > > metadata > > > > > to fsblock explicitly. > > > > > > > > > > With this patch, metadata can be written to fsblock by: > > > > ># update AIL > > > > >sync > > > > ># new introduced interface to flush metadata to fsblock > > > > >mount -o remount,metasync mountpoint > > > > > > > > I think this ought to be an ioctl or some sort of generic call since the > > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER > > > > is too dumb to recover logs but still wants to write to the fs" > > > > checkpointing problem. > > > Yes, a syscall sounds more reasonable. > > > > > > > > (Or maybe we should just put all that stuff in a vfat filesystem, I > > > > don't know...) > > > I think it is unavoidable to involve in each fs' implementation. What > > > about introducing an interface sync_to_fsblock(struct super_block *sb) in > > > the struct super_operations, then let each fs manage its own case? > > > > Well, we already have a way to achieve what you need: fsfreeze. > > Traditionally, that is guaranteed to put fs into a "clean" state very much > > equivalent to the fs being unmounted and that seems to be what the > > bootloader wants so that it can access the filesystem without worrying > > about some recovery details. So do you see any problem with replacing > > 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'? > > > > Honza > > The problem with fsfreeze is that if the device you want to quiesce is, say, > the root fs, freeze isn't really a good option. Yes, that is the difference between my patch and fsfreeze. But honestly, it is a rare case where a system has not a /boot partition. Due to the activity on /boot is very low, fsfreeze may meet the need, or repeatly retry fsfress until success. > > But the other thing I want to highlight about this approach is that it does > not > solve the root problem: something is trying to read the block device without > first replaying the log. > > A call such as the proposal here is only going to leave consistent metadata at > the time the call returns; at any time after that, all guarantees are off > again, My patch places assumption that grub only accesses limited files and ensures the consistency only on those files (kernel,initramfs). > so the problem hasn't been solved. Agree. The perfect solution should be a log aware bootloader. Thanks and regards, Pingfan
Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock
On Mon, Oct 14, 2019 at 01:40:27AM -0700, Christoph Hellwig wrote: > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote: > > When using fadump (fireware assist dump) mode on powerpc, a mismatch > > between grub xfs driver and kernel xfs driver has been obsevered. Note: > > fadump boots up in the following sequence: fireware -> grub reads kernel > > and initramfs -> kernel boots. > > This isn't something new. To fundamentally fix this you need to > implement (in-memory) log recovery in grub. That is the only really safe > long-term solutioin. But the equivalent of your patch you can already Agree. For the consistency of the whole fs, we need grub to be aware of log. While this patch just assumes that files accessed by grub are known, and the consistency is forced only on these files. > get by freezing and unfreezing the file system using the FIFREEZE and > FITHAW ioctls. And if my memory is serving me correctly Dave has been freeze will block any further modification to the fs. That is different from my patch, which does not have such limitation. > preaching that to the bootloader folks for a long time, but apparently > without visible results. Yes, it is a pity. And maybe it is uneasy to do. Thanks and regards, Pingfan
Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock
On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote: > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote: > > When using fadump (fireware assist dump) mode on powerpc, a mismatch > > between grub xfs driver and kernel xfs driver has been obsevered. Note: > > fadump boots up in the following sequence: fireware -> grub reads kernel > > and initramfs -> kernel boots. > > > > The process to reproduce this mismatch: > > - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf. > > - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl > > restart" to rebuild the initramfs. Detail about the rebuilding looks > > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp; > > mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname > > -r`.img > > sync > > - "echo c >/proc/sysrq-trigger". > > > > The result: > > The dump image will not be saved under /var/crashnew/* as expected, but > > still saved under /var/crash. > > > > The root cause: > > As Eric pointed out that on xfs, 'sync' ensures the consistency by writing > > back metadata to xlog, but not necessary to fsblock. This raises issue if > > grub can not replay the xlog before accessing the xfs files. Since the > > above dir entry of initramfs should be saved as inline data with xfs_inode, > > so xfs_fs_sync_fs() does not guarantee it written to fsblock. > > > > umount can be used to write metadata fsblock, but the filesystem can not be > > umounted if still in use. > > > > There are two ways to fix this mismatch, either grub or xfs. It may be > > easier to do this in xfs side by introducing an interface to flush metadata > > to fsblock explicitly. > > > > With this patch, metadata can be written to fsblock by: > > # update AIL > > sync > > # new introduced interface to flush metadata to fsblock > > mount -o remount,metasync mountpoint > > I think this ought to be an ioctl or some sort of generic call since the > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER > is too dumb to recover logs but still wants to write to the fs" > checkpointing problem. Yes, a syscall sounds more reasonable. > > (Or maybe we should just put all that stuff in a vfat filesystem, I > don't know...) I think it is unavoidable to involve in each fs' implementation. What about introducing an interface sync_to_fsblock(struct super_block *sb) in the struct super_operations, then let each fs manage its own case? > > --D > > > Signed-off-by: Pingfan Liu > > Cc: "Darrick J. Wong" > > Cc: Dave Chinner > > Cc: Eric Sandeen > > Cc: Hari Bathini > > Cc: linuxppc-dev@lists.ozlabs.org > > To: linux-...@vger.kernel.org > > --- > > fs/xfs/xfs_mount.h | 1 + > > fs/xfs/xfs_super.c | 15 ++- > > fs/xfs/xfs_trans.h | 2 ++ > > fs/xfs/xfs_trans_ail.c | 26 +- > > fs/xfs/xfs_trans_priv.h | 1 + > > 5 files changed, 43 insertions(+), 2 deletions(-) > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > index fdb60e0..85f32e6 100644 > > --- a/fs/xfs/xfs_mount.h > > +++ b/fs/xfs/xfs_mount.h > > @@ -243,6 +243,7 @@ typedef struct xfs_mount { > > #define XFS_MOUNT_FILESTREAMS (1ULL << 24)/* enable the > > filestreams > >allocator */ > > #define XFS_MOUNT_NOATTR2 (1ULL << 25)/* disable use of attr2 format > > */ > > +#define XFS_MOUNT_METASYNC (1ull << 26)/* write meta to fsblock */ > > > > #define XFS_MOUNT_DAX (1ULL << 62)/* TEST ONLY! */ > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > index 8d1df9f..41df810 100644 > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -59,7 +59,7 @@ enum { > > Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, > > Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota, > > Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce, > > - Opt_discard, Opt_nodiscard, Opt_dax, Opt_err, > > + Opt_discard, Opt_nodiscard, Opt_dax, Opt_metasync, Opt_err > > }; > > > > static const match_table_t tokens = { > > @@ -106,6 +106,7 @@ static const match_table_t tokens = { > > {Opt_discard, "discard"}, /* Discard unused blocks */ > > {Opt_nodiscard, "nodiscard"}, /* Do not dis
[PATCH] xfs: introduce "metasync" api to sync metadata to fsblock
When using fadump (fireware assist dump) mode on powerpc, a mismatch between grub xfs driver and kernel xfs driver has been obsevered. Note: fadump boots up in the following sequence: fireware -> grub reads kernel and initramfs -> kernel boots. The process to reproduce this mismatch: - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf. - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl restart" to rebuild the initramfs. Detail about the rebuilding looks like: mkdumprd /boot/initramfs-`uname -r`.img.tmp; mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname -r`.img sync - "echo c >/proc/sysrq-trigger". The result: The dump image will not be saved under /var/crashnew/* as expected, but still saved under /var/crash. The root cause: As Eric pointed out that on xfs, 'sync' ensures the consistency by writing back metadata to xlog, but not necessary to fsblock. This raises issue if grub can not replay the xlog before accessing the xfs files. Since the above dir entry of initramfs should be saved as inline data with xfs_inode, so xfs_fs_sync_fs() does not guarantee it written to fsblock. umount can be used to write metadata fsblock, but the filesystem can not be umounted if still in use. There are two ways to fix this mismatch, either grub or xfs. It may be easier to do this in xfs side by introducing an interface to flush metadata to fsblock explicitly. With this patch, metadata can be written to fsblock by: # update AIL sync # new introduced interface to flush metadata to fsblock mount -o remount,metasync mountpoint Signed-off-by: Pingfan Liu Cc: "Darrick J. Wong" Cc: Dave Chinner Cc: Eric Sandeen Cc: Hari Bathini Cc: linuxppc-dev@lists.ozlabs.org To: linux-...@vger.kernel.org --- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_super.c | 15 ++- fs/xfs/xfs_trans.h | 2 ++ fs/xfs/xfs_trans_ail.c | 26 +- fs/xfs/xfs_trans_priv.h | 1 + 5 files changed, 43 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index fdb60e0..85f32e6 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -243,6 +243,7 @@ typedef struct xfs_mount { #define XFS_MOUNT_FILESTREAMS (1ULL << 24)/* enable the filestreams allocator */ #define XFS_MOUNT_NOATTR2 (1ULL << 25)/* disable use of attr2 format */ +#define XFS_MOUNT_METASYNC (1ull << 26)/* write meta to fsblock */ #define XFS_MOUNT_DAX (1ULL << 62)/* TEST ONLY! */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 8d1df9f..41df810 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -59,7 +59,7 @@ enum { Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota, Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce, - Opt_discard, Opt_nodiscard, Opt_dax, Opt_err, + Opt_discard, Opt_nodiscard, Opt_dax, Opt_metasync, Opt_err }; static const match_table_t tokens = { @@ -106,6 +106,7 @@ static const match_table_t tokens = { {Opt_discard, "discard"}, /* Discard unused blocks */ {Opt_nodiscard, "nodiscard"}, /* Do not discard unused blocks */ {Opt_dax, "dax"}, /* Enable direct access to bdev pages */ + {Opt_metasync, "metasync"},/* one shot to write meta to fsblock */ {Opt_err, NULL}, }; @@ -338,6 +339,9 @@ xfs_parseargs( mp->m_flags |= XFS_MOUNT_DAX; break; #endif + case Opt_metasync: + mp->m_flags |= XFS_MOUNT_METASYNC; + break; default: xfs_warn(mp, "unknown mount option [%s].", p); return -EINVAL; @@ -1259,6 +1263,9 @@ xfs_fs_remount( mp->m_flags |= XFS_MOUNT_SMALL_INUMS; mp->m_maxagi = xfs_set_inode_alloc(mp, sbp->sb_agcount); break; + case Opt_metasync: + mp->m_flags |= XFS_MOUNT_METASYNC; + break; default: /* * Logically we would return an error here to prevent @@ -1286,6 +1293,12 @@ xfs_fs_remount( } } + if (mp->m_flags & XFS_MOUNT_METASYNC) { + xfs_ail_push_sync(mp->m_ail); + /* one shot flag */ + mp->m_flags &= ~XFS_MOUNT_METASYNC; + } + /* ro -> rw */ if ((mp->m_flags & XFS_MOUNT_RDONLY) && !(*flags & SB_RDONLY)) { if (mp-&g
Re: [PATCH] powerpc/crashkernel: take mem option into account
On Wed, Sep 18, 2019 at 7:23 PM Michael Ellerman wrote: > > Pingfan Liu writes: > > Cc Kexec list. And keep the original content. > > > > On Thu, Sep 12, 2019 at 10:50 AM Pingfan Liu wrote: > >> > >> 'mem=" option is an easy way to put high pressure on memory during some > >> test. Hence in stead of total mem, the effective usable memory size >^ ^ >instead"actual" would be clearer > > I think adding: "after applying the memory limit" > > would help here. > > >> should be considered when reserving mem for crashkernel. Otherwise > >> the boot up may experience oom issue. > ^ > OOM > >> > >> E.g passing > >> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and > >> mem=5G on a 256G machine. > > Spelling out the behaviour before and after would help here, eg: > > .. "would reserve 4G prior to the change and 512M afterward." > Thanks for kindly review. I will update the commit based on your suggestion. > > >> Signed-off-by: Pingfan Liu > >> Cc: Hari Bathini > >> Cc: Michael Ellerman > >> To: linuxppc-dev@lists.ozlabs.org > >> --- > >> v1 -> v2: fix the printk info about the total mem > >> arch/powerpc/kernel/machine_kexec.c | 7 --- > >> 1 file changed, 4 insertions(+), 3 deletions(-) > >> > >> diff --git a/arch/powerpc/kernel/machine_kexec.c > >> b/arch/powerpc/kernel/machine_kexec.c > >> index c4ed328..eec96dc 100644 > >> --- a/arch/powerpc/kernel/machine_kexec.c > >> +++ b/arch/powerpc/kernel/machine_kexec.c > >> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) > >> > >> void __init reserve_crashkernel(void) > >> { > >> - unsigned long long crash_size, crash_base; > >> + unsigned long long crash_size, crash_base, total_mem_sz; > >> int ret; > >> > >> + total_mem_sz = memory_limit ? memory_limit : > >> memblock_phys_mem_size(); > >> /* use common parsing */ > >> - ret = parse_crashkernel(boot_command_line, > >> memblock_phys_mem_size(), > >> + ret = parse_crashkernel(boot_command_line, total_mem_sz, > >> _size, _base); > > I think this change makes sense. But we have multiple arches that > implement similar logic, and I wonder if we should keep them all the > same. > > eg: > > arch/arm/kernel/setup.c:ret = > parse_crashkernel(boot_command_line, total_mem, > arch/arm64/mm/init.c: ret = > parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > arch/ia64/kernel/setup.c: ret = > parse_crashkernel(boot_command_line, total, > arch/mips/kernel/setup.c: ret = > parse_crashkernel(boot_command_line, total_mem, > arch/powerpc/kernel/fadump.c: ret = > parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > arch/powerpc/kernel/machine_kexec.c:ret = > parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > arch/s390/kernel/setup.c: rc = > parse_crashkernel(boot_command_line, memory_end, _size, > arch/sh/kernel/machine_kexec.c: ret = > parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > arch/x86/kernel/setup.c:ret = > parse_crashkernel(boot_command_line, total_mem, _size, _base); > > > From a quick glance most of them don't seem to take the memory limit > into account. > > So I guess the question is do we want all arches to implement the same > behaviour or do we think it doesn't matter if they differ in details > like this? On powerpc, the current code make fadump/kdump a higher priority than "mem=" option, as the notes in fadump_reserve_mem() says " /* * Calculate the memory boundary. * If memory_limit is less than actual memory boundary then reserve * the memory for fadump beyond the memory_limit and adjust the * memory_limit accordingly, so that the running kernel can run with * specified memory_limit. */ " While on other archs, they pack "mem=" info into memblock before calling memblock_phys_mem_size(). So when parse_crashkernel() calls memblock_phys_mem_size(), the "mem=" takes effect. E.g for x86 in arch/x86/kernel/e820.c static int __init parse_memopt(char *p) { ... e820__range_remove(mem_size, ULLONG_MAX - mem_size, E820_TYPE_RAM, 1); // this pack the "mem=" info into e820, and is finally feed to memblock } Thanks, Pingfan
Re: [PATCH] powerpc/crashkernel: take mem option into account
Cc Kexec list. And keep the original content. On Thu, Sep 12, 2019 at 10:50 AM Pingfan Liu wrote: > > 'mem=" option is an easy way to put high pressure on memory during some > test. Hence in stead of total mem, the effective usable memory size should > be considered when reserving mem for crashkernel. Otherwise the boot up may > experience oom issue. > > E.g passing > crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and > mem=5G on a 256G machine. > > Signed-off-by: Pingfan Liu > Cc: Hari Bathini > Cc: Michael Ellerman > To: linuxppc-dev@lists.ozlabs.org > --- > v1 -> v2: fix the printk info about the total mem > arch/powerpc/kernel/machine_kexec.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kernel/machine_kexec.c > b/arch/powerpc/kernel/machine_kexec.c > index c4ed328..eec96dc 100644 > --- a/arch/powerpc/kernel/machine_kexec.c > +++ b/arch/powerpc/kernel/machine_kexec.c > @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) > > void __init reserve_crashkernel(void) > { > - unsigned long long crash_size, crash_base; > + unsigned long long crash_size, crash_base, total_mem_sz; > int ret; > > + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); > /* use common parsing */ > - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > + ret = parse_crashkernel(boot_command_line, total_mem_sz, > _size, _base); > if (ret == 0 && crash_size > 0) { > crashk_res.start = crash_base; > @@ -185,7 +186,7 @@ void __init reserve_crashkernel(void) > "for crashkernel (System RAM: %ldMB)\n", > (unsigned long)(crash_size >> 20), > (unsigned long)(crashk_res.start >> 20), > - (unsigned long)(memblock_phys_mem_size() >> 20)); > + (unsigned long)(total_mem_sz >> 20)); > > if (!memblock_is_region_memory(crashk_res.start, crash_size) || > memblock_reserve(crashk_res.start, crash_size)) { > -- > 2.7.5 >
[PATCH] powerpc/crashkernel: take mem option into account
'mem=" option is an easy way to put high pressure on memory during some test. Hence in stead of total mem, the effective usable memory size should be considered when reserving mem for crashkernel. Otherwise the boot up may experience oom issue. E.g passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G on a 256G machine. Signed-off-by: Pingfan Liu Cc: Hari Bathini Cc: Michael Ellerman To: linuxppc-dev@lists.ozlabs.org --- v1 -> v2: fix the printk info about the total mem arch/powerpc/kernel/machine_kexec.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/machine_kexec.c b/arch/powerpc/kernel/machine_kexec.c index c4ed328..eec96dc 100644 --- a/arch/powerpc/kernel/machine_kexec.c +++ b/arch/powerpc/kernel/machine_kexec.c @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) void __init reserve_crashkernel(void) { - unsigned long long crash_size, crash_base; + unsigned long long crash_size, crash_base, total_mem_sz; int ret; + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); /* use common parsing */ - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), + ret = parse_crashkernel(boot_command_line, total_mem_sz, _size, _base); if (ret == 0 && crash_size > 0) { crashk_res.start = crash_base; @@ -185,7 +186,7 @@ void __init reserve_crashkernel(void) "for crashkernel (System RAM: %ldMB)\n", (unsigned long)(crash_size >> 20), (unsigned long)(crashk_res.start >> 20), - (unsigned long)(memblock_phys_mem_size() >> 20)); + (unsigned long)(total_mem_sz >> 20)); if (!memblock_is_region_memory(crashk_res.start, crash_size) || memblock_reserve(crashk_res.start, crash_size)) { -- 2.7.5
Re: [PATCH] powerpc/crashkernel: take mem option into account
NACK it. Due to a miss the updating of printk info. I will send out V2 On Mon, Sep 9, 2019 at 12:05 PM Pingfan Liu wrote: > > 'mem=" option is an easy way to put high pressure on memory during some > test. Hence in stead of total mem, the effective usable memory size should > be considered when reserving mem for crashkernel. Otherwise the boot up may > experience oom issue. > > E.g passing > crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and > mem=5G. > > Signed-off-by: Pingfan Liu > Cc: Hari Bathini > Cc: Michael Ellerman > To: linuxppc-dev@lists.ozlabs.org > --- > arch/powerpc/kernel/machine_kexec.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/machine_kexec.c > b/arch/powerpc/kernel/machine_kexec.c > index c4ed328..714b733 100644 > --- a/arch/powerpc/kernel/machine_kexec.c > +++ b/arch/powerpc/kernel/machine_kexec.c > @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) > > void __init reserve_crashkernel(void) > { > - unsigned long long crash_size, crash_base; > + unsigned long long crash_size, crash_base, total_mem_sz; > int ret; > > + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); > /* use common parsing */ > - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > + ret = parse_crashkernel(boot_command_line, total_mem_sz, > _size, _base); > if (ret == 0 && crash_size > 0) { > crashk_res.start = crash_base; > -- > 2.7.5 >
Re: [PATCH] powerpc/crashkernel: take mem option into account
On Mon, Sep 9, 2019 at 12:05 PM Pingfan Liu wrote: > > 'mem=" option is an easy way to put high pressure on memory during some > test. Hence in stead of total mem, the effective usable memory size should > be considered when reserving mem for crashkernel. Otherwise the boot up may > experience oom issue. > > E.g passing > crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and > mem=5G. > > Signed-off-by: Pingfan Liu > Cc: Hari Bathini > Cc: Michael Ellerman > To: linuxppc-dev@lists.ozlabs.org > --- > arch/powerpc/kernel/machine_kexec.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/machine_kexec.c > b/arch/powerpc/kernel/machine_kexec.c > index c4ed328..714b733 100644 > --- a/arch/powerpc/kernel/machine_kexec.c > +++ b/arch/powerpc/kernel/machine_kexec.c > @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) > > void __init reserve_crashkernel(void) > { > - unsigned long long crash_size, crash_base; > + unsigned long long crash_size, crash_base, total_mem_sz; > int ret; > > + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); Here memory_limit is used to esstimation and may be changed. So I think it is better to use memory_limit here than moving memblock_enforce_memory_limit() before the call to reserve_crashkernel() Thanks, Pingfan > /* use common parsing */ > - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), > + ret = parse_crashkernel(boot_command_line, total_mem_sz, > _size, _base); > if (ret == 0 && crash_size > 0) { > crashk_res.start = crash_base; > -- > 2.7.5 >
[PATCH] powerpc/crashkernel: take mem option into account
'mem=" option is an easy way to put high pressure on memory during some test. Hence in stead of total mem, the effective usable memory size should be considered when reserving mem for crashkernel. Otherwise the boot up may experience oom issue. E.g passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G. Signed-off-by: Pingfan Liu Cc: Hari Bathini Cc: Michael Ellerman To: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/machine_kexec.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/machine_kexec.c b/arch/powerpc/kernel/machine_kexec.c index c4ed328..714b733 100644 --- a/arch/powerpc/kernel/machine_kexec.c +++ b/arch/powerpc/kernel/machine_kexec.c @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) void __init reserve_crashkernel(void) { - unsigned long long crash_size, crash_base; + unsigned long long crash_size, crash_base, total_mem_sz; int ret; + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); /* use common parsing */ - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), + ret = parse_crashkernel(boot_command_line, total_mem_sz, _size, _base); if (ret == 0 && crash_size > 0) { crashk_res.start = crash_base; -- 2.7.5
Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant
Matthias, ping? Any suggestions? Thanks, Pingfan On Thu, May 2, 2019 at 2:22 PM Pingfan Liu wrote: > > On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu wrote: > > > > On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger wrote: > > > > > > > > [...] > > > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char > > > > *cmdline, > > > > pr_warn("crashkernel: unrecognized char: %c\n", *cur); > > > > return -EINVAL; > > > > } > > > > + if (*crash_size == 0) > > > > + return -EINVAL; > > > > > > This covers the case where I pass an argument like "crashkernel=0M" ? > > > Can't we fix that by using kstrtoull() in memparse and check if the > > > return value > > > is < 0? In that case we could return without updating the retptr and we > > > will be > > > fine. > After a series of work, I suddenly realized that it can not be done > like this way. "0M" causes kstrtoull() to return -EINVAL, but this is > caused by "M", not "0". If passing "0" to kstrtoull(), it will return > 0 on success. > > > > > > It seems that kstrtoull() treats 0M as invalid parameter, while > > simple_strtoull() does not. > > > My careless going through the code. And I tested with a valid value > "256M" using kstrtoull(), it also returned -EINVAL. > > So I think there is no way to distinguish 0 from a positive value > inside this basic math function. > Do I miss anything? > > Thanks and regards, > Pingfan
Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant
On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu wrote: > > On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger wrote: > > > > > [...] > > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char > > > *cmdline, > > > pr_warn("crashkernel: unrecognized char: %c\n", *cur); > > > return -EINVAL; > > > } > > > + if (*crash_size == 0) > > > + return -EINVAL; > > > > This covers the case where I pass an argument like "crashkernel=0M" ? > > Can't we fix that by using kstrtoull() in memparse and check if the return > > value > > is < 0? In that case we could return without updating the retptr and we > > will be > > fine. After a series of work, I suddenly realized that it can not be done like this way. "0M" causes kstrtoull() to return -EINVAL, but this is caused by "M", not "0". If passing "0" to kstrtoull(), it will return 0 on success. > > > It seems that kstrtoull() treats 0M as invalid parameter, while > simple_strtoull() does not. > My careless going through the code. And I tested with a valid value "256M" using kstrtoull(), it also returned -EINVAL. So I think there is no way to distinguish 0 from a positive value inside this basic math function. Do I miss anything? Thanks and regards, Pingfan