Re: [RFC PATCH 1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core

2024-02-15 Thread Pingfan Liu
On Thu, Feb 15, 2024 at 9:09 PM Michael Ellerman
 wrote:
>
> On Fri, 29 Dec 2023 23:01:03 +1100, Michael Ellerman wrote:
> > If nr_cpu_ids is too low to include at least all the threads of a single
> > core adjust nr_cpu_ids upwards. This avoids triggering odd bugs in code
> > that assumes all threads of a core are available.
> >
> >
>
> Applied to powerpc/next.
>

Great! After all these years, finally we are close to the conclusion
of this feature.

Thanks,

Pingfan

> [1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core
>   
> https://git.kernel.org/powerpc/c/5580e96dad5a439d561d9648ffcbccb739c2a120
> [2/5] powerpc/smp: Increase nr_cpu_ids to include the boot CPU
>   
> https://git.kernel.org/powerpc/c/777f81f0a9c780a6443bcf2c7785f0cc2e87c1ef
> [3/5] powerpc/smp: Lookup avail once per device tree node
>   
> https://git.kernel.org/powerpc/c/dca79603fbc592ec7ea8bd7ba274052d3984e882
> [4/5] powerpc/smp: Factor out assign_threads()
>   
> https://git.kernel.org/powerpc/c/9832de654499f0bf797a3719c4d4c5bd401f18f5
> [5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids
>   
> https://git.kernel.org/powerpc/c/0875f1ceba974042069f04946aa8f1d4d1e688da
>
> cheers
>



Re: [PATCH v6 (proposal)] powerpc/cpu: enable nr_cpus for crash kernel

2024-01-29 Thread Pingfan Liu
Hi Christophe,

The latest series is
https://lore.kernel.org/linuxppc-dev/20231017022806.4523-1-pi...@redhat.com/

And Michael has his implement on:
https://lore.kernel.org/all/20231229120107.2281153-3-...@ellerman.id.au/T/#m46128446bce1095631162a1927415733a3bf0633

Thanks,

Pingfan

On Fri, Jan 26, 2024 at 3:40 AM Christophe Leroy
 wrote:
>
> Hi,
>
> Le 22/05/2018 à 10:23, Pingfan Liu a écrit :
> > For kexec -p, the boot cpu can be not the cpu0, this causes the problem
> > to alloc paca[]. In theory, there is no requirement to assign cpu's logical
> > id as its present seq by device tree. But we have something like
> > cpu_first_thread_sibling(), which makes assumption on the mapping inside
> > a core. Hence partially changing the mapping, i.e. unbind the mapping of
> > core while keep the mapping inside a core. After this patch, the core with
> > boot-cpu will always be mapped into core 0.
> >
> > And at present, the code to discovery cpu spreads over two functions:
> > early_init_dt_scan_cpus() and smp_setup_cpu_maps().
> > This patch tries to fold smp_setup_cpu_maps() into the "previous" one
>
> This patch is pretty old and doesn't apply anymore. If still relevant
> can you please rebase and resubmit.
>
> Thanks
> Christophe
>
> >
> > Signed-off-by: Pingfan Liu 
> > ---
> > v5 -> v6:
> >simplify the loop logic (Hope it can answer Benjamin's concern)
> >concentrate the cpu recovery code to early stage (Hope it can answer 
> > Michael's concern)
> > Todo: (if this method is accepted)
> >fold the whole smp_setup_cpu_maps()
> >
> >   arch/powerpc/include/asm/smp.h |   1 +
> >   arch/powerpc/kernel/prom.c | 123 
> > -
> >   arch/powerpc/kernel/setup-common.c |  58 ++---
> >   drivers/of/fdt.c   |   2 +-
> >   include/linux/of_fdt.h |   2 +
> >   5 files changed, 103 insertions(+), 83 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> > index fac963e..80c7693 100644
> > --- a/arch/powerpc/include/asm/smp.h
> > +++ b/arch/powerpc/include/asm/smp.h
> > @@ -30,6 +30,7 @@
> >   #include 
> >
> >   extern int boot_cpuid;
> > +extern int threads_in_core;
> >   extern int spinning_secondaries;
> >
> >   extern void cpu_die(void);
> > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> > index 4922162..2ae0b4a 100644
> > --- a/arch/powerpc/kernel/prom.c
> > +++ b/arch/powerpc/kernel/prom.c
> > @@ -77,7 +77,6 @@ unsigned long tce_alloc_start, tce_alloc_end;
> >   u64 ppc64_rma_size;
> >   #endif
> >   static phys_addr_t first_memblock_size;
> > -static int __initdata boot_cpu_count;
> >
> >   static int __init early_parse_mem(char *p)
> >   {
> > @@ -305,6 +304,14 @@ static void __init 
> > check_cpu_feature_properties(unsigned long node)
> >   }
> >   }
> >
> > +struct bootinfo {
> > + int boot_thread_id;
> > + unsigned int cpu_cnt;
> > + int cpu_hwids[NR_CPUS];
> > + bool avail[NR_CPUS];
> > +};
> > +static struct bootinfo *bt_info;
> > +
> >   static int __init early_init_dt_scan_cpus(unsigned long node,
> > const char *uname, int depth,
> > void *data)
> > @@ -312,10 +319,12 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> >   const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> >   const __be32 *prop;
> >   const __be32 *intserv;
> > - int i, nthreads;
> > + int i, nthreads, maxidx;
> >   int len;
> > - int found = -1;
> > - int found_thread = 0;
> > + int found_thread = -1;
> > + struct bootinfo *info = data;
> > + bool avail;
> > + int rotate_cnt, id;
> >
> >   /* We are scanning "cpu" nodes only */
> >   if (type == NULL || strcmp(type, "cpu") != 0)
> > @@ -325,8 +334,15 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> >   intserv = of_get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", 
> > );
> >   if (!intserv)
> >   intserv = of_get_flat_dt_prop(node, "reg", );
> > + avail = of_fdt_device_is_available(initial_boot_params, node);
> > +#if 0
> > + //todo
> > + if (!avail)
> > + avail = !of_fdt_property

Re: [RFC PATCH 5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids

2024-01-01 Thread Pingfan Liu
On Fri, Dec 29, 2023 at 8:07 PM Michael Ellerman  wrote:
>
> Michael Ellerman  writes:
> > If nr_cpu_ids is too low to include the boot CPU, remap the boot CPU
> > onto logical core 0.
>
> Hi guys,
>
> I finally got time to look at this issue. I think this series should fix

Thanks a lot for sparing time on it and hope we can close this
prolonged issue soon.

And loop in Wen Xiong and Ming Lei, who care for this issue too.

Best Regards,

Pingfan

> the problems that have been seen. I've tested this fairly thoroughly
> with a qemu script, and also a few boots on a real machine.
>
> If you can test it with your setups that would be great. Hopefully there
> isn't some obscure case I've missed.
>
> cheers
>



[PATCHv10 3/3] powerpc/smp: Allow hole in paca_ptrs to accommodate boot_cpu

2023-12-26 Thread Pingfan Liu
From: Pingfan Liu 

This patch always forces the first core onlined due to some subsystem
needs cpu0. After core0, a hole may follow, then comes the crashed core.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/paca.c |  7 +--
 arch/powerpc/kernel/prom.c |  6 ++
 arch/powerpc/kernel/setup-common.c | 24 
 4 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 576d0e15..f01c7891b0d7 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -27,6 +27,7 @@
 
 extern int boot_cpuid;
 extern int boot_cpu_hwid; /* PPC64 only */
+extern int threads_in_core;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
 extern bool coregroup_enabled;
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 840c74dd17d6..1fe0fd2a6021 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -242,9 +242,12 @@ static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_last_cpu_num = nr_cpu_ids;
+   unsigned int cnt;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * paca_last_cpu_num;
+   /* paca_ptrs should be big enough to hold boot cpu */
+   cnt = max((unsigned int)ALIGN(boot_cpuid + 1, threads_in_core), 
nr_cpu_ids);
+   paca_last_cpu_num = cnt;
+   paca_ptrs_size = sizeof(struct paca_struct *) * cnt;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..e1a671156941 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -371,9 +371,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
DBG("boot cpu: logical %d physical %d\n", found,
be32_to_cpu(intserv[found_thread]));
boot_cpuid = found;
+   /* This forces all threads in a core to be onlined */
+   set_nr_cpu_ids(ALIGN(nr_cpu_ids, nthreads));
+   /* Core 0 is always onlined and assure enough room for boot core */
+   if (nthreads -1 < boot_cpuid && nr_cpu_ids < 2 * nthreads)
+   set_nr_cpu_ids(2 * nthreads);
 
if (IS_ENABLED(CONFIG_PPC64))
boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   threads_in_core = nthreads;
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index f9f5f313abf0..b70474e1b5fe 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -86,6 +86,7 @@ EXPORT_SYMBOL(machine_id);
 
 int boot_cpuid = -1;
 EXPORT_SYMBOL_GPL(boot_cpuid);
+int __initdata threads_in_core = 1;
 
 #ifdef CONFIG_PPC64
 int boot_cpu_hwid = -1;
@@ -448,8 +449,9 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
+   int cpu_onlined = 0, cpu = 0;
int nthreads = 1;
+   bool bootcpu_covered = false;
 
DBG("smp_setup_cpu_maps()\n");
 
@@ -484,7 +486,19 @@ void __init smp_setup_cpu_maps(void)
 
nthreads = len / sizeof(int);
 
-   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
+   if (!bootcpu_covered) {
+   if (cpu == ALIGN_DOWN(boot_cpuid, nthreads)) {
+   bootcpu_covered = true;
+   goto scan;
+
+   /* Reserve the last online slot for boot core */
+   } else if (cpu >= nr_cpu_ids - nthreads && 
!bootcpu_covered) {
+   cpu += nthreads;
+   continue;
+   }
+   }
+scan:
+   for (j = 0; j < nthreads && cpu_onlined < nr_cpu_ids; j++) {
bool avail;
 
DBG("thread %d -> cpu %d (hard id %d)\n",
@@ -499,9 +513,10 @@ void __init smp_setup_cpu_maps(void)
set_cpu_possible(cpu, true);
cpu_to_phys_id[cpu] = be32_to_cpu(intserv[j]);
cpu++;
+   cpu_onlined++;
}
 
-   if (cpu >= nr_cpu_ids) {
+   if (cpu_onlined >= nr_cpu_ids) {
of_node_put(dn);
break;
}
@@ -547,7 +562,8 @@ vo

[PATCHv10 2/3] powerpc/kernel: Extend arrays' size to make room for a hole in cpu_possible_mask

2023-12-26 Thread Pingfan Liu
From: Pingfan Liu 

This patch aims to mark all the arrays which size is decided by
nr_cpu_ids or num_possible_cpus().  Later if a hole is allowed in
cpu_possible_mask, the corresponding array should extend to hold the
last bit number in cpu_possible_mask.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/paca.h| 2 ++
 arch/powerpc/kernel/paca.c | 8 
 arch/powerpc/kernel/setup-common.c | 2 +-
 arch/powerpc/kernel/smp.c  | 3 ++-
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index e667d455ecb4..a577d98dd0d8 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -299,5 +299,7 @@ static inline void free_unused_pacas(void) { }
 
 #endif /* CONFIG_PPC64 */
 
+extern int paca_last_cpu_num;
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_PACA_H */
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 760f371cf096..840c74dd17d6 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -236,15 +236,15 @@ void setup_paca(struct paca_struct *new_paca)
 
 }
 
-static int __initdata paca_nr_cpu_ids;
+int __initdata paca_last_cpu_num;
 static int __initdata paca_ptrs_size;
 static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_nr_cpu_ids = nr_cpu_ids;
+   paca_last_cpu_num = nr_cpu_ids;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   paca_ptrs_size = sizeof(struct paca_struct *) * paca_last_cpu_num;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
@@ -258,7 +258,7 @@ void __init allocate_paca(int cpu)
u64 limit;
struct paca_struct *paca;
 
-   BUG_ON(cpu >= paca_nr_cpu_ids);
+   BUG_ON(cpu >= paca_last_cpu_num);
 
 #ifdef CONFIG_PPC_BOOK3S_64
/*
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2f1026fba00d..f9f5f313abf0 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -453,7 +453,7 @@ void __init smp_setup_cpu_maps(void)
 
DBG("smp_setup_cpu_maps()\n");
 
-   cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32),
+   cpu_to_phys_id = memblock_alloc(paca_last_cpu_num * sizeof(u32),
__alignof__(u32));
if (!cpu_to_phys_id)
panic("%s: Failed to allocate %zu bytes align=0x%zx\n",
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5826f5108a12..6fefe22fd118 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1140,7 +1140,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
}
 
if (cpu_to_chip_id(boot_cpuid) != -1) {
-   int idx = DIV_ROUND_UP(num_possible_cpus(), threads_per_core);
+   int idx = DIV_ROUND_UP(cpumask_last(cpu_possible_mask),
+   threads_per_core);
 
/*
 * All threads of a core will all belong to the same core,
-- 
2.31.1



[PATCHv10 1/3] powerpc/kernel: Remove check on paca_ptrs_size

2023-12-26 Thread Pingfan Liu
From: Pingfan Liu 

Between early_setup()->allocate_paca_ptrs() and
smp_setup_cpu_maps()->free_unused_pacas(), there is no call to
set_nr_cpu_ids(), which means nr_cpu_ids is unchanged.

Hence removing the check.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/paca.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index cda4e00b67c1..760f371cf096 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -286,16 +286,6 @@ void __init allocate_paca(int cpu)
 
 void __init free_unused_pacas(void)
 {
-   int new_ptrs_size;
-
-   new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
-   if (new_ptrs_size < paca_ptrs_size)
-   memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
-  paca_ptrs_size - new_ptrs_size);
-
-   paca_nr_cpu_ids = nr_cpu_ids;
-   paca_ptrs_size = new_ptrs_size;
-
 #ifdef CONFIG_PPC_64S_HASH_MMU
if (early_radix_enabled()) {
/* Ugly fixup, see new_slb_shadow() */
@@ -304,9 +294,6 @@ void __init free_unused_pacas(void)
paca_ptrs[boot_cpuid]->slb_shadow_ptr = NULL;
}
 #endif
-
-   printk(KERN_DEBUG "Allocated %u bytes for %u pacas\n",
-   paca_ptrs_size + paca_struct_size, nr_cpu_ids);
 }
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
-- 
2.31.1



[PATCHv10 0/3] enable nr_cpus for powerpc without re-ordering cpu number

2023-12-26 Thread Pingfan Liu
From: Pingfan Liu 

This series addresses the nr_cpus issue for PowerPC without re-ordering
cpu number. To save the memory used by percpu area, it also limits the
possible cpu numbers by allowing hole in cpu_possible_mask.

Because the last cpu number will bigger than nr_cpu_ids in this way,
some pointer arrays indexed by cpu should be extended to hold the
pointer, e.g. paca_ptrs.

Please notice that this series still has some issue (some cpu can not be
brought up), but before I resolve it. Please share your thoughts about
it.

Thanks


Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org

Pingfan Liu (3):
  powerpc/kernel: Remove check on paca_ptrs_size
  powerpc/kernel: Extend arrays' size to make room for a hole in
cpu_possible_mask
  powerpc/smp: Allow hole in paca_ptrs to accommodate boot_cpu

 arch/powerpc/include/asm/paca.h|  2 ++
 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/paca.c | 24 +++-
 arch/powerpc/kernel/prom.c |  6 ++
 arch/powerpc/kernel/setup-common.c | 26 +-
 arch/powerpc/kernel/smp.c  |  3 ++-
 6 files changed, 39 insertions(+), 23 deletions(-)

-- 
2.31.1



Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-11-28 Thread Pingfan Liu
Hi Hari,


On Mon, Nov 27, 2023 at 12:30 PM Hari Bathini  wrote:
>
> Hi Pingfan, Michael,
>
> On 17/10/23 4:03 pm, Hari Bathini wrote:
> >
> >
> > On 17/10/23 7:58 am, Pingfan Liu wrote:
> >> *** Idea ***
> >> For kexec -p, the boot cpu can be not the cpu0, this causes the problem
> >> of allocating memory for paca_ptrs[]. However, in theory, there is no
> >> requirement to assign cpu's logical id as its present sequence in the
> >> device tree. But there is something like cpu_first_thread_sibling(),
> >> which makes assumption on the mapping inside a core. Hence partially
> >> loosening the mapping, i.e. unbind the mapping of core while keep the
> >> mapping inside a core.
> >>
> >> *** Implement ***
> >> At this early stage, there are plenty of memory to utilize. Hence, this
> >> patch allocates interim memory to link the cpu info on a list, then
> >> reorder cpus by changing the list head. As a result, there is a rotate
> >> shift between the sequence number in dt and the cpu logical number.
> >>
> >> *** Result ***
> >> After this patch, a boot-cpu's logical id will always be mapped into the
> >> range [0,threads_per_core).
> >>
> >> Besides this, at this phase, all threads in the boot core are forced to
> >> be onlined. This restriction will be lifted in a later patch with
> >> extra effort.
> >>
> >> Signed-off-by: Pingfan Liu 
> >> Cc: Michael Ellerman 
> >> Cc: Nicholas Piggin 
> >> Cc: Christophe Leroy 
> >> Cc: Mahesh Salgaonkar 
> >> Cc: Wen Xiong 
> >> Cc: Baoquan He 
> >> Cc: Ming Lei 
> >> Cc: Sourabh Jain 
> >> Cc: Hari Bathini 
> >> Cc: ke...@lists.infradead.org
> >> To: linuxppc-dev@lists.ozlabs.org
> >
> > Thanks for working on this, Pingfan.
> > Looks good to me.
> >
> > Acked-by: Hari Bathini 
> >
>
> On second thoughts, probably better off with no impact for
> bootcpu < nr_cpu_ids case and changing only two cores logical
> numbering otherwise. Something like the below (Please share
> your thoughts):
>

I am afraid that it may not be as ideal as it looks, considering the
following factors:
-1. For the case of 'bootcpu < nr_cpu_ids', crash can happen evenly
across any cpu in the system, which seriously undermines the
protection intended here (Under the most optimistic scenario, there is
a 50% chance of success)

-2. For the re-ordering of logical numbering, IMHO, if there is
concern that re-ordering will break something, the partial re-ordering
can not avoid that.  We ought to spot probable hazards so as to ease
worries.


Thanks,

Pingfan

> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index ec82f5bda908..78a8312aa8c4 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -76,7 +76,9 @@ u64 ppc64_rma_size;
>   unsigned int boot_cpu_node_count __ro_after_init;
>   #endif
>   static phys_addr_t first_memblock_size;
> +#ifdef CONFIG_SMP
>   static int __initdata boot_cpu_count;
> +#endif
>
>   static int __init early_parse_mem(char *p)
>   {
> @@ -357,6 +359,25 @@ static int __init early_init_dt_scan_cpus(unsigned
> long node,
> fdt_boot_cpuid_phys(initial_boot_params)) {
> found = boot_cpu_count;
> found_thread = i;
> +   /*
> +* Map boot-cpu logical id into the range
> +* of [0, thread_per_core) if it can't be
> +* accommodated within nr_cpu_ids.
> +*/
> +   if (i != boot_cpu_count && boot_cpu_count >= 
> nr_cpu_ids) {
> +   boot_cpuid = i;
> +   DBG("Logical CPU number for boot CPU changed 
> from %d to %d\n",
> +   boot_cpu_count, i);
> +   } else {
> +   boot_cpuid = boot_cpu_count;
> +   }
> +
> +   /* Ensure boot thread is acconted for in nr_cpu_ids */
> +   if (boot_cpuid >= nr_cpu_ids) {
> +   set_nr_cpu_ids(boot_cpuid + 1);
> +   DBG("Adjusted nr_cpu_ids to %u, to include 
> boot CPU.\n",
> +   nr_cpu_ids);
> +   }
> }
>   #ifdef CONFIG_SMP
> /* logical cpu id is always 0 on UP kernels */
> @@ -368,9 +389,8 @@ static int __ini

Re: [PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-10-18 Thread Pingfan Liu
On Tue, Oct 17, 2023 at 6:39 PM Hari Bathini  wrote:
>
>
>
> On 17/10/23 7:58 am, Pingfan Liu wrote:
> > *** Idea ***
> > For kexec -p, the boot cpu can be not the cpu0, this causes the problem
> > of allocating memory for paca_ptrs[]. However, in theory, there is no
> > requirement to assign cpu's logical id as its present sequence in the
> > device tree. But there is something like cpu_first_thread_sibling(),
> > which makes assumption on the mapping inside a core. Hence partially
> > loosening the mapping, i.e. unbind the mapping of core while keep the
> > mapping inside a core.
> >
> > *** Implement ***
> > At this early stage, there are plenty of memory to utilize. Hence, this
> > patch allocates interim memory to link the cpu info on a list, then
> > reorder cpus by changing the list head. As a result, there is a rotate
> > shift between the sequence number in dt and the cpu logical number.
> >
> > *** Result ***
> > After this patch, a boot-cpu's logical id will always be mapped into the
> > range [0,threads_per_core).
> >
> > Besides this, at this phase, all threads in the boot core are forced to
> > be onlined. This restriction will be lifted in a later patch with
> > extra effort.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Christophe Leroy 
> > Cc: Mahesh Salgaonkar 
> > Cc: Wen Xiong 
> > Cc: Baoquan He 
> > Cc: Ming Lei 
> > Cc: Sourabh Jain 
> > Cc: Hari Bathini 
> > Cc: ke...@lists.infradead.org
> > To: linuxppc-dev@lists.ozlabs.org
>
> Thanks for working on this, Pingfan.
> Looks good to me.
>
> Acked-by: Hari Bathini 
>

Thank you for kindly reviewing. I hope that after all these years, we
have accomplished the objective.

Best Regards,

Pingfan



[PATCHv9 2/2] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-10-16 Thread Pingfan Liu
*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the problem
of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/prom.c | 25 +
 arch/powerpc/kernel/setup-common.c | 84 +++---
 2 files changed, 82 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index ec82f5bda908..7ed9034912ca 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
 unsigned int boot_cpu_node_count __ro_after_init;
 #endif
 static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
 static int __initdata boot_cpu_count;
+#endif
 
 static int __init early_parse_mem(char *p)
 {
@@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
const __be32 *intserv;
int i, nthreads;
int len;
-   int found = -1;
-   int found_thread = 0;
+   bool found = false;
 
/* We are scanning "cpu" nodes only */
if (type == NULL || strcmp(type, "cpu") != 0)
@@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
for (i = 0; i < nthreads; i++) {
if (be32_to_cpu(intserv[i]) ==
fdt_boot_cpuid_phys(initial_boot_params)) {
-   found = boot_cpu_count;
-   found_thread = i;
+   /*
+* always map the boot-cpu logical id into the
+* range of [0, thread_per_core)
+*/
+   boot_cpuid = i;
+   found = true;
+   /* This forces all threads in a core to be online */
+   if (nr_cpu_ids % nthreads != 0)
+   set_nr_cpu_ids(ALIGN(nr_cpu_ids, nthreads));
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
@@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
}
 
/* Not the boot CPU */
-   if (found < 0)
+   if (!found)
return 0;
 
-   DBG("boot cpu: logical %d physical %d\n", found,
-   be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
+   be32_to_cpu(intserv[boot_cpuid]));
 
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 707f0490639d..9802c7e5ee2f 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -425,6 +426,13 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+struct interrupt_server_node {
+   struct list_head node;
+   boolavail;
+   int len;
+   __be32 intserv[];
+};
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *  cpu_possible_mask
@@ -446,11 +454,16 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
-   int nthreads = 1;
+   int shift = 0, cpu = 0;
+   int j, nthreads = 1;
+   int len;
+   struct interrupt_server_node *intserv_node, *n;
+   struct list_head *bt_node, head;
+   bool avail, found_boot_cpu = false;
 
DBG("smp_setup_cpu_maps()\n");
 
+   INIT_LIST_HEAD();
cpu_to_phys_id = memblock_alloc(nr_cpu_ids 

[PATCHv9 1/2] powerpc/setup : Enable boot_cpu_hwid for PPC32

2023-10-16 Thread Pingfan Liu
In order to identify the boot cpu, its intserv[] should be recorded and
checked in smp_setup_cpu_maps().

smp_setup_cpu_maps() is shared between PPC64 and PPC32. Since PPC64 has
already used boot_cpu_hwid to carry that information, enabling this
variable on PPC32 so later it can also be used to carry that information
for PPC32 in the coming patch.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/smp.h | 2 +-
 arch/powerpc/kernel/prom.c | 3 +--
 arch/powerpc/kernel/setup-common.c | 2 --
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 576d0e15..5db9178cc800 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -26,7 +26,7 @@
 #include 
 
 extern int boot_cpuid;
-extern int boot_cpu_hwid; /* PPC64 only */
+extern int boot_cpu_hwid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
 extern bool coregroup_enabled;
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..ec82f5bda908 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -372,8 +372,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
be32_to_cpu(intserv[found_thread]));
boot_cpuid = found;
 
-   if (IS_ENABLED(CONFIG_PPC64))
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2f1026fba00d..707f0490639d 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -87,9 +87,7 @@ EXPORT_SYMBOL(machine_id);
 int boot_cpuid = -1;
 EXPORT_SYMBOL_GPL(boot_cpuid);
 
-#ifdef CONFIG_PPC64
 int boot_cpu_hwid = -1;
-#endif
 
 /*
  * These are used in binfmt_elf.c to put aux entries on the stack
-- 
2.31.1



[PATCHv9 0/2] enable nr_cpus for powerpc

2023-10-16 Thread Pingfan Liu
From: Pingfan Liu 


Since my last v4 [1], the code has undergone great changes. The paca[]
array has been reorganized and indexed by paca_ptrs[], which
dramatically decreases the memory consumption even if there are many
unpresent cpus in the middle.

However, reordering the logical cpu numbers can further decrease the
size of paca_ptrs[] in the kdump case. These two patches rotate-shifts
the cpu's sequence number in the device tree to obtain the logical cpu
id.


[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/
---
v8 -> v9
  put aside [3-5/5] in v8 for the time being, which complicates the code.
  optimize out some unnecessary initialization according to Hari's
suggestion

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: Sourabh Jain 
Cc: Hari Bathini 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org

Pingfan Liu (2):
  powerpc/setup : Enable boot_cpu_hwid for PPC32
  powerpc/setup: Loosen the mapping between cpu logical id and its seq
in dt

 arch/powerpc/include/asm/smp.h |  2 +-
 arch/powerpc/kernel/prom.c | 26 +
 arch/powerpc/kernel/setup-common.c | 86 +++---
 3 files changed, 83 insertions(+), 31 deletions(-)

-- 
2.31.1



Re: [PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32

2023-10-16 Thread Pingfan Liu
On Mon, Oct 16, 2023 at 12:13:53PM +0530, Sourabh Jain wrote:
> Hello Pingfan,
> 
> > > > > > With this patch series applied, the kdump kernel fails to boot on
> > > > > > powerpc with nr_cpus=1.
> > > > > > 
> > > > > > Console logs:
> > > > > > ---
> > > > > > [root]# echo c > /proc/sysrq-trigger
> > > > > > [   74.783235] sysrq: Trigger a crash
> > > > > > [   74.783244] Kernel panic - not syncing: sysrq triggered crash
> > > > > > [   74.783252] CPU: 58 PID: 3838 Comm: bash Kdump: loaded Not 
> > > > > > tainted
> > > > > > 6.6.0-rc5pf-nr-cpus+ #3
> > > > > > [   74.783259] Hardware name: POWER10 (raw) phyp pSeries
> > > > > > [   74.783275] Call Trace:
> > > > > > [   74.783280] [c0020f4ebac0] [c0ed9f38]
> > > > > > dump_stack_lvl+0x6c/0x9c (unreliable)
> > > > > > [   74.783291] [c0020f4ebaf0] [c0150300] 
> > > > > > panic+0x178/0x438
> > > > > > [   74.783298] [c0020f4ebb90] [c0936d48]
> > > > > > sysrq_handle_crash+0x28/0x30
> > > > > > [   74.783304] [c0020f4ebbf0] [c093773c]
> > > > > > __handle_sysrq+0x10c/0x250
> > > > > > [   74.783309] [c0020f4ebc90] [c0937fa8]
> > > > > > write_sysrq_trigger+0xc8/0x168
> > > > > > [   74.783314] [c0020f4ebcd0] [c0665d8c]
> > > > > > proc_reg_write+0x10c/0x1b0
> > > > > > [   74.783321] [c0020f4ebd00] [c058da54]
> > > > > > vfs_write+0x104/0x4b0
> > > > > > [   74.783326] [c0020f4ebdc0] [c058dfdc]
> > > > > > ksys_write+0x7c/0x140
> > > > > > [   74.783331] [c0020f4ebe10] [c0033a64]
> > > > > > system_call_exception+0x144/0x3a0
> > > > > > [   74.783337] [c0020f4ebe50] [c000c554]
> > > > > > system_call_common+0xf4/0x258
> > > > > > [   74.783343] --- interrupt: c00 at 0x7fffa0721594
> > > > > > [   74.783352] NIP:  7fffa0721594 LR: 7fffa0697bf4 CTR:
> > > > > > 
> > > > > > [   74.783364] REGS: c0020f4ebe80 TRAP: 0c00   Not tainted
> > > > > > (6.6.0-rc5pf-nr-cpus+)
> > > > > > [   74.783376] MSR:  8280f033
> > > > > >   CR: 2802  XER: 
> > > > > > [   74.783394] IRQMASK: 0
> > > > > > [   74.783394] GPR00: 0004 7c4b6800 
> > > > > > 7fffa0807300
> > > > > > 0001
> > > > > > [   74.783394] GPR04: 00013549ea60 0002 
> > > > > > 0010
> > > > > > 
> > > > > > [   74.783394] GPR08:   
> > > > > > 
> > > > > > 
> > > > > > [   74.783394] GPR12:  7fffa0abaf70 
> > > > > > 4000
> > > > > > 00011a0f9798
> > > > > > [   74.783394] GPR16: 00011a0f9724 00011a097688 
> > > > > > 00011a02ff70
> > > > > > 00011a0fd568
> > > > > > [   74.783394] GPR20: 000135554bf0 0001 
> > > > > > 00011a0aa478
> > > > > > 7c4b6a24
> > > > > > [   74.783394] GPR24: 7c4b6a20 00011a0faf94 
> > > > > > 0002
> > > > > > 00013549ea60
> > > > > > [   74.783394] GPR28: 0002 7fffa08017a0 
> > > > > > 00013549ea60
> > > > > > 0002
> > > > > > [   74.783440] NIP [7fffa0721594] 0x7fffa0721594
> > > > > > [   74.783443] LR [7fffa0697bf4] 0x7fffa0697bf4
> > > > > > [   74.783447] --- interrupt: c00
> > > > > > I'm in purgatory
> > > > > > [0.00] radix-mmu: Page sizes from device-tree:
> > > > > > [0.00] radix-mmu: Page size shift = 12 AP=0x0
> > > > > > [0.00] radix-mmu: Page size shift = 16 AP=0x5
> > > > > > [0.00] radix-mmu: Page size shift = 21 AP=0x1
> > > > > > [0.00] radix-mmu: Page size shift = 30 AP=0x2
> > > > > > [0.00] Activating Kernel Userspace Access Prevention
> > > > > > [0.00] Activating Kernel Userspace Execution Prevention
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x-0x0001
> > > > > > with 64.0 KiB pages (exec)
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x0001-0x0020
> > > > > > with 64.0 KiB pages
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x0020-0x2000
> > > > > > with 2.00 MiB pages
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x2000-0x2260
> > > > > > with 2.00 MiB pages (exec)
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x2260-0x4000
> > > > > > with 2.00 MiB pages
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x4000-0x00018000
> > > > > > with 1.00 GiB pages
> > > > > > [0.00] radix-mmu: Mapped 
> > > > > > 0x00018000-0x0001a000
> > > > > > with 2.00 MiB pages
> > > > > > [0.00] lpar: Using radix MMU under hypervisor
> > > > > > [0.00] Linux version 6.6.0-rc5pf-nr-cpus+
> > > > > > (r...@ltcever7x0-lp1.aus.stglabs.ibm.com) (gcc (GCC) 8.5.0 20210514 
> > > > > > (Red
> > > > > > Hat 

Re: [PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32

2023-10-12 Thread Pingfan Liu
On Wed, Oct 11, 2023 at 6:53 PM Sourabh Jain  wrote:
>
> Hello Pingfan,
> >>> With this patch series applied, the kdump kernel fails to boot on
> >>> powerpc with nr_cpus=1.
> >>>
> >>> Console logs:
> >>> ---
> >>> [root]# echo c > /proc/sysrq-trigger
> >>> [   74.783235] sysrq: Trigger a crash
> >>> [   74.783244] Kernel panic - not syncing: sysrq triggered crash
> >>> [   74.783252] CPU: 58 PID: 3838 Comm: bash Kdump: loaded Not tainted
> >>> 6.6.0-rc5pf-nr-cpus+ #3
> >>> [   74.783259] Hardware name: POWER10 (raw) phyp pSeries
> >>> [   74.783275] Call Trace:
> >>> [   74.783280] [c0020f4ebac0] [c0ed9f38]
> >>> dump_stack_lvl+0x6c/0x9c (unreliable)
> >>> [   74.783291] [c0020f4ebaf0] [c0150300] panic+0x178/0x438
> >>> [   74.783298] [c0020f4ebb90] [c0936d48]
> >>> sysrq_handle_crash+0x28/0x30
> >>> [   74.783304] [c0020f4ebbf0] [c093773c]
> >>> __handle_sysrq+0x10c/0x250
> >>> [   74.783309] [c0020f4ebc90] [c0937fa8]
> >>> write_sysrq_trigger+0xc8/0x168
> >>> [   74.783314] [c0020f4ebcd0] [c0665d8c]
> >>> proc_reg_write+0x10c/0x1b0
> >>> [   74.783321] [c0020f4ebd00] [c058da54]
> >>> vfs_write+0x104/0x4b0
> >>> [   74.783326] [c0020f4ebdc0] [c058dfdc]
> >>> ksys_write+0x7c/0x140
> >>> [   74.783331] [c0020f4ebe10] [c0033a64]
> >>> system_call_exception+0x144/0x3a0
> >>> [   74.783337] [c0020f4ebe50] [c000c554]
> >>> system_call_common+0xf4/0x258
> >>> [   74.783343] --- interrupt: c00 at 0x7fffa0721594
> >>> [   74.783352] NIP:  7fffa0721594 LR: 7fffa0697bf4 CTR:
> >>> 
> >>> [   74.783364] REGS: c0020f4ebe80 TRAP: 0c00   Not tainted
> >>> (6.6.0-rc5pf-nr-cpus+)
> >>> [   74.783376] MSR:  8280f033
> >>>   CR: 2802  XER: 
> >>> [   74.783394] IRQMASK: 0
> >>> [   74.783394] GPR00: 0004 7c4b6800 7fffa0807300
> >>> 0001
> >>> [   74.783394] GPR04: 00013549ea60 0002 0010
> >>> 
> >>> [   74.783394] GPR08:   
> >>> 
> >>> [   74.783394] GPR12:  7fffa0abaf70 4000
> >>> 00011a0f9798
> >>> [   74.783394] GPR16: 00011a0f9724 00011a097688 00011a02ff70
> >>> 00011a0fd568
> >>> [   74.783394] GPR20: 000135554bf0 0001 00011a0aa478
> >>> 7c4b6a24
> >>> [   74.783394] GPR24: 7c4b6a20 00011a0faf94 0002
> >>> 00013549ea60
> >>> [   74.783394] GPR28: 0002 7fffa08017a0 00013549ea60
> >>> 0002
> >>> [   74.783440] NIP [7fffa0721594] 0x7fffa0721594
> >>> [   74.783443] LR [7fffa0697bf4] 0x7fffa0697bf4
> >>> [   74.783447] --- interrupt: c00
> >>> I'm in purgatory
> >>> [0.00] radix-mmu: Page sizes from device-tree:
> >>> [0.00] radix-mmu: Page size shift = 12 AP=0x0
> >>> [0.00] radix-mmu: Page size shift = 16 AP=0x5
> >>> [0.00] radix-mmu: Page size shift = 21 AP=0x1
> >>> [0.00] radix-mmu: Page size shift = 30 AP=0x2
> >>> [0.00] Activating Kernel Userspace Access Prevention
> >>> [0.00] Activating Kernel Userspace Execution Prevention
> >>> [0.00] radix-mmu: Mapped 0x-0x0001
> >>> with 64.0 KiB pages (exec)
> >>> [0.00] radix-mmu: Mapped 0x0001-0x0020
> >>> with 64.0 KiB pages
> >>> [0.00] radix-mmu: Mapped 0x0020-0x2000
> >>> with 2.00 MiB pages
> >>> [0.00] radix-mmu: Mapped 0x2000-0x2260
> >>> with 2.00 MiB pages (exec)
> >>> [0.00] radix-mmu: Mapped 0x2260-0x4000
> >>> with 2.00 MiB pages
> >>> [0.00] radix-mmu: Mapped 0x4000-0x00018000
> >>> with 1.00 GiB pages
> >>> [0.00] radix-mmu: Mapped 0x00018000-0x0001a000
> >>> with 2.00 MiB pages
> >>> [0.00] lpar: Using radix MMU under hypervisor
> >>> [0.00] Linux version 6.6.0-rc5pf-nr-cpus+
> >>> (r...@ltcever7x0-lp1.aus.stglabs.ibm.com) (gcc (GCC) 8.5.0 20210514 (Red
> >>> Hat 8.5.0-20), GNU ld version 2.30-123.el8) #3 SMP Mon Oct  9 11:07:
> >>> 41 CDT 2023
> >>> [0.00] Found initrd at 0xc00022e6:0xc000248f08d8
> >>> [0.00] Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200
> >>> 0xf06 of:IBM,FW1060.00 (NM1060_016) hv:phyp pSeries
> >>> [0.00] printk: bootconsole [udbg0] enabled
> >>> [0.00] the round shift between dt seq and the cpu logic number:
> >>> 56
> >>> [0.00] BUG: Unable to handle kernel data access on write at
> >>> 0xc001a000
> >>> [0.00] Faulting instruction address: 0xc00022009c64
> >>> [0.00] Oops: Kernel access of bad area, sig: 11 [#1]
> >>> [0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> >>> 

Re: [PATCHv8 2/5] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-10-10 Thread Pingfan Liu
On Tue, Oct 10, 2023 at 04:07:00PM +0530, Hari Bathini wrote:
> 
> 
> On 09/10/23 5:00 pm, Pingfan Liu wrote:
> > *** Idea ***
> > For kexec -p, the boot cpu can be not the cpu0, this causes the problem
> > of allocating memory for paca_ptrs[]. However, in theory, there is no
> > requirement to assign cpu's logical id as its present sequence in the
> > device tree. But there is something like cpu_first_thread_sibling(),
> > which makes assumption on the mapping inside a core. Hence partially
> > loosening the mapping, i.e. unbind the mapping of core while keep the
> > mapping inside a core.
> > 
> > *** Implement ***
> > At this early stage, there are plenty of memory to utilize. Hence, this
> > patch allocates interim memory to link the cpu info on a list, then
> > reorder cpus by changing the list head. As a result, there is a rotate
> > shift between the sequence number in dt and the cpu logical number.
> > 
> > *** Result ***
> > After this patch, a boot-cpu's logical id will always be mapped into the
> > range [0,threads_per_core).
> > 
> > Besides this, at this phase, all threads in the boot core are forced to
> > be onlined. This restriction will be lifted in a later patch with
> > extra effort.
> > 
> > Signed-off-by: Pingfan Liu 
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Christophe Leroy 
> > Cc: Mahesh Salgaonkar 
> > Cc: Wen Xiong 
> > Cc: Baoquan He 
> > Cc: Ming Lei 
> > Cc: ke...@lists.infradead.org
> > To: linuxppc-dev@lists.ozlabs.org
> > ---
> >   arch/powerpc/kernel/prom.c | 25 +
> >   arch/powerpc/kernel/setup-common.c | 87 +++---
> >   2 files changed, 85 insertions(+), 27 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> > index ec82f5bda908..87272a2d8c10 100644
> > --- a/arch/powerpc/kernel/prom.c
> > +++ b/arch/powerpc/kernel/prom.c
> > @@ -76,7 +76,9 @@ u64 ppc64_rma_size;
> >   unsigned int boot_cpu_node_count __ro_after_init;
> >   #endif
> >   static phys_addr_t first_memblock_size;
> > +#ifdef CONFIG_SMP
> >   static int __initdata boot_cpu_count;
> > +#endif
> >   static int __init early_parse_mem(char *p)
> >   {
> > @@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
> > node,
> > const __be32 *intserv;
> > int i, nthreads;
> > int len;
> > -   int found = -1;
> > -   int found_thread = 0;
> > +   bool found = false;
> > /* We are scanning "cpu" nodes only */
> > if (type == NULL || strcmp(type, "cpu") != 0)
> > @@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> > for (i = 0; i < nthreads; i++) {
> > if (be32_to_cpu(intserv[i]) ==
> > fdt_boot_cpuid_phys(initial_boot_params)) {
> > -   found = boot_cpu_count;
> > -   found_thread = i;
> > +   /*
> > +* always map the boot-cpu logical id into the
> > +* range of [0, thread_per_core)
> > +*/
> > +   boot_cpuid = i;
> > +   found = true;
> > +   /* This works around the hole in paca_ptrs[]. */
> > +   if (nr_cpu_ids < nthreads)
> > +   set_nr_cpu_ids(nthreads);
> > }
> >   #ifdef CONFIG_SMP
> > /* logical cpu id is always 0 on UP kernels */
> > @@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned 
> > long node,
> > }
> > /* Not the boot CPU */
> > -   if (found < 0)
> > +   if (!found)
> > return 0;
> > -   DBG("boot cpu: logical %d physical %d\n", found,
> > -   be32_to_cpu(intserv[found_thread]));
> > -   boot_cpuid = found;
> > +   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
> > +   be32_to_cpu(intserv[boot_cpuid]));
> > -   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
> > +   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
> > /*
> >  * PAPR defines "logical" PVR values for cpus that
> > diff --git a/arch/powerpc/kernel/setup-common.c 
> > b/arch/powerpc/kernel/setup-common.c
> > index 1b19a9815672..81291e13dec0 100644
> > --- a/arch/powerpc/kernel/setup-common.c
> > +++ b/arch/powerpc/kernel/setup-common.c
> > @@ -36,6 +36,7 @@
> >   #includ

Re: [PATCHv8 3/5] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus

2023-10-10 Thread Pingfan Liu
On Tue, Oct 10, 2023 at 01:56:13PM +0530, Hari Bathini wrote:
> 
> 
> On 09/10/23 5:00 pm, Pingfan Liu wrote:
> > If the boot_cpuid is smaller than nr_cpus, it requires extra effort to
> > ensure the boot_cpu is in cpu_present_mask. This can be achieved by
> > reserving the last quota for the boot cpu.
> > 
> > Note: the restriction on nr_cpus will be lifted with more effort in the
> > successive patches
> > 
> > Signed-off-by: Pingfan Liu 
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Christophe Leroy 
> > Cc: Mahesh Salgaonkar 
> > Cc: Wen Xiong 
> > Cc: Baoquan He 
> > Cc: Ming Lei 
> > Cc: ke...@lists.infradead.org
> > To: linuxppc-dev@lists.ozlabs.org
> > ---
> >   arch/powerpc/kernel/setup-common.c | 25 ++---
> >   1 file changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/setup-common.c 
> > b/arch/powerpc/kernel/setup-common.c
> > index 81291e13dec0..f9ef0a2666b0 100644
> > --- a/arch/powerpc/kernel/setup-common.c
> > +++ b/arch/powerpc/kernel/setup-common.c
> > @@ -454,8 +454,8 @@ struct interrupt_server_node {
> >   void __init smp_setup_cpu_maps(void)
> >   {
> > struct device_node *dn;
> > -   int shift = 0, cpu = 0;
> > -   int j, nthreads = 1;
> > +   int terminate, shift = 0, cpu = 0;
> > +   int j, bt_thread = 0, nthreads = 1;
> > int len;
> > struct interrupt_server_node *intserv_node, *n;
> > struct list_head *bt_node, head;
> > @@ -518,6 +518,7 @@ void __init smp_setup_cpu_maps(void)
> > for (j = 0 ; j < nthreads; j++) {
> > if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) {
> > bt_node = _node->node;
> > +   bt_thread = j;
> > found_boot_cpu = true;
> > /*
> >  * Record the round-shift between dt
> > @@ -537,11 +538,21 @@ void __init smp_setup_cpu_maps(void)
> > /* Select the primary thread, the boot cpu's slibing, as the logic 0 */
> > list_add_tail(, bt_node);
> > pr_info("the round shift between dt seq and the cpu logic number: 
> > %d\n", shift);
> > +   terminate = nr_cpu_ids;
> > list_for_each_entry(intserv_node, , node) {
> > +   j = 0;
> 
> > +   /* Choose a start point to cover the boot cpu */
> > +   if (nr_cpu_ids - 1 < bt_thread) {
> > +   /*
> > +* The processor core puts assumption on the thread id,
> > +* not to breach the assumption.
> > +*/
> > +   terminate = nr_cpu_ids - 1;
> 
> nthreads is anyway assumed to be same for all cores. So, enforcing
> nr_cpu_ids to a minimum of nthreads (and multiple of nthreads) should
> make the code much simpler without the need for above check and the
> other complexities addressed in the subsequent patches...
> 

Indeed, this series can be splited into two partsk, [1-2/5] and [3-5/5].
In [1-2/5], if smaller, the nr_cpu_ids is enforced to be equal to
nthreads. I will make it align upward on nthreads in the next version.
So [1-2/5] can be totally independent from the rest patches in this
series.


>From an engineer's perspective, [3-5/5] are added to maintain the
nr_cpus semantics. (Finally, nr_cpus=1 can be achieved but requiring
effort on other subsystem)


Testing result on my Power9 machine with SMT=4

-1. taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger'

kdump:/# cat /proc/meminfo | grep Percpu
Percpu:  896 kB
kdump:/# cat /sys/devices/system/cpu/possible
0


-2. taskset -c 5 bash -c 'echo c > /proc/sysrq-trigger'

kdump:/# cat /proc/meminfo | grep Percpu
Percpu: 1792 kB
kdump:/# cat /sys/devices/system/cpu/possible
0-1



-3. taskset -c 6 bash -c 'echo c > /proc/sysrq-trigger'

kdump:/# cat /proc/meminfo | grep Percpu
Percpu: 1792 kB
kdump:/# cat /sys/devices/system/cpu/possible
0,2


-4. taskset -c 7 bash -c 'echo c > /proc/sysrq-trigger'

kdump:/# cat /proc/meminfo | grep Percpu
Percpu: 1792 kB
kdump:/# cat /sys/devices/system/cpu/possible
0,3


Thanks,
Pingfan





Re: [PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32

2023-10-10 Thread Pingfan Liu
On Tue, Oct 10, 2023 at 02:38:40PM +0530, Sourabh Jain wrote:
> Hello Pingfan,
> 
> > 
> > With this patch series applied, the kdump kernel fails to boot on
> > powerpc with nr_cpus=1.
> > 
> > Console logs:
> > ---
> > [root]# echo c > /proc/sysrq-trigger
> > [   74.783235] sysrq: Trigger a crash
> > [   74.783244] Kernel panic - not syncing: sysrq triggered crash
> > [   74.783252] CPU: 58 PID: 3838 Comm: bash Kdump: loaded Not tainted
> > 6.6.0-rc5pf-nr-cpus+ #3
> > [   74.783259] Hardware name: POWER10 (raw) phyp pSeries
> > [   74.783275] Call Trace:
> > [   74.783280] [c0020f4ebac0] [c0ed9f38]
> > dump_stack_lvl+0x6c/0x9c (unreliable)
> > [   74.783291] [c0020f4ebaf0] [c0150300] panic+0x178/0x438
> > [   74.783298] [c0020f4ebb90] [c0936d48]
> > sysrq_handle_crash+0x28/0x30
> > [   74.783304] [c0020f4ebbf0] [c093773c]
> > __handle_sysrq+0x10c/0x250
> > [   74.783309] [c0020f4ebc90] [c0937fa8]
> > write_sysrq_trigger+0xc8/0x168
> > [   74.783314] [c0020f4ebcd0] [c0665d8c]
> > proc_reg_write+0x10c/0x1b0
> > [   74.783321] [c0020f4ebd00] [c058da54]
> > vfs_write+0x104/0x4b0
> > [   74.783326] [c0020f4ebdc0] [c058dfdc]
> > ksys_write+0x7c/0x140
> > [   74.783331] [c0020f4ebe10] [c0033a64]
> > system_call_exception+0x144/0x3a0
> > [   74.783337] [c0020f4ebe50] [c000c554]
> > system_call_common+0xf4/0x258
> > [   74.783343] --- interrupt: c00 at 0x7fffa0721594
> > [   74.783352] NIP:  7fffa0721594 LR: 7fffa0697bf4 CTR:
> > 
> > [   74.783364] REGS: c0020f4ebe80 TRAP: 0c00   Not tainted
> > (6.6.0-rc5pf-nr-cpus+)
> > [   74.783376] MSR:  8280f033
> >   CR: 2802  XER: 
> > [   74.783394] IRQMASK: 0
> > [   74.783394] GPR00: 0004 7c4b6800 7fffa0807300
> > 0001
> > [   74.783394] GPR04: 00013549ea60 0002 0010
> > 
> > [   74.783394] GPR08:   
> > 
> > [   74.783394] GPR12:  7fffa0abaf70 4000
> > 00011a0f9798
> > [   74.783394] GPR16: 00011a0f9724 00011a097688 00011a02ff70
> > 00011a0fd568
> > [   74.783394] GPR20: 000135554bf0 0001 00011a0aa478
> > 7c4b6a24
> > [   74.783394] GPR24: 7c4b6a20 00011a0faf94 0002
> > 00013549ea60
> > [   74.783394] GPR28: 0002 7fffa08017a0 00013549ea60
> > 0002
> > [   74.783440] NIP [7fffa0721594] 0x7fffa0721594
> > [   74.783443] LR [7fffa0697bf4] 0x7fffa0697bf4
> > [   74.783447] --- interrupt: c00
> > I'm in purgatory
> > [    0.00] radix-mmu: Page sizes from device-tree:
> > [    0.00] radix-mmu: Page size shift = 12 AP=0x0
> > [    0.00] radix-mmu: Page size shift = 16 AP=0x5
> > [    0.00] radix-mmu: Page size shift = 21 AP=0x1
> > [    0.00] radix-mmu: Page size shift = 30 AP=0x2
> > [    0.00] Activating Kernel Userspace Access Prevention
> > [    0.00] Activating Kernel Userspace Execution Prevention
> > [    0.00] radix-mmu: Mapped 0x-0x0001
> > with 64.0 KiB pages (exec)
> > [    0.00] radix-mmu: Mapped 0x0001-0x0020
> > with 64.0 KiB pages
> > [    0.00] radix-mmu: Mapped 0x0020-0x2000
> > with 2.00 MiB pages
> > [    0.00] radix-mmu: Mapped 0x2000-0x2260
> > with 2.00 MiB pages (exec)
> > [    0.00] radix-mmu: Mapped 0x2260-0x4000
> > with 2.00 MiB pages
> > [    0.00] radix-mmu: Mapped 0x4000-0x00018000
> > with 1.00 GiB pages
> > [    0.00] radix-mmu: Mapped 0x00018000-0x0001a000
> > with 2.00 MiB pages
> > [    0.00] lpar: Using radix MMU under hypervisor
> > [    0.00] Linux version 6.6.0-rc5pf-nr-cpus+
> > (r...@ltcever7x0-lp1.aus.stglabs.ibm.com) (gcc (GCC) 8.5.0 20210514 (Red
> > Hat 8.5.0-20), GNU ld version 2.30-123.el8) #3 SMP Mon Oct  9 11:07:
> > 41 CDT 2023
> > [    0.00] Found initrd at 0xc00022e6:0xc000248f08d8
> > [    0.00] Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200
> > 0xf06 of:IBM,FW1060.00 (NM1060_016) hv:phyp pSeries
> > [    0.00] printk: bootconsole [udbg0] enabled
> > [    0.00] the round shift between dt seq and the cpu logic number:
> > 56
> > [    0.00] BUG: Unable to handle kernel data access on write at
> > 0xc001a000
> > [    0.00] Faulting instruction address: 0xc00022009c64
> > [    0.00] Oops: Kernel access of bad area, sig: 11 [#1]
> > [    0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > [    0.00] Modules linked in:
> > [    0.00] CPU: 2 PID: 0 Comm: swapper Not tainted
> > 6.6.0-rc5pf-nr-cpus+ #3
> > [    0.00] Hardware name:  POWER10 (raw)  

[PATCHv8 5/5] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

2023-10-09 Thread Pingfan Liu
paca_ptrs should be large enough to hold the boot_cpuid, hence, its
lower boundary is set to the bigger one between boot_cpuid+1 and
nr_cpus.

On the other hand, some kernel component: -1. the timer assumes cpu0
online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if
not initialized to a proper present cpu.  -2. power9_idle_stop() assumes
the primary thread's paca is allocated.

Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the
boot cpu is not cpu0.

Result:
When nr_cpus=1, taskset -c 14 bash -c 'echo c > /proc/sysrq-trigger'
the kdump kernel brings up two cpus.
While when taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger',
the kdump kernel brings up one cpu.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/paca.c | 10 ++
 arch/powerpc/kernel/prom.c |  9 ++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index cda4e00b67c1..91e2401de1bd 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -242,9 +242,10 @@ static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_nr_cpu_ids = nr_cpu_ids;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   paca_nr_cpu_ids = n;
+   paca_ptrs_size = sizeof(struct paca_struct *) * n;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
@@ -287,13 +288,14 @@ void __init allocate_paca(int cpu)
 void __init free_unused_pacas(void)
 {
int new_ptrs_size;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   new_ptrs_size = sizeof(struct paca_struct *) * n;
if (new_ptrs_size < paca_ptrs_size)
memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
   paca_ptrs_size - new_ptrs_size);
 
-   paca_nr_cpu_ids = nr_cpu_ids;
+   paca_nr_cpu_ids = n;
paca_ptrs_size = new_ptrs_size;
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 87272a2d8c10..15c994f54bf9 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
 */
boot_cpuid = i;
found = true;
-   /* This works around the hole in paca_ptrs[]. */
-   if (nr_cpu_ids < nthreads)
-   set_nr_cpu_ids(nthreads);
+   /*
+* Ideally, nr_cpus=1 can be achieved if each kernel
+* component does not assume cpu0 is onlined.
+*/
+   if (boot_cpuid != 0 && nr_cpu_ids < 2)
+   set_nr_cpu_ids(2);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
-- 
2.31.1



[PATCHv8 4/5] powerpc/cpu: Skip impossible cpu during iteration on a core

2023-10-09 Thread Pingfan Liu
The threads in a core have equal status, so the code introduces a for
loop pattern to execute the same task on each thread:
for (i = first_thread; i < first_thread + threads_per_core; i++)

Now that some threads may not be in the cpu_possible_mask, the iteration
skips those threads by checking the mask. In this way, the unpopulated
pcpu struct can be skipped and left unaccessed.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/cputhreads.h|  6 +
 arch/powerpc/kernel/smp.c|  2 +-
 arch/powerpc/kvm/book3s_hv.c |  7 ++
 arch/powerpc/platforms/powernv/idle.c| 32 
 arch/powerpc/platforms/powernv/subcore.c |  5 +++-
 5 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index f26c430f3982..fdb71ff7f6a9 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -65,6 +65,12 @@ static inline int cpu_last_thread_sibling(int cpu)
return cpu | (threads_per_core - 1);
 }
 
+#define for_each_possible_cpu_in_core(start, iter) \
+   for (iter = start; iter < start + threads_per_core; iter++) \
+   if (unlikely(!cpu_possible(iter)))  \
+   continue;   \
+   else
+
 /*
  * tlb_thread_siblings are siblings which share a TLB. This is not
  * architected, is not something a hypervisor could emulate and a future
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index fbbb695bae3d..2936f7a2240d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -933,7 +933,7 @@ static int __init 
update_mask_from_threadgroup(cpumask_var_t *mask, struct threa
 
zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
 
-   for (i = first_thread; i < first_thread + threads_per_core; i++) {
+   for_each_possible_cpu_in_core(first_thread, i) {
int i_group_start = get_cpu_thread_group_start(i, tg);
 
if (unlikely(i_group_start == -1)) {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..ff4b3f8affba 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6235,12 +6235,9 @@ static int kvm_init_subcore_bitmap(void)
return -ENOMEM;
 
 
-   for (j = 0; j < threads_per_core; j++) {
-   int cpu = first_cpu + j;
-
-   paca_ptrs[cpu]->sibling_subcore_state =
+   for_each_possible_cpu_in_core(first_cpu, j)
+   paca_ptrs[j]->sibling_subcore_state =
sibling_subcore_state;
-   }
}
return 0;
 }
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index ad41dffe4d92..79d81ce5cf4c 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -823,36 +823,36 @@ void pnv_power9_force_smt4_catch(void)
 
cpu = smp_processor_id();
cpu0 = cpu & ~(threads_per_core - 1);
-   for (thr = 0; thr < threads_per_core; ++thr) {
-   if (cpu != cpu0 + thr)
-   atomic_inc(_ptrs[cpu0+thr]->dont_stop);
+   for_each_possible_cpu_in_core(cpu0, thr) {
+   if (cpu != thr)
+   atomic_inc(_ptrs[thr]->dont_stop);
}
/* order setting dont_stop vs testing requested_psscr */
smp_mb();
-   for (thr = 0; thr < threads_per_core; ++thr) {
-   if (!paca_ptrs[cpu0+thr]->requested_psscr)
+   for_each_possible_cpu_in_core(cpu0, thr) {
+   if (!paca_ptrs[thr]->requested_psscr)
++awake_threads;
else
-   poke_threads |= (1 << thr);
+   poke_threads |= (1 << (thr - cpu0));
}
 
/* If at least 3 threads are awake, the core is in SMT4 already */
if (awake_threads < need_awake) {
/* We have to wake some threads; we'll use msgsnd */
-   for (thr = 0; thr < threads_per_core; ++thr) {
-   if (poke_threads & (1 << thr)) {
+   for_each_possible_cpu_in_core(cpu0, thr) {
+   if (poke_threads & (1 << (thr - cpu0))) {
ppc_msgsnd_sync();
ppc_msgsnd(PPC_DBELL_MSGTYPE, 0,
-  paca_ptrs[cpu0+thr]->hw_cpu_id);
+  paca_ptrs[thr]->hw_cp

[PATCHv8 1/5] powerpc/setup : Enable boot_cpu_hwid for PPC32

2023-10-09 Thread Pingfan Liu
In order to identify the boot cpu, its intserv[] should be recorded and
checked in smp_setup_cpu_maps().

smp_setup_cpu_maps() is shared between PPC64 and PPC32. Since PPC64 has
already used boot_cpu_hwid to carry that information, enabling this
variable on PPC32 so later it can also be used to carry that information
for PPC32 in the coming patch.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/smp.h | 2 +-
 arch/powerpc/kernel/prom.c | 3 +--
 arch/powerpc/kernel/setup-common.c | 2 --
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 576d0e15..5db9178cc800 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -26,7 +26,7 @@
 #include 
 
 extern int boot_cpuid;
-extern int boot_cpu_hwid; /* PPC64 only */
+extern int boot_cpu_hwid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
 extern bool coregroup_enabled;
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..ec82f5bda908 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -372,8 +372,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
be32_to_cpu(intserv[found_thread]));
boot_cpuid = found;
 
-   if (IS_ENABLED(CONFIG_PPC64))
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index d2a446216444..1b19a9815672 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -87,9 +87,7 @@ EXPORT_SYMBOL(machine_id);
 int boot_cpuid = -1;
 EXPORT_SYMBOL_GPL(boot_cpuid);
 
-#ifdef CONFIG_PPC64
 int boot_cpu_hwid = -1;
-#endif
 
 /*
  * These are used in binfmt_elf.c to put aux entries on the stack
-- 
2.31.1



[PATCHv8 3/5] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus

2023-10-09 Thread Pingfan Liu
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to
ensure the boot_cpu is in cpu_present_mask. This can be achieved by
reserving the last quota for the boot cpu.

Note: the restriction on nr_cpus will be lifted with more effort in the
successive patches

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/setup-common.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 81291e13dec0..f9ef0a2666b0 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -454,8 +454,8 @@ struct interrupt_server_node {
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int shift = 0, cpu = 0;
-   int j, nthreads = 1;
+   int terminate, shift = 0, cpu = 0;
+   int j, bt_thread = 0, nthreads = 1;
int len;
struct interrupt_server_node *intserv_node, *n;
struct list_head *bt_node, head;
@@ -518,6 +518,7 @@ void __init smp_setup_cpu_maps(void)
for (j = 0 ; j < nthreads; j++) {
if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) {
bt_node = _node->node;
+   bt_thread = j;
found_boot_cpu = true;
/*
 * Record the round-shift between dt
@@ -537,11 +538,21 @@ void __init smp_setup_cpu_maps(void)
/* Select the primary thread, the boot cpu's slibing, as the logic 0 */
list_add_tail(, bt_node);
pr_info("the round shift between dt seq and the cpu logic number: 
%d\n", shift);
+   terminate = nr_cpu_ids;
list_for_each_entry(intserv_node, , node) {
 
+   j = 0;
+   /* Choose a start point to cover the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   /*
+* The processor core puts assumption on the thread id,
+* not to breach the assumption.
+*/
+   terminate = nr_cpu_ids - 1;
+   }
avail = intserv_node->avail;
nthreads = intserv_node->len / sizeof(int);
-   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
+   for (; j < nthreads && cpu < terminate; j++) {
set_cpu_present(cpu, avail);
set_cpu_possible(cpu, true);
cpu_to_phys_id[cpu] = 
be32_to_cpu(intserv_node->intserv[j]);
@@ -549,6 +560,14 @@ void __init smp_setup_cpu_maps(void)
j, cpu, be32_to_cpu(intserv_node->intserv[j]));
cpu++;
}
+   /* Online the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   set_cpu_present(bt_thread, avail);
+   set_cpu_possible(bt_thread, true);
+   cpu_to_phys_id[bt_thread] = 
be32_to_cpu(intserv_node->intserv[bt_thread]);
+   DBG("thread %d -> cpu %d (hard id %d)\n",
+   bt_thread, bt_thread, 
be32_to_cpu(intserv_node->intserv[bt_thread]));
+   }
}
 
list_for_each_entry_safe(intserv_node, n, , node) {
-- 
2.31.1



[PATCHv8 2/5] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-10-09 Thread Pingfan Liu
*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the problem
of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/prom.c | 25 +
 arch/powerpc/kernel/setup-common.c | 87 +++---
 2 files changed, 85 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index ec82f5bda908..87272a2d8c10 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
 unsigned int boot_cpu_node_count __ro_after_init;
 #endif
 static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
 static int __initdata boot_cpu_count;
+#endif
 
 static int __init early_parse_mem(char *p)
 {
@@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
const __be32 *intserv;
int i, nthreads;
int len;
-   int found = -1;
-   int found_thread = 0;
+   bool found = false;
 
/* We are scanning "cpu" nodes only */
if (type == NULL || strcmp(type, "cpu") != 0)
@@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
for (i = 0; i < nthreads; i++) {
if (be32_to_cpu(intserv[i]) ==
fdt_boot_cpuid_phys(initial_boot_params)) {
-   found = boot_cpu_count;
-   found_thread = i;
+   /*
+* always map the boot-cpu logical id into the
+* range of [0, thread_per_core)
+*/
+   boot_cpuid = i;
+   found = true;
+   /* This works around the hole in paca_ptrs[]. */
+   if (nr_cpu_ids < nthreads)
+   set_nr_cpu_ids(nthreads);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
@@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
}
 
/* Not the boot CPU */
-   if (found < 0)
+   if (!found)
return 0;
 
-   DBG("boot cpu: logical %d physical %d\n", found,
-   be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
+   be32_to_cpu(intserv[boot_cpuid]));
 
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 1b19a9815672..81291e13dec0 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -425,6 +426,13 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+struct interrupt_server_node {
+   struct list_head node;
+   boolavail;
+   int len;
+   __be32 *intserv;
+};
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *  cpu_possible_mask
@@ -446,11 +454,16 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
-   int nthreads = 1;
+   int shift = 0, cpu = 0;
+   int j, nthreads = 1;
+   int len;
+   struct interrupt_server_node *intserv_node, *n;
+   struct list_head *bt_node, head;
+   bool avail, found_boot_cpu = false;
 
DBG("smp_setup_cpu_maps()\n");
 
+   INIT_LIST_HEAD();
cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32),
__alignof_

[PATCHv8 0/5] enable nr_cpus for powerpc

2023-10-09 Thread Pingfan Liu
Since my last v4 [1], the code has undergone great changes. The paca[]
array has been reorganized and indexed by paca_ptrs[], which
dramatically decreases the memory consumption even if there are many
unpresent cpus in the middle.

However, reordering the logical cpu numbers can further decrease the
size of paca_ptrs[] in the kdump case. So I keep [1-2/5], which
rotate-shifts the cpu's sequence number in the device tree to obtain the
logical cpu id.


Patch [3-5/5] make further efforts to decrease the nr_cpus to be less
than or equal to two.

[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/
---
v7 -> v8
  Fix bug when turning on DEBUG macro
  Introducing [PATCHv7 4/5] powerpc/cpu: Skip impossible cpu during iteration on
a core, which avoid access to unpopulated pcpu data.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org


Pingfan Liu (5):
  powerpc/setup : Enable boot_cpu_hwid for PPC32
  powerpc/setup: Loosen the mapping between cpu logical id and its seq
in dt
  powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
  powerpc/cpu: Skip impossible cpu during iteration on a core
  powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

 arch/powerpc/include/asm/cputhreads.h|   6 ++
 arch/powerpc/include/asm/smp.h   |   2 +-
 arch/powerpc/kernel/paca.c   |  10 ++-
 arch/powerpc/kernel/prom.c   |  29 +++---
 arch/powerpc/kernel/setup-common.c   | 108 ++-
 arch/powerpc/kernel/smp.c|   2 +-
 arch/powerpc/kvm/book3s_hv.c |   7 +-
 arch/powerpc/platforms/powernv/idle.c|  32 +++
 arch/powerpc/platforms/powernv/subcore.c |   5 +-
 9 files changed, 143 insertions(+), 58 deletions(-)

-- 
2.31.1



Re: [PATCHv7 4/4] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

2023-10-06 Thread Pingfan Liu
On Wed, Oct 4, 2023 at 2:07 AM Mahesh J Salgaonkar  wrote:
>
> On 2023-09-25 15:53:48 Mon, Pingfan Liu wrote:
> > paca_ptrs should be large enough to hold the boot_cpuid, hence, its
> > lower boundary is set to the bigger one between boot_cpuid+1 and
> > nr_cpus.
> >
> > On the other hand, some kernel component: -1. the timer assumes cpu0
> > online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if
> > not initialized to a proper present cpu.  -2. power9_idle_stop() assumes
> > the primary thread's paca is allocated.
> >
> > Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the
> > boot cpu is not cpu0.
> >
> > Result:
> > When nr_cpus=1, taskset -c 14 bash -c 'echo c > /proc/sysrq-trigger'
> > the kdump kernel brings up two cpus.
> > While when taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger',
> > the kdump kernel brings up one cpu.
>
> I tried your changes on power9 and power10 systems. However, on power10 lpar I
> see bellow backtrace in kdump kernel bootup with nr_cpus=1.
>

Thanks for the testing. I have only tried this series on Power9 bare
metal.  I think the bug is related with the code snippet in
update_mask_from_threadgroup()
  for (i = first_thread; i < first_thread + threads_per_core; i++) {
int i_group_start = get_cpu_thread_group_start(i, tg);
  ^^^

Here it iterates over each thread in the core, but some of them are not online.

I will try to bring up a remedy.

Thanks,

Pingfan


> $ taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger'
> [...]
> [0.00] Hardware name: IBM,9105-22A POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1040.00 (NL1040_005) hv:phyp pSeries
> [0.00] printk: bootconsole [udbg0] enabled
> [0.00] the round shift between dt seq and the cpu logic number: 8
> [0.00] Partition configured for 16 cpus, operating system maximum is 
> 2.
> [0.00] CPU maps initialized for 8 threads per core
> [...]
> [0.002249] BUG: Unable to handle kernel data access at 0x88c0
> [0.002260] Faulting instruction address: 0xc0001201226c
> [0.002268] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.002274] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [0.002282] Modules linked in:
> [0.002288] CPU: 4 PID: 1 Comm: swapper/4 Not tainted 6.6.0-rc4 #1
> [0.002296] Hardware name: IBM,9105-22A POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1040.00 (NL1040_005) hv:phyp pSeries
> [0.002305] NIP:  c0001201226c LR: c00012012234 CTR: 
> 0004
> [0.002312] REGS: c000167ff8f0 TRAP: 0380   Not tainted  (6.6.0-rc4)
> [0.002321] MSR:  82009033   CR: 
> 24000844  XER: 000a
> [0.002346] CFAR: c0001201231c IRQMASK: 0
> [0.002346] GPR00: c00012012234 c000167ffb90 c00011b61900 
> 0002
> [0.002346] GPR04:  0001 0001 
> c0004ffeff80
> [0.002346] GPR08:   0002 
> 
> [0.002346] GPR12:  c00013141000 c00010011058 
> 
> [0.002346] GPR16:    
> 
> [0.002346] GPR20: 0028 c00012170968 c000120a3e80 
> 0016
> [0.002346] GPR24: c0004ffdcfd0  c00012b82058 
> 
> [0.002346] GPR28: c0004fc80a68 c00012bf0350 c000120a3e2c 
> 
> [0.002426] NIP [c0001201226c] update_mask_from_threadgroup+0x98/0x174
> [0.002437] LR [c00012012234] update_mask_from_threadgroup+0x60/0x174
> [0.002444] Call Trace:
> [0.002451] [c000167ffb90] [c00012012234] 
> update_mask_from_threadgroup+0x60/0x174 (unreliable)
> [0.002464] [c000167ffbe0] [c000120125f8] 
> init_thread_group_cache_map+0x2b0/0x328
> [0.002477] [c000167ffc50] [c0001201296c] 
> smp_prepare_cpus+0x2fc/0x4f0
> [0.002497] [c000167ffd10] [c00012004e40] 
> kernel_init_freeable+0x198/0x3cc
> [0.002509] [c000167ffde0] [c00010011084] kernel_init+0x34/0x1b0
> [0.002531] [c000167ffe50] [c0001000dd3c] 
> ret_from_kernel_user_thread+0x14/0x1c
> [0.002547] --- interrupt: 0 at 0x0
> [0.002553] NIP:   LR:  CTR: 
> 
> [0.002563] REGS: c000167ffe80 TRAP:    Not tainted  (6.6.0-rc4)
> [0.002569] MSR:   <>  CR:   XER: 
> [0.002576] CFAR:  IRQMASK: 0
> [0.002576] GPR00:  00

Re: [PATCHv7 2/4] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-09-28 Thread Pingfan Liu
On Fri, Sep 29, 2023 at 4:36 AM Wen Xiong  wrote:
>
> Hi Pingfan,
>
> +   avail = intserv_node->avail;
> +   nthreads = intserv_node->len / sizeof(int);
> +   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
> set_cpu_present(cpu, avail);
> set_cpu_possible(cpu, true);
> -   cpu_to_phys_id[cpu] = be32_to_cpu(intserv[j]);
> +   cpu_to_phys_id[cpu] = 
> be32_to_cpu(intserv_node->intserv[j]);
> +   DBG("thread %d -> cpu %d (hard id %d)\n",
> +   j, cpu, be32_to_cpu(intserv[j]));
>
> Intserv is not defined. Should "be32_to_cpu(intserv_node->intserv[j])?

Yes, thanks. Sorry that I did not turn on the DBG macro and not catch this bug.

Thanks,

Pingfan
>             cpu++;
> }
> +   }
>
> -Original Message-
> From: Pingfan Liu 
> Sent: Monday, September 25, 2023 2:54 AM
> To: linuxppc-dev@lists.ozlabs.org
> Cc: Pingfan Liu ; Michael Ellerman ; 
> Nicholas Piggin ; Christophe Leroy 
> ; Mahesh Salgaonkar ; Wen 
> Xiong ; Baoquan He ; Ming Lei 
> ; ke...@lists.infradead.org
> Subject: [EXTERNAL] [PATCHv7 2/4] powerpc/setup: Loosen the mapping between 
> cpu logical id and its seq in dt
>
> *** Idea ***
> For kexec -p, the boot cpu can be not the cpu0, this causes the problem of 
> allocating memory for paca_ptrs[]. However, in theory, there is no 
> requirement to assign cpu's logical id as its present sequence in the device 
> tree. But there is something like cpu_first_thread_sibling(), which makes 
> assumption on the mapping inside a core. Hence partially loosening the 
> mapping, i.e. unbind the mapping of core while keep the mapping inside a core.
>
> *** Implement ***
> At this early stage, there are plenty of memory to utilize. Hence, this patch 
> allocates interim memory to link the cpu info on a list, then reorder cpus by 
> changing the list head. As a result, there is a rotate shift between the 
> sequence number in dt and the cpu logical number.
>
> *** Result ***
> After this patch, a boot-cpu's logical id will always be mapped into the 
> range [0,threads_per_core).
>
> Besides this, at this phase, all threads in the boot core are forced to be 
> onlined. This restriction will be lifted in a later patch with extra effort.
>
> Signed-off-by: Pingfan Liu 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: Mahesh Salgaonkar 
> Cc: Wen Xiong 
> Cc: Baoquan He 
> Cc: Ming Lei 
> Cc: ke...@lists.infradead.org
> To: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/prom.c | 25 +
>  arch/powerpc/kernel/setup-common.c | 87 +++---
>  2 files changed, 85 insertions(+), 27 deletions(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 
> ec82f5bda908..87272a2d8c10 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -76,7 +76,9 @@ u64 ppc64_rma_size;
>  unsigned int boot_cpu_node_count __ro_after_init;  #endif  static 
> phys_addr_t first_memblock_size;
> +#ifdef CONFIG_SMP
>  static int __initdata boot_cpu_count;
> +#endif
>
>  static int __init early_parse_mem(char *p)  { @@ -331,8 +333,7 @@ static int 
> __init early_init_dt_scan_cpus(unsigned long node,
> const __be32 *intserv;
> int i, nthreads;
> int len;
> -   int found = -1;
> -   int found_thread = 0;
> +   bool found = false;
>
> /* We are scanning "cpu" nodes only */
> if (type == NULL || strcmp(type, "cpu") != 0) @@ -355,8 +356,15 @@ 
> static int __init early_init_dt_scan_cpus(unsigned long node,
> for (i = 0; i < nthreads; i++) {
> if (be32_to_cpu(intserv[i]) ==
> fdt_boot_cpuid_phys(initial_boot_params)) {
> -   found = boot_cpu_count;
> -   found_thread = i;
> +   /*
> +* always map the boot-cpu logical id into the
> +* range of [0, thread_per_core)
> +*/
> +   boot_cpuid = i;
> +   found = true;
> +   /* This works around the hole in paca_ptrs[]. */
> +   if (nr_cpu_ids < nthreads)
> +   set_nr_cpu_ids(nthreads);
> }
>  #ifdef CONFIG_SMP
> /* logical cpu id is always 0 on UP kernels */ @@ -365,14 
> +373,13 @@ static int __init early_init_

[PATCHv7 4/4] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

2023-09-25 Thread Pingfan Liu
paca_ptrs should be large enough to hold the boot_cpuid, hence, its
lower boundary is set to the bigger one between boot_cpuid+1 and
nr_cpus.

On the other hand, some kernel component: -1. the timer assumes cpu0
online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if
not initialized to a proper present cpu.  -2. power9_idle_stop() assumes
the primary thread's paca is allocated.

Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the
boot cpu is not cpu0.

Result:
When nr_cpus=1, taskset -c 14 bash -c 'echo c > /proc/sysrq-trigger'
the kdump kernel brings up two cpus.
While when taskset -c 4 bash -c 'echo c > /proc/sysrq-trigger',
the kdump kernel brings up one cpu.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/paca.c | 10 ++
 arch/powerpc/kernel/prom.c |  9 ++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index cda4e00b67c1..91e2401de1bd 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -242,9 +242,10 @@ static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_nr_cpu_ids = nr_cpu_ids;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   paca_nr_cpu_ids = n;
+   paca_ptrs_size = sizeof(struct paca_struct *) * n;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
@@ -287,13 +288,14 @@ void __init allocate_paca(int cpu)
 void __init free_unused_pacas(void)
 {
int new_ptrs_size;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   new_ptrs_size = sizeof(struct paca_struct *) * n;
if (new_ptrs_size < paca_ptrs_size)
memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
   paca_ptrs_size - new_ptrs_size);
 
-   paca_nr_cpu_ids = nr_cpu_ids;
+   paca_nr_cpu_ids = n;
paca_ptrs_size = new_ptrs_size;
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 87272a2d8c10..15c994f54bf9 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
 */
boot_cpuid = i;
found = true;
-   /* This works around the hole in paca_ptrs[]. */
-   if (nr_cpu_ids < nthreads)
-   set_nr_cpu_ids(nthreads);
+   /*
+* Ideally, nr_cpus=1 can be achieved if each kernel
+* component does not assume cpu0 is onlined.
+*/
+   if (boot_cpuid != 0 && nr_cpu_ids < 2)
+   set_nr_cpu_ids(2);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
-- 
2.31.1



[PATCHv7 3/4] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus

2023-09-25 Thread Pingfan Liu
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to
ensure the boot_cpu is in cpu_present_mask. This can be achieved by
reserving the last quota for the boot cpu.

Note: the restriction on nr_cpus will be lifted with more effort in the
next patch

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/setup-common.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index f6d32324b5a5..a72d00a6cff2 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -454,8 +454,8 @@ struct interrupt_server_node {
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int shift = 0, cpu = 0;
-   int j, nthreads = 1;
+   int terminate, shift = 0, cpu = 0;
+   int j, bt_thread = 0, nthreads = 1;
int len;
struct interrupt_server_node *intserv_node, *n;
struct list_head *bt_node, head;
@@ -518,6 +518,7 @@ void __init smp_setup_cpu_maps(void)
for (j = 0 ; j < nthreads; j++) {
if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) {
bt_node = _node->node;
+   bt_thread = j;
found_boot_cpu = true;
/*
 * Record the round-shift between dt
@@ -537,11 +538,21 @@ void __init smp_setup_cpu_maps(void)
/* Select the primary thread, the boot cpu's slibing, as the logic 0 */
list_add_tail(, bt_node);
pr_info("the round shift between dt seq and the cpu logic number: 
%d\n", shift);
+   terminate = nr_cpu_ids;
list_for_each_entry(intserv_node, , node) {
 
+   j = 0;
+   /* Choose a start point to cover the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   /*
+* The processor core puts assumption on the thread id,
+* not to breach the assumption.
+*/
+   terminate = nr_cpu_ids - 1;
+   }
avail = intserv_node->avail;
nthreads = intserv_node->len / sizeof(int);
-   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
+   for (; j < nthreads && cpu < terminate; j++) {
set_cpu_present(cpu, avail);
set_cpu_possible(cpu, true);
cpu_to_phys_id[cpu] = 
be32_to_cpu(intserv_node->intserv[j]);
@@ -549,6 +560,14 @@ void __init smp_setup_cpu_maps(void)
j, cpu, be32_to_cpu(intserv[j]));
cpu++;
}
+   /* Online the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   set_cpu_present(bt_thread, avail);
+   set_cpu_possible(bt_thread, true);
+   cpu_to_phys_id[bt_thread] = 
be32_to_cpu(intserv_node->intserv[bt_thread]);
+   DBG("thread %d -> cpu %d (hard id %d)\n",
+   bt_thread, bt_thread, 
be32_to_cpu(intserv[bt_thread]));
+   }
}
 
list_for_each_entry_safe(intserv_node, n, , node) {
-- 
2.31.1



[PATCHv7 2/4] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-09-25 Thread Pingfan Liu
*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the problem
of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/prom.c | 25 +
 arch/powerpc/kernel/setup-common.c | 87 +++---
 2 files changed, 85 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index ec82f5bda908..87272a2d8c10 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
 unsigned int boot_cpu_node_count __ro_after_init;
 #endif
 static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
 static int __initdata boot_cpu_count;
+#endif
 
 static int __init early_parse_mem(char *p)
 {
@@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
const __be32 *intserv;
int i, nthreads;
int len;
-   int found = -1;
-   int found_thread = 0;
+   bool found = false;
 
/* We are scanning "cpu" nodes only */
if (type == NULL || strcmp(type, "cpu") != 0)
@@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
for (i = 0; i < nthreads; i++) {
if (be32_to_cpu(intserv[i]) ==
fdt_boot_cpuid_phys(initial_boot_params)) {
-   found = boot_cpu_count;
-   found_thread = i;
+   /*
+* always map the boot-cpu logical id into the
+* range of [0, thread_per_core)
+*/
+   boot_cpuid = i;
+   found = true;
+   /* This works around the hole in paca_ptrs[]. */
+   if (nr_cpu_ids < nthreads)
+   set_nr_cpu_ids(nthreads);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
@@ -365,14 +373,13 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
}
 
/* Not the boot CPU */
-   if (found < 0)
+   if (!found)
return 0;
 
-   DBG("boot cpu: logical %d physical %d\n", found,
-   be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
+   be32_to_cpu(intserv[boot_cpuid]));
 
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 1b19a9815672..f6d32324b5a5 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -425,6 +426,13 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+struct interrupt_server_node {
+   struct list_head node;
+   boolavail;
+   int len;
+   __be32 *intserv;
+};
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *  cpu_possible_mask
@@ -446,11 +454,16 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
-   int nthreads = 1;
+   int shift = 0, cpu = 0;
+   int j, nthreads = 1;
+   int len;
+   struct interrupt_server_node *intserv_node, *n;
+   struct list_head *bt_node, head;
+   bool avail, found_boot_cpu = false;
 
DBG("smp_setup_cpu_maps()\n");
 
+   INIT_LIST_HEAD();
cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32),
__alignof_

[PATCHv7 1/4] powerpc/setup : Enable boot_cpu_hwid for PPC32

2023-09-25 Thread Pingfan Liu
In order to identify the boot cpu, its intserv[] should be recorded and
checked in smp_setup_cpu_maps().

smp_setup_cpu_maps() is shared between PPC64 and PPC32. Since PPC64 has
already used boot_cpu_hwid to carry that information, enabling this
variable on PPC32 so later it can also be used to carry that information
for PPC32 in the coming patch.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202309130232.n2rewhbv-...@intel.com/
---
 arch/powerpc/include/asm/smp.h | 2 +-
 arch/powerpc/kernel/prom.c | 3 +--
 arch/powerpc/kernel/setup-common.c | 2 --
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 576d0e15..5db9178cc800 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -26,7 +26,7 @@
 #include 
 
 extern int boot_cpuid;
-extern int boot_cpu_hwid; /* PPC64 only */
+extern int boot_cpu_hwid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
 extern bool coregroup_enabled;
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..ec82f5bda908 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -372,8 +372,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
be32_to_cpu(intserv[found_thread]));
boot_cpuid = found;
 
-   if (IS_ENABLED(CONFIG_PPC64))
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index d2a446216444..1b19a9815672 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -87,9 +87,7 @@ EXPORT_SYMBOL(machine_id);
 int boot_cpuid = -1;
 EXPORT_SYMBOL_GPL(boot_cpuid);
 
-#ifdef CONFIG_PPC64
 int boot_cpu_hwid = -1;
-#endif
 
 /*
  * These are used in binfmt_elf.c to put aux entries on the stack
-- 
2.31.1



[PATCHv7 0/4] enable nr_cpus for powerpc

2023-09-25 Thread Pingfan Liu
Since my last v4 [1], the code has undergone great changes. The paca[]
array has been reorganized and indexed by paca_ptrs[], which
dramatically decreases the memory consumption even if there are many
unpresent cpus in the middle.

However, reordering the logical cpu numbers can further decrease the
size of paca_ptrs[] in the kdump case. So I keep [2/4], which
rotate-shifts the cpu's sequence number in the device tree to obtain the
logical cpu id.

Patch [3-4/4] make efforts to decrease the nr_cpus to be less than or
equal to two.

[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/
---
v6 -> v7
  Add [1/4], which fixes compilation error on PPC32

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org


Pingfan Liu (4):
  powerpc/setup : Enable boot_cpu_hwid for PPC32
  powerpc/setup: Loosen the mapping between cpu logical id and its seq
in dt
  powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
  powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

 arch/powerpc/include/asm/smp.h |   2 +-
 arch/powerpc/kernel/paca.c |  10 +--
 arch/powerpc/kernel/prom.c |  29 +---
 arch/powerpc/kernel/setup-common.c | 108 +++--
 4 files changed, 114 insertions(+), 35 deletions(-)

-- 
2.31.1



Re: [RFC PATCH] powerpc: Make crashing cpu to be discovered first in kdump kernel.

2023-09-11 Thread Pingfan Liu
Hi Mahesh,

I am not quite sure about fdt, so I skip that part, and leave some
comments from the kexec view.

On Thu, Sep 7, 2023 at 1:59 AM Mahesh Salgaonkar  wrote:
>
> The kernel boot parameter 'nr_cpus=' allows one to specify number of
> possible cpus in the system. In the normal scenario the first cpu (cpu0)
> that shows up is the boot cpu and hence it gets covered under nr_cpus
> limit.
>
> But this assumption is broken in kdump scenario where kdump kernel after a
> crash can boot up on an non-zero boot cpu. The paca structure allocation
> depends on value of nr_cpus and is indexed using logical cpu ids. The cpu
> discovery code brings up the cpus as they appear sequentially on device
> tree and assigns logical cpu ids starting from 0. This definitely becomes
> an issue if boot cpu id > nr_cpus. When this occurs it results into
>
> In past there were proposals to fix this by making changes to cpu discovery
> code to identify non-zero boot cpu and map it to logical cpu 0. However,
> the changes were very invasive, making discovery code more complicated and
> risky.
>
> Considering that the non-zero boot cpu scenario is more specific to kdump
> kernel, limiting the changes in panic/crash kexec path would probably be a
> best approach to have.
>
> Hence proposed change is, in crash kexec path, move the crashing cpu's
> device node to the first position under '/cpus' node, which will make the
> crashing cpu to be discovered as part of the first core in kdump kernel.
>
> In order to accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids,
> align up the nr_cpu_ids to SMT threads in early_init_dt_scan_cpus(). This
> will allow kdump kernel to work with nr_cpus=X where X will be aligned up
> in multiple of SMT threads per core.
>
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/kexec.h  |1
>  arch/powerpc/kernel/prom.c|   13 
>  arch/powerpc/kexec/core_64.c  |  128 
> +
>  arch/powerpc/kexec/file_load_64.c |2 -
>  4 files changed, 143 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/kexec.h 
> b/arch/powerpc/include/asm/kexec.h
> index a1ddba01e7d13..f5a6f4a1b8eb0 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -144,6 +144,7 @@ unsigned int kexec_extra_fdt_size_ppc64(struct kimage 
> *image);
>  int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
> unsigned long initrd_load_addr,
> unsigned long initrd_len, const char *cmdline);
> +int add_node_props(void *fdt, int node_offset, const struct device_node *dn);
>  #endif /* CONFIG_PPC64 */
>
>  #endif /* CONFIG_KEXEC_FILE */
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 0b5878c3125b1..c2d4f55042d72 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -322,6 +322,9 @@ static void __init check_cpu_feature_properties(unsigned 
> long node)
> }
>  }
>
> +/* align addr on a size boundary - adjust address up */
> +#define _ALIGN_UP(addr, size)   
> (((addr)+((size)-1))&(~((typeof(addr))(size)-1)))
> +
>  static int __init early_init_dt_scan_cpus(unsigned long node,
>   const char *uname, int depth,
>   void *data)
> @@ -348,6 +351,16 @@ static int __init early_init_dt_scan_cpus(unsigned long 
> node,
>
> nthreads = len / sizeof(int);
>
> +   /*
> +* Align nr_cpu_ids to correct SMT value. This will help us to 
> allocate
> +* pacas correctly to accomodate boot_cpu != 0 scenario e.g. in kdump
> +* kernel the boot cpu can be any cpu between 0 through nthreads.
> +*/
> +   if (nr_cpu_ids % nthreads) {
> +   nr_cpu_ids = _ALIGN_UP(nr_cpu_ids, nthreads);

It is better to use set_nr_cpu_ids(), which can hide the difference of
nr_cpus_ids under different kernel configuration.

> +   pr_info("Aligned nr_cpus to SMT=%d, nr_cpu_ids = %d\n", 
> nthreads, nr_cpu_ids);
> +   }
> +
> /*
>  * Now see if any of these threads match our boot cpu.
>  * NOTE: This must match the parsing done in smp_setup_cpu_maps.
> diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
> index a79e28c91e2be..168bef43e22c2 100644
> --- a/arch/powerpc/kexec/core_64.c
> +++ b/arch/powerpc/kexec/core_64.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -298,6 +299,119 @@ extern void kexec_sequence(void *newstack, unsigned 
> long start,
>void (*clear_all)(void),
>bool copy_with_mmu_off) __noreturn;
>
> +/*
> + * Move the crashing cpus FDT node as the first node under '/cpus' node.
> + *
> + * - Get the FDT segment from the crash image segments.
> + * - Locate the crashing CPUs fdt subnode 'A' under '/cpus' node.
> + * - 

[PATCHv6 3/3] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

2023-09-11 Thread Pingfan Liu
paca_ptrs should be large enough to hold the boot_cpuid, hence, its
lower boundary is set to the bigger one between boot_cpuid+1 and
nr_cpus.

On the other hand, some kernel component: -1. the timer assumes cpu0
online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if
not initialized to a proper present cpu.  -2. power9_idle_stop() assumes
the primary thread's paca is allocated.

Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the
boot cpu is not cpu0.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/paca.c | 10 ++
 arch/powerpc/kernel/prom.c |  9 ++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index cda4e00b67c1..91e2401de1bd 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -242,9 +242,10 @@ static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_nr_cpu_ids = nr_cpu_ids;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   paca_nr_cpu_ids = n;
+   paca_ptrs_size = sizeof(struct paca_struct *) * n;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
@@ -287,13 +288,14 @@ void __init allocate_paca(int cpu)
 void __init free_unused_pacas(void)
 {
int new_ptrs_size;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   new_ptrs_size = sizeof(struct paca_struct *) * n;
if (new_ptrs_size < paca_ptrs_size)
memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
   paca_ptrs_size - new_ptrs_size);
 
-   paca_nr_cpu_ids = nr_cpu_ids;
+   paca_nr_cpu_ids = n;
paca_ptrs_size = new_ptrs_size;
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index cb3f3e040455..28441edbc42d 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -362,9 +362,12 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
 */
boot_cpuid = i;
found = true;
-   /* This works around the hole in paca_ptrs[]. */
-   if (nr_cpu_ids < nthreads)
-   set_nr_cpu_ids(nthreads);
+   /*
+* Ideally, nr_cpus=1 can be achieved if each kernel
+* component does not assume cpu0 is onlined.
+*/
+   if (boot_cpuid != 0 && nr_cpu_ids < 2)
+   set_nr_cpu_ids(2);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
-- 
2.31.1



[PATCHv6 2/3] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus

2023-09-11 Thread Pingfan Liu
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to
ensure the boot_cpu is in cpu_present_mask. This can be achieved by
reserving the last quota for the boot cpu.

Note: the restriction on nr_cpus will be lifted with more effort in the
next patch

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/setup-common.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index a07af8de6674..58a988c64dd2 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -456,8 +456,8 @@ struct interrupt_server_node {
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int shift = 0, cpu = 0;
-   int j, nthreads = 1;
+   int terminate, shift = 0, cpu = 0;
+   int j, bt_thread = 0, nthreads = 1;
int len;
struct interrupt_server_node *intserv_node, *n;
struct list_head *bt_node, head;
@@ -520,6 +520,7 @@ void __init smp_setup_cpu_maps(void)
for (j = 0 ; j < nthreads; j++) {
if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) {
bt_node = _node->node;
+   bt_thread = j;
found_boot_cpu = true;
/*
 * Record the round-shift between dt
@@ -539,11 +540,21 @@ void __init smp_setup_cpu_maps(void)
/* Select the primary thread, the boot cpu's slibing, as the logic 0 */
list_add_tail(, bt_node);
pr_info("the round shift between dt seq and the cpu logic number: 
%d\n", shift);
+   terminate = nr_cpu_ids;
list_for_each_entry(intserv_node, , node) {
 
+   j = 0;
+   /* Choose a start point to cover the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   /*
+* The processor core puts assumption on the thread id,
+* not to breach the assumption.
+*/
+   terminate = nr_cpu_ids - 1;
+   }
avail = intserv_node->avail;
nthreads = intserv_node->len / sizeof(int);
-   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
+   for (; j < nthreads && cpu < terminate; j++) {
set_cpu_present(cpu, avail);
set_cpu_possible(cpu, true);
cpu_to_phys_id[cpu] = 
be32_to_cpu(intserv_node->intserv[j]);
@@ -551,6 +562,14 @@ void __init smp_setup_cpu_maps(void)
j, cpu, be32_to_cpu(intserv[j]));
cpu++;
}
+   /* Online the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   set_cpu_present(bt_thread, avail);
+   set_cpu_possible(bt_thread, true);
+   cpu_to_phys_id[bt_thread] = 
be32_to_cpu(intserv_node->intserv[bt_thread]);
+   DBG("thread %d -> cpu %d (hard id %d)\n",
+   bt_thread, bt_thread, 
be32_to_cpu(intserv[bt_thread]));
+   }
}
 
list_for_each_entry_safe(intserv_node, n, , node) {
-- 
2.31.1



[PATCHv6 1/3] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-09-11 Thread Pingfan Liu
*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this may waste plenty of
room when of allocating memory for paca_ptrs[]. However, in theory,
there is no requirement to assign cpu's logical id as its present
sequence in the device tree. But there is something like
cpu_first_thread_sibling(), which makes assumption on the mapping inside
a core. Hence partially loosening the mapping, i.e. unbind the mapping
of core while keep the mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/prom.c | 25 +
 arch/powerpc/kernel/setup-common.c | 87 +++---
 2 files changed, 85 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..cb3f3e040455 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,7 +76,9 @@ u64 ppc64_rma_size;
 unsigned int boot_cpu_node_count __ro_after_init;
 #endif
 static phys_addr_t first_memblock_size;
+#ifdef CONFIG_SMP
 static int __initdata boot_cpu_count;
+#endif
 
 static int __init early_parse_mem(char *p)
 {
@@ -331,8 +333,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
const __be32 *intserv;
int i, nthreads;
int len;
-   int found = -1;
-   int found_thread = 0;
+   bool found = false;
 
/* We are scanning "cpu" nodes only */
if (type == NULL || strcmp(type, "cpu") != 0)
@@ -355,8 +356,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
for (i = 0; i < nthreads; i++) {
if (be32_to_cpu(intserv[i]) ==
fdt_boot_cpuid_phys(initial_boot_params)) {
-   found = boot_cpu_count;
-   found_thread = i;
+   /*
+* always map the boot-cpu logical id into the
+* range of [0, thread_per_core)
+*/
+   boot_cpuid = i;
+   found = true;
+   /* This works around the hole in paca_ptrs[]. */
+   if (nr_cpu_ids < nthreads)
+   set_nr_cpu_ids(nthreads);
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
@@ -365,15 +373,14 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
}
 
/* Not the boot CPU */
-   if (found < 0)
+   if (!found)
return 0;
 
-   DBG("boot cpu: logical %d physical %d\n", found,
-   be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
+   be32_to_cpu(intserv[boot_cpuid]));
 
if (IS_ENABLED(CONFIG_PPC64))
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index d2a446216444..a07af8de6674 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -427,6 +428,13 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+struct interrupt_server_node {
+   struct list_head node;
+   boolavail;
+   int len;
+   __be32 *intserv;
+};
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *  cpu_possible_mask
@@ -448,11 +456,16 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
-   int nthreads = 1;
+   int shift = 0, cpu = 0;
+   int j, nthreads = 1;
+   int len;
+   struct interrupt_server_node *intserv_node, *n;
+   struct list_head *bt_node, head;
+   bool avail, found_boot_cpu = false;
 
DBG("smp_setup_cpu_maps()\n");
 
+   INIT_LIST_HEAD();
cpu_to_phys_id = memblock_alloc(nr_cpu_i

[PATCHv6 0/3] enable nr_cpus for powerpc

2023-09-11 Thread Pingfan Liu
Since my last v4 [1], the code has undergone great changes. The paca[]
array has been reorganized and indexed by paca_ptrs[], which
dramatically decreases the memory consumption even if there are many
unpresent cpus in the middle.

However, reordering the logical cpu numbers can further decrease the
size of paca_ptrs[] in the kdump case. So I keep [1/3], which
rotate-shifts the cpu's sequence number in the device tree to obtain the
logical cpu id.

Patch [2-3/3] make efforts to decrease the nr_cpus to be less than or
equal to two.

[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org

v5 -> v6:
  assign nr_cpu_ids by set_nr_cpu_ids() to tackle with the issue if nr_cpu_ids 
is
configured as a constant

Pingfan Liu (3):
  powerpc/setup: Loosen the mapping between cpu logical id and its seq
in dt
  powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
  powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

 arch/powerpc/kernel/paca.c |  10 +--
 arch/powerpc/kernel/prom.c |  28 +---
 arch/powerpc/kernel/setup-common.c | 106 -
 3 files changed, 113 insertions(+), 31 deletions(-)

-- 
2.31.1



[PATCHv5 3/3] powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

2023-09-08 Thread Pingfan Liu
paca_ptrs should be large enough to hold the boot_cpuid, hence, its
lower boundary is set to the bigger one between boot_cpuid+1 and
nr_cpus.

On the other hand, some kernel component: -1. the timer assumes cpu0
online since the timer_list->flags subfield 'TIMER_CPUMASK' is zero if
not initialized to a proper present cpu.  -2. power9_idle_stop() assumes
the primary thread's paca is allocated.

Hence lift nr_cpu_ids from one to two to ensure cpu0 is onlined, if the
boot cpu is not cpu0.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/paca.c | 10 ++
 arch/powerpc/kernel/prom.c |  9 ++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index cda4e00b67c1..91e2401de1bd 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -242,9 +242,10 @@ static int __initdata paca_struct_size;
 
 void __init allocate_paca_ptrs(void)
 {
-   paca_nr_cpu_ids = nr_cpu_ids;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   paca_nr_cpu_ids = n;
+   paca_ptrs_size = sizeof(struct paca_struct *) * n;
paca_ptrs = memblock_alloc_raw(paca_ptrs_size, SMP_CACHE_BYTES);
if (!paca_ptrs)
panic("Failed to allocate %d bytes for paca pointers\n",
@@ -287,13 +288,14 @@ void __init allocate_paca(int cpu)
 void __init free_unused_pacas(void)
 {
int new_ptrs_size;
+   int n = (boot_cpuid + 1) > nr_cpu_ids ? (boot_cpuid + 1) : nr_cpu_ids;
 
-   new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+   new_ptrs_size = sizeof(struct paca_struct *) * n;
if (new_ptrs_size < paca_ptrs_size)
memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
   paca_ptrs_size - new_ptrs_size);
 
-   paca_nr_cpu_ids = nr_cpu_ids;
+   paca_nr_cpu_ids = n;
paca_ptrs_size = new_ptrs_size;
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 72be75d4f003..eca6a1568749 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -360,9 +360,12 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
 */
boot_cpuid = i;
found = true;
-   /* This works around the hole in paca_ptrs[]. */
-   if (nr_cpu_ids < nthreads)
-   nr_cpu_ids = nthreads;
+   /*
+* Ideally, nr_cpus=1 can be achieved if each kernel
+* component does not assume cpu0 is onlined.
+*/
+   if (boot_cpuid != 0 && nr_cpu_ids < 2)
+   nr_cpu_ids = 2;
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
-- 
2.31.1



[PATCHv5 2/3] powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus

2023-09-08 Thread Pingfan Liu
If the boot_cpuid is smaller than nr_cpus, it requires extra effort to
ensure the boot_cpu is in cpu_present_mask. This can be achieved by
reserving the last quota for the boot cpu.

Note: the restriction on nr_cpus will be lifted with more effort in the
next patch

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/setup-common.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index a07af8de6674..58a988c64dd2 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -456,8 +456,8 @@ struct interrupt_server_node {
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int shift = 0, cpu = 0;
-   int j, nthreads = 1;
+   int terminate, shift = 0, cpu = 0;
+   int j, bt_thread = 0, nthreads = 1;
int len;
struct interrupt_server_node *intserv_node, *n;
struct list_head *bt_node, head;
@@ -520,6 +520,7 @@ void __init smp_setup_cpu_maps(void)
for (j = 0 ; j < nthreads; j++) {
if (be32_to_cpu(intserv[j]) == boot_cpu_hwid) {
bt_node = _node->node;
+   bt_thread = j;
found_boot_cpu = true;
/*
 * Record the round-shift between dt
@@ -539,11 +540,21 @@ void __init smp_setup_cpu_maps(void)
/* Select the primary thread, the boot cpu's slibing, as the logic 0 */
list_add_tail(, bt_node);
pr_info("the round shift between dt seq and the cpu logic number: 
%d\n", shift);
+   terminate = nr_cpu_ids;
list_for_each_entry(intserv_node, , node) {
 
+   j = 0;
+   /* Choose a start point to cover the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   /*
+* The processor core puts assumption on the thread id,
+* not to breach the assumption.
+*/
+   terminate = nr_cpu_ids - 1;
+   }
avail = intserv_node->avail;
nthreads = intserv_node->len / sizeof(int);
-   for (j = 0; j < nthreads && cpu < nr_cpu_ids; j++) {
+   for (; j < nthreads && cpu < terminate; j++) {
set_cpu_present(cpu, avail);
set_cpu_possible(cpu, true);
cpu_to_phys_id[cpu] = 
be32_to_cpu(intserv_node->intserv[j]);
@@ -551,6 +562,14 @@ void __init smp_setup_cpu_maps(void)
j, cpu, be32_to_cpu(intserv[j]));
cpu++;
}
+   /* Online the boot cpu */
+   if (nr_cpu_ids - 1 < bt_thread) {
+   set_cpu_present(bt_thread, avail);
+   set_cpu_possible(bt_thread, true);
+   cpu_to_phys_id[bt_thread] = 
be32_to_cpu(intserv_node->intserv[bt_thread]);
+   DBG("thread %d -> cpu %d (hard id %d)\n",
+   bt_thread, bt_thread, 
be32_to_cpu(intserv[bt_thread]));
+   }
}
 
list_for_each_entry_safe(intserv_node, n, , node) {
-- 
2.31.1



[PATCHv5 0/3] enable nr_cpus for powerpc

2023-09-08 Thread Pingfan Liu
It is a long time since my last v4 [1].

The code has undergone great changes. The paca[] array has been
reorganized and indexed by paca_ptrs[], which dramatically decreases the
memory consumption even if there are many unpresent cpus in the middle.

However, reordering the logical cpu numbers can further decrease the
size of paca_ptrs[] in the kdump case. So I keep [1/3], which
rotate-shifts the cpu's sequence number in the device tree to obtain the
logical cpu id.

Patch [2-3/3] make efforts to decrease the nr_cpus to be less than or
equal to two.

[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org

Pingfan Liu (3):
  powerpc/setup: Loosen the mapping between cpu logical id and its seq
in dt
  powerpc/setup: Handle the case when boot_cpuid greater than nr_cpus
  powerpc/setup: alloc extra paca_ptrs to hold boot_cpuid

 arch/powerpc/kernel/paca.c |  10 +--
 arch/powerpc/kernel/prom.c |  26 ---
 arch/powerpc/kernel/setup-common.c | 106 -
 3 files changed, 111 insertions(+), 31 deletions(-)

-- 
2.31.1



[PATCHv5 1/3] powerpc/setup: Loosen the mapping between cpu logical id and its seq in dt

2023-09-08 Thread Pingfan Liu
*** Idea ***
For kexec -p, the boot cpu can be not the cpu0, this causes the problem
of allocating memory for paca_ptrs[]. However, in theory, there is no
requirement to assign cpu's logical id as its present sequence in the
device tree. But there is something like cpu_first_thread_sibling(),
which makes assumption on the mapping inside a core. Hence partially
loosening the mapping, i.e. unbind the mapping of core while keep the
mapping inside a core.

*** Implement ***
At this early stage, there are plenty of memory to utilize. Hence, this
patch allocates interim memory to link the cpu info on a list, then
reorder cpus by changing the list head. As a result, there is a rotate
shift between the sequence number in dt and the cpu logical number.

*** Result ***
After this patch, a boot-cpu's logical id will always be mapped into the
range [0,threads_per_core).

Besides this, at this phase, all threads in the boot core are forced to
be onlined. This restriction will be lifted in a later patch with
extra effort.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Mahesh Salgaonkar 
Cc: Wen Xiong 
Cc: Baoquan He 
Cc: Ming Lei 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/prom.c | 23 
 arch/powerpc/kernel/setup-common.c | 87 +++---
 2 files changed, 83 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..72be75d4f003 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -331,8 +331,7 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
const __be32 *intserv;
int i, nthreads;
int len;
-   int found = -1;
-   int found_thread = 0;
+   bool found = false;
 
/* We are scanning "cpu" nodes only */
if (type == NULL || strcmp(type, "cpu") != 0)
@@ -355,8 +354,15 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
for (i = 0; i < nthreads; i++) {
if (be32_to_cpu(intserv[i]) ==
fdt_boot_cpuid_phys(initial_boot_params)) {
-   found = boot_cpu_count;
-   found_thread = i;
+   /*
+* always map the boot-cpu logical id into the
+* range of [0, thread_per_core)
+*/
+   boot_cpuid = i;
+   found = true;
+   /* This works around the hole in paca_ptrs[]. */
+   if (nr_cpu_ids < nthreads)
+   nr_cpu_ids = nthreads;
}
 #ifdef CONFIG_SMP
/* logical cpu id is always 0 on UP kernels */
@@ -365,15 +371,14 @@ static int __init early_init_dt_scan_cpus(unsigned long 
node,
}
 
/* Not the boot CPU */
-   if (found < 0)
+   if (!found)
return 0;
 
-   DBG("boot cpu: logical %d physical %d\n", found,
-   be32_to_cpu(intserv[found_thread]));
-   boot_cpuid = found;
+   DBG("boot cpu: logical %d physical %d\n", boot_cpuid,
+   be32_to_cpu(intserv[boot_cpuid]));
 
if (IS_ENABLED(CONFIG_PPC64))
-   boot_cpu_hwid = be32_to_cpu(intserv[found_thread]);
+   boot_cpu_hwid = be32_to_cpu(intserv[boot_cpuid]);
 
/*
 * PAPR defines "logical" PVR values for cpus that
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index d2a446216444..a07af8de6674 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -427,6 +428,13 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+struct interrupt_server_node {
+   struct list_head node;
+   boolavail;
+   int len;
+   __be32 *intserv;
+};
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *  cpu_possible_mask
@@ -448,11 +456,16 @@ u32 *cpu_to_phys_id = NULL;
 void __init smp_setup_cpu_maps(void)
 {
struct device_node *dn;
-   int cpu = 0;
-   int nthreads = 1;
+   int shift = 0, cpu = 0;
+   int j, nthreads = 1;
+   int len;
+   struct interrupt_server_node *intserv_node, *n;
+   struct list_head *bt_node, head;
+   bool avail, found_boot_cpu = false;
 
DBG("smp_setup_cpu_maps()\n");
 
+   INIT_LIST_HEAD();
cpu_to_phys_id = memblock_alloc(nr_cpu_ids * sizeof(u32),
__alignof__(u32));
if (!cpu_to_phys_id)
@@ -462,7 +475,6 @@ void __init smp_setup_cpu_maps(void)
for_each_node_by_type(dn, "cpu") {
const __be32 *intserv;

Re: [RFC PATCH] powerpc: Make crashing cpu to be discovered first in kdump kernel.

2023-09-08 Thread Pingfan Liu
Hi Mahesh,

Thanks for sharing your great idea.  I was in the middle of V5 and
finish it today.

My v5 is based on the same idea of my v4 [1] with the improvement of
the code. And I will send it out.

[1]: 
https://lore.kernel.org/linuxppc-dev/1520829790-14029-1-git-send-email-kernelf...@gmail.com/

I will have a close look at your patch later.

Thanks,

Pingfan

On Thu, Sep 7, 2023 at 1:59 AM Mahesh Salgaonkar  wrote:
>
> The kernel boot parameter 'nr_cpus=' allows one to specify number of
> possible cpus in the system. In the normal scenario the first cpu (cpu0)
> that shows up is the boot cpu and hence it gets covered under nr_cpus
> limit.
>
> But this assumption is broken in kdump scenario where kdump kernel after a
> crash can boot up on an non-zero boot cpu. The paca structure allocation
> depends on value of nr_cpus and is indexed using logical cpu ids. The cpu
> discovery code brings up the cpus as they appear sequentially on device
> tree and assigns logical cpu ids starting from 0. This definitely becomes
> an issue if boot cpu id > nr_cpus. When this occurs it results into
>
> In past there were proposals to fix this by making changes to cpu discovery
> code to identify non-zero boot cpu and map it to logical cpu 0. However,
> the changes were very invasive, making discovery code more complicated and
> risky.
>
> Considering that the non-zero boot cpu scenario is more specific to kdump
> kernel, limiting the changes in panic/crash kexec path would probably be a
> best approach to have.
>
> Hence proposed change is, in crash kexec path, move the crashing cpu's
> device node to the first position under '/cpus' node, which will make the
> crashing cpu to be discovered as part of the first core in kdump kernel.
>
> In order to accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids,
> align up the nr_cpu_ids to SMT threads in early_init_dt_scan_cpus(). This
> will allow kdump kernel to work with nr_cpus=X where X will be aligned up
> in multiple of SMT threads per core.
>
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/kexec.h  |1
>  arch/powerpc/kernel/prom.c|   13 
>  arch/powerpc/kexec/core_64.c  |  128 
> +
>  arch/powerpc/kexec/file_load_64.c |2 -
>  4 files changed, 143 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/kexec.h 
> b/arch/powerpc/include/asm/kexec.h
> index a1ddba01e7d13..f5a6f4a1b8eb0 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -144,6 +144,7 @@ unsigned int kexec_extra_fdt_size_ppc64(struct kimage 
> *image);
>  int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
> unsigned long initrd_load_addr,
> unsigned long initrd_len, const char *cmdline);
> +int add_node_props(void *fdt, int node_offset, const struct device_node *dn);
>  #endif /* CONFIG_PPC64 */
>
>  #endif /* CONFIG_KEXEC_FILE */
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 0b5878c3125b1..c2d4f55042d72 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -322,6 +322,9 @@ static void __init check_cpu_feature_properties(unsigned 
> long node)
> }
>  }
>
> +/* align addr on a size boundary - adjust address up */
> +#define _ALIGN_UP(addr, size)   
> (((addr)+((size)-1))&(~((typeof(addr))(size)-1)))
> +
>  static int __init early_init_dt_scan_cpus(unsigned long node,
>   const char *uname, int depth,
>   void *data)
> @@ -348,6 +351,16 @@ static int __init early_init_dt_scan_cpus(unsigned long 
> node,
>
> nthreads = len / sizeof(int);
>
> +   /*
> +* Align nr_cpu_ids to correct SMT value. This will help us to 
> allocate
> +* pacas correctly to accomodate boot_cpu != 0 scenario e.g. in kdump
> +* kernel the boot cpu can be any cpu between 0 through nthreads.
> +*/
> +   if (nr_cpu_ids % nthreads) {
> +   nr_cpu_ids = _ALIGN_UP(nr_cpu_ids, nthreads);
> +   pr_info("Aligned nr_cpus to SMT=%d, nr_cpu_ids = %d\n", 
> nthreads, nr_cpu_ids);
> +   }
> +
> /*
>  * Now see if any of these threads match our boot cpu.
>  * NOTE: This must match the parsing done in smp_setup_cpu_maps.
> diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
> index a79e28c91e2be..168bef43e22c2 100644
> --- a/arch/powerpc/kexec/core_64.c
> +++ b/arch/powerpc/kexec/core_64.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -298,6 +299,119 @@ extern void kexec_sequence(void *newstack, unsigned 
> long start,
>void (*clear_all)(void),
>bool copy_with_mmu_off) __noreturn;
>
> +/*
> + * Move the crashing cpus FDT node as the first node under '/cpus' node.
> + *
> 

Re: [PATCH 2/2] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues

2023-07-10 Thread Pingfan Liu
Hi Ming,

Having no [PATCH 1/2] blk-mq: add blk_mq_max_nr_hw_queues() in inbox.
So I reply here.

At first glance, I think that  the cpu hot plug callback hook should
be the remedy for the newly introduced blk_mq_max_nr_hw_queues(),
although it is more complicated.

Consider the scene where nr_cpus=4, which can speed up the dumping
process, the blk_mq_max_nr_hw_queues() can not utilize the other three
cpus.


Thanks,

Pingfan

On Mon, Jul 10, 2023 at 5:16 PM Ming Lei  wrote:
>
> On Mon, Jul 10, 2023 at 08:41:09AM +0200, Christoph Hellwig wrote:
> > On Sat, Jul 08, 2023 at 10:02:59AM +0800, Ming Lei wrote:
> > > Take blk-mq's knowledge into account for calculating io queues.
> > >
> > > Fix wrong queue mapping in case of kdump kernel.
> > >
> > > On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see
> > > `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> > > still returns all CPUs.
> >
> > That's simply broken.  Please fix the arch code to make sure
> > it does not return a bogus num_possible_cpus value for these
>
> That is documented in Documentation/admin-guide/kdump/kdump.rst.
>
> On arm and ppc64, 'maxcpus=1' is passed for kdump kernel, and "maxcpu=1"
> simply keep one of CPU cores as online, and others as offline.
>
> So Cc our arch(arm & ppc64) & kdump guys wrt. passing 'maxcpus=1' for
> kdump kernel.
>
> > setups, otherwise you'll have to paper over it in all kind of
> > drivers.
>
> The issue is only triggered for drivers which use managed irq &
> multiple hw queues.
>
>
> Thanks,
> Ming
>
>
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>



Re: [PATCH 2/2] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues

2023-07-10 Thread Pingfan Liu
On Mon, Jul 10, 2023 at 5:16 PM Ming Lei  wrote:
>
> On Mon, Jul 10, 2023 at 08:41:09AM +0200, Christoph Hellwig wrote:
> > On Sat, Jul 08, 2023 at 10:02:59AM +0800, Ming Lei wrote:
> > > Take blk-mq's knowledge into account for calculating io queues.
> > >
> > > Fix wrong queue mapping in case of kdump kernel.
> > >
> > > On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see
> > > `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> > > still returns all CPUs.
> >
> > That's simply broken.  Please fix the arch code to make sure
> > it does not return a bogus num_possible_cpus value for these
>

In fact, num_possible_cpus is not broken.

Quote from admin-guide/kernel-parameters.txt
   maxcpus=[SMP] Maximum number of processors that an SMP kernel
   will bring up during bootup.  maxcpus=n : n >= 0 limits
   the kernel to bring up 'n' processors. Surely after
   bootup you can bring up the other plugged cpu
by executing
   "echo 1 > /sys/devices/system/cpu/cpuX/online".
So maxcpus
   only takes effect during system bootup.
   While n=0 is a special case, it is equivalent to "nosmp",
   which also disables the IO APIC.

Here, as it explained, maxcpus only affects the bootup, later, extra
cpus can be online.

> That is documented in Documentation/admin-guide/kdump/kdump.rst.
>
> On arm and ppc64, 'maxcpus=1' is passed for kdump kernel, and "maxcpu=1"

On aarch64 and x86, nr_cpus=1 is used, while on ppc64, due to the
implementation, nr_cpus=1 can not be supported.


Thanks,

Pingfan

> simply keep one of CPU cores as online, and others as offline.
>
> So Cc our arch(arm & ppc64) & kdump guys wrt. passing 'maxcpus=1' for
> kdump kernel.
>
> > setups, otherwise you'll have to paper over it in all kind of
> > drivers.
>
> The issue is only triggered for drivers which use managed irq &
> multiple hw queues.
>
>
> Thanks,
> Ming
>
>
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>



Re: // a kdump hang caused by PPC pci patch series

2022-11-24 Thread Pingfan Liu
On Mon, Nov 21, 2022 at 8:57 PM Cédric Le Goater  wrote:
>
> On 11/21/22 12:57, Pingfan Liu wrote:
> > Sorry that forget a subject.
> >
> > On Mon, Nov 21, 2022 at 7:54 PM Pingfan Liu  wrote:
> >>
> >> Hello Powerpc folks,
> >>
> >> I encounter an kdump bug, which I bisect and pin commit 174db9e7f775
> >> ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug")
> >> In that case, using Fedora 36 as host, the mentioned commit as the
> >> guest kernel, and virto-block disk, the kdump kernel will hang:
>
> The host kernel should be using the PowerNV platform and not pseries
> or are you running a nested L2 guest on KVM/pseries L1 ?
>
> And as far as I remember, the patch above only impacts the IBM PowerVM
> hypervisor, not KVM, and PHB hotplug, or kdump induces some hot-plugging
> I am not aware of.
>
> Also, if indeed, this is a L2 guest, the XIVE interrupt controller is
> emulated in QEMU, "info pic" should return:
>
>...
>irqchip: emulated
>
> >>
> >> [0.00] Kernel command line: elfcorehdr=0x22c0
> >> no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8
> >> irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory
> >>   numa=off udev.children-max=2 ehea.use_mcs=0 panic=10
> >> kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd
> >> hugetlb_cma=0
> >>  ...
> >>  [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues
> >>  [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical
> >> blocks (10.7 GB/10.0 GiB)
> >>  [   68.398234] systemd-udevd[187]: virtio2: Worker [190]
> >> processing SEQNUM=1193 is taking a long time
> >>  [  188.398258] systemd-udevd[187]: virtio2: Worker [190]
> >> processing SEQNUM=1193 killed
> >>
> >>
> >> During my test, I found that in very rare cases, the kdump can success
> >> (I guess it may be due to the cpu id).  And if using either maxcpus=2
> >> or using scsi-disk, then kdump can also success.  And before the
> >> mentioned commit, kdump can also success.
> >>
> >> The attachment contains the xml to reproduce that bug.
> >>
> >> Do you have any ideas?
>
> Most certainly an interrupt not being delivered. You can check the status
> on the host with :
>
>virsh qemu-monitor-command --hmp   "info pic"
>

Please pick it up from the attachment.

Thanks,

Pingfan
Script started on 2022-11-24 03:22:55-05:00 [TERM="xterm-256color" 
TTY="/dev/pts/0" COLUMNS="172" LINES="41"]
]0;root@ibm-p9wr-02:~[?2004h[root@ibm-p9wr-02 ~]#  virsh 
qemu-monitor-command --hmp  rhel9 "info pic"
[?2004l
CPU[]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2

CPU[]: USER00   00  0000   00  00  00   00  

CPU[]:   OS00   ff  0000   ff  00  ff   ff  8400

CPU[]: POOL00   00  0000   00  00  00   00  

CPU[]: PHYS00   00  0000   00  00  00   ff  

CPU[0001]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2

CPU[0001]: USER00   00  0000   00  00  00   00  

CPU[0001]:   OS00   ff  0000   ff  00  ff   ff  8401

CPU[0001]: POOL00   00  0000   00  00  00   00  

CPU[0001]: PHYS00   00  0000   00  00  00   ff  

CPU[0002]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2

CPU[0002]: USER00   00  0000   00  00  00   00  

CPU[0002]:   OS00   ff  0000   ff  00  ff   ff  8402

CPU[0002]: POOL00   00  0000   00  00  00   00  

CPU[0002]: PHYS00   00  0000   00  00  00   ff  

CPU[0003]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2

CPU[0003]: USER00   00  0000   00  00  00   00  

CPU[0003]:   OS00   ff  0000   ff  00  ff   ff  8403

CPU[0003]: POOL00   00  0000   00  00  00   00  

CPU[0003]: PHYS00   00  0000   00  00  00   ff  

CPU[0004]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2

CPU[0004]: USER00   00  0000   00  00  00   00  

CPU[0004]:   OS00   ff  0000   ff  00  ff   ff  8404

CPU[0004]: POOL00   00  0000   00  00  00   00  

CPU[0004]: PHYS00   00  0000   00  00  00   ff  

CPU[0005]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2

CPU[0005]: USER00   00  0000   00  00  00   00  

CPU[0005]:   OS00   ff  0000   ff  00  ff   ff  8405

CPU[0005]: POOL00   00  0000   00  00  00   00  

CPU[0005]: PHYS00   00  0000   00  00  00   ff  

CPU[0006]:   QW   NSR CPPR IPB LSM

Re: // a kdump hang caused by PPC pci patch series

2022-11-21 Thread Pingfan Liu
Hi Gedric,

Appreciate your insight. Please see the comment inline below.

On Mon, Nov 21, 2022 at 8:57 PM Cédric Le Goater  wrote:
>
> On 11/21/22 12:57, Pingfan Liu wrote:
> > Sorry that forget a subject.
> >
> > On Mon, Nov 21, 2022 at 7:54 PM Pingfan Liu  wrote:
> >>
> >> Hello Powerpc folks,
> >>
> >> I encounter an kdump bug, which I bisect and pin commit 174db9e7f775
> >> ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug")
> >> In that case, using Fedora 36 as host, the mentioned commit as the
> >> guest kernel, and virto-block disk, the kdump kernel will hang:
>
> The host kernel should be using the PowerNV platform and not pseries
> or are you running a nested L2 guest on KVM/pseries L1 ?
>

Host kernel ran on P9 bare metal. And here PowerKVM is used.

> And as far as I remember, the patch above only impacts the IBM PowerVM
> hypervisor, not KVM, and PHB hotplug, or kdump induces some hot-plugging
> I am not aware of.
>

Sorry that my information is not clear.
The suspect series is "[PATCH 00/31] powerpc: Modernize the PCI/MSI
support", and in the main line, beginning from commit 786e5b102a00
("powerpc/pseries/pci: Introduce __find_pe_total_msi()").

I tried to bisect, and the commit a5f3d2c17b07 ("powerpc/pseries/pci:
Add MSI domains") even hangs the first kernel. So I went ahead to find
the next functional change on pseries, which is commit 174db9e7f775
("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug").


> Also, if indeed, this is a L2 guest, the XIVE interrupt controller is
> emulated in QEMU, "info pic" should return:
>
>...
>irqchip: emulated
>
> >>
> >> [0.00] Kernel command line: elfcorehdr=0x22c0
> >> no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8
> >> irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory
> >>   numa=off udev.children-max=2 ehea.use_mcs=0 panic=10
> >> kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd
> >> hugetlb_cma=0
> >>  ...
> >>  [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues
> >>  [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical
> >> blocks (10.7 GB/10.0 GiB)
> >>  [   68.398234] systemd-udevd[187]: virtio2: Worker [190]
> >> processing SEQNUM=1193 is taking a long time
> >>  [  188.398258] systemd-udevd[187]: virtio2: Worker [190]
> >> processing SEQNUM=1193 killed
> >>
> >>
> >> During my test, I found that in very rare cases, the kdump can success
> >> (I guess it may be due to the cpu id).  And if using either maxcpus=2
> >> or using scsi-disk, then kdump can also success.  And before the
> >> mentioned commit, kdump can also success.
> >>
> >> The attachment contains the xml to reproduce that bug.
> >>
> >> Do you have any ideas?
>
> Most certainly an interrupt not being delivered. You can check the status
> on the host with :
>
>virsh qemu-monitor-command --hmp   "info pic"
>

OK, I will try to occupy a P9 machine and have a shot. I will update
the info later.


Thanks,

Pingfa
>
>
> Thanks,
>
> C.


Re: // a kdump hang caused by PPC pci patch series

2022-11-21 Thread Pingfan Liu
Sorry that forget a subject.

On Mon, Nov 21, 2022 at 7:54 PM Pingfan Liu  wrote:
>
> Hello Powerpc folks,
>
> I encounter an kdump bug, which I bisect and pin commit 174db9e7f775
> ("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug")
>
> In that case, using Fedora 36 as host, the mentioned commit as the
> guest kernel, and virto-block disk, the kdump kernel will hang:
>
> [0.00] Kernel command line: elfcorehdr=0x22c0
> no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8
> irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory
>  numa=off udev.children-max=2 ehea.use_mcs=0 panic=10
> kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd
> hugetlb_cma=0
> ...
> [7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues
> [7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical
> blocks (10.7 GB/10.0 GiB)
> [   68.398234] systemd-udevd[187]: virtio2: Worker [190]
> processing SEQNUM=1193 is taking a long time
> [  188.398258] systemd-udevd[187]: virtio2: Worker [190]
> processing SEQNUM=1193 killed
>
>
> During my test, I found that in very rare cases, the kdump can success
> (I guess it may be due to the cpu id).  And if using either maxcpus=2
> or using scsi-disk, then kdump can also success.  And before the
> mentioned commit, kdump can also success.
>
> The attachment contains the xml to reproduce that bug.
>
> Do you have any ideas?
>
> Thanks


[no subject]

2022-11-21 Thread Pingfan Liu
Hello Powerpc folks,

I encounter an kdump bug, which I bisect and pin commit 174db9e7f775
("powerpc/pseries/pci: Add support of MSI domains to PHB hotplug")

In that case, using Fedora 36 as host, the mentioned commit as the
guest kernel, and virto-block disk, the kdump kernel will hang:

[0.00] Kernel command line: elfcorehdr=0x22c0
no_timer_check net.ifnames=0 console=tty0 console=hvc0,115200n8
irqpoll maxcpus=1 noirqdistrib reset_devices cgroup_disable=memory
 numa=off udev.children-max=2 ehea.use_mcs=0 panic=10
kvm_cma_resv_ratio=0 transparent_hugepage=never novmcoredd
hugetlb_cma=0
...
[7.763260] virtio_blk virtio2: 32/0/0 default/read/poll queues
[7.771391] virtio_blk virtio2: [vda] 20971520 512-byte logical
blocks (10.7 GB/10.0 GiB)
[   68.398234] systemd-udevd[187]: virtio2: Worker [190]
processing SEQNUM=1193 is taking a long time
[  188.398258] systemd-udevd[187]: virtio2: Worker [190]
processing SEQNUM=1193 killed


During my test, I found that in very rare cases, the kdump can success
(I guess it may be due to the cpu id).  And if using either maxcpus=2
or using scsi-disk, then kdump can also success.  And before the
mentioned commit, kdump can also success.

The attachment contains the xml to reproduce that bug.

Do you have any ideas?

Thanks

  rhel9
  6266c1c1-1e74-4046-b959-33d94877b387
  
http://libosinfo.org/xmlns/libvirt/domain/1.0;>
  http://redhat.com/rhel/8-unknown"/>

  
  16777216
  16777216
  16
  
hvm

  
  
POWER9
  
  
  destroy
  restart
  destroy
  
/usr/libexec/qemu-kvm

  
  
  
  


  


  
  


  


	
	
	
	


  

  
  



  
  


  
  


  


  


  
  


  



  
  


  


  /dev/urandom
  


  



[RFC 08/10] cpuhp: Replace cpumask_any_but(cpu_online_mask, cpu)

2022-08-21 Thread Pingfan Liu
In a kexec quick reboot path, the dying cpus are still on
cpu_online_mask. During the teardown of cpu, a subsystem needs to
migrate its broker to a real online cpu.

This patch replaces cpumask_any_but(cpu_online_mask, cpu) in a teardown
procedure with cpumask_not_dying_but(cpu_online_mask, cpu).

Signed-off-by: Pingfan Liu 
Cc: Russell King 
Cc: Shawn Guo 
Cc: Sascha Hauer 
Cc: Pengutronix Kernel Team 
Cc: Fabio Estevam 
Cc: NXP Linux Team 
Cc: Fenghua Yu 
Cc: Dave Jiang 
Cc: Vinod Koul 
Cc: Wu Hao 
Cc: Tom Rix 
Cc: Moritz Fischer 
Cc: Xu Yilun 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Frank Li 
Cc: Shaokun Zhang 
Cc: Qi Liu 
Cc: Andy Gross 
Cc: Bjorn Andersson 
Cc: Konrad Dybcio 
Cc: Khuong Dinh 
Cc: Li Yang 
Cc: Yury Norov 
To: linux-arm-ker...@lists.infradead.org
To: dmaeng...@vger.kernel.org
To: linux-f...@vger.kernel.org
To: intel-...@lists.freedesktop.org
To: dri-de...@lists.freedesktop.org
To: linux-arm-...@vger.kernel.org
To: linuxppc-dev@lists.ozlabs.org
To: linux-ker...@vger.kernel.org
---
 arch/arm/mach-imx/mmdc.c | 2 +-
 arch/arm/mm/cache-l2x0-pmu.c | 2 +-
 drivers/dma/idxd/perfmon.c   | 2 +-
 drivers/fpga/dfl-fme-perf.c  | 2 +-
 drivers/gpu/drm/i915/i915_pmu.c  | 2 +-
 drivers/perf/arm-cci.c   | 2 +-
 drivers/perf/arm-ccn.c   | 2 +-
 drivers/perf/arm-cmn.c   | 4 ++--
 drivers/perf/arm_dmc620_pmu.c| 2 +-
 drivers/perf/arm_dsu_pmu.c   | 2 +-
 drivers/perf/arm_smmuv3_pmu.c| 2 +-
 drivers/perf/fsl_imx8_ddr_perf.c | 2 +-
 drivers/perf/hisilicon/hisi_uncore_pmu.c | 2 +-
 drivers/perf/marvell_cn10k_tad_pmu.c | 2 +-
 drivers/perf/qcom_l2_pmu.c   | 2 +-
 drivers/perf/qcom_l3_pmu.c   | 2 +-
 drivers/perf/xgene_pmu.c | 2 +-
 drivers/soc/fsl/qbman/bman_portal.c  | 2 +-
 drivers/soc/fsl/qbman/qman_portal.c  | 2 +-
 19 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/arm/mach-imx/mmdc.c b/arch/arm/mach-imx/mmdc.c
index af12668d0bf5..a109a7ea8613 100644
--- a/arch/arm/mach-imx/mmdc.c
+++ b/arch/arm/mach-imx/mmdc.c
@@ -220,7 +220,7 @@ static int mmdc_pmu_offline_cpu(unsigned int cpu, struct 
hlist_node *node)
if (!cpumask_test_and_clear_cpu(cpu, _mmdc->cpu))
return 0;
 
-   target = cpumask_any_but(cpu_online_mask, cpu);
+   target = cpumask_not_dying_but(cpu_online_mask, cpu);
if (target >= nr_cpu_ids)
return 0;
 
diff --git a/arch/arm/mm/cache-l2x0-pmu.c b/arch/arm/mm/cache-l2x0-pmu.c
index 993fefdc167a..1b0037ef7fa5 100644
--- a/arch/arm/mm/cache-l2x0-pmu.c
+++ b/arch/arm/mm/cache-l2x0-pmu.c
@@ -428,7 +428,7 @@ static int l2x0_pmu_offline_cpu(unsigned int cpu)
if (!cpumask_test_and_clear_cpu(cpu, _cpu))
return 0;
 
-   target = cpumask_any_but(cpu_online_mask, cpu);
+   target = cpumask_not_dying_but(cpu_online_mask, cpu);
if (target >= nr_cpu_ids)
return 0;
 
diff --git a/drivers/dma/idxd/perfmon.c b/drivers/dma/idxd/perfmon.c
index d73004f47cf4..f3f1ccb55f73 100644
--- a/drivers/dma/idxd/perfmon.c
+++ b/drivers/dma/idxd/perfmon.c
@@ -528,7 +528,7 @@ static int perf_event_cpu_offline(unsigned int cpu, struct 
hlist_node *node)
if (!cpumask_test_and_clear_cpu(cpu, _dsa_cpu_mask))
return 0;
 
-   target = cpumask_any_but(cpu_online_mask, cpu);
+   target = cpumask_not_dying_but(cpu_online_mask, cpu);
 
/* migrate events if there is a valid target */
if (target < nr_cpu_ids)
diff --git a/drivers/fpga/dfl-fme-perf.c b/drivers/fpga/dfl-fme-perf.c
index 587c82be12f7..57804f28357e 100644
--- a/drivers/fpga/dfl-fme-perf.c
+++ b/drivers/fpga/dfl-fme-perf.c
@@ -948,7 +948,7 @@ static int fme_perf_offline_cpu(unsigned int cpu, struct 
hlist_node *node)
if (cpu != priv->cpu)
return 0;
 
-   target = cpumask_any_but(cpu_online_mask, cpu);
+   target = cpumask_not_dying_but(cpu_online_mask, cpu);
if (target >= nr_cpu_ids)
return 0;
 
diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 958b37123bf1..f866f9223492 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -1068,7 +1068,7 @@ static int i915_pmu_cpu_offline(unsigned int cpu, struct 
hlist_node *node)
return 0;
 
if (cpumask_test_and_clear_cpu(cpu, _pmu_cpumask)) {
-   target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu);
+   target = cpumask_not_dying_but(topology_sibling_cpumask(cpu), 
cpu);
 
/* Migrate events if there is a valid target */
if (target < nr_cpu_ids) {
diff --git a/drivers/perf/arm-cci.c b/drivers/perf/arm-cci.c
index 03b1309875ae

[PATCHv4 1/2] cpu/hotplug: Keep cpu hotplug disabled until the rebooting cpu is stable

2022-05-11 Thread Pingfan Liu
smp_shutdown_nonboot_cpus() repeats the same code chunk as
migrate_to_reboot_cpu() to ensure that the rebooting happens on a valid
cpu.

if (!cpu_online(primary_cpu))
primary_cpu = cpumask_first(cpu_online_mask);

This is due to an unexpected cpu-down event like the following:
kernel_kexec()
   migrate_to_reboot_cpu();
   cpu_hotplug_enable();
---> comes a cpu_down(this_cpu) on other cpu
   machine_shutdown();
 smp_shutdown_nonboot_cpus();which needs to re-check "if 
(!cpu_online(primary_cpu))"

Although the kexec-reboot task can get through a cpu_down() on its cpu,
this code looks a little confusing.

Tracing down the git history, the cpu_hotplug_enable() called by
kernel_kexec() is introduced by commit 011e4b02f1da ("powerpc, kexec:
Fix "Processor X is stuck" issue during kexec from ST mode"), which
wakes up all offline cpu by cpu_up(cpu). Later, it is required by the
architectures(arm/arm64/ia64/riscv) which resort to cpu hot-removing to
achieve kexec-reboot by
smp_shutdown_nonboot_cpus()->cpu_down_maps_locked().

Hence, the cpu_hotplug_enable() in kernel_kexec() is an architecture
requirement.

By deferring the cpu hotplug enable to a more proper point, where
smp_shutdown_nonboot_cpus() holds cpu_add_remove_lock, the
unexpected cpu-down event is squashed out and the rebooting cpu can keep
unchanged. (For powerpc, no gains from this change.)

As a result, the repeated code chunk can be removed and in [2/2], the
callsites of smp_shutdown_nonboot_cpus() can be consistent.

Signed-off-by: Pingfan Liu 
Cc: Eric Biederman 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vincent Donnefort 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Mark Rutland 
Cc: YueHaibing 
Cc: Baokun Li 
Cc: Randy Dunlap 
Cc: Valentin Schneider 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
To: linux-ker...@vger.kernel.org
---
 arch/powerpc/kexec/core_64.c |  1 +
 kernel/cpu.c | 10 +-
 kernel/kexec_core.c  | 11 +--
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 6cc7793b8420..8ccf22197f08 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -224,6 +224,7 @@ static void wake_offline_cpus(void)
 
 static void kexec_prepare_cpus(void)
 {
+   cpu_hotplug_enable();
wake_offline_cpus();
smp_call_function(kexec_smp_down, NULL, /* wait */0);
local_irq_disable();
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d0a9aa0b42e8..4415370f0e91 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1236,12 +1236,12 @@ void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
cpu_maps_update_begin();
 
/*
-* Make certain the cpu I'm about to reboot on is online.
-*
-* This is inline to what migrate_to_reboot_cpu() already do.
+* At this point, the cpu hotplug is still disabled by
+* migrate_to_reboot_cpu() to guarantee that the rebooting happens on
+* the selected CPU.  But cpu_down_maps_locked() returns -EBUSY, if
+* cpu_hotplug_disabled. So re-enable CPU hotplug here.
 */
-   if (!cpu_online(primary_cpu))
-   primary_cpu = cpumask_first(cpu_online_mask);
+   __cpu_hotplug_enable();
 
for_each_online_cpu(cpu) {
if (cpu == primary_cpu)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 68480f731192..1bd5a8c95a20 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1168,14 +1168,13 @@ int kernel_kexec(void)
kexec_in_progress = true;
kernel_restart_prepare("kexec reboot");
migrate_to_reboot_cpu();
-
/*
-* migrate_to_reboot_cpu() disables CPU hotplug assuming that
-* no further code needs to use CPU hotplug (which is true in
-* the reboot case). However, the kexec path depends on using
-* CPU hotplug again; so re-enable it here.
+* migrate_to_reboot_cpu() disables CPU hotplug and pin the
+* rebooting thread on the selected CPU. If an architecture
+* requires CPU hotplug to achieve kexec reboot, it should
+* enable the hotplug in the architecture specific code
 */
-   cpu_hotplug_enable();
+
pr_notice("Starting new kernel\n");
machine_shutdown();
}
-- 
2.31.1



Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Pingfan Liu
Correct mail address of Kazuhito

On Tue, Jun 8, 2021 at 6:34 PM Pingfan Liu  wrote:
>
> As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> formula:
> #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
>
> Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> PAGES_PER_SECTION in makedumpfile just like kernel.
>
> Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> info. Such info is impossible to be deduced from a crashdump vmcore.
> Hence append SECTION_SIZE_BITS to vmcoreinfo.
>
> Signed-off-by: Pingfan Liu 
> Cc: Bhupesh Sharma 
> Cc: Kazuhito Hagio 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Boris Petkov 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: James Morse 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Dave Anderson 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Cc: ke...@lists.infradead.org
> Cc: x...@kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> ---
>  kernel/crash_core.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..684a6061a13a 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
> VMCOREINFO_STRUCT_SIZE(mem_section);
> VMCOREINFO_OFFSET(mem_section, section_mem_map);
> +   VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
> VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>  #endif
> VMCOREINFO_STRUCT_SIZE(page);
> --
> 2.29.2
>


[PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Pingfan Liu
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
formula:
#define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)

Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
PAGES_PER_SECTION in makedumpfile just like kernel.

Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
SECTION_SIZE_BITS"). But user space wants a stable interface to get this
info. Such info is impossible to be deduced from a crashdump vmcore.
Hence append SECTION_SIZE_BITS to vmcoreinfo.

Signed-off-by: Pingfan Liu 
Cc: Bhupesh Sharma 
Cc: Kazuhito Hagio 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Boris Petkov 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: James Morse 
Cc: Mark Rutland 
Cc: Will Deacon 
Cc: Catalin Marinas 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Dave Anderson 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Cc: ke...@lists.infradead.org
Cc: x...@kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
 kernel/crash_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 825284baaf46..684a6061a13a 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
+   VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
 #endif
VMCOREINFO_STRUCT_SIZE(page);
-- 
2.29.2



Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2021-04-13 Thread Pingfan Liu
On Sat, Apr 10, 2021 at 12:33 AM Michal Suchánek  wrote:
>
> Hello,
>
> On Fri, Aug 28, 2020 at 04:10:09PM +0800, Pingfan Liu wrote:
> > On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour  
> > wrote:
> > >
> > > Le 10/08/2020 à 10:52, Pingfan Liu a écrit :
> > > > A bug is observed on pseries by taking the following steps on rhel:
> > > > -1. drmgr -c mem -r -q 5
> > > > -2. echo c > /proc/sysrq-trigger
> > > >
> > > > And then, the failure looks like:
> > > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > > > kdump: saving vmcore-dmesg.txt
> > > > kdump: saving vmcore-dmesg.txt complete
> > > > kdump: saving vmcore
> > > >   Checking for memory holes : [  0.0 %] /   
> > > > Checking for memory holes : [100.0 
> > > > %] |   Excluding unnecessary pages  
> > > >  : [100.0 %] \   Copying data   
> > > >: [  0.3 %] -  eta: 38s[   44.337636] hash-mmu: mm: 
> > > > Hashing failure ! EA=0x7fffba40 access=0x8004 
> > > > current=makedumpfile
> > > > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > > psize=2 psize 2 pte=0xc0005504
> > > > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > > > access=0x8004 current=makedumpfile
> > > > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > > psize=2 psize 2 pte=0xc0005504
> > > > [   44.337708] makedumpfile[469]: unhandled signal 7 at 
> > > > 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > > > [   44.338548] Core dump to |/bin/false pipe failed
> > > > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > > > $CORE_COLLECTOR /proc/vmcore 
> > > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > > > kdump: saving vmcore failed
> > > >
> > > > * Root cause *
> > > >After analyzing, it turns out that in the current implementation,
> > > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt 
> > > > updating as
> > > > the code __remove_memory() comes before drmem_update_dt().
> > > > So in kdump kernel, when read_from_oldmem() resorts to
> > > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, 
> > > > as it
> > > > can be observed "Bus error"
> > > >
> > > >  From a viewpoint of listener and publisher, the publisher notifies the
> > > > listener before data is ready.  This introduces a problem where udev
> > > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > > > updating. And in capture kernel, makedumpfile will access the memory 
> > > > based
> > > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> > > >
> > > > * Fix *
> > > > This bug is introduced by commit 063b8b1251fd
> > > > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> > > > request"), which tried to combine all the dt updating into one.
> > > >
> > > > To fix this issue, meanwhile not to introduce a quadratic runtime
> > > > complexity by the model:
> > > >dlpar_memory_add_by_count
> > > >  for_each_drmem_lmb <--
> > > >dlpar_add_lmb
> > > >  drmem_update_dt(_v1|_v2)
> > > >for_each_drmem_lmb   <--
> > > > The dt should still be only updated once, and just before the last 
> > > > memory
> > > > online/offline event is ejected to user space. Achieve this by tracing 
> > > > the
> > > > num of lmb added or removed.
> > > >
> > > > Signed-off-by: Pingfan Liu 
> > > > Cc: Michael Ellerman 
> > > > Cc: Hari Bathini 
> > > > Cc: Nathan Lynch 
> > > > Cc: Nathan Fontenot 
> > > > Cc: Laurent Dufour 
> > > > To: linuxppc-dev@lists.ozlabs.org
> > > > Cc: ke...@lists.infradead.org
> > > > ---
> > > > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() pro

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-25 Thread Pingfan Liu
On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner  wrote:
>
> On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 
> > platform.
> > When the bug happens, the kernel is totally occupies by irq.  Currently, 
> > there
> > may be nothing or just soft lockup warning showed in console. It is better
> > to warn users with irq flood info.
> >
> > In the kdump case, the kernel can move on by suppressing the irq flood.
>
> You're curing the symptom not the cause and the cure is just magic and
> can't work reliably.
Yeah, it is magic. But at least, it is better to printk something and
alarm users about what happens. With current code, it may show nothing
when system hangs.
>
> Where is that irq flood originated from and why is none of the
> mechanisms we have in place to shut it up working?
The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus
driver (i2c-opal.c). After i2c_opal_send_request(), the bug is
triggered.

But things are complicated by introducing a firmware layer: Skiboot.
This software layer hides the detail of manipulating the hardware from
Linux.

I guess the software logic can not enter a sane state when kernel crashes.

Cc Skiboot and ppc64 community to see whether anyone has idea about it.

Thanks,
Pingfan


Re: [PATCH] powerpc/time: enable sched clock for irqtime

2020-10-22 Thread Pingfan Liu
I encounter a irq flood on Power9 machine, and tries a way to work
around it by https://www.spinics.net/lists/kernel/msg3705028.html

As irq time accounting is the foundation for the method, it needs to
make irq accounting take effect on powerpc platform.

On Thu, Oct 22, 2020 at 2:51 PM Pingfan Liu  wrote:
>
> When CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN, powerpc
> does not enable "sched_clock_irqtime" and can not utilize irq time
> accounting.
>
> Like x86, powerpc does not use the sched_clock_register() interface. So it
> needs an dedicated call to enable_sched_clock_irqtime() to enable irq time
> accounting.
>
> Signed-off-by: Pingfan Liu 
> Cc: Michael Ellerman 
> Cc: Christophe Leroy 
> Cc: Nicholas Piggin 
> To: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/time.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index f85539e..4083b59e 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -53,6 +53,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>
> @@ -1134,6 +1135,7 @@ void __init time_init(void)
> tick_setup_hrtimer_broadcast();
>
> of_clk_init(NULL);
> +   enable_sched_clock_irqtime();
>  }
>
>  /*
> --
> 2.7.5
>


[PATCH] powerpc/time: enable sched clock for irqtime

2020-10-22 Thread Pingfan Liu
When CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN, powerpc
does not enable "sched_clock_irqtime" and can not utilize irq time
accounting.

Like x86, powerpc does not use the sched_clock_register() interface. So it
needs an dedicated call to enable_sched_clock_irqtime() to enable irq time
accounting.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Nicholas Piggin 
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/time.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f85539e..4083b59e 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -1134,6 +1135,7 @@ void __init time_init(void)
tick_setup_hrtimer_broadcast();
 
of_clk_init(NULL);
+   enable_sched_clock_irqtime();
 }
 
 /*
-- 
2.7.5



Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-28 Thread Pingfan Liu
On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour  wrote:
>
> Le 10/08/2020 à 10:52, Pingfan Liu a écrit :
> > A bug is observed on pseries by taking the following steps on rhel:
> > -1. drmgr -c mem -r -q 5
> > -2. echo c > /proc/sysrq-trigger
> >
> > And then, the failure looks like:
> > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > kdump: saving vmcore-dmesg.txt
> > kdump: saving vmcore-dmesg.txt complete
> > kdump: saving vmcore
> >   Checking for memory holes : [  0.0 %] /   
> > Checking for memory holes : [100.0 %] | 
> >   Excluding unnecessary pages   : [100.0 %] 
> > \   Copying data  : [  
> > 0.3 %] -  eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! 
> > EA=0x7fffba40 access=0x8004 current=makedumpfile
> > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> > psize 2 pte=0xc0005504
> > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > access=0x8004 current=makedumpfile
> > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> > psize 2 pte=0xc0005504
> > [   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 
> > nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > [   44.338548] Core dump to |/bin/false pipe failed
> > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > $CORE_COLLECTOR /proc/vmcore 
> > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > kdump: saving vmcore failed
> >
> > * Root cause *
> >After analyzing, it turns out that in the current implementation,
> > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating 
> > as
> > the code __remove_memory() comes before drmem_update_dt().
> > So in kdump kernel, when read_from_oldmem() resorts to
> > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
> > can be observed "Bus error"
> >
> >  From a viewpoint of listener and publisher, the publisher notifies the
> > listener before data is ready.  This introduces a problem where udev
> > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > updating. And in capture kernel, makedumpfile will access the memory based
> > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> >
> > * Fix *
> > This bug is introduced by commit 063b8b1251fd
> > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> > request"), which tried to combine all the dt updating into one.
> >
> > To fix this issue, meanwhile not to introduce a quadratic runtime
> > complexity by the model:
> >dlpar_memory_add_by_count
> >  for_each_drmem_lmb <--
> >dlpar_add_lmb
> >  drmem_update_dt(_v1|_v2)
> >for_each_drmem_lmb   <--
> > The dt should still be only updated once, and just before the last memory
> > online/offline event is ejected to user space. Achieve this by tracing the
> > num of lmb added or removed.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Michael Ellerman 
> > Cc: Hari Bathini 
> > Cc: Nathan Lynch 
> > Cc: Nathan Fontenot 
> > Cc: Laurent Dufour 
> > To: linuxppc-dev@lists.ozlabs.org
> > Cc: ke...@lists.infradead.org
> > ---
> > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
> >whether dt is updated successfully.
> >Fix a condition boundary check bug
> > v3 -> v4: resolve a quadratic runtime complexity issue.
> >This series is applied on next-test branch
> >   arch/powerpc/platforms/pseries/hotplug-memory.c | 102 
> > +++-
> >   1 file changed, 80 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 46cbcd1..1567d9f 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
> >   return true;
> >   }
> >
> > -static int dlpar_add_lmb(struct drmem_lmb *);
> > +enum dt_update_status {
> > + DT_NOUPDATE,
> > + DT_TOUP

Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-27 Thread Pingfan Liu
Hello guys. Do you have further comments on this version?

Thanks,
Pingfan

On Mon, Aug 10, 2020 at 4:53 PM Pingfan Liu  wrote:
>
> A bug is observed on pseries by taking the following steps on rhel:
> -1. drmgr -c mem -r -q 5
> -2. echo c > /proc/sysrq-trigger
>
> And then, the failure looks like:
> kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> kdump: saving vmcore-dmesg.txt
> kdump: saving vmcore-dmesg.txt complete
> kdump: saving vmcore
>  Checking for memory holes : [  0.0 %] /  
>  Checking for memory holes : [100.0 %] |  
>  Excluding unnecessary pages   : [100.0 %] \  
>  Copying data  : [  0.3 %] -  
> eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! 
> EA=0x7fffba40 access=0x8004 current=makedumpfile
> [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> psize 2 pte=0xc0005504
> [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> access=0x8004 current=makedumpfile
> [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> psize 2 pte=0xc0005504
> [   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
> 7fffbbc4d7fc lr 00011356ca3c code 2
> [   44.338548] Core dump to |/bin/false pipe failed
> /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> $CORE_COLLECTOR /proc/vmcore 
> $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> kdump: saving vmcore failed
>
> * Root cause *
>   After analyzing, it turns out that in the current implementation,
> when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
> the code __remove_memory() comes before drmem_update_dt().
> So in kdump kernel, when read_from_oldmem() resorts to
> pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
> can be observed "Bus error"
>
> From a viewpoint of listener and publisher, the publisher notifies the
> listener before data is ready.  This introduces a problem where udev
> launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> updating. And in capture kernel, makedumpfile will access the memory based
> on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
>
> * Fix *
> This bug is introduced by commit 063b8b1251fd
> ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> request"), which tried to combine all the dt updating into one.
>
> To fix this issue, meanwhile not to introduce a quadratic runtime
> complexity by the model:
>   dlpar_memory_add_by_count
> for_each_drmem_lmb <--
>   dlpar_add_lmb
> drmem_update_dt(_v1|_v2)
>   for_each_drmem_lmb   <--
> The dt should still be only updated once, and just before the last memory
> online/offline event is ejected to user space. Achieve this by tracing the
> num of lmb added or removed.
>
> Signed-off-by: Pingfan Liu 
> Cc: Michael Ellerman 
> Cc: Hari Bathini 
> Cc: Nathan Lynch 
> Cc: Nathan Fontenot 
> Cc: Laurent Dufour 
> To: linuxppc-dev@lists.ozlabs.org
> Cc: ke...@lists.infradead.org
> ---
> v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
>   whether dt is updated successfully.
>   Fix a condition boundary check bug
> v3 -> v4: resolve a quadratic runtime complexity issue.
>   This series is applied on next-test branch
>  arch/powerpc/platforms/pseries/hotplug-memory.c | 102 
> +++-
>  1 file changed, 80 insertions(+), 22 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 46cbcd1..1567d9f 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
> return true;
>  }
>
> -static int dlpar_add_lmb(struct drmem_lmb *);
> +enum dt_update_status {
> +   DT_NOUPDATE,
> +   DT_TOUPDATE,
> +   DT_UPDATED,
> +};
> +
> +/* "*dt_update" returns DT_UPDATED if updated */
> +static int dlpar_add_lmb(struct drmem_lmb *lmb,
> +   enum dt_update_status *dt_update);
>
> -static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> +static int dlpar_remove_lmb(struct drmem_lmb *lmb,
> +   enum dt_update_status *dt_update)
>  {
> unsigned long block_sz;

[PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-10 Thread Pingfan Liu
A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes : [  0.0 %] /
   Checking for memory holes : [100.0 %] |  
 Excluding unnecessary pages   : [100.0 %] \
   Copying data  : [  0.3 %] -  
eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

>From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
This bug is introduced by commit 063b8b1251fd
("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
request"), which tried to combine all the dt updating into one.

To fix this issue, meanwhile not to introduce a quadratic runtime
complexity by the model:
  dlpar_memory_add_by_count
for_each_drmem_lmb <--
  dlpar_add_lmb
drmem_update_dt(_v1|_v2)
  for_each_drmem_lmb   <--
The dt should still be only updated once, and just before the last memory
online/offline event is ejected to user space. Achieve this by tracing the
num of lmb added or removed.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: Laurent Dufour 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
  whether dt is updated successfully.
  Fix a condition boundary check bug
v3 -> v4: resolve a quadratic runtime complexity issue.
  This series is applied on next-test branch
 arch/powerpc/platforms/pseries/hotplug-memory.c | 102 +++-
 1 file changed, 80 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 46cbcd1..1567d9f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
return true;
 }
 
-static int dlpar_add_lmb(struct drmem_lmb *);
+enum dt_update_status {
+   DT_NOUPDATE,
+   DT_TOUPDATE,
+   DT_UPDATED,
+};
+
+/* "*dt_update" returns DT_UPDATED if updated */
+static int dlpar_add_lmb(struct drmem_lmb *lmb,
+   enum dt_update_status *dt_update);
 
-static int dlpar_remove_lmb(struct drmem_lmb *lmb)
+static int dlpar_remove_lmb(struct drmem_lmb *lmb,
+   enum dt_update_status *dt_update)
 {
unsigned long block_sz;
phys_addr_t base_addr;
-   int rc, nid;
+   int rc, ret, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -372,6 +381,13 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   if (*dt_update) {
+   ret = drmem_update_dt();
+   if (ret)
+   pr_warn("%s fail to update dt, but continue\n", 
__func__);
+   else
+   *dt_update = DT_UPDATED;
+   }
 
__remove_m

[PATCHv5 1/2] powerpc/pseries: group lmb operation and memblock's

2020-08-10 Thread Pingfan Liu
This patch prepares for the incoming patch which swaps the order of
KOBJ_ADD/REMOVE uevent and dt's updating.

The dt updating should come after lmb operations, and before
__remove_memory()/__add_memory().  Accordingly, grouping all lmb operations
before the memblock's.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: Laurent Dufour 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v4 -> v5: fix the miss of clearing DRCONF_MEM_ASSIGNED in a failure path
 arch/powerpc/platforms/pseries/hotplug-memory.c | 28 +
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b7..46cbcd1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
 
+   base_addr = lmb->base_addr;
+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
 
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+   __remove_memory(nid, base_addr, block_sz);
+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
 }
 
@@ -603,22 +606,29 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}
 
lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
 
/* Add the memory */
rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
if (rc) {
invalidate_lmb_associativity_index(lmb);
+   lmb_clear_nid(lmb);
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
return rc;
}
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
 
return rc;
-- 
2.7.5



Re: [PATCHv4 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-04 Thread Pingfan Liu
On Tue, Aug 4, 2020 at 12:29 AM Laurent Dufour  wrote:
>
[...]
> >   lmb_set_nid(lmb);
> >   lmb->flags |= DRCONF_MEM_ASSIGNED;
> > + if (dt_update) {
> > + ret = drmem_update_dt();
> > + if (ret)
> > + pr_warn("%s fail to update dt, but continue\n", 
> > __func__);
> > + }
> >
> >   block_sz = memory_block_size_bytes();
>
> In the case the call to __add_memory is failing, the flag DRCONF_MEM_ASSIGNED
> should be cleared as I mentioned in your previous patch. In addition to this,
Yes.
> the DT should be updated, or the caller should manage that but that will be 
> hard
> since it doesn't know where the error happened in this function.
Yeah, it is hard to manage it by caller, so just updating dt  is a
easier method.
>
> >
> > @@ -625,7 +653,11 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> > -
> > + if (dt_update) {
> > + ret = drmem_update_dt();
> > + if (ret)
> > + pr_warn("%s fail to update dt during 
> > rollback, but continue\n", __func__);
> > + }
> >   __remove_memory(nid, base_addr, block_sz);
> >   }
> >
> > @@ -638,6 +670,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
> >   int lmbs_available = 0;
> >   int lmbs_added = 0;
> >   int rc;
> > + bool dt_update = false;
> >
> >   pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add);
> >
> > @@ -664,7 +697,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
> >   if (rc)
> >   continue;
> >
> > - rc = dlpar_add_lmb(lmb);
> > + rc = dlpar_add_lmb(lmb, dt_update);
> >   if (rc) {
> >   dlpar_release_drc(lmb->drc_index);
> >   continue;
> > @@ -678,16 +711,23 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
> >   lmbs_added++;
> >   if (lmbs_added == lmbs_to_add)
> >   break;
> > + else if (lmbs_added == lmbs_to_add - 1)
> > + dt_update = true;
>
> In the case the number of LMB to add is 1, dt_update is never set to true, and
> the device tree is never updated. You need to set dt_update to true earlier in
> the loop block.
Oh, I will fix it in V5
>
> >   }
> >
> >   if (lmbs_added != lmbs_to_add) {
> > + bool rollback_dt_update = false;
> > +
> >   pr_err("Memory hot-add failed, removing any added LMBs\n");
> >
> >   for_each_drmem_lmb(lmb) {
> >   if (!drmem_lmb_reserved(lmb))
> >   continue;
> >
> > - rc = dlpar_remove_lmb(lmb);
> > + if (--lmbs_added == 0 && dt_update)
> > + rollback_dt_update = true;
>
> That test may have to be review to deal with error during the last LMB 
> addition,
> the DT may have been updated before __add_memory() is failing in
> dlpar_add_lmb(). In that case, lmbs_added == 0 and that branch is not covered.
> That's not an issue if dlpar_add_lmb() is handling that case (see my comment 
> above).
I will fix it in next version.

Thanks for your review.

Regards,
Pingfan


Re: [PATCHv4 1/2] powerpc/pseries: group lmb operation and memblock's

2020-08-04 Thread Pingfan Liu
On Mon, Aug 3, 2020 at 9:52 PM Laurent Dufour  wrote:
>
> > @@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> > }
> >
> > lmb_set_nid(lmb);
> > +   lmb->flags |= DRCONF_MEM_ASSIGNED;
> > +
> > block_sz = memory_block_size_bytes();
> >
> > /* Add the memory */
>
> Since the lmb->flags is now set earlier, you should unset it in the case the
> call to __add_memory() fails, something like:
>
> @@ -614,6 +614,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
> if (rc) {
> invalidate_lmb_associativity_index(lmb);
> +   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
You are right. I will fix it in V5.

Thanks,
Pingfan


[PATCHv4 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-30 Thread Pingfan Liu
A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes : [  0.0 %] /
   Checking for memory holes : [100.0 %] |  
 Excluding unnecessary pages   : [100.0 %] \
   Copying data  : [  0.3 %] -  
eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

>From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
This bug is introduced by commit 063b8b1251fd
("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
request"), which tried to combine all the dt updating into one.

To fix this issue, meanwhile not to introduce a quadratic runtime
complexity by the model:
  dlpar_memory_add_by_count
for_each_drmem_lmb <--
  dlpar_add_lmb
drmem_update_dt(_v1|_v2)
  for_each_drmem_lmb   <--
The dt should still be only updated once, and just before the last memory
online/offline event is ejected to user space. Achieve this by tracing the
num of lmb added or removed.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
v3 -> v4: resolve a quadratic runtime complexity issue.
  This series is applied on next-test branch
 arch/powerpc/platforms/pseries/hotplug-memory.c | 88 ++---
 1 file changed, 66 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 1a3ac3b..e07d5b1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -350,13 +350,13 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
return true;
 }
 
-static int dlpar_add_lmb(struct drmem_lmb *);
+static int dlpar_add_lmb(struct drmem_lmb *lmb, bool dt_update);
 
-static int dlpar_remove_lmb(struct drmem_lmb *lmb)
+static int dlpar_remove_lmb(struct drmem_lmb *lmb, bool dt_update)
 {
unsigned long block_sz;
phys_addr_t base_addr;
-   int rc, nid;
+   int rc, ret, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -372,6 +372,11 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   if (dt_update) {
+   ret = drmem_update_dt();
+   if (ret)
+   pr_warn("%s fail to update dt, but continue\n", 
__func__);
+   }
 
__remove_memory(nid, base_addr, block_sz);
 
@@ -387,6 +392,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
int lmbs_removed = 0;
int lmbs_available = 0;
int rc;
+   bool dt_update = false;
 
pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove);
 
@@ -409,7 +415,7 @@ static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
}
 
for_each_drmem_lmb(lmb) {
-   rc = dlpar_remove_

[PATCHv4 1/2] powerpc/pseries: group lmb operation and memblock's

2020-07-30 Thread Pingfan Liu
This patch prepares for the incoming patch which swaps the order of
KOBJ_ADD/REMOVE uevent and dt's updating.

The dt updating should come after lmb operations, and before
__remove_memory()/__add_memory().  Accordingly, grouping all lmb operations
before the memblock's.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
Cc: Nathan Fontenot 
Cc: ke...@lists.infradead.org
To: linuxppc-dev@lists.ozlabs.org
---
v3 -> v4: improve commit log
 arch/powerpc/platforms/pseries/hotplug-memory.c | 26 -
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b7..1a3ac3b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
 
+   base_addr = lmb->base_addr;
+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
 
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+   __remove_memory(nid, base_addr, block_sz);
+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
 }
 
@@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}
 
lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
 
/* Add the memory */
@@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
 
return rc;
-- 
2.7.5



Re: [PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's

2020-07-28 Thread Pingfan Liu
On Thu, Jul 23, 2020 at 10:41 PM Nathan Lynch  wrote:
>
> Pingfan Liu  writes:
> > This patch prepares for the incoming patch which swaps the order of KOBJ_
> > uevent and dt's updating.
> >
> > It has no functional effect, just groups lmb operation and memblock's in
> > order to insert dt updating operation easily, and makes it easier to
> > review.
>
> ...
>
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 5d545b7..1a3ac3b 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
> >  static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> >  {
> >   unsigned long block_sz;
> > - int rc;
> > + phys_addr_t base_addr;
> > + int rc, nid;
> >
> >   if (!lmb_is_removable(lmb))
> >   return -EINVAL;
> > @@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> >   if (rc)
> >   return rc;
> >
> > + base_addr = lmb->base_addr;
> > + nid = lmb->nid;
> >   block_sz = pseries_memory_block_size();
> >
> > - __remove_memory(lmb->nid, lmb->base_addr, block_sz);
> > -
> > - /* Update memory regions for memory remove */
> > - memblock_remove(lmb->base_addr, block_sz);
> > -
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> >
> > + __remove_memory(nid, base_addr, block_sz);
> > +
> > + /* Update memory regions for memory remove */
> > + memblock_remove(base_addr, block_sz);
> > +
> >   return 0;
> >  }
>
> I don't understand; the commit message should not claim this has no
> functional effect when it changes the order of operations like
> this. Maybe this is an improvement over the current behavior, but it's
> not explained why it would be.
One group of functions, which name contains lmb, are powerpc specific,
and used to form dt.

The other group __remove_memory() and memblock_remove() are integrated
with linux mm.

And [2/2] arrange dt-updating just before __remove_memory()

Thanks,
Pingfan


Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-24 Thread Pingfan Liu
On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch  wrote:
>
> Pingfan Liu  writes:
> > A bug is observed on pseries by taking the following steps on rhel:
> > -1. drmgr -c mem -r -q 5
> > -2. echo c > /proc/sysrq-trigger
> >
> > And then, the failure looks like:
> > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > kdump: saving vmcore-dmesg.txt
> > kdump: saving vmcore-dmesg.txt complete
> > kdump: saving vmcore
> >  Checking for memory holes : [  0.0 %] /
> >Checking for memory holes : [100.0 %] |  
> >  Excluding unnecessary pages   : [100.0 %] 
> > \   Copying data  : [  
> > 0.3 %] -  eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! 
> > EA=0x7fffba40 access=0x8004 current=makedumpfile
> > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> > psize 2 pte=0xc0005504
> > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > access=0x8004 current=makedumpfile
> > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> > psize 2 pte=0xc0005504
> > [   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 
> > nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > [   44.338548] Core dump to |/bin/false pipe failed
> > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > $CORE_COLLECTOR /proc/vmcore 
> > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > kdump: saving vmcore failed
> >
> > * Root cause *
> >   After analyzing, it turns out that in the current implementation,
> > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating 
> > as
> > the code __remove_memory() comes before drmem_update_dt().
> > So in kdump kernel, when read_from_oldmem() resorts to
> > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
> > can be observed "Bus error"
> >
> > From a viewpoint of listener and publisher, the publisher notifies the
> > listener before data is ready.  This introduces a problem where udev
> > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > updating. And in capture kernel, makedumpfile will access the memory based
> > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> >
> > * Fix *
> >   In order to fix this issue, update dt before __remove_memory(), and
> > accordingly the same rule in hot-add path.
> >
> > This will introduce extra dt updating payload for each involved lmb when 
> > hotplug.
> > But it should be fine since drmem_update_dt() is memory based operation and
> > hotplug is not a hot path.
>
> This is great analysis but the performance implications of the change
> are grave. The add/remove paths here are already O(n) where n is the
> quantity of memory assigned to the LP, this change would make it O(n^2):
>
> dlpar_memory_add_by_count
>   for_each_drmem_lmb <--
> dlpar_add_lmb
>   drmem_update_dt(_v1|_v2)
> for_each_drmem_lmb   <--
>
> Memory add/remove isn't a hot path but quadratic runtime complexity
> isn't acceptable. Its current performance is bad enough that I have
Yes, the quadratic runtime complexity sounds terrible.
And I am curious about the bug. Does the system have thousands of lmb?

> internal bugs open on it.
>
> Not to mention we leak memory every time drmem_update_dt is called
> because we can't safely free device tree properties :-(
Do you know what block us to free it?
>
> Also note that this sort of reverts (fixes?) 063b8b1251fd
> ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> request").
Yes. And now, I think I need to bring up another method to fix it.

Thanks,
Pingfan


Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-22 Thread Pingfan Liu
On Wed, Jul 22, 2020 at 12:57 PM Michael Ellerman  wrote:
>
> Pingfan Liu  writes:
> > A bug is observed on pseries by taking the following steps on rhel:
> ^
> RHEL
>
> I assume it happens on mainline too?
Yes, it does.
>
[...]
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 1a3ac3b..def8cb3f 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -372,6 +372,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> > + drmem_update_dt();
>
> No error checking?
Hmm, here should be a more careful design. Please see the comment at the end.
>
> >   __remove_memory(nid, base_addr, block_sz);
> >
> > @@ -607,6 +608,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> >
> >   lmb_set_nid(lmb);
> >   lmb->flags |= DRCONF_MEM_ASSIGNED;
> > + drmem_update_dt();
>
> And here ..
> >
> >   block_sz = memory_block_size_bytes();
> >
> > @@ -625,6 +627,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
> >   invalidate_lmb_associativity_index(lmb);
> >   lmb_clear_nid(lmb);
> >   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
> > + drmem_update_dt();
>
>
> And here ..
>
> >   __remove_memory(nid, base_addr, block_sz);
> >   }
> > @@ -877,9 +880,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
> >   break;
> >   }
> >
> > - if (!rc)
> > - rc = drmem_update_dt();
> > -
> >   unlock_device_hotplug();
> >   return rc;
>
> Whereas previously we did check it.

drmem_update_dt() fails iff allocating memory fail. And in the failed
case, even the original code does not roll back the effect of
__add_memory()/__remove_memory().

And I plan to do the following in V4: if drmem_update_dt() fails in
dlpar_add_lmb(), then bails out immediately.

Thanks,
Pingfan


[PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-07-21 Thread Pingfan Liu
A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes : [  0.0 %] /
   Checking for memory holes : [100.0 %] |  
 Excluding unnecessary pages   : [100.0 %] \
   Copying data  : [  0.3 %] -  
eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

>From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
  In order to fix this issue, update dt before __remove_memory(), and
accordingly the same rule in hot-add path.

This will introduce extra dt updating payload for each involved lmb when 
hotplug.
But it should be fine since drmem_update_dt() is memory based operation and
hotplug is not a hot path.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v2 -> v3: rebase onto ppc next-test branch
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 1a3ac3b..def8cb3f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -372,6 +372,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   drmem_update_dt();
 
__remove_memory(nid, base_addr, block_sz);
 
@@ -607,6 +608,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
lmb_set_nid(lmb);
lmb->flags |= DRCONF_MEM_ASSIGNED;
+   drmem_update_dt();
 
block_sz = memory_block_size_bytes();
 
@@ -625,6 +627,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   drmem_update_dt();
 
__remove_memory(nid, base_addr, block_sz);
}
@@ -877,9 +880,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
break;
}
 
-   if (!rc)
-   rc = drmem_update_dt();
-
unlock_device_hotplug();
return rc;
 }
-- 
2.7.5



[PATCHv3 1/2] powerpc/pseries: group lmb operation and memblock's

2020-07-21 Thread Pingfan Liu
This patch prepares for the incoming patch which swaps the order of KOBJ_
uevent and dt's updating.

It has no functional effect, just groups lmb operation and memblock's in
order to insert dt updating operation easily, and makes it easier to
review.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Nathan Lynch 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v2 -> v3: rebase onto ppc next-test branch
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 26 -
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b7..1a3ac3b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -355,7 +355,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -364,17 +365,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
 
+   base_addr = lmb->base_addr;
+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
 
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+   __remove_memory(nid, base_addr, block_sz);
+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
 }
 
@@ -603,6 +606,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}
 
lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
 
/* Add the memory */
@@ -614,11 +619,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
 
return rc;
-- 
2.7.5



[PATCHv2 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-04-08 Thread Pingfan Liu
A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes : [  0.0 %] /
   Checking for memory holes : [100.0 %] |  
 Excluding unnecessary pages   : [100.0 %] \
   Copying data  : [  0.3 %] -  
eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().
So in kdump kernel, when read_from_oldmem() resorts to
pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
can be observed "Bus error"

>From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
  In order to fix this issue, update dt before __remove_memory(), and
accordingly the same rule in hot-add path.

This will introduce extra dt updating payload for each involved lmb when 
hotplug.
But it should be fine since drmem_update_dt() is memory based operation and
hotplug is not a hot path.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Hari Bathini 
Cc: Leonardo Bras  
Cc: Libor Pechacek  
Cc: Nathan Fontenot  
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
v1 -> v2: improve commit, and more detail about the SIGBUG failure path
 arch/powerpc/platforms/pseries/hotplug-memory.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 4bd9004..72cd4a5 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -394,6 +394,9 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   rtas_hp_event = true;
+   drmem_update_dt();
+   rtas_hp_event = false;
 
__remove_memory(nid, base_addr, block_sz);
 
@@ -667,6 +670,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
lmb_set_nid(lmb);
lmb->flags |= DRCONF_MEM_ASSIGNED;
+   rtas_hp_event = true;
+   drmem_update_dt();
+   rtas_hp_event = false;
 
block_sz = memory_block_size_bytes();
 
@@ -685,6 +691,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   rtas_hp_event = true;
+   drmem_update_dt();
+   rtas_hp_event = false;
 
__remove_memory(nid, base_addr, block_sz);
}
@@ -941,12 +950,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
break;
}
 
-   if (!rc) {
-   rtas_hp_event = true;
-   rc = drmem_update_dt();
-   rtas_hp_event = false;
-   }
-
unlock_device_hotplug();
return rc;
 }
-- 
2.7.5



[PATCHv2 1/2] powerpc/pseries: group lmb operation and memblock's

2020-04-08 Thread Pingfan Liu
This patch prepares for the incoming patch which swaps the order of KOBJ_
uevent and dt's updating.

It has no functional effect, just groups lmb operation and memblock's in
order to insert dt updating operation easily, and makes it easier to
review.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Hari Bathini 
Cc: Leonardo Bras 
Cc: Libor Pechacek 
Cc: Nathan Fontenot 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 26 -
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index b2cde17..4bd9004 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -377,7 +377,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -386,17 +387,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
 
+   base_addr = lmb->base_addr;
+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
 
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+   __remove_memory(nid, base_addr, block_sz);
+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
 }
 
@@ -663,6 +666,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}
 
lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
 
/* Add the memory */
@@ -674,11 +679,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
 
return rc;
-- 
2.7.5



[PATCHv4] powerpc/crashkernel: take "mem=" option into account

2020-04-01 Thread Pingfan Liu
'mem=" option is an easy way to put high pressure on memory during some
test. Hence after applying the memory limit, instead of total mem, the
actual usable memory should be considered when reserving mem for
crashkernel. Otherwise the boot up may experience OOM issue.

E.g. it would reserve 4G prior to the change and 512M afterward, if passing
crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
mem=5G on a 256G machine.

This issue is powerpc specific because it puts higher priority on fadump
and kdump reservation than on "mem=". Referring the following code:
if (fadump_reserve_mem() == 0)
reserve_crashkernel();
...
/* Ensure that total memory size is page-aligned. */
limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
memblock_enforce_memory_limit(limit);

While on other arches, the effect of "mem=" takes a higher priority and pass
through memblock_phys_mem_size() before calling reserve_crashkernel().

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Hari Bathini 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
---
v3 -> v4: fix total_mem_sz based on adjusted memory_limit

 arch/powerpc/kexec/core.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 078fe3d..56da5eb 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -115,11 +115,12 @@ void machine_kexec(struct kimage *image)

 void __init reserve_crashkernel(void)
 {
-   unsigned long long crash_size, crash_base;
+   unsigned long long crash_size, crash_base, total_mem_sz;
int ret;

+   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
/* use common parsing */
-   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, total_mem_sz,
_size, _base);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
@@ -178,6 +179,7 @@ void __init reserve_crashkernel(void)
/* Crash kernel trumps memory limit */
if (memory_limit && memory_limit <= crashk_res.end) {
memory_limit = crashk_res.end + 1;
+   total_mem_sz = memory_limit;
printk("Adjusted memory limit for crashkernel, now 0x%llx\n",
   memory_limit);
}
@@ -186,7 +188,7 @@ void __init reserve_crashkernel(void)
"for crashkernel (System RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crashk_res.start >> 20),
-   (unsigned long)(memblock_phys_mem_size() >> 20));
+   (unsigned long)(total_mem_sz >> 20));

if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
memblock_reserve(crashk_res.start, crash_size)) {
--
2.7.5



Re: [PATCHv3 2/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel

2020-03-16 Thread Pingfan Liu
On Mon, Mar 16, 2020 at 10:53 AM Aneesh Kumar K.V
 wrote:
>
> On 3/4/20 2:17 PM, Pingfan Liu wrote:
> > At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so
> > if dumping to fsdax, it will take a very long time.
> >
>
>
> that should be fixed by
>
> faa6d21153fd11e139dd880044521389b34a24f2
> Author:   Aneesh Kumar K.V 
> AuthorDate:   Tue Sep 3 18:04:52 2019 +0530
> Commit:   Michael Ellerman 
> CommitDate:   Wed Sep 25 08:32:59 2019 +1000
>
> powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error
>
> Right now we force an unbind of SCM memory at drcindex on H_OVERLAP error.
> This really slows down operations like kexec where we get the H_OVERLAP
> error because we don't go through a full hypervisor re init.
>
> H_OVERLAP error for a H_SCM_BIND_MEM hcall indicates that SCM memory at
> drc index is already bound. Since we don't specify a logical memory
> address for bind hcall, we can use the H_SCM_QUERY hcall to query
> the already bound logical address.
Good to know it.

Thanks,
Pingfan
>
>
>
>
> > Take a closer look, during the papr_scm initialization, the only
> > configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM,
> > ...), which helps to set up the bound address.
> >
> > On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this
> > step can be stepped around to save times.  So the pmem bound address can be
> > passed to the 2nd kernel through a dynamic added property "bound-addr" in
> > dt node 'ibm,pmemory'.
> >
>
> -aneesh
>


Re: [PATCHv3 2/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel

2020-03-15 Thread Pingfan Liu
Appreciate for your kind review. And I have some comment as below.

On Fri, Mar 13, 2020 at 11:18 AM Oliver O'Halloran  wrote:
>
> On Wed, Mar 4, 2020 at 7:50 PM Pingfan Liu  wrote:
> >
> > At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so
> > if dumping to fsdax, it will take a very long time.
> >
> > Take a closer look, during the papr_scm initialization, the only
> > configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM,
> > ...), which helps to set up the bound address.
> >
> > On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this
> > step can be stepped around to save times.  So the pmem bound address can be
> > passed to the 2nd kernel through a dynamic added property "bound-addr" in
> > dt node 'ibm,pmemory'.
> >
> > Signed-off-by: Pingfan Liu 
> > To: linuxppc-dev@lists.ozlabs.org
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: Hari Bathini 
> > Cc: Aneesh Kumar K.V 
> > Cc: Oliver O'Halloran 
> > Cc: Dan Williams 
> > Cc: Andrew Donnellan 
> > Cc: Christophe Leroy 
> > Cc: Rob Herring 
> > Cc: Frank Rowand 
> > Cc: ke...@lists.infradead.org
> > ---
> > note: This patch has not been tested since I can not get such a pseries 
> > with pmem.
> >   Please kindly to give some suggestion, thanks.
>
> There was some qemu patches to implement the Hcall interface floating
> around a while ago. I'm not sure they ever made it into upstream qemu
> though.
Unfortunately, it does not appear in latest qemu code. I think
probably virt-pmem has achieved the same feature.
>
> > ---
> >  arch/powerpc/platforms/pseries/of_helpers.c |  1 +
> >  arch/powerpc/platforms/pseries/papr_scm.c   | 33 
> > -
> >  drivers/of/base.c   |  1 +
> >  3 files changed, 25 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
> > b/arch/powerpc/platforms/pseries/of_helpers.c
> > index 1022e0f..2c7bab4 100644
> > --- a/arch/powerpc/platforms/pseries/of_helpers.c
> > +++ b/arch/powerpc/platforms/pseries/of_helpers.c
> > @@ -34,6 +34,7 @@ struct property *new_property(const char *name, const int 
> > length,
> > kfree(new);
> > return NULL;
> >  }
> > +EXPORT_SYMBOL(new_property);
> >
> >  /**
> >   * pseries_of_derive_parent - basically like dirname(1)
> > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> > b/arch/powerpc/platforms/pseries/papr_scm.c
> > index 0b4467e..54ae903 100644
> > --- a/arch/powerpc/platforms/pseries/papr_scm.c
> > +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> > @@ -14,6 +14,7 @@
> >  #include 
> >
> >  #include 
> > +#include "of_helpers.h"
> >
> >  #define BIND_ANY_ADDR (~0ul)
> >
> > @@ -383,7 +384,7 @@ static int papr_scm_probe(struct platform_device *pdev)
> >  {
> > struct device_node *dn = pdev->dev.of_node;
> > u32 drc_index, metadata_size;
> > -   u64 blocks, block_size;
> > +   u64 blocks, block_size, bound_addr = 0;
> > struct papr_scm_priv *p;
> > const char *uuid_str;
> > u64 uuid[2];
> > @@ -440,17 +441,29 @@ static int papr_scm_probe(struct platform_device 
> > *pdev)
> > p->metadata_size = metadata_size;
> > p->pdev = pdev;
> >
> > -   /* request the hypervisor to bind this region to somewhere in 
> > memory */
> > -   rc = drc_pmem_bind(p);
> > +   of_property_read_u64(dn, "bound-addr", _addr);
> > +   if (bound_addr) {
> > +   p->bound_addr = bound_addr;
> > +   } else {
> > +   struct property *property;
> > +   u64 big;
> >
> > -   /* If phyp says drc memory still bound then force unbound and retry 
> > */
> > -   if (rc == H_OVERLAP)
> > -   rc = drc_pmem_query_n_bind(p);
> > +   /* request the hypervisor to bind this region to somewhere 
> > in memory */
> > +   rc = drc_pmem_bind(p);
> >
> > -   if (rc != H_SUCCESS) {
> > -   dev_err(>pdev->dev, "bind err: %d\n", rc);
> > -   rc = -ENXIO;
> > -   goto err;
> > +   /* If phyp says drc memory still bound then force unbound 
> > and retry */
> > +   if (rc == H_OVERLAP)
> > + 

Re: [PATCHv3 1/2] powerpc/of: split out new_property() for reusing

2020-03-08 Thread Pingfan Liu
On Sat, Mar 7, 2020 at 3:59 AM Nathan Lynch  wrote:
>
> Hi,
>
> Pingfan Liu  writes:
> > Splitting out new_property() for coming reusing and moving it to
> > of_helpers.c.
>
> [...]
>
> > +struct property *new_property(const char *name, const int length,
> > + const unsigned char *value, struct property *last)
> > +{
> > + struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
> > +
> > + if (!new)
> > + return NULL;
> > +
> > + new->name = kstrdup(name, GFP_KERNEL);
> > + if (!new->name)
> > + goto cleanup;
> > + new->value = kmalloc(length + 1, GFP_KERNEL);
> > + if (!new->value)
> > + goto cleanup;
> > +
> > + memcpy(new->value, value, length);
> > + *(((char *)new->value) + length) = 0;
> > + new->length = length;
> > + new->next = last;
> > + return new;
> > +
> > +cleanup:
> > + kfree(new->name);
> > + kfree(new->value);
> > + kfree(new);
> > + return NULL;
> > +}
>
> This function in its current form isn't suitable for more general use:
>
> * It appears to be tailored to string properties - note the char * value
>   parameter, the length + 1 allocation and nul termination.
>
> * Most code shouldn't need the 'last' argument. The code where this
>   currently resides builds a list of properties and attaches it to a new
>   node, bypassing of_add_property().
>
> Let's look at the call site you add in your next patch:
>
> +   big = cpu_to_be64(p->bound_addr);
> +   property = new_property("bound-addr", sizeof(u64), (const 
> unsigned char *),
> +   NULL);
> +   of_add_property(dn, property);
>
> So you have to use a cast, and this is going to allocate (sizeof(u64) + 1)
> for the value, is that what you want?
>
> I think you should leave that legacy pseries reconfig code undisturbed
> (frankly that stuff should get deprecated and removed) and if you want a
> generic helper it should look more like:
>
> struct property *of_property_new(const char *name, size_t length,
>  const void *value, gfp_t allocflags)
>
> __of_prop_dup() looks like a good model/guide here.

Thanks for your good suggestion.
I will re-code based on your suggestion, if [2/2] turns out acceptable.

Regards,
Pingfan


[PATCHv3 2/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel

2020-03-04 Thread Pingfan Liu
At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so
if dumping to fsdax, it will take a very long time.

Take a closer look, during the papr_scm initialization, the only
configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM,
...), which helps to set up the bound address.

On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this
step can be stepped around to save times.  So the pmem bound address can be
passed to the 2nd kernel through a dynamic added property "bound-addr" in
dt node 'ibm,pmemory'.

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: Andrew Donnellan 
Cc: Christophe Leroy 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: ke...@lists.infradead.org
---
note: This patch has not been tested since I can not get such a pseries with 
pmem.
  Please kindly to give some suggestion, thanks.
---
 arch/powerpc/platforms/pseries/of_helpers.c |  1 +
 arch/powerpc/platforms/pseries/papr_scm.c   | 33 -
 drivers/of/base.c   |  1 +
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 1022e0f..2c7bab4 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -34,6 +34,7 @@ struct property *new_property(const char *name, const int 
length,
kfree(new);
return NULL;
 }
+EXPORT_SYMBOL(new_property);
 
 /**
  * pseries_of_derive_parent - basically like dirname(1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e..54ae903 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include "of_helpers.h"
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -383,7 +384,7 @@ static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
u32 drc_index, metadata_size;
-   u64 blocks, block_size;
+   u64 blocks, block_size, bound_addr = 0;
struct papr_scm_priv *p;
const char *uuid_str;
u64 uuid[2];
@@ -440,17 +441,29 @@ static int papr_scm_probe(struct platform_device *pdev)
p->metadata_size = metadata_size;
p->pdev = pdev;
 
-   /* request the hypervisor to bind this region to somewhere in memory */
-   rc = drc_pmem_bind(p);
+   of_property_read_u64(dn, "bound-addr", _addr);
+   if (bound_addr) {
+   p->bound_addr = bound_addr;
+   } else {
+   struct property *property;
+   u64 big;
 
-   /* If phyp says drc memory still bound then force unbound and retry */
-   if (rc == H_OVERLAP)
-   rc = drc_pmem_query_n_bind(p);
+   /* request the hypervisor to bind this region to somewhere in 
memory */
+   rc = drc_pmem_bind(p);
 
-   if (rc != H_SUCCESS) {
-   dev_err(>pdev->dev, "bind err: %d\n", rc);
-   rc = -ENXIO;
-   goto err;
+   /* If phyp says drc memory still bound then force unbound and 
retry */
+   if (rc == H_OVERLAP)
+   rc = drc_pmem_query_n_bind(p);
+
+   if (rc != H_SUCCESS) {
+   dev_err(>pdev->dev, "bind err: %d\n", rc);
+   rc = -ENXIO;
+   goto err;
+   }
+   big = cpu_to_be64(p->bound_addr);
+   property = new_property("bound-addr", sizeof(u64), (const 
unsigned char *),
+   NULL);
+   of_add_property(dn, property);
}
 
/* setup the resource for the newly bound range */
diff --git a/drivers/of/base.c b/drivers/of/base.c
index ae03b12..602d2a9 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -1817,6 +1817,7 @@ int of_add_property(struct device_node *np, struct 
property *prop)
 
return rc;
 }
+EXPORT_SYMBOL_GPL(of_add_property);
 
 int __of_remove_property(struct device_node *np, struct property *prop)
 {
-- 
2.7.5



[PATCHv3 1/2] powerpc/of: split out new_property() for reusing

2020-03-04 Thread Pingfan Liu
Splitting out new_property() for coming reusing and moving it to
of_helpers.c.

Also do some coding style cleanup.

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: Andrew Donnellan 
Cc: Christophe Leroy 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: ke...@lists.infradead.org
---
 arch/powerpc/platforms/pseries/of_helpers.c | 28 
 arch/powerpc/platforms/pseries/of_helpers.h |  3 +++
 arch/powerpc/platforms/pseries/reconfig.c   | 26 --
 3 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 66dfd82..1022e0f 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -7,6 +7,34 @@
 
 #include "of_helpers.h"
 
+struct property *new_property(const char *name, const int length,
+   const unsigned char *value, struct property *last)
+{
+   struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
+
+   if (!new)
+   return NULL;
+
+   new->name = kstrdup(name, GFP_KERNEL);
+   if (!new->name)
+   goto cleanup;
+   new->value = kmalloc(length + 1, GFP_KERNEL);
+   if (!new->value)
+   goto cleanup;
+
+   memcpy(new->value, value, length);
+   *(((char *)new->value) + length) = 0;
+   new->length = length;
+   new->next = last;
+   return new;
+
+cleanup:
+   kfree(new->name);
+   kfree(new->value);
+   kfree(new);
+   return NULL;
+}
+
 /**
  * pseries_of_derive_parent - basically like dirname(1)
  * @path:  the full_name of a node to be added to the tree
diff --git a/arch/powerpc/platforms/pseries/of_helpers.h 
b/arch/powerpc/platforms/pseries/of_helpers.h
index decad65..34add82 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.h
+++ b/arch/powerpc/platforms/pseries/of_helpers.h
@@ -4,6 +4,9 @@
 
 #include 
 
+struct property *new_property(const char *name, const int length,
+   const unsigned char *value, struct property *last);
+
 struct device_node *pseries_of_derive_parent(const char *path);
 
 #endif /* _PSERIES_OF_HELPERS_H */
diff --git a/arch/powerpc/platforms/pseries/reconfig.c 
b/arch/powerpc/platforms/pseries/reconfig.c
index 7f7369f..8e5a2ba 100644
--- a/arch/powerpc/platforms/pseries/reconfig.c
+++ b/arch/powerpc/platforms/pseries/reconfig.c
@@ -165,32 +165,6 @@ static char * parse_next_property(char *buf, char *end, 
char **name, int *length
return tmp;
 }
 
-static struct property *new_property(const char *name, const int length,
-const unsigned char *value, struct 
property *last)
-{
-   struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
-
-   if (!new)
-   return NULL;
-
-   if (!(new->name = kstrdup(name, GFP_KERNEL)))
-   goto cleanup;
-   if (!(new->value = kmalloc(length + 1, GFP_KERNEL)))
-   goto cleanup;
-
-   memcpy(new->value, value, length);
-   *(((char *)new->value) + length) = 0;
-   new->length = length;
-   new->next = last;
-   return new;
-
-cleanup:
-   kfree(new->name);
-   kfree(new->value);
-   kfree(new);
-   return NULL;
-}
-
 static int do_add_node(char *buf, size_t bufsize)
 {
char *path, *end, *name;
-- 
2.7.5



[PATCHv3 0/2] pseries/scm: buffer pmem's bound addr in dt for kexec kernel

2020-03-04 Thread Pingfan Liu
V2 -> V3:
   in [2/2], EXPORT_SYMBOL(new_property) and EXPORT_SYMBOL_GPL(of_add_property)

To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: Andrew Donnellan 
Cc: Christophe Leroy 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: ke...@lists.infradead.org

Pingfan Liu (2):
  powerpc/of: split out new_property() for reusing
  pseries/scm: buffer pmem's bound addr in dt for kexec kernel

 arch/powerpc/platforms/pseries/of_helpers.c | 29 +
 arch/powerpc/platforms/pseries/of_helpers.h |  3 +++
 arch/powerpc/platforms/pseries/papr_scm.c   | 33 -
 arch/powerpc/platforms/pseries/reconfig.c   | 26 ---
 drivers/of/base.c   |  1 +
 5 files changed, 56 insertions(+), 36 deletions(-)

-- 
2.7.5



[PATCHv2 2/2] pSeries/papr_scm: buffer pmem's bound addr in dt for kexec kernel

2020-02-28 Thread Pingfan Liu
At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so
if dumping to fsdax, it will take a very long time.

Take a closer look, during the papr_scm initialization, the only
configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM,
...), which helps to set up the bound address.

On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this
step can be stepped around to save times.  So the pmem bound address can be
passed to the 2nd kernel through a dynamic added property "bound-addr" in
dt node 'ibm,pmemory'.

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: Andrew Donnellan 
Cc: Christophe Leroy 
Cc: ke...@lists.infradead.org
---
note: This patch has not been tested since I can not get such a pseries with 
pmem.
  Please kindly to give some suggestion, thanks.

 arch/powerpc/platforms/pseries/papr_scm.c | 32 +--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e..40cd214 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include "of_helpers.h"
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -383,7 +384,7 @@ static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
u32 drc_index, metadata_size;
-   u64 blocks, block_size;
+   u64 blocks, block_size, bound_addr = 0;
struct papr_scm_priv *p;
const char *uuid_str;
u64 uuid[2];
@@ -440,17 +441,28 @@ static int papr_scm_probe(struct platform_device *pdev)
p->metadata_size = metadata_size;
p->pdev = pdev;
 
-   /* request the hypervisor to bind this region to somewhere in memory */
-   rc = drc_pmem_bind(p);
+   of_property_read_u64(dn, "bound-addr", _addr);
+   if (bound_addr) {
+   p->bound_addr = bound_addr;
+   } else {
+   struct property *property;
+   u64 big;
 
-   /* If phyp says drc memory still bound then force unbound and retry */
-   if (rc == H_OVERLAP)
-   rc = drc_pmem_query_n_bind(p);
+   /* request the hypervisor to bind this region to somewhere in 
memory */
+   rc = drc_pmem_bind(p);
 
-   if (rc != H_SUCCESS) {
-   dev_err(>pdev->dev, "bind err: %d\n", rc);
-   rc = -ENXIO;
-   goto err;
+   /* If phyp says drc memory still bound then force unbound and 
retry */
+   if (rc == H_OVERLAP)
+   rc = drc_pmem_query_n_bind(p);
+
+   if (rc != H_SUCCESS) {
+   dev_err(>pdev->dev, "bind err: %d\n", rc);
+   rc = -ENXIO;
+   goto err;
+   }
+   big = cpu_to_be64(p->bound_addr);
+   property = new_property("bound-addr", sizeof(u64), , NULL);
+   of_add_property(dn, property);
}
 
/* setup the resource for the newly bound range */
-- 
2.7.5



[PATCHv2 1/2] powerpc/of: split out new_property() for reusing

2020-02-28 Thread Pingfan Liu
Splitting out new_property() for coming reusing and moving it to
of_helpers.c.

Also do some coding style cleanup.

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: Andrew Donnellan 
Cc: Christophe Leroy 
Cc: ke...@lists.infradead.org
---
 arch/powerpc/platforms/pseries/of_helpers.c | 28 
 arch/powerpc/platforms/pseries/of_helpers.h |  3 +++
 arch/powerpc/platforms/pseries/reconfig.c   | 26 --
 3 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 66dfd82..1022e0f 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -7,6 +7,34 @@
 
 #include "of_helpers.h"
 
+struct property *new_property(const char *name, const int length,
+   const unsigned char *value, struct property *last)
+{
+   struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
+
+   if (!new)
+   return NULL;
+
+   new->name = kstrdup(name, GFP_KERNEL);
+   if (!new->name)
+   goto cleanup;
+   new->value = kmalloc(length + 1, GFP_KERNEL);
+   if (!new->value)
+   goto cleanup;
+
+   memcpy(new->value, value, length);
+   *(((char *)new->value) + length) = 0;
+   new->length = length;
+   new->next = last;
+   return new;
+
+cleanup:
+   kfree(new->name);
+   kfree(new->value);
+   kfree(new);
+   return NULL;
+}
+
 /**
  * pseries_of_derive_parent - basically like dirname(1)
  * @path:  the full_name of a node to be added to the tree
diff --git a/arch/powerpc/platforms/pseries/of_helpers.h 
b/arch/powerpc/platforms/pseries/of_helpers.h
index decad65..34add82 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.h
+++ b/arch/powerpc/platforms/pseries/of_helpers.h
@@ -4,6 +4,9 @@
 
 #include 
 
+struct property *new_property(const char *name, const int length,
+   const unsigned char *value, struct property *last);
+
 struct device_node *pseries_of_derive_parent(const char *path);
 
 #endif /* _PSERIES_OF_HELPERS_H */
diff --git a/arch/powerpc/platforms/pseries/reconfig.c 
b/arch/powerpc/platforms/pseries/reconfig.c
index 7f7369f..8e5a2ba 100644
--- a/arch/powerpc/platforms/pseries/reconfig.c
+++ b/arch/powerpc/platforms/pseries/reconfig.c
@@ -165,32 +165,6 @@ static char * parse_next_property(char *buf, char *end, 
char **name, int *length
return tmp;
 }
 
-static struct property *new_property(const char *name, const int length,
-const unsigned char *value, struct 
property *last)
-{
-   struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
-
-   if (!new)
-   return NULL;
-
-   if (!(new->name = kstrdup(name, GFP_KERNEL)))
-   goto cleanup;
-   if (!(new->value = kmalloc(length + 1, GFP_KERNEL)))
-   goto cleanup;
-
-   memcpy(new->value, value, length);
-   *(((char *)new->value) + length) = 0;
-   new->length = length;
-   new->next = last;
-   return new;
-
-cleanup:
-   kfree(new->name);
-   kfree(new->value);
-   kfree(new);
-   return NULL;
-}
-
 static int do_add_node(char *buf, size_t bufsize)
 {
char *path, *end, *name;
-- 
2.7.5



Re: [PATCH 3/3] pseries/scm: buffer pmem's bound addr in dt for kexec kernel

2020-02-28 Thread Pingfan Liu
On Fri, Feb 28, 2020 at 2:52 PM Christophe Leroy
 wrote:
>
>
>
> Le 28/02/2020 à 06:53, Pingfan Liu a écrit :
> > At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so
> > if dumping to fsdax, it will take a very long time.
> >
> > Take a closer look, during the papr_scm initialization, the only
> > configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM,
> > ...), which helps to set up the bound address.
> >
> > On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this
> > step can be stepped around to save times.  So the pmem bound address can be
> > passed to the 2nd kernel through a dynamic added property "bound-addr" in
> > dt node 'ibm,pmemory'.
> >
> > Signed-off-by: Pingfan Liu 
> > To: linuxppc-dev@lists.ozlabs.org
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: Hari Bathini 
> > Cc: Aneesh Kumar K.V 
> > Cc: Oliver O'Halloran 
> > Cc: Dan Williams 
> > Cc: ke...@lists.infradead.org
> > ---
> > note: I can not find such a pseries machine, and not finish it yet.
> > ---
> >   arch/powerpc/platforms/pseries/papr_scm.c | 32 
> > +--
> >   1 file changed, 22 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> > b/arch/powerpc/platforms/pseries/papr_scm.c
> > index c2ef320..555e746 100644
> > --- a/arch/powerpc/platforms/pseries/papr_scm.c
> > +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> > @@ -382,7 +382,7 @@ static int papr_scm_probe(struct platform_device *pdev)
> >   {
> >   struct device_node *dn = pdev->dev.of_node;
> >   u32 drc_index, metadata_size;
> > - u64 blocks, block_size;
> > + u64 blocks, block_size, bound_addr = 0;
> >   struct papr_scm_priv *p;
> >   const char *uuid_str;
> >   u64 uuid[2];
> > @@ -439,17 +439,29 @@ static int papr_scm_probe(struct platform_device 
> > *pdev)
> >   p->metadata_size = metadata_size;
> >   p->pdev = pdev;
> >
> > - /* request the hypervisor to bind this region to somewhere in memory 
> > */
> > - rc = drc_pmem_bind(p);
> > + of_property_read_u64(dn, "bound-addr", _addr);
> > + if (bound_addr)
> > + p->bound_addr = bound_addr;
> > + else {
>
> All legs of an if/else must have { } when one leg need them, see codying
> style.
OK,
>
> > + struct property *property;
> > + u64 big;
> >
> > - /* If phyp says drc memory still bound then force unbound and retry */
> > - if (rc == H_OVERLAP)
> > - rc = drc_pmem_query_n_bind(p);
> > + /* request the hypervisor to bind this region to somewhere in 
> > memory */
> > + rc = drc_pmem_bind(p);
> >
> > - if (rc != H_SUCCESS) {
> > - dev_err(>pdev->dev, "bind err: %d\n", rc);
> > - rc = -ENXIO;
> > - goto err;
> > + /* If phyp says drc memory still bound then force unbound and 
> > retry */
> > + if (rc == H_OVERLAP)
> > + rc = drc_pmem_query_n_bind(p);
> > +
> > + if (rc != H_SUCCESS) {
> > + dev_err(>pdev->dev, "bind err: %d\n", rc);
> > + rc = -ENXIO;
> > + goto err;
> > + }
> > + big = cpu_to_be64(p->bound_addr);
> > + property = new_property("bound-addr", sizeof(u64), ,
> > + NULL);
>
> Why plitting this line in two parts ? You have lines far longer above.
> In powerpc we allow lines 90 chars long.
OK, good to know it.

Thanks,
Pingfan


Re: [PATCH 1/3] powerpc/of: split out new_property() for reusing

2020-02-27 Thread Pingfan Liu
On Fri, Feb 28, 2020 at 2:03 PM Andrew Donnellan  wrote:
>
> On 28/2/20 4:53 pm, Pingfan Liu wrote:
> > Since new_property() is used in several calling sites, splitting it out for
> > reusing.
> >
> > To ease the review, although the split out part has coding style issue,
> > keeping it untouched and fixed in next patch.
> >
> > Signed-off-by: Pingfan Liu 
> > To: linuxppc-dev@lists.ozlabs.org
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: Hari Bathini 
> > Cc: Aneesh Kumar K.V 
> > Cc: Oliver O'Halloran 
> > Cc: Dan Williams 
> > Cc: ke...@lists.infradead.org
>
> Which tree does this apply to? I don't see a new_property() in mm/drmem.c...
Sorry, there is mud in my git tree, I check, either linux git or
powerpc git tree does not have this function.

Nack this series, and I will send out V2 for patch 3/3.

Thanks,
Pingfan
>
> --
> Andrew Donnellan  OzLabs, ADL Canberra
> a...@linux.ibm.com IBM Australia Limited
>


[PATCH 3/3] pseries/scm: buffer pmem's bound addr in dt for kexec kernel

2020-02-27 Thread Pingfan Liu
At present, plpar_hcall(H_SCM_BIND_MEM, ...) takes a very long time, so
if dumping to fsdax, it will take a very long time.

Take a closer look, during the papr_scm initialization, the only
configuration is through drc_pmem_bind()-> plpar_hcall(H_SCM_BIND_MEM,
...), which helps to set up the bound address.

On pseries, for kexec -l/-p kernel, there is no reset of hardware, and this
step can be stepped around to save times.  So the pmem bound address can be
passed to the 2nd kernel through a dynamic added property "bound-addr" in
dt node 'ibm,pmemory'.

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: ke...@lists.infradead.org
---
note: I can not find such a pseries machine, and not finish it yet.
---
 arch/powerpc/platforms/pseries/papr_scm.c | 32 +--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index c2ef320..555e746 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -382,7 +382,7 @@ static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
u32 drc_index, metadata_size;
-   u64 blocks, block_size;
+   u64 blocks, block_size, bound_addr = 0;
struct papr_scm_priv *p;
const char *uuid_str;
u64 uuid[2];
@@ -439,17 +439,29 @@ static int papr_scm_probe(struct platform_device *pdev)
p->metadata_size = metadata_size;
p->pdev = pdev;
 
-   /* request the hypervisor to bind this region to somewhere in memory */
-   rc = drc_pmem_bind(p);
+   of_property_read_u64(dn, "bound-addr", _addr);
+   if (bound_addr)
+   p->bound_addr = bound_addr;
+   else {
+   struct property *property;
+   u64 big;
 
-   /* If phyp says drc memory still bound then force unbound and retry */
-   if (rc == H_OVERLAP)
-   rc = drc_pmem_query_n_bind(p);
+   /* request the hypervisor to bind this region to somewhere in 
memory */
+   rc = drc_pmem_bind(p);
 
-   if (rc != H_SUCCESS) {
-   dev_err(>pdev->dev, "bind err: %d\n", rc);
-   rc = -ENXIO;
-   goto err;
+   /* If phyp says drc memory still bound then force unbound and 
retry */
+   if (rc == H_OVERLAP)
+   rc = drc_pmem_query_n_bind(p);
+
+   if (rc != H_SUCCESS) {
+   dev_err(>pdev->dev, "bind err: %d\n", rc);
+   rc = -ENXIO;
+   goto err;
+   }
+   big = cpu_to_be64(p->bound_addr);
+   property = new_property("bound-addr", sizeof(u64), ,
+   NULL);
+   of_add_property(dn, property);
}
 
/* setup the resource for the newly bound range */
-- 
2.7.5



[PATCH 2/3] powerpc/of: coding style cleanup

2020-02-27 Thread Pingfan Liu
Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: ke...@lists.infradead.org
---
 arch/powerpc/kernel/of_property.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/of_property.c 
b/arch/powerpc/kernel/of_property.c
index e56c832..c6abf7e 100644
--- a/arch/powerpc/kernel/of_property.c
+++ b/arch/powerpc/kernel/of_property.c
@@ -5,16 +5,18 @@
 #include 
 
 struct property *new_property(const char *name, const int length,
-const unsigned char *value, struct 
property *last)
+   const unsigned char *value, struct property *last)
 {
struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
 
if (!new)
return NULL;
 
-   if (!(new->name = kstrdup(name, GFP_KERNEL)))
+   new->name = kstrdup(name, GFP_KERNEL);
+   if (!new->name)
goto cleanup;
-   if (!(new->value = kmalloc(length + 1, GFP_KERNEL)))
+   new->value = kmalloc(length + 1, GFP_KERNEL);
+   if (!new->value)
goto cleanup;
 
memcpy(new->value, value, length);
-- 
2.7.5



[PATCH 1/3] powerpc/of: split out new_property() for reusing

2020-02-27 Thread Pingfan Liu
Since new_property() is used in several calling sites, splitting it out for
reusing.

To ease the review, although the split out part has coding style issue,
keeping it untouched and fixed in next patch.

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Hari Bathini 
Cc: Aneesh Kumar K.V 
Cc: Oliver O'Halloran 
Cc: Dan Williams 
Cc: ke...@lists.infradead.org
---
 arch/powerpc/include/asm/prom.h   |  2 ++
 arch/powerpc/kernel/Makefile  |  2 +-
 arch/powerpc/kernel/of_property.c | 32 +++
 arch/powerpc/mm/drmem.c   | 26 -
 arch/powerpc/platforms/pseries/reconfig.c | 26 -
 5 files changed, 35 insertions(+), 53 deletions(-)
 create mode 100644 arch/powerpc/kernel/of_property.c

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 94e3fd5..02f7b1b 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -90,6 +90,8 @@ struct of_drc_info {
 extern int of_read_drc_info_cell(struct property **prop,
const __be32 **curval, struct of_drc_info *data);
 
+extern struct property *new_property(const char *name, const int length,
+   const unsigned char *value, struct property *last);
 
 /*
  * There are two methods for telling firmware what our capabilities are.
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 157b014..23375fd 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -47,7 +47,7 @@ obj-y := cputable.o ptrace.o 
syscalls.o \
   signal.o sysfs.o cacheinfo.o time.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
-  of_platform.o prom_parse.o
+  of_platform.o prom_parse.o of_property.o
 obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \
   signal_64.o ptrace32.o \
   paca.o nvram_64.o firmware.o note.o
diff --git a/arch/powerpc/kernel/of_property.c 
b/arch/powerpc/kernel/of_property.c
new file mode 100644
index 000..e56c832
--- /dev/null
+++ b/arch/powerpc/kernel/of_property.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+
+struct property *new_property(const char *name, const int length,
+const unsigned char *value, struct 
property *last)
+{
+   struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
+
+   if (!new)
+   return NULL;
+
+   if (!(new->name = kstrdup(name, GFP_KERNEL)))
+   goto cleanup;
+   if (!(new->value = kmalloc(length + 1, GFP_KERNEL)))
+   goto cleanup;
+
+   memcpy(new->value, value, length);
+   *(((char *)new->value) + length) = 0;
+   new->length = length;
+   new->next = last;
+   return new;
+
+cleanup:
+   kfree(new->name);
+   kfree(new->value);
+   kfree(new);
+   return NULL;
+}
+
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 85b088a..888227e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -99,32 +99,6 @@ static void init_drconf_v2_cell(struct of_drconf_cell_v2 
*dr_cell,
 
 extern int test_hotplug;
 
-static struct property *new_property(const char *name, const int length,
-const unsigned char *value, struct 
property *last)
-{
-   struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
-
-   if (!new)
-   return NULL;
-
-   if (!(new->name = kstrdup(name, GFP_KERNEL)))
-   goto cleanup;
-   if (!(new->value = kmalloc(length + 1, GFP_KERNEL)))
-   goto cleanup;
-
-   memcpy(new->value, value, length);
-   *(((char *)new->value) + length) = 0;
-   new->length = length;
-   new->next = last;
-   return new;
-
-cleanup:
-   kfree(new->name);
-   kfree(new->value);
-   kfree(new);
-   return NULL;
-}
-
 static int drmem_update_dt_v2(struct device_node *memory,
  struct property *prop)
 {
diff --git a/arch/powerpc/platforms/pseries/reconfig.c 
b/arch/powerpc/platforms/pseries/reconfig.c
index 7f7369f..8e5a2ba 100644
--- a/arch/powerpc/platforms/pseries/reconfig.c
+++ b/arch/powerpc/platforms/pseries/reconfig.c
@@ -165,32 +165,6 @@ static char * parse_next_property(char *buf, char *end, 
char **name, int *length
return tmp;
 }
 
-static struct property *new_property(const char *name, const int length,
-const unsigned char *value, struct 
property *last)
-{
-   struct 

[PATCHv3] powerpc/crashkernel: take "mem=" option into account

2020-02-19 Thread Pingfan Liu
'mem=" option is an easy way to put high pressure on memory during some
test. Hence after applying the memory limit, instead of total mem, the
actual usable memory should be considered when reserving mem for
crashkernel. Otherwise the boot up may experience OOM issue.

E.g. it would reserve 4G prior to the change and 512M afterward, if passing
crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
mem=5G on a 256G machine.

This issue is powerpc specific because it puts higher priority on fadump
and kdump reservation than on "mem=". Referring the following code:
if (fadump_reserve_mem() == 0)
reserve_crashkernel();
...
/* Ensure that total memory size is page-aligned. */
limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
memblock_enforce_memory_limit(limit);

While on other arches, the effect of "mem=" takes a higher priority and pass
through memblock_phys_mem_size() before calling reserve_crashkernel().

Signed-off-by: Pingfan Liu 
To: linuxppc-dev@lists.ozlabs.org
Cc: Hari Bathini 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
---
v2 -> v3: improve commit log
 arch/powerpc/kernel/machine_kexec.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index c4ed328..eec96dc 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
 
 void __init reserve_crashkernel(void)
 {
-   unsigned long long crash_size, crash_base;
+   unsigned long long crash_size, crash_base, total_mem_sz;
int ret;
 
+   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
/* use common parsing */
-   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, total_mem_sz,
_size, _base);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
@@ -185,7 +186,7 @@ void __init reserve_crashkernel(void)
"for crashkernel (System RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crashk_res.start >> 20),
-   (unsigned long)(memblock_phys_mem_size() >> 20));
+   (unsigned long)(total_mem_sz >> 20));
 
if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
memblock_reserve(crashk_res.start, crash_size)) {
-- 
2.7.5



[PATCH 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-02-10 Thread Pingfan Liu
A bug is observed on pseries by taking the following steps on rhel:
-1. drmgr -c mem -r -q 5
-2. echo c > /proc/sysrq-trigger

And then, the failure looks like:
kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
 Checking for memory holes : [  0.0 %] /
   Checking for memory holes : [100.0 %] |  
 Excluding unnecessary pages   : [100.0 %] \
   Copying data  : [  0.3 %] -  
eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
access=0x8004 current=makedumpfile
[   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
psize 2 pte=0xc0005504
[   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 nip 
7fffbbc4d7fc lr 00011356ca3c code 2
[   44.338548] Core dump to |/bin/false pipe failed
/lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
$CORE_COLLECTOR /proc/vmcore 
$_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
kdump: saving vmcore failed

* Root cause *
  After analyzing, it turns out that in the current implementation,
when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating as
the code __remove_memory() comes before drmem_update_dt().

>From a viewpoint of listener and publisher, the publisher notifies the
listener before data is ready.  This introduces a problem where udev
launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
updating. And in capture kernel, makedumpfile will access the memory based
on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.

* Fix *
  In order to fix this issue, update dt before __remove_memory(), and
accordingly the same rule in hot-add path.

This will introduce extra dt updating payload for each involved lmb when 
hotplug.
But it should be fine since drmem_update_dt() is memory based operation and
hotplug is not a hot path.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Hari Bathini 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index a3a9353..1f623c3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -392,6 +392,9 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   rtas_hp_event = true;
+   drmem_update_dt();
+   rtas_hp_event = false;
 
__remove_memory(nid, base_addr, block_sz);
 
@@ -665,6 +668,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
lmb_set_nid(lmb);
lmb->flags |= DRCONF_MEM_ASSIGNED;
+   rtas_hp_event = true;
+   drmem_update_dt();
+   rtas_hp_event = false;
 
block_sz = memory_block_size_bytes();
 
@@ -683,6 +689,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   rtas_hp_event = true;
+   drmem_update_dt();
+   rtas_hp_event = false;
 
__remove_memory(nid, base_addr, block_sz);
}
@@ -939,12 +948,6 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
break;
}
 
-   if (!rc) {
-   rtas_hp_event = true;
-   rc = drmem_update_dt();
-   rtas_hp_event = false;
-   }
-
unlock_device_hotplug();
return rc;
 }
-- 
2.7.5



[PATCH 1/2] powerpc/pseries: group lmb operation and memblock's

2020-02-10 Thread Pingfan Liu
This patch prepares for the incoming patch which swaps the order of KOBJ_
uevent and dt's updating.

It has no functional effect, just groups lmb operation and memblock's in
order to insert dt updating operation easily, and makes it easier to
review.

Signed-off-by: Pingfan Liu 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Hari Bathini 
To: linuxppc-dev@lists.ozlabs.org
Cc: ke...@lists.infradead.org
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 26 -
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c126b94..a3a9353 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -375,7 +375,8 @@ static int dlpar_add_lmb(struct drmem_lmb *);
 static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 {
unsigned long block_sz;
-   int rc;
+   phys_addr_t base_addr;
+   int rc, nid;
 
if (!lmb_is_removable(lmb))
return -EINVAL;
@@ -384,17 +385,19 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
if (rc)
return rc;
 
+   base_addr = lmb->base_addr;
+   nid = lmb->nid;
block_sz = pseries_memory_block_size();
 
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
-
-   /* Update memory regions for memory remove */
-   memblock_remove(lmb->base_addr, block_sz);
-
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
lmb->flags &= ~DRCONF_MEM_ASSIGNED;
 
+   __remove_memory(nid, base_addr, block_sz);
+
+   /* Update memory regions for memory remove */
+   memblock_remove(base_addr, block_sz);
+
return 0;
 }
 
@@ -661,6 +664,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
}
 
lmb_set_nid(lmb);
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
+
block_sz = memory_block_size_bytes();
 
/* Add the memory */
@@ -672,11 +677,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
rc = dlpar_online_lmb(lmb);
if (rc) {
-   __remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   int nid = lmb->nid;
+   phys_addr_t base_addr = lmb->base_addr;
+
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+
+   __remove_memory(nid, base_addr, block_sz);
}
 
return rc;
-- 
2.7.5



[PATCH] powerpc/pseries: in lmb_is_removable(), advance pfn if section is not present

2020-01-09 Thread Pingfan Liu
In lmb_is_removable(), if a section is not present, it should continue to
test the rest sections in the block. But the current code fails to do so.

Signed-off-by: Pingfan Liu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c126b94..a4d40a3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -360,8 +360,10 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
 
for (i = 0; i < scns_per_block; i++) {
pfn = PFN_DOWN(phys_addr);
-   if (!pfn_present(pfn))
+   if (!pfn_present(pfn)) {
+   phys_addr += MIN_MEMORY_BLOCK_SIZE;
continue;
+   }
 
rc = rc && is_mem_section_removable(pfn, PAGES_PER_SECTION);
phys_addr += MIN_MEMORY_BLOCK_SIZE;
-- 
2.7.5



Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Mon, Oct 14, 2019 at 10:03:03PM +0200, Jan Kara wrote:
> On Mon 14-10-19 08:23:39, Eric Sandeen wrote:
> > On 10/14/19 4:43 AM, Jan Kara wrote:
> > > On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> > > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > > > > between grub xfs driver and kernel xfs driver has been obsevered.  
> > > > > > Note:
> > > > > > fadump boots up in the following sequence: fireware -> grub reads 
> > > > > > kernel
> > > > > > and initramfs -> kernel boots.
> > > > > > 
> > > > > > The process to reproduce this mismatch:
> > > > > >- On powerpc, boot kernel with fadump=on and edit 
> > > > > > /etc/kdump.conf.
> > > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, 
> > > > > > "kdumpctl
> > > > > >  restart" to rebuild the initramfs. Detail about the rebuilding 
> > > > > > looks
> > > > > >  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > > > > >mv /boot/initramfs-`uname -r`.img.tmp 
> > > > > > /boot/initramfs-`uname -r`.img
> > > > > >sync
> > > > > >- "echo c >/proc/sysrq-trigger".
> > > > > > 
> > > > > > The result:
> > > > > > The dump image will not be saved under /var/crashnew/* as expected, 
> > > > > > but
> > > > > > still saved under /var/crash.
> > > > > > 
> > > > > > The root cause:
> > > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by 
> > > > > > writing
> > > > > > back metadata to xlog, but not necessary to fsblock. This raises 
> > > > > > issue if
> > > > > > grub can not replay the xlog before accessing the xfs files. Since 
> > > > > > the
> > > > > > above dir entry of initramfs should be saved as inline data with 
> > > > > > xfs_inode,
> > > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > > > > 
> > > > > > umount can be used to write metadata fsblock, but the filesystem 
> > > > > > can not be
> > > > > > umounted if still in use.
> > > > > > 
> > > > > > There are two ways to fix this mismatch, either grub or xfs. It may 
> > > > > > be
> > > > > > easier to do this in xfs side by introducing an interface to flush 
> > > > > > metadata
> > > > > > to fsblock explicitly.
> > > > > > 
> > > > > > With this patch, metadata can be written to fsblock by:
> > > > > ># update AIL
> > > > > >sync
> > > > > ># new introduced interface to flush metadata to fsblock
> > > > > >mount -o remount,metasync mountpoint
> > > > > 
> > > > > I think this ought to be an ioctl or some sort of generic call since 
> > > > > the
> > > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > > > > is too dumb to recover logs but still wants to write to the fs"
> > > > > checkpointing problem.
> > > > Yes, a syscall sounds more reasonable.
> > > > > 
> > > > > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > > > > don't know...)
> > > > I think it is unavoidable to involve in each fs' implementation. What
> > > > about introducing an interface sync_to_fsblock(struct super_block *sb) 
> > > > in
> > > > the struct super_operations, then let each fs manage its own case?
> > > 
> > > Well, we already have a way to achieve what you need: fsfreeze.
> > > Traditionally, that is guaranteed to put fs into a "clean" state very much
> > > equivalent to the fs being unmounted and that seems to be what the
> > > bootloader wants so that it can access the filesystem without worrying
> > > about some recovery details. So do you see any problem with replacing
> > > 'sync

Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Mon, Oct 14, 2019 at 08:23:39AM -0500, Eric Sandeen wrote:
> On 10/14/19 4:43 AM, Jan Kara wrote:
> > On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > > > between grub xfs driver and kernel xfs driver has been obsevered.  
> > > > > Note:
> > > > > fadump boots up in the following sequence: fireware -> grub reads 
> > > > > kernel
> > > > > and initramfs -> kernel boots.
> > > > > 
> > > > > The process to reproduce this mismatch:
> > > > >- On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, 
> > > > > "kdumpctl
> > > > >  restart" to rebuild the initramfs. Detail about the rebuilding 
> > > > > looks
> > > > >  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > > > >mv /boot/initramfs-`uname -r`.img.tmp 
> > > > > /boot/initramfs-`uname -r`.img
> > > > >sync
> > > > >- "echo c >/proc/sysrq-trigger".
> > > > > 
> > > > > The result:
> > > > > The dump image will not be saved under /var/crashnew/* as expected, 
> > > > > but
> > > > > still saved under /var/crash.
> > > > > 
> > > > > The root cause:
> > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by 
> > > > > writing
> > > > > back metadata to xlog, but not necessary to fsblock. This raises 
> > > > > issue if
> > > > > grub can not replay the xlog before accessing the xfs files. Since the
> > > > > above dir entry of initramfs should be saved as inline data with 
> > > > > xfs_inode,
> > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > > > 
> > > > > umount can be used to write metadata fsblock, but the filesystem can 
> > > > > not be
> > > > > umounted if still in use.
> > > > > 
> > > > > There are two ways to fix this mismatch, either grub or xfs. It may be
> > > > > easier to do this in xfs side by introducing an interface to flush 
> > > > > metadata
> > > > > to fsblock explicitly.
> > > > > 
> > > > > With this patch, metadata can be written to fsblock by:
> > > > ># update AIL
> > > > >sync
> > > > ># new introduced interface to flush metadata to fsblock
> > > > >mount -o remount,metasync mountpoint
> > > > 
> > > > I think this ought to be an ioctl or some sort of generic call since the
> > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > > > is too dumb to recover logs but still wants to write to the fs"
> > > > checkpointing problem.
> > > Yes, a syscall sounds more reasonable.
> > > > 
> > > > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > > > don't know...)
> > > I think it is unavoidable to involve in each fs' implementation. What
> > > about introducing an interface sync_to_fsblock(struct super_block *sb) in
> > > the struct super_operations, then let each fs manage its own case?
> > 
> > Well, we already have a way to achieve what you need: fsfreeze.
> > Traditionally, that is guaranteed to put fs into a "clean" state very much
> > equivalent to the fs being unmounted and that seems to be what the
> > bootloader wants so that it can access the filesystem without worrying
> > about some recovery details. So do you see any problem with replacing
> > 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?
> > 
> > Honza
> 
> The problem with fsfreeze is that if the device you want to quiesce is, say,
> the root fs, freeze isn't really a good option.
Yes, that is the difference between my patch and fsfreeze.  But
honestly, it is a rare case where a system has not a /boot partition. Due
to the activity on /boot is very low, fsfreeze may meet the need, or
repeatly retry fsfress until success.
> 
> But the other thing I want to highlight about this approach is that it does 
> not
> solve the root problem: something is trying to read the block device without
> first replaying the log.
> 
> A call such as the proposal here is only going to leave consistent metadata at
> the time the call returns; at any time after that, all guarantees are off 
> again,
My patch places assumption that grub only accesses limited files and ensures the
consistency only on those files (kernel,initramfs).
> so the problem hasn't been solved.
Agree. The perfect solution should be a log aware bootloader.

Thanks and regards,
Pingfan


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Mon, Oct 14, 2019 at 01:40:27AM -0700, Christoph Hellwig wrote:
> On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> > fadump boots up in the following sequence: fireware -> grub reads kernel
> > and initramfs -> kernel boots.
> 
> This isn't something new.  To fundamentally fix this you need to
> implement (in-memory) log recovery in grub.  That is the only really safe
> long-term solutioin.  But the equivalent of your patch you can already
Agree. For the consistency of the whole fs, we need grub to be aware of
log. While this patch just assumes that files accessed by grub are
known, and the consistency is forced only on these files.
> get by freezing and unfreezing the file system using the FIFREEZE and
> FITHAW ioctls.  And if my memory is serving me correctly Dave has been
freeze will block any further modification to the fs. That is different
from my patch, which does not have such limitation.
> preaching that to the bootloader folks for a long time, but apparently
> without visible results.
Yes, it is a pity. And maybe it is uneasy to do.

Thanks and regards,
Pingfan


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> > fadump boots up in the following sequence: fireware -> grub reads kernel
> > and initramfs -> kernel boots.
> > 
> > The process to reproduce this mismatch:
> >   - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> >   - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
> > restart" to rebuild the initramfs. Detail about the rebuilding looks
> > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> >   mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname 
> > -r`.img
> >   sync
> >   - "echo c >/proc/sysrq-trigger".
> > 
> > The result:
> > The dump image will not be saved under /var/crashnew/* as expected, but
> > still saved under /var/crash.
> > 
> > The root cause:
> > As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
> > back metadata to xlog, but not necessary to fsblock. This raises issue if
> > grub can not replay the xlog before accessing the xfs files. Since the
> > above dir entry of initramfs should be saved as inline data with xfs_inode,
> > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > 
> > umount can be used to write metadata fsblock, but the filesystem can not be
> > umounted if still in use.
> > 
> > There are two ways to fix this mismatch, either grub or xfs. It may be
> > easier to do this in xfs side by introducing an interface to flush metadata
> > to fsblock explicitly.
> > 
> > With this patch, metadata can be written to fsblock by:
> >   # update AIL
> >   sync
> >   # new introduced interface to flush metadata to fsblock
> >   mount -o remount,metasync mountpoint
> 
> I think this ought to be an ioctl or some sort of generic call since the
> jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> is too dumb to recover logs but still wants to write to the fs"
> checkpointing problem.
Yes, a syscall sounds more reasonable.
> 
> (Or maybe we should just put all that stuff in a vfat filesystem, I
> don't know...)
I think it is unavoidable to involve in each fs' implementation. What
about introducing an interface sync_to_fsblock(struct super_block *sb) in
the struct super_operations, then let each fs manage its own case?
> 
> --D
> 
> > Signed-off-by: Pingfan Liu 
> > Cc: "Darrick J. Wong" 
> > Cc: Dave Chinner 
> > Cc: Eric Sandeen 
> > Cc: Hari Bathini 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > To: linux-...@vger.kernel.org
> > ---
> >  fs/xfs/xfs_mount.h  |  1 +
> >  fs/xfs/xfs_super.c  | 15 ++-
> >  fs/xfs/xfs_trans.h  |  2 ++
> >  fs/xfs/xfs_trans_ail.c  | 26 +-
> >  fs/xfs/xfs_trans_priv.h |  1 +
> >  5 files changed, 43 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index fdb60e0..85f32e6 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -243,6 +243,7 @@ typedef struct xfs_mount {
> >  #define XFS_MOUNT_FILESTREAMS  (1ULL << 24)/* enable the 
> > filestreams
> >allocator */
> >  #define XFS_MOUNT_NOATTR2  (1ULL << 25)/* disable use of attr2 format 
> > */
> > +#define XFS_MOUNT_METASYNC (1ull << 26)/* write meta to fsblock */
> >  
> >  #define XFS_MOUNT_DAX  (1ULL << 62)/* TEST ONLY! */
> >  
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 8d1df9f..41df810 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -59,7 +59,7 @@ enum {
> > Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota,
> > Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
> > Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
> > -   Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
> > +   Opt_discard, Opt_nodiscard, Opt_dax, Opt_metasync, Opt_err
> >  };
> >  
> >  static const match_table_t tokens = {
> > @@ -106,6 +106,7 @@ static const match_table_t tokens = {
> > {Opt_discard,   "discard"}, /* Discard unused blocks */
> > {Opt_nodiscard, "nodiscard"},   /* Do not dis

[PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-13 Thread Pingfan Liu
When using fadump (fireware assist dump) mode on powerpc, a mismatch
between grub xfs driver and kernel xfs driver has been obsevered.  Note:
fadump boots up in the following sequence: fireware -> grub reads kernel
and initramfs -> kernel boots.

The process to reproduce this mismatch:
  - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
  - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
restart" to rebuild the initramfs. Detail about the rebuilding looks
like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
  mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname -r`.img
  sync
  - "echo c >/proc/sysrq-trigger".

The result:
The dump image will not be saved under /var/crashnew/* as expected, but
still saved under /var/crash.

The root cause:
As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
back metadata to xlog, but not necessary to fsblock. This raises issue if
grub can not replay the xlog before accessing the xfs files. Since the
above dir entry of initramfs should be saved as inline data with xfs_inode,
so xfs_fs_sync_fs() does not guarantee it written to fsblock.

umount can be used to write metadata fsblock, but the filesystem can not be
umounted if still in use.

There are two ways to fix this mismatch, either grub or xfs. It may be
easier to do this in xfs side by introducing an interface to flush metadata
to fsblock explicitly.

With this patch, metadata can be written to fsblock by:
  # update AIL
  sync
  # new introduced interface to flush metadata to fsblock
  mount -o remount,metasync mountpoint

Signed-off-by: Pingfan Liu 
Cc: "Darrick J. Wong" 
Cc: Dave Chinner 
Cc: Eric Sandeen 
Cc: Hari Bathini 
Cc: linuxppc-dev@lists.ozlabs.org
To: linux-...@vger.kernel.org
---
 fs/xfs/xfs_mount.h  |  1 +
 fs/xfs/xfs_super.c  | 15 ++-
 fs/xfs/xfs_trans.h  |  2 ++
 fs/xfs/xfs_trans_ail.c  | 26 +-
 fs/xfs/xfs_trans_priv.h |  1 +
 5 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index fdb60e0..85f32e6 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -243,6 +243,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_FILESTREAMS  (1ULL << 24)/* enable the filestreams
   allocator */
 #define XFS_MOUNT_NOATTR2  (1ULL << 25)/* disable use of attr2 format 
*/
+#define XFS_MOUNT_METASYNC (1ull << 26)/* write meta to fsblock */
 
 #define XFS_MOUNT_DAX  (1ULL << 62)/* TEST ONLY! */
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8d1df9f..41df810 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -59,7 +59,7 @@ enum {
Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota,
Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
-   Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
+   Opt_discard, Opt_nodiscard, Opt_dax, Opt_metasync, Opt_err
 };
 
 static const match_table_t tokens = {
@@ -106,6 +106,7 @@ static const match_table_t tokens = {
{Opt_discard,   "discard"}, /* Discard unused blocks */
{Opt_nodiscard, "nodiscard"},   /* Do not discard unused blocks */
{Opt_dax,   "dax"}, /* Enable direct access to bdev pages */
+   {Opt_metasync,  "metasync"},/* one shot to write meta to fsblock */
{Opt_err,   NULL},
 };
 
@@ -338,6 +339,9 @@ xfs_parseargs(
mp->m_flags |= XFS_MOUNT_DAX;
break;
 #endif
+   case Opt_metasync:
+   mp->m_flags |= XFS_MOUNT_METASYNC;
+   break;
default:
xfs_warn(mp, "unknown mount option [%s].", p);
return -EINVAL;
@@ -1259,6 +1263,9 @@ xfs_fs_remount(
mp->m_flags |= XFS_MOUNT_SMALL_INUMS;
mp->m_maxagi = xfs_set_inode_alloc(mp, sbp->sb_agcount);
break;
+   case Opt_metasync:
+   mp->m_flags |= XFS_MOUNT_METASYNC;
+   break;
default:
/*
 * Logically we would return an error here to prevent
@@ -1286,6 +1293,12 @@ xfs_fs_remount(
}
}
 
+   if (mp->m_flags & XFS_MOUNT_METASYNC) {
+   xfs_ail_push_sync(mp->m_ail);
+   /* one shot flag */
+   mp->m_flags &= ~XFS_MOUNT_METASYNC;
+   }
+
/* ro -> rw */
if ((mp->m_flags & XFS_MOUNT_RDONLY) && !(*flags & SB_RDONLY)) {
if (mp-&g

Re: [PATCH] powerpc/crashkernel: take mem option into account

2019-09-22 Thread Pingfan Liu
On Wed, Sep 18, 2019 at 7:23 PM Michael Ellerman  wrote:
>
> Pingfan Liu  writes:
> > Cc Kexec list. And keep the original content.
> >
> > On Thu, Sep 12, 2019 at 10:50 AM Pingfan Liu  wrote:
> >>
> >> 'mem=" option is an easy way to put high pressure on memory during some
> >> test. Hence in stead of total mem, the effective usable memory size
>^  ^
>instead"actual" would be clearer
>
> I think adding: "after applying the memory limit"
>
> would help here.
>
> >> should be considered when reserving mem for crashkernel. Otherwise
> >> the boot up may experience oom issue.
>   ^
>   OOM
> >>
> >> E.g passing
> >> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
> >> mem=5G on a 256G machine.
>
> Spelling out the behaviour before and after would help here, eg:
>
> .. "would reserve 4G prior to the change and 512M afterward."
>
Thanks for kindly review. I will update the commit based on your suggestion.
>
> >> Signed-off-by: Pingfan Liu 
> >> Cc: Hari Bathini 
> >> Cc: Michael Ellerman 
> >> To: linuxppc-dev@lists.ozlabs.org
> >> ---
> >> v1 -> v2: fix the printk info about the total mem
> >>  arch/powerpc/kernel/machine_kexec.c | 7 ---
> >>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kernel/machine_kexec.c 
> >> b/arch/powerpc/kernel/machine_kexec.c
> >> index c4ed328..eec96dc 100644
> >> --- a/arch/powerpc/kernel/machine_kexec.c
> >> +++ b/arch/powerpc/kernel/machine_kexec.c
> >> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
> >>
> >>  void __init reserve_crashkernel(void)
> >>  {
> >> -   unsigned long long crash_size, crash_base;
> >> +   unsigned long long crash_size, crash_base, total_mem_sz;
> >> int ret;
> >>
> >> +   total_mem_sz = memory_limit ? memory_limit : 
> >> memblock_phys_mem_size();
> >> /* use common parsing */
> >> -   ret = parse_crashkernel(boot_command_line, 
> >> memblock_phys_mem_size(),
> >> +   ret = parse_crashkernel(boot_command_line, total_mem_sz,
> >> _size, _base);
>
> I think this change makes sense. But we have multiple arches that
> implement similar logic, and I wonder if we should keep them all the
> same.
>
> eg:
>
>   arch/arm/kernel/setup.c:ret = 
> parse_crashkernel(boot_command_line, total_mem,
>   arch/arm64/mm/init.c:   ret = 
> parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>   arch/ia64/kernel/setup.c:   ret = 
> parse_crashkernel(boot_command_line, total,
>   arch/mips/kernel/setup.c:   ret = 
> parse_crashkernel(boot_command_line, total_mem,
>   arch/powerpc/kernel/fadump.c:   ret = 
> parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>   arch/powerpc/kernel/machine_kexec.c:ret = 
> parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>   arch/s390/kernel/setup.c:   rc = 
> parse_crashkernel(boot_command_line, memory_end, _size,
>   arch/sh/kernel/machine_kexec.c: ret = 
> parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>   arch/x86/kernel/setup.c:ret = 
> parse_crashkernel(boot_command_line, total_mem, _size, _base);
>
>
> From a quick glance most of them don't seem to take the memory limit
> into account.
>
> So I guess the question is do we want all arches to implement the same
> behaviour or do we think it doesn't matter if they differ in details
> like this?

On powerpc, the current code make fadump/kdump a higher priority than
"mem=" option, as the notes in fadump_reserve_mem() says
"
/*
 * Calculate the memory boundary.
 * If memory_limit is less than actual memory boundary then reserve
 * the memory for fadump beyond the memory_limit and adjust the
 * memory_limit accordingly, so that the running kernel can run with
 * specified memory_limit.
 */
"

While on other archs, they pack "mem=" info into memblock before
calling memblock_phys_mem_size(). So when parse_crashkernel() calls
memblock_phys_mem_size(), the "mem=" takes effect.

E.g for x86 in arch/x86/kernel/e820.c
static int __init parse_memopt(char *p)
{
...
e820__range_remove(mem_size, ULLONG_MAX - mem_size, E820_TYPE_RAM, 1);
// this pack the "mem=" info into e820, and is finally feed to
memblock
}

Thanks,
Pingfan


Re: [PATCH] powerpc/crashkernel: take mem option into account

2019-09-16 Thread Pingfan Liu
Cc Kexec list. And keep the original content.

On Thu, Sep 12, 2019 at 10:50 AM Pingfan Liu  wrote:
>
> 'mem=" option is an easy way to put high pressure on memory during some
> test. Hence in stead of total mem, the effective usable memory size should
> be considered when reserving mem for crashkernel. Otherwise the boot up may
> experience oom issue.
>
> E.g passing
> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
> mem=5G on a 256G machine.
>
> Signed-off-by: Pingfan Liu 
> Cc: Hari Bathini 
> Cc: Michael Ellerman 
> To: linuxppc-dev@lists.ozlabs.org
> ---
> v1 -> v2: fix the printk info about the total mem
>  arch/powerpc/kernel/machine_kexec.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kernel/machine_kexec.c 
> b/arch/powerpc/kernel/machine_kexec.c
> index c4ed328..eec96dc 100644
> --- a/arch/powerpc/kernel/machine_kexec.c
> +++ b/arch/powerpc/kernel/machine_kexec.c
> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
>
>  void __init reserve_crashkernel(void)
>  {
> -   unsigned long long crash_size, crash_base;
> +   unsigned long long crash_size, crash_base, total_mem_sz;
> int ret;
>
> +   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
> /* use common parsing */
> -   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> +   ret = parse_crashkernel(boot_command_line, total_mem_sz,
> _size, _base);
> if (ret == 0 && crash_size > 0) {
> crashk_res.start = crash_base;
> @@ -185,7 +186,7 @@ void __init reserve_crashkernel(void)
> "for crashkernel (System RAM: %ldMB)\n",
> (unsigned long)(crash_size >> 20),
> (unsigned long)(crashk_res.start >> 20),
> -   (unsigned long)(memblock_phys_mem_size() >> 20));
> +   (unsigned long)(total_mem_sz >> 20));
>
> if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
> memblock_reserve(crashk_res.start, crash_size)) {
> --
> 2.7.5
>


[PATCH] powerpc/crashkernel: take mem option into account

2019-09-11 Thread Pingfan Liu
'mem=" option is an easy way to put high pressure on memory during some
test. Hence in stead of total mem, the effective usable memory size should
be considered when reserving mem for crashkernel. Otherwise the boot up may
experience oom issue.

E.g passing
crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
mem=5G on a 256G machine.

Signed-off-by: Pingfan Liu 
Cc: Hari Bathini 
Cc: Michael Ellerman 
To: linuxppc-dev@lists.ozlabs.org
---
v1 -> v2: fix the printk info about the total mem
 arch/powerpc/kernel/machine_kexec.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index c4ed328..eec96dc 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
 
 void __init reserve_crashkernel(void)
 {
-   unsigned long long crash_size, crash_base;
+   unsigned long long crash_size, crash_base, total_mem_sz;
int ret;
 
+   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
/* use common parsing */
-   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, total_mem_sz,
_size, _base);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
@@ -185,7 +186,7 @@ void __init reserve_crashkernel(void)
"for crashkernel (System RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crashk_res.start >> 20),
-   (unsigned long)(memblock_phys_mem_size() >> 20));
+   (unsigned long)(total_mem_sz >> 20));
 
if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
memblock_reserve(crashk_res.start, crash_size)) {
-- 
2.7.5



Re: [PATCH] powerpc/crashkernel: take mem option into account

2019-09-11 Thread Pingfan Liu
NACK it. Due to a miss the updating of printk info. I will send out V2

On Mon, Sep 9, 2019 at 12:05 PM Pingfan Liu  wrote:
>
> 'mem=" option is an easy way to put high pressure on memory during some
> test. Hence in stead of total mem, the effective usable memory size should
> be considered when reserving mem for crashkernel. Otherwise the boot up may
> experience oom issue.
>
> E.g passing
> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
> mem=5G.
>
> Signed-off-by: Pingfan Liu 
> Cc: Hari Bathini 
> Cc: Michael Ellerman 
> To: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/machine_kexec.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/machine_kexec.c 
> b/arch/powerpc/kernel/machine_kexec.c
> index c4ed328..714b733 100644
> --- a/arch/powerpc/kernel/machine_kexec.c
> +++ b/arch/powerpc/kernel/machine_kexec.c
> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
>
>  void __init reserve_crashkernel(void)
>  {
> -   unsigned long long crash_size, crash_base;
> +   unsigned long long crash_size, crash_base, total_mem_sz;
> int ret;
>
> +   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
> /* use common parsing */
> -   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> +   ret = parse_crashkernel(boot_command_line, total_mem_sz,
> _size, _base);
> if (ret == 0 && crash_size > 0) {
> crashk_res.start = crash_base;
> --
> 2.7.5
>


Re: [PATCH] powerpc/crashkernel: take mem option into account

2019-09-09 Thread Pingfan Liu
On Mon, Sep 9, 2019 at 12:05 PM Pingfan Liu  wrote:
>
> 'mem=" option is an easy way to put high pressure on memory during some
> test. Hence in stead of total mem, the effective usable memory size should
> be considered when reserving mem for crashkernel. Otherwise the boot up may
> experience oom issue.
>
> E.g passing
> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
> mem=5G.
>
> Signed-off-by: Pingfan Liu 
> Cc: Hari Bathini 
> Cc: Michael Ellerman 
> To: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/machine_kexec.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/machine_kexec.c 
> b/arch/powerpc/kernel/machine_kexec.c
> index c4ed328..714b733 100644
> --- a/arch/powerpc/kernel/machine_kexec.c
> +++ b/arch/powerpc/kernel/machine_kexec.c
> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
>
>  void __init reserve_crashkernel(void)
>  {
> -   unsigned long long crash_size, crash_base;
> +   unsigned long long crash_size, crash_base, total_mem_sz;
> int ret;
>
> +   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
Here memory_limit is used to esstimation and may be changed.
So I think it is better to use memory_limit here than moving
memblock_enforce_memory_limit() before the call to
reserve_crashkernel()

Thanks,
Pingfan
> /* use common parsing */
> -   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> +   ret = parse_crashkernel(boot_command_line, total_mem_sz,
> _size, _base);
> if (ret == 0 && crash_size > 0) {
> crashk_res.start = crash_base;
> --
> 2.7.5
>


[PATCH] powerpc/crashkernel: take mem option into account

2019-09-08 Thread Pingfan Liu
'mem=" option is an easy way to put high pressure on memory during some
test. Hence in stead of total mem, the effective usable memory size should
be considered when reserving mem for crashkernel. Otherwise the boot up may
experience oom issue.

E.g passing
crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
mem=5G.

Signed-off-by: Pingfan Liu 
Cc: Hari Bathini 
Cc: Michael Ellerman 
To: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/machine_kexec.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index c4ed328..714b733 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
 
 void __init reserve_crashkernel(void)
 {
-   unsigned long long crash_size, crash_base;
+   unsigned long long crash_size, crash_base, total_mem_sz;
int ret;
 
+   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
/* use common parsing */
-   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, total_mem_sz,
_size, _base);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
-- 
2.7.5



Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant

2019-05-23 Thread Pingfan Liu
Matthias, ping? Any suggestions?

Thanks,
Pingfan


On Thu, May 2, 2019 at 2:22 PM Pingfan Liu  wrote:
>
> On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu  wrote:
> >
> > On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger  wrote:
> > >
> > >
> > [...]
> > > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char 
> > > > *cmdline,
> > > >   pr_warn("crashkernel: unrecognized char: %c\n", *cur);
> > > >   return -EINVAL;
> > > >   }
> > > > + if (*crash_size == 0)
> > > > + return -EINVAL;
> > >
> > > This covers the case where I pass an argument like "crashkernel=0M" ?
> > > Can't we fix that by using kstrtoull() in memparse and check if the 
> > > return value
> > > is < 0? In that case we could return without updating the retptr and we 
> > > will be
> > > fine.
> After a series of work, I suddenly realized that it can not be done
> like this way. "0M" causes kstrtoull() to return -EINVAL, but this is
> caused by "M", not "0". If passing "0" to kstrtoull(), it will return
> 0 on success.
>
> > >
> > It seems that kstrtoull() treats 0M as invalid parameter, while
> > simple_strtoull() does not.
> >
> My careless going through the code. And I tested with a valid value
> "256M" using kstrtoull(), it also returned -EINVAL.
>
> So I think there is no way to distinguish 0 from a positive value
> inside this basic math function.
> Do I miss anything?
>
> Thanks and regards,
> Pingfan


Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant

2019-05-02 Thread Pingfan Liu
On Thu, Apr 25, 2019 at 4:20 PM Pingfan Liu  wrote:
>
> On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger  wrote:
> >
> >
> [...]
> > > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char 
> > > *cmdline,
> > >   pr_warn("crashkernel: unrecognized char: %c\n", *cur);
> > >   return -EINVAL;
> > >   }
> > > + if (*crash_size == 0)
> > > + return -EINVAL;
> >
> > This covers the case where I pass an argument like "crashkernel=0M" ?
> > Can't we fix that by using kstrtoull() in memparse and check if the return 
> > value
> > is < 0? In that case we could return without updating the retptr and we 
> > will be
> > fine.
After a series of work, I suddenly realized that it can not be done
like this way. "0M" causes kstrtoull() to return -EINVAL, but this is
caused by "M", not "0". If passing "0" to kstrtoull(), it will return
0 on success.

> >
> It seems that kstrtoull() treats 0M as invalid parameter, while
> simple_strtoull() does not.
>
My careless going through the code. And I tested with a valid value
"256M" using kstrtoull(), it also returned -EINVAL.

So I think there is no way to distinguish 0 from a positive value
inside this basic math function.
Do I miss anything?

Thanks and regards,
Pingfan


  1   2   >