Re: [PATCH] MAINTAINERS: powerpc: Remove Aneesh

2024-04-29 Thread Aneesh Kumar K . V
Michael Ellerman  writes:

> Aneesh is stepping down from powerpc maintenance.
>
> Signed-off-by: Michael Ellerman 

Acked-by: Aneesh Kumar K.V (Arm) 

> ---
>  MAINTAINERS | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7c121493f43d..93af33ee8541 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -12478,7 +12478,6 @@ LINUX FOR POWERPC (32-BIT AND 64-BIT)
>  M:   Michael Ellerman 
>  R:   Nicholas Piggin 
>  R:   Christophe Leroy 
> -R:   Aneesh Kumar K.V 
>  R:   Naveen N. Rao 
>  L:   linuxppc-dev@lists.ozlabs.org
>  S:   Supported
> -- 
> 2.44.0


Re: [PATCH] MAINTAINERS: MMU GATHER: Update Aneesh's address

2024-04-29 Thread Aneesh Kumar K . V
Michael Ellerman  writes:

> Aneesh's IBM address no longer works, switch to his preferred kernel.org
> address.
>
> Signed-off-by: Michael Ellerman 
 
Acked-by: Aneesh Kumar K.V (Arm) 
 
> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 93af33ee8541..f096c9fff5b3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14898,7 +14898,7 @@ F:drivers/phy/marvell/phy-pxa-usb.c
>  
>  MMU GATHER AND TLB INVALIDATION
>  M:   Will Deacon 
> -M:   "Aneesh Kumar K.V" 
> +M:   "Aneesh Kumar K.V" 
>  M:   Andrew Morton 
>  M:   Nick Piggin 
>  M:   Peter Zijlstra 
> -- 
> 2.44.0


Re: [PATCH] powerpc: align memory_limit to 16MB in early_parse_mem

2024-03-08 Thread Aneesh Kumar K . V
Joel Savitz  writes:

> On 64-bit powerpc, usage of a non-16MB-aligned value for the mem= kernel
> cmdline parameter results in a system hang at boot.
>
> For example, using 'mem=4198400K' will always reproduce this issue.
>
> This patch fixes the problem by aligning any argument to mem= to 16MB
> corresponding with the large page size on powerpc.
>
> Fixes: 2babf5c2ec2f ("[PATCH] powerpc: Unify mem= handling")
> Co-developed-by: Gonzalo Siero 
> Signed-off-by: Gonzalo Siero 
> Signed-off-by: Joel Savitz 
> ---
>  arch/powerpc/kernel/prom.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 0b5878c3125b..8cd3e2445d8a 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -82,8 +82,12 @@ static int __init early_parse_mem(char *p)
>  {
>   if (!p)
>   return 1;
> -
> +#ifdef CONFIG_PPC64
> + /* Align to 16 MB == size of ppc64 large page */
> + memory_limit = ALIGN(memparse(p, ), 0x100);
> +#else
>   memory_limit = PAGE_ALIGN(memparse(p, ));
> +#endif
>   DBG("memory limit = 0x%llx\n", memory_limit);
>  
>   return 0;
> -- 
> 2.43.0

Can you try this change?

commit bc55e1aa71f545cff31e1eccdb4a2e39df84
Author: Aneesh Kumar K.V (IBM) 
Date:   Fri Mar 8 14:45:26 2024 +0530

powerpc/mm: Align memory_limit value specified using mem= kernel parameter

The value specified for the memory limit is used to set a restriction on
memory usage. It is important to ensure that this restriction is within
the linear map kernel address space range. The hash page table
translation uses a 16MB page size to map the kernel linear map address
space. htab_bolt_mapping() function aligns down the size of the range
while mapping kernel linear address space. Since the memblock limit is
enforced very early during boot, before we can detect the type of memory
translation (radix vs hash), we align the memory limit value specified
as a kernel parameter to 16MB. This alignment value will work for both
hash and radix translations.

Signed-off-by: Aneesh Kumar K.V (IBM) 

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 0b5878c3125b..9bd965d35352 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -824,8 +824,11 @@ void __init early_init_devtree(void *params)
reserve_crashkernel();
early_reserve_mem();
 
-   /* Ensure that total memory size is page-aligned. */
-   limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
+   if (memory_limit > memblock_phys_mem_size())
+   memory_limit = 0;
+
+   /* Align down to 16 MB which is large page size with hash page 
translation */
+   limit = ALIGN_DOWN(memory_limit ?: memblock_phys_mem_size(), SZ_16M);
memblock_enforce_memory_limit(limit);
 
 #if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_4K_PAGES)
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index e67effdba85c..d6410549e141 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -817,8 +817,8 @@ static void __init early_cmdline_parse(void)
opt += 4;
prom_memory_limit = prom_memparse(opt, (const char **));
 #ifdef CONFIG_PPC64
-   /* Align to 16 MB == size of ppc64 large page */
-   prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
+   /* Align down to 16 MB which is large page size with hash page 
translation */
+   prom_memory_limit = ALIGN_DOWN(prom_memory_limit, SZ_16M);
 #endif
}
 


Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-03-07 Thread Aneesh Kumar K V
On 3/7/24 5:13 PM, Michael Ellerman wrote:
> Hi Mahesh,
> 
> Mahesh Salgaonkar  writes:
>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
>> interrupt handler) if percpu allocation comes from vmalloc area.
>>
>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
>> percpu allocation is from the embedded first chunk. However with
>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
>> allocation can come from the vmalloc area.
>>
>> With kernel command line "percpu_alloc=page" we can force percpu allocation
>> to come from vmalloc area and can see kernel crash in machine_check_early:
>>
>> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
>> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
>> [1.215719] --- interrupt: 200
>> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
>> [1.215722] [c00fffd731b0] [] 0x0
>> [1.215724] [c00fffd73210] [c0008364] 
>> machine_check_early_common+0x134/0x1f8
>>
>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
>> first chunk is not embedded.
> 
> My system (powernv) doesn't even boot with percpu_alloc=page.
> 


Can you share the crash details? 


> AFAIK the only reason we added support for it was to handle 4K kernels
> with HPT. See commit eb553f16973a ("powerpc/64/mm: implement page
> mapping percpu first chunk allocator").
> 
> So I wonder if we should change the Kconfig to only offer it as an
> option in that case, and change the logic in setup_per_cpu_areas() to
> only use it as a last resort.
> 
> I guess we probably still need this commit though, even if just for 4K HPT.
> 
>
We have also observed some error when we have large gap between the start 
memory of
NUMA nodes. That made the percpu offset really large causing boot failures even 
on 64K.

-aneesh


Re: [PATCH] powerpc/mm: Code cleanup for __hash_page_thp

2024-02-28 Thread Aneesh Kumar K . V
Michael Ellerman  writes:

> Kunwu Chan  writes:
>> Thanks for the reply.
>>
>> On 2024/2/26 18:49, Michael Ellerman wrote:
>>> Kunwu Chan  writes:
 This part was commented from commit 6d492ecc6489
 ("powerpc/THP: Add code to handle HPTE faults for hugepages")
 in about 11 years before.

 If there are no plans to enable this part code in the future,
 we can remove this dead code.
>>> 
>>> I agree the code can go. But I'd like it to be replaced with a comment
>>> explaining what the dead code was trying to say.
>
>> Thanks, i'll update a new patch with the following comment:
>>  /*
>>  * No CPU has hugepages but lacks no execute, so we
>>  * don't need to worry about cpu no CPU_FTR_COHERENT_ICACHE feature case
>>  */
>
> Maybe wait until we can get some input from Aneesh. I'm not sure the
> code/comment are really up to date.
>
> cheers

How about?

modified   arch/powerpc/mm/book3s64/hash_hugepage.c
@@ -58,17 +58,13 @@ int __hash_page_thp(unsigned long ea, unsigned long access, 
unsigned long vsid,
return 0;
 
rflags = htab_convert_pte_flags(new_pmd, flags);
+   /*
+* THPs are only supported on platforms that can do mixed page size
+* segments (MPSS) and all such platforms have coherent icache. Hence we
+* don't need to do lazy icache flush (hash_page_do_lazy_icache()) on
+* noexecute fault.
+*/
 
-#if 0
-   if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
-
-   /*
-* No CPU has hugepages but lacks no execute, so we
-* don't need to worry about that case
-*/
-   rflags = hash_page_do_lazy_icache(rflags, __pte(old_pte), trap);
-   }
-#endif
/*
 * Find the slot index details for this ea, using base page size.
 */



Re: [PATCH v2] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-05 Thread Aneesh Kumar K . V
Amit Machhiwal  writes:

> Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> ppc_set_compat ioctl. This triggers a condition failure in
> kvmppc_set_arch_compat() resulting in an EINVAL.
>
> qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
> argument
>
> Also, a value of 0 for arch_compat generally refers the default
> compatibility of the host. But, arch_compat, being a Guest Wide Element
> in nested API v2, cannot be set to 0 in GSB as PowerVM (L0) expects a
> non-zero value. A value of 0 triggers a kernel trap during a reboot and
> consequently causes it to fail:
>
> [   22.106360] reboot: Restarting system
> KVM: unknown exit, hardware reason ffea
> NIP 0100   LR fe44 CTR  XER 
> 20040092 CPU#0
> MSR 1000 HID0   HF 6c00 iidx 3 didx 3
> TB   DECR 0
> GPR00   c2a8c300 7fe0
> GPR04   1002 82803033
> GPR08 0a00  0004 2fff
> GPR12  c2e1 000105639200 0004
> GPR16  00010563a090  
> GPR20 000105639e20 0001056399c8 7fffe54abab0 000105639288
> GPR24  0001 0001 
> GPR28   c2b30840 
> CR   [ -  -  -  -  -  -  -  -  ] RES 000@
>  SRR0   SRR1 PVR 00800200 VRSAVE 
> 
> SPRG0  SPRG1   SPRG2   SPRG3 
> 
> SPRG4  SPRG5   SPRG6   SPRG7 
> 
> HSRR0  HSRR1 
>  CFAR 
>  LPCR 00020400
>  PTCR    DAR   DSISR 
>
>  kernel:trap=0xffea | pc=0x100 | msr=0x1000
>
> This patch updates kvmppc_set_arch_compat() to use the host PVR value if
> 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
> specific PVR compat mode.
>

Reviewed-by: Aneesh Kumar K.V (IBM) 

>
> Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
> Signed-off-by: Amit Machhiwal 
> ---
>
> Changes v1 -> v2:
> - Added descriptive error log in the patch description when
>   `arch_compat == 0` passed in GSB
> - Added a helper function for PCR to capabilities mapping
> - Added relevant comments around the changes being made
>
> v1: 
> https://lore.kernel.org/lkml/20240118095653.2588129-1-amach...@linux.ibm.com/
>
>  arch/powerpc/kvm/book3s_hv.c  | 25 +++--
>  arch/powerpc/kvm/book3s_hv_nestedv2.c | 23 +--
>  2 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 52427fc2a33f..270ab9cf9a54 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -391,6 +391,23 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 
> pvr)
>  /* Dummy value used in computing PCR value below */
>  #define PCR_ARCH_31(PCR_ARCH_300 << 1)
>  
> +static inline unsigned long map_pcr_to_cap(unsigned long pcr)
> +{
> + unsigned long cap = 0;
> +
> + switch (pcr) {
> + case PCR_ARCH_300:
> + cap = H_GUEST_CAP_POWER9;
> + break;
> + case PCR_ARCH_31:
> + cap = H_GUEST_CAP_POWER10;
> + default:
> + break;
> + }
> +
> + return cap;
> +}
> +
>  static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
>  {
>   unsigned long host_pcr_bit = 0, guest_pcr_bit = 0, cap = 0;
> @@ -424,11 +441,9 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
> u32 arch_compat)
>   break;
>   case PVR_ARCH_300:
>   guest_pcr_bit = PCR_ARCH_300;
> - cap = H_GUEST_CAP_POWER9;
>   break;
>   case PVR_ARCH_31:
>   guest_pcr_bit = PCR_ARCH_31;
> - cap = H_GUEST_CAP_POWER10;
>   break;
>   default:
>   return -EINVAL;
> @@ -440,6 +455,12 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
> u32 arch_compat)
>   return -EINVAL;
>  
>   if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
> + /*
> +  * 'arch_compat == 0' would mean the guest should default to
> +  * L1's compatibility. In this case, the guest would pick
> +  * host's PCR and evaluate the corresponding capabilities.
> +  */
> + cap = map_pcr_to_cap(guest_pcr_bit);
>  

Re: [PATCH v2 1/2] powerpc: Add Power11 architected and raw mode

2024-02-05 Thread Aneesh Kumar K . V
Madhavan Srinivasan  writes:

> reg.h is updated with Power11 pvr. pvr_mask value of 0x0F07
> means we are arch v3.1 compliant.
>

If it is called arch v3.1, it will conflict with. 


#define PVR_ARCH_31 0x0f06

>This is used by phyp and
> kvm when booting as a pseries guest to detect and enable
> the appropriate hwcap, facility bits and PMU related fields.
> Copied most of fields from Power10 table entry and added relevant
> Power11 setup/restore and device tree routines.
>
> Signed-off-by: Madhavan Srinivasan 
> ---
> Changelog v1:
> - no change in this patch.
>
>  arch/powerpc/include/asm/cpu_setup.h  |  2 ++
>  arch/powerpc/include/asm/cputable.h   |  3 ++
>  arch/powerpc/include/asm/mce.h|  1 +
>  arch/powerpc/include/asm/mmu.h|  1 +
>  arch/powerpc/include/asm/reg.h|  1 +
>  arch/powerpc/kernel/cpu_setup_power.c | 10 +++
>  arch/powerpc/kernel/cpu_specs_book3s_64.h | 34 +++
>  arch/powerpc/kernel/dt_cpu_ftrs.c | 15 ++
>  arch/powerpc/kernel/mce_power.c   |  5 
>  arch/powerpc/kernel/prom_init.c   | 10 ++-
>  10 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/cpu_setup.h 
> b/arch/powerpc/include/asm/cpu_setup.h
> index 30e2fe389502..ce800650bb8b 100644
> --- a/arch/powerpc/include/asm/cpu_setup.h
> +++ b/arch/powerpc/include/asm/cpu_setup.h
> @@ -9,10 +9,12 @@ void __setup_cpu_power7(unsigned long offset, struct 
> cpu_spec *spec);
>  void __setup_cpu_power8(unsigned long offset, struct cpu_spec *spec);
>  void __setup_cpu_power9(unsigned long offset, struct cpu_spec *spec);
>  void __setup_cpu_power10(unsigned long offset, struct cpu_spec *spec);
> +void __setup_cpu_power11(unsigned long offset, struct cpu_spec *spec);
>  void __restore_cpu_power7(void);
>  void __restore_cpu_power8(void);
>  void __restore_cpu_power9(void);
>  void __restore_cpu_power10(void);
> +void __restore_cpu_power11(void);
>  
>  void __setup_cpu_e500v1(unsigned long offset, struct cpu_spec *spec);
>  void __setup_cpu_e500v2(unsigned long offset, struct cpu_spec *spec);
> diff --git a/arch/powerpc/include/asm/cputable.h 
> b/arch/powerpc/include/asm/cputable.h
> index 8765d5158324..3bd6e6e0224c 100644
> --- a/arch/powerpc/include/asm/cputable.h
> +++ b/arch/powerpc/include/asm/cputable.h
> @@ -454,6 +454,9 @@ static inline void cpu_feature_keys_init(void) { }
>   CPU_FTR_ARCH_300 | CPU_FTR_ARCH_31 | \
>   CPU_FTR_DAWR | CPU_FTR_DAWR1 | \
>   CPU_FTR_DEXCR_NPHIE)
> +
> +#define CPU_FTRS_POWER11 CPU_FTRS_POWER10
>

One of the problem with that is we have code that does the below in kvm.

if (cpu_has_feature(CPU_FTR_ARCH_31))
host_pcr_bit = PCR_ARCH_31;


How should we handle that?

> +
>  #define CPU_FTRS_CELL(CPU_FTR_LWSYNC | \
>   CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
>   CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index c9f0936bd3c9..241eee743fc5 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -257,6 +257,7 @@ long __machine_check_early_realmode_p7(struct pt_regs 
> *regs);
>  long __machine_check_early_realmode_p8(struct pt_regs *regs);
>  long __machine_check_early_realmode_p9(struct pt_regs *regs);
>  long __machine_check_early_realmode_p10(struct pt_regs *regs);
> +long __machine_check_early_realmode_p11(struct pt_regs *regs);
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
> index d8b7e246a32f..61ebe5eff2c9 100644
> --- a/arch/powerpc/include/asm/mmu.h
> +++ b/arch/powerpc/include/asm/mmu.h
> @@ -133,6 +133,7 @@
>  #define MMU_FTRS_POWER8  MMU_FTRS_POWER6
>  #define MMU_FTRS_POWER9  MMU_FTRS_POWER6
>  #define MMU_FTRS_POWER10 MMU_FTRS_POWER6
> +#define MMU_FTRS_POWER11 MMU_FTRS_POWER6
>  #define MMU_FTRS_CELLMMU_FTRS_DEFAULT_HPTE_ARCH_V2 | \
>   MMU_FTR_CI_LARGE_PAGE
>  #define MMU_FTRS_PA6TMMU_FTRS_DEFAULT_HPTE_ARCH_V2 | \
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 7fd09f25452d..7a7aa24bf57a 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -1364,6 +1364,7 @@
>  #define PVR_HX_C2000 0x0066
>  #define PVR_POWER9   0x004E
>  #define PVR_POWER10  0x0080
> +#define PVR_POWER11  0x0082
>  #define PVR_BE   0x0070
>  #define PVR_PA6T 0x0090
>  
> diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
> b/arch/powerpc/kernel/cpu_setup_power.c
> index 98bd4e6c1770..8c24fc67d90f 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.c
> +++ b/arch/powerpc/kernel/cpu_setup_power.c
> @@ -286,3 +286,13 @@ void __restore_cpu_power10(void)
>   init_HFSCR();
>   

Re: [PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-01-23 Thread Aneesh Kumar K . V
Amit Machhiwal  writes:

> Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> ppc_set_compat ioctl. This triggers a condition failure in
> kvmppc_set_arch_compat() resulting in an EINVAL.
>
> qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
>
> This patch updates kvmppc_set_arch_compat() to use the host PVR value if
> 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
> specific PVR compat mode.
>
> Signed-off-by: Amit Machhiwal 
> ---
>  arch/powerpc/kvm/book3s_hv.c  |  2 +-
>  arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++--
>  2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 1ed6ec140701..9573d7f4764a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
> u32 arch_compat)
>   if (guest_pcr_bit > host_pcr_bit)
>   return -EINVAL;
>  
> - if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
> + if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) {
>   if (!(cap & nested_capabilities))
>   return -EINVAL;
>   }
>

Instead of that arch_compat check, would it better to do

if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
if (cap && !(cap & nested_capabilities))
return -EINVAL;
}

ie, if a capability is requested, then check against nested_capbilites
to see if the capability exist.



> diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
> b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> index fd3c4f2d9480..069a1fcfd782 100644
> --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
> +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> @@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> kvmppc_gs_buff *gsb,
>   vector128 v;
>   int rc, i;
>   u16 iden;
> + u32 arch_compat = 0;
>  
>   vcpu = gsm->data;
>  
> @@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> kvmppc_gs_buff *gsb,
>   break;
>   }
>   case KVMPPC_GSID_LOGICAL_PVR:
> - rc = kvmppc_gse_put_u32(gsb, iden,
> - vcpu->arch.vcore->arch_compat);
> + if (!vcpu->arch.vcore->arch_compat) {
> + if (cpu_has_feature(CPU_FTR_ARCH_31))
> + arch_compat = PVR_ARCH_31;
> + else if (cpu_has_feature(CPU_FTR_ARCH_300))
> + arch_compat = PVR_ARCH_300;
> + } else {
> + arch_compat = vcpu->arch.vcore->arch_compat;
> + }
> + rc = kvmppc_gse_put_u32(gsb, iden, arch_compat);
>

Won't a arch_compat = 0 work here?. ie, where you observing the -EINVAL from
the first hunk or does this hunk have an impact? 

>   break;
>   }
>  
> -- 
> 2.43.0

-aneesh


Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64

2024-01-23 Thread Aneesh Kumar K . V
David Hildenbrand  writes:

>>>
 If high bits are used for
 something else, then we might produce a garbage PTE on overflow, but that
 shouldn't really matter I concluded for folio_pte_batch() purposes, we'd 
 not
 detect "belongs to this folio batch" either way.
>>>
>>> Exactly.
>>>

 Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I 
 just
 hope that we don't lose any other arbitrary PTE bits by doing the 
 pte_pgprot().
>>>
>>> I don't see the need for ppc to implement pte_next_pfn().
>> 
>> Agreed.
>
> So likely we should then do on top for powerpc (whitespace damage):
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index a04ae4449a025..549a440ed7f65 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, 
> pte_t *ptep,
>  break;
>  ptep++;
>  addr += PAGE_SIZE;
> -   /*
> -* increment the pfn.
> -*/
> -   pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
> +   pte = pte_next_pfn(pte);
>  }
>   }

Agreed.

-aneesh


Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64

2024-01-23 Thread Aneesh Kumar K . V
David Hildenbrand  writes:

> On 23.01.24 12:38, Ryan Roberts wrote:
>> On 23/01/2024 11:31, David Hildenbrand wrote:
>
>> If high bits are used for
>> something else, then we might produce a garbage PTE on overflow, but that
>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd 
>> not
>> detect "belongs to this folio batch" either way.
>
> Exactly.
>
>>
>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I 
>> just
>> hope that we don't lose any other arbitrary PTE bits by doing the 
>> pte_pgprot().
>
> I don't see the need for ppc to implement pte_next_pfn().

 Agreed.
>>>
>>> So likely we should then do on top for powerpc (whitespace damage):
>>>
>>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>>> index a04ae4449a025..549a440ed7f65 100644
>>> --- a/arch/powerpc/mm/pgtable.c
>>> +++ b/arch/powerpc/mm/pgtable.c
>>> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> pte_t *ptep,
>>>      break;
>>>      ptep++;
>>>      addr += PAGE_SIZE;
>>> -   /*
>>> -    * increment the pfn.
>>> -    */
>>> -   pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>> +   pte = pte_next_pfn(pte);
>>>      }
>>>   }
>> 
>> Looks like commit 47b8def9358c ("powerpc/mm: Avoid calling
>> arch_enter/leave_lazy_mmu() in set_ptes") changed from doing the simple
>> increment to this more complex approach, but the log doesn't say why.
>
> @Aneesh, was that change on purpose?
>

Because we had a bug with the patch that introduced the change and that
line was confusing. The right thing should have been to add
pte_pfn_next() to make it clear. It was confusing because not all pte
format had pfn at PAGE_SHIFT offset (even though we did use the correct
PTE_RPN_SHIFT in this specific case). To make it simpler I ended up
switching that line to pte_pfn(pte) + 1 .

-aneesh


Re: [RFC PATCH 5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids

2024-01-01 Thread Aneesh Kumar K . V
Michael Ellerman  writes:



>  #ifdef CONFIG_PPC64
>  int boot_cpu_hwid = -1;
> @@ -492,12 +493,26 @@ void __init smp_setup_cpu_maps(void)
>   avail = !of_property_match_string(dn,
>   "enable-method", "spin-table");
>  
> - cpu = assign_threads(cpu, nthreads, avail, intserv);
> + if (boot_core_hwid >= 0) {
> + if (cpu == 0) {
> + pr_info("Skipping CPU node %pOF to allow for 
> boot core.\n", dn);
> + cpu = nthreads;
> + continue;
> + }
>  
> - if (cpu >= nr_cpu_ids) {
> + if (be32_to_cpu(intserv[0]) == boot_core_hwid) {
> + pr_info("Renumbered boot core %pOF to logical 
> 0\n", dn);
> + assign_threads(0, nthreads, avail, intserv);
> + of_node_put(dn);
> + break;
>

I was expecting a 'continue' here. Why 'break' the loop? The condition that
should break the loop should be cpu >= nr_cpu_ids 


> + }
> + } else if (cpu >= nr_cpu_ids) {
>   of_node_put(dn);
>   break;
>   }
> +
> + if (cpu < nr_cpu_ids)
> + cpu = assign_threads(cpu, nthreads, avail, intserv);
>   }
>  
>   /* If no SMT supported, nthreads is forced to 1 */
> -- 
> 2.43.0

-aneesh


Re: [RFC PATCH 4/5] powerpc/smp: Factor out assign_threads()

2024-01-01 Thread Aneesh Kumar K . V
Michael Ellerman  writes:


  
> +static int assign_threads(unsigned cpu, unsigned int nthreads, bool avail,
>

May be rename 'avail' to 'present'

> +  const __be32 *hw_ids)
> +{
> + for (int i = 0; i < nthreads && cpu < nr_cpu_ids; i++) {
> + __be32 hwid;
> +
> + hwid = be32_to_cpu(hw_ids[i]);
> +
> + DBG("thread %d -> cpu %d (hard id %d)\n", i, cpu, hwid);
> +
> + set_cpu_present(cpu, avail);
> + set_cpu_possible(cpu, true);
> + cpu_to_phys_id[cpu] = hwid;
> + cpu++;
> + }
> +

-aneesh


Re: [PATCH v4] powerpc/pseries/vas: Use usleep_range() to support HCALL delay

2023-12-31 Thread Aneesh Kumar K . V
Haren Myneni  writes:

> VAS allocate, modify and deallocate HCALLs returns
> H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy
> delay and expects OS to reissue HCALL after that delay. But using
> msleep() will often sleep at least 20 msecs even though the
> hypervisor suggests OS reissue these HCALLs after 1 or 10msecs.
>
> The open and close VAS window functions hold mutex and then issue
> these HCALLs. So these operations can take longer than the
> necessary when multiple threads issue open or close window APIs
> simultaneously, especially might affect the performance in the
> case of repeat open/close APIs for each compression request.
> On the large machine configuration which allows more simultaneous
> open/close windows (Ex: 240 cores provides 4800 VAS credits), the
> user can observe hung task traces in dmesg due to mutex contention
> around open/close HCAlls.
>
> So instead of msleep(), use usleep_range() to ensure sleep with
> the expected value before issuing HCALL again.
>
> Signed-off-by: Haren Myneni 
> Suggested-by: Nathan Lynch 
>
> ---
> v1 -> v2:
> - Use usleep_range instead of using RTAS sleep routine as
>   suggested by Nathan
> v2 -> v3:
> - Sleep 10MSecs even for HCALL delay > 10MSecs and the other
>   commit / comemnt changes as suggested by Nathan and Ellerman.
> v4 -> v3:
> - More description in the commit log with the visible impact for
>   the current code as suggested by Aneesh
> ---
>  arch/powerpc/platforms/pseries/vas.c | 25 -
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/vas.c 
> b/arch/powerpc/platforms/pseries/vas.c
> index 71d52a670d95..5cf81c564d4b 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -38,7 +38,30 @@ static long hcall_return_busy_check(long rc)
>  {
>   /* Check if we are stalled for some time */
>   if (H_IS_LONG_BUSY(rc)) {
> - msleep(get_longbusy_msecs(rc));
> + unsigned int ms;
> + /*
> +  * Allocate, Modify and Deallocate HCALLs returns
> +  * H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC
> +  * for the long delay. So the sleep time should always
> +  * be either 1 or 10msecs, but in case if the HCALL
> +  * returns the long delay > 10 msecs, clamp the sleep
> +  * time to 10msecs.
> +  */
> + ms = clamp(get_longbusy_msecs(rc), 1, 10);
> +
> + /*
> +  * msleep() will often sleep at least 20 msecs even
> +  * though the hypervisor suggests that the OS reissue
> +  * HCALLs after 1 or 10msecs. Also the delay hint from
> +  * the HCALL is just a suggestion. So OK to pause for
> +  * less time than the hinted delay. Use usleep_range()
> +  * to ensure we don't sleep much longer than actually
> +  * needed.
> +  *
> +  * See Documentation/timers/timers-howto.rst for
> +  * explanation of the range used here.
> +  */
> + usleep_range(ms * 100, ms * 1000);
>

Is there more details on this range? (ms *100, ms * 1000)

can we use USEC_PER_MSEC instead of 1000.



>   rc = H_BUSY;
>   } else if (rc == H_BUSY) {
>   cond_resched();


It would be good to convert this to a helper and switch rtas_busy_delay
to use this new helper. One question though is w.r.t the clamp values.
Does that need to be specific to each hcall? Can we make it generic?

rtas_busy_delay() expliclity check for 20msec. Any reason to do that?
timers-howto.rst suggest > 10msec to use msleep. 

if (ms <= 20)
usleep_range(ms * 100, ms * 1000);
else
msleep(ms);


Re: [PATCH v1 2/2] powerpc/debug: hook to user return notifier infrastructure

2023-12-18 Thread Aneesh Kumar K . V
Luming Yu  writes:

> Before we have powerpc to use the generic entry infrastructure,
> the call to fire user return notifier is made temporarily in powerpc
> entry code.
>

It is still not clear what will be registered as user return notifier.
Can you summarize that here?

>
> Signed-off-by: Luming Yu 
> ---
>  arch/powerpc/kernel/interrupt.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index c4f6d3c69ba9..7fe704946e96 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #if defined(CONFIG_PPC_ADV_DEBUG_REGS) && defined(CONFIG_PPC32)
>  unsigned long global_dbcr0[NR_CPUS];
> @@ -245,6 +246,8 @@ interrupt_exit_user_prepare_main(unsigned long ret, 
> struct pt_regs *regs)
>   /* Restore user access locks last */
>   kuap_user_restore(regs);
>  
> + arch_exit_to_user_mode_prepare(regs, ti_flags);
> +
>

That will run the notifier with user AMR/IAMR values. 

>   return ret;
>  }
>  
> -- 
> 2.42.0.windows.2


Re: [PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST

2023-12-17 Thread Aneesh Kumar K . V
Vaibhav Jain  writes:

> Hi Aneesh,
>
> "Aneesh Kumar K.V"  writes:
>
> 
>>> Yes, Agreed and thats a nice suggestion. However ATM the hypervisor 
>>> supporting Nestedv2 doesnt have support for this hcall. In future
>>> once we have support for this hcall for nestedv2 from the hypervisor
>>> we can replace this branch with a firmware_has_feature() test.
>>> 
>>
>> What I am suggesting is we convert that conditional to firmware_has_feature 
>> so that
>> later when hypervisor supports this hcall all older kernel can make
>> use of the copy_tofrom_guest without any code change.
>
> AFAIK for firmware_has_feature to work we either need:
> - A way to call this hcall with some invalid args. However lpid/pid for
> guest arent allocated during boot.
>
> - A way for hypervisor to advertise support for this hcall before the L1
> kernel boots.
>
> ATM L0 dosent support for any of these two ways. I can do a follow up
> patch later when we have a clarity on how we want to advertise support
> for this hcall. For now current kernel supporting nestedv2 wont be
> using this hcall assuming its not supported. Future kernels can use one
> of the two ways above to set the firmware_has_feature flag to take
> advantage of this hcall.
>

We can use the second option and have L0 publish the firmware feature
when it adds the new hcall. The good part about this is that all
existing L1 kernels will automatically use the new hcall. Something
like.

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 69ae9cf57d50..0ef97b56f999 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -57,6 +57,7 @@
 #define FW_FEATURE_ENERGY_SCALE_INFO ASM_CONST(0x0400)
 #define FW_FEATURE_WATCHDOGASM_CONST(0x0800)
 #define FW_FEATURE_PLPKS   ASM_CONST(0x1000)
+#define FW_FEATURE_H_COPY_TOFROM_GUEST ASM_CONST(0x2000)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 282d1b54b073..8fc598b4767a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -39,6 +39,9 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
unsigned long quadrant, ret = n;
bool is_load = !!to;
 
+   if (!firmware_has_feature(FW_FEATURE_H_COPY_TOFROM_GUEST))
+   return H_UNSUPPORTED;
+
/* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
if (kvmhv_on_pseries())
return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
diff --git a/arch/powerpc/platforms/pseries/firmware.c 
b/arch/powerpc/platforms/pseries/firmware.c
index 18447e5fa17d..d49b5c52e7b8 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -69,6 +69,8 @@ hypertas_fw_features_table[] = {
{FW_FEATURE_ENERGY_SCALE_INFO,  "hcall-energy-scale-info"},
{FW_FEATURE_WATCHDOG,   "hcall-watchdog"},
{FW_FEATURE_PLPKS,  "hcall-pks"},
+   {FW_FEATURE_H_COPY_TOFROM_GUEST,
+   "hcall-h-copy_tofrom-guest"},
 };
 
 /* Build up the firmware features bitmask using the contents of



Re: [PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest

2023-12-15 Thread Aneesh Kumar K . V
Vaibhav Jain  writes:

> Hi Aneesh,
>
> Thanks for looking into this patch. My responses inline below:
>
> "Aneesh Kumar K.V (IBM)"  writes:
>
>> Vaibhav Jain  writes:
>>
>>> From: Jordan Niethe 
>>>
>>> An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not
>>> already been done. This is a slow operation that means H_GUEST_DELETE
>>> must return H_BUSY multiple times before completing. Invalidating the
>>> tables before deleting the guest so there is less work for the L0 to do.
>>>
>>> Signed-off-by: Jordan Niethe 
>>> ---
>>>  arch/powerpc/include/asm/kvm_book3s.h | 1 +
>>>  arch/powerpc/kvm/book3s_hv.c  | 6 --
>>>  arch/powerpc/kvm/book3s_hv_nested.c   | 2 +-
>>>  3 files changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
>>> b/arch/powerpc/include/asm/kvm_book3s.h
>>> index 4f527d09c92b..a37736ed3728 100644
>>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>>> @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void);
>>>  void kvmhv_vm_nested_init(struct kvm *kvm);
>>>  long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
>>>  long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
>>> +void kvmhv_flush_lpid(u64 lpid);
>>>  void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
>>>  void kvmhv_release_all_nested(struct kvm *kvm);
>>>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 1ed6ec140701..5543e8490cd9 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm 
>>> *kvm)
>>> kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
>>> }
>>>  
>>> -   if (kvmhv_is_nestedv2())
>>> +   if (kvmhv_is_nestedv2()) {
>>> +   kvmhv_flush_lpid(kvm->arch.lpid);
>>> plpar_guest_delete(0, kvm->arch.lpid);
>>>
>>
>> I am not sure I follow the optimization here. I would expect the
>> hypervisor to kill all the translation caches as part of guest_delete.
>> What is the benefit of doing a lpid flush outside the guest delete?
>>
> Thats right. However without this optimization the H_GUEST_DELETE hcall
> in plpar_guest_delete() returns H_BUSY multiple times resulting in
> multiple hcalls to the hypervisor until it finishes. Flushing the guest
> translation cache upfront reduces the number of HCALLs L1 guests has to
> make to delete a L2 guest via H_GUEST_DELETE.
>

can we add that as a comment above that kvmhv_flush_lpid()?

-aneesh


Re: [PATCH 10/13] powerpc: Define KMSAN metadata address ranges for vmalloc and ioremap

2023-12-15 Thread Aneesh Kumar K . V
Nicholas Miehlbradt  writes:

> Splits the vmalloc region into four. The first quarter is the new
> vmalloc region, the second is used to store shadow metadata and the
> third is used to store origin metadata. The fourth quarter is unused.
>

Do we support KMSAN for both hash and radix? If hash is not supported
can we then using radix.h for these changes?

> Do the same for the ioremap region.
>
> Module data is stored in the vmalloc region so alias the modules
> metadata addresses to the respective vmalloc metadata addresses. Define
> MODULES_VADDR and MODULES_END to the start and end of the vmalloc
> region.
>
> Since MODULES_VADDR was previously only defined on ppc32 targets checks
> for if this macro is defined need to be updated to include
> defined(CONFIG_PPC32).

-aneesh


Re: [PATCH 09/13] powerpc: Disable KMSAN checks on functions which walk the stack

2023-12-15 Thread Aneesh Kumar K . V
Nicholas Miehlbradt  writes:

> Functions which walk the stack read parts of the stack which cannot be
> instrumented by KMSAN e.g. the backchain. Disable KMSAN sanitization of
> these functions to prevent false positives.
>

Is the annotation needed to avoid uninitialized access check when
reading parts of the stack? Can you provide a false positive example for
the commit message?

-aneesh


Re: [PATCH v5 1/5] powerpc/smp: Enable Asym packing for cores on shared processor

2023-12-14 Thread Aneesh Kumar K . V
Srikar Dronamraju  writes:

> If there are shared processor LPARs, underlying Hypervisor can have more
> virtual cores to handle than actual physical cores.
>
> Starting with Power 9, a big core (aka SMT8 core) has 2 nearly
> independent thread groups. On a shared processors LPARs, it helps to
> pack threads to lesser number of cores so that the overall system
> performance and utilization improves. PowerVM schedules at a big core
> level. Hence packing to fewer cores helps.
>



> +/*
> + * On shared processor LPARs scheduled on a big core (which has two or more
> + * independent thread groups per core), prefer lower numbered CPUs, so
> + * that workload consolidates to lesser number of cores.
> + */
> +static __ro_after_init DEFINE_STATIC_KEY_FALSE(splpar_asym_pack);


DEFINE_STATIC_KEY_FALSE_RO ?

> +
>  /*
>   * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
>   * This topology makes it *much* cheaper to migrate tasks between adjacent 
> cores
> @@ -1011,9 +1018,20 @@ static int powerpc_smt_flags(void)
>   */

-aneesh


Re: [PATCH] powerpc/pseries/iommu: IOMMU table is not initialized for kdump over SR-IOV

2023-12-14 Thread Aneesh Kumar K . V
Gaurav Batra  writes:

> When kdump kernel tries to copy dump data over SR-IOV, LPAR panics due to
> NULL pointer execption.
>
> Here is the complete stack
>
> [   19.944378] Kernel attempted to read user page (0) - exploit attempt? 
> (uid: 0)^M
> [   19.944388] BUG: Kernel NULL pointer dereference on read at 0x^M
> [   19.944394] Faulting instruction address: 0xc00020847ad4^M
> [   19.944400] Oops: Kernel access of bad area, sig: 11 [#1]^M
> [   19.944406] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries^M
> [   19.944413] Modules linked in: mlx5_core(+) vmx_crypto pseries_wdt 
> papr_scm libnvdimm mlxfw tls psample sunrpc fuse overlay squashfs loop^M
> [   19.944435] CPU: 12 PID: 315 Comm: systemd-udevd Not tainted 
> 6.4.0-Test102+ #12^M
> [   19.92] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries^M
> [   19.944450] NIP:  c00020847ad4 LR: c0002083b2dc CTR: 
> 006cd18c^M
> [   19.944457] REGS: c00029162ca0 TRAP: 0300   Not tainted  
> (6.4.0-Test102+)^M
> [   19.944463] MSR:  8280b033   CR: 
> 48288244  XER: 0008^M
> [   19.944480] CFAR: c0002083b2d8 DAR:  DSISR: 4000 
> IRQMASK: 1 ^M
> [   19.944480] GPR00: c0002088ecd4 c00029162f40 c00021542900 
>  ^M
> [   19.944480] GPR04:    
>  ^M
> [   19.944480] GPR08:    
> 0001 ^M
> [   19.944480] GPR12:  c00022ec1000 4000 
>  ^M
> [   19.944480] GPR16: c0002b8b3900 c0002b8b3918 c0002b8b3800 
>  ^M
> [   19.944480] GPR20:    
>  ^M
> [   19.944480] GPR24:  0001  
>  ^M
> [   19.944480] GPR28:    
>  ^M
> [   19.944553] NIP [c00020847ad4] _find_next_zero_bit+0x24/0x110^M
> [   19.944564] LR [c0002083b2dc] 
> bitmap_find_next_zero_area_off+0x5c/0xe0^M
> [   19.944572] Call Trace:^M
> [   19.944576] [c00029162f40] [c000209fec70] 
> dev_printk_emit+0x38/0x48 (unreliable)^M
> [   19.944587] [c00029162fa0] [c0002088ecd4] 
> iommu_area_alloc+0xc4/0x180^M
> [   19.944596] [c00029163020] [c0002005a3d8] 
> iommu_range_alloc+0x1e8/0x580^M
> [   19.944606] [c00029163150] [c0002005a7d0] iommu_alloc+0x60/0x130^M
> [   19.944613] [c000291631a0] [c0002005b9f8] 
> iommu_alloc_coherent+0x158/0x2b0^M
> [   19.944621] [c00029163260] [c00020058fdc] 
> dma_iommu_alloc_coherent+0x3c/0x50^M
> [   19.944629] [c00029163280] [c00020235260] 
> dma_alloc_attrs+0x170/0x1f0^M
> [   19.944637] [c000291632f0] [c0080140c058] mlx5_cmd_init+0xc0/0x760 
> [mlx5_core]^M
> [   19.944745] [c000291633c0] [c00801403448] 
> mlx5_function_setup+0xf0/0x510 [mlx5_core]^M
> [   19.944845] [c00029163490] [c008014039bc] mlx5_init_one+0x84/0x210 
> [mlx5_core]^M
> [   19.944943] [c00029163520] [c00801404c00] probe_one+0x118/0x2c0 
> [mlx5_core]^M
> [   19.945041] [c000291635b0] [c000208ddce8] 
> local_pci_probe+0x68/0x110^M
> [   19.945049] [c00029163630] [c000208deb98] 
> pci_call_probe+0x68/0x200^M
> [   19.945057] [c00029163790] [c000208dfecc] 
> pci_device_probe+0xbc/0x1a0^M
> [   19.945065] [c000291637d0] [c00020a032f4] 
> really_probe+0x104/0x540^M
> [   19.945072] [c00029163850] [c00020a037e4] 
> __driver_probe_device+0xb4/0x230^M
> [   19.945080] [c000291638d0] [c00020a039b4] 
> driver_probe_device+0x54/0x130^M
> [   19.945088] [c00029163910] [c00020a03da8] 
> __driver_attach+0x158/0x2b0^M
> [   19.945096] [c00029163990] [c000209ffa78] 
> bus_for_each_dev+0xa8/0x130^M
> [   19.945103] [c000291639f0] [c00020a026c4] driver_attach+0x34/0x50^M
> [   19.945110] [c00029163a10] [c00020a01aac] 
> bus_add_driver+0x16c/0x300^M
> [   19.945118] [c00029163aa0] [c00020a05944] 
> driver_register+0xa4/0x1b0^M
> [   19.945126] [c00029163b10] [c000208dd428] 
> __pci_register_driver+0x68/0x80^M
> [   19.945133] [c00029163b30] [c008015414cc] mlx5_init+0xb8/0x100 
> [mlx5_core]^M
> [   19.945247] [c00029163ba0] [c00020012f60] 
> do_one_initcall+0x60/0x300^M
> [   19.945255] [c00029163c80] [c00020241cbc] 
> do_init_module+0x7c/0x2b0^M
>
> At the time of LPAR dump, before kexec hands over control to kdump kernel,
> DDWs are scanned and added in the FDT. For the SR-IOV case, default DMA
> window "ibm,dma-window" is removed from the FDT and DDW added, for the
> device.
>
> Now, kexec hands over control to the kdump kernel.
>
> When the kdump kernel initializes, PCI busses are scanned and IOMMU
> group/tables created, in pci_dma_bus_setup_pSeriesLP(). For the SR-IOV

Re: [PATCH v4 0/5] powerpc/smp: Topology and shared processor optimizations

2023-12-13 Thread Aneesh Kumar K . V
Srikar Dronamraju  writes:

> * Srikar Dronamraju  [2023-11-09 11:19:28]:
>
> Hi Michael,
>
>> PowerVM systems configured in shared processors mode have some unique
>> challenges. Some device-tree properties will be missing on a shared
>> processor. Hence some sched domains may not make sense for shared processor
>> systems.
>> 
>> 
>
> Did you get a chance to look at this patchset?
> Do you see this getting pulled into your merge branch?
> I havent seen any comments that requires a change from the current patchset.
>

It would be helpful if you could include the details mentioned in your
reply in the commit message. Specifically, provide information
about the over-provisioned config and if you plan to send another
update, please remove the additional changes in the printk_once section.

Reviewed-by: Aneesh Kumar K.V (IBM) 

Thank you.
-aneesh


Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-22 Thread Aneesh Kumar K V
On 11/22/23 4:05 PM, Sourabh Jain wrote:
> Hello Michael,
> 
> 
> On 22/11/23 10:47, Michael Ellerman wrote:
>> Aneesh Kumar K V  writes:
>> ...
>>> I am not sure whether we need to add all the complexity to enable 
>>> supporting different fadump kernel
>>> version. Is that even a possible use case with fadump? Can't we always 
>>> assume that with fadump the
>>> crash kernel and fadump kernel will be same version?
>> How sure are we of that?
>>
>> Don't we go through grub when we boot into the 2nd kernel. And so
>> couldn't it choose to boot a different kernel, for whatever reason.
>>
>> I don't think we need to support different pt_reg / cpumask sizes, but
>> requiring the exact same kernel version is too strict I think.
> Agree.
>>
>> But maybe I'm wrong. Would be good to hear what distro folks think.
> 
> How about checking fadump crash info header compatibility in the following 
> way?
> 
> static bool is_fadump_header_compatible(struct fadump_crash_info_header *fdh)
> {
>     if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>     pr_err("Old magic number, can't process the dump.");
>     return false;
>     }
> 
>     if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>     pr_err("Fadump header is corrupted.");
>     return false;
>     }
> 
>     /*
>      * If the kernel version of the first/crashed kernel and the second/fadump
>      * kernel is not same, then only collect the dump if the size of all
>      * non-primitive type members of the fadump header is the same across 
> kernels.
>      */
>     if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
>     if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != 
> sizeof(struct cpumask)) {
>         pr_err("Fadump header size mismatch.\n")
>         return false;
>     } else
>         pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>     }
> 

You also want a fdh->version check? I am still not sure you need the 
kernel_version check. IMHO that is too strict
and can hit that check where you have kexec kernel which is not installed in 
/boot crashing? 


>     return true;
> }
> 
> And the new fadump crash info header will be: As suggested by Hari.
> 
> /* fadump crash info structure */
> struct fadump_crash_info_header {
>     u64        magic_number;
> +  u32        version;
>     u32        crashing_cpu;
>     u64        elfcorehdr_addr;
> +  u64        elfcorehdr_size;
> +  u64        vmcoreinfo_raddr;
> +  u64        vmcoreinfo_size;
> +  u8          kernel_version[__NEW_UTS_LEN + 1];
> +  u32        pt_regs_sz;
>     struct pt_regs    regs;
> +  u32        cpu_mask_sz;
>     struct cpumask    cpu_mask;
> };
> 
> Thanks,
> Sourabh Jain



Re: [PATCH] powerpc: Adjust config HOTPLUG_CPU dependency

2023-11-22 Thread Aneesh Kumar K V
On 11/22/23 2:53 PM, Vishal Chourasia wrote:
> Changed HOTPLUG_CPU dependency to SMP and either ARCH_HIBERNATION_POSSIBLE or
> ARCH_SUSPEND_POSSIBLE, aligning with systems' suspend/hibernation 
> capabilities.
> This update links CPU hotplugging more logically with platforms' capabilities.
> 
> Other changes include
> 
> 1. configs ARCH_HIBERNATION_POSSIBLE and ARCH_SUSPEND_POSSIBLE are now 
> selected
>only if required platform dependencies are met.
> 
> 2. Defaults for configs ARCH_HIBERNATION_POSSIBLE and
>ARCH_SUSPEND_POSSIBLE has been set to 'N'
> 
> Reported-by: Srikar Dronamraju 
> Suggested-by: Aneesh Kumar K V 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Vishal Chourasia 
> 
> v1: https://lore.kernel.org/all/20231114082046.6018-1-vish...@linux.ibm.com
> ---
> During the configuration process with 'make randconfig' followed by
> 'make olddefconfig', we observed a warning indicating an unmet direct
> dependency for the HOTPLUG_CPU option. The dependency in question relates to
> various PowerPC configurations (PPC_PSERIES, PPC_PMAC, PPC_POWERNV,
> FSL_SOC_BOOKE) which were not enabled, yet the HOTPLUG_CPU was being
> erroneously selected due to an implicit assumption by the PM_SLEEP_SMP option.
> This misalignment in dependencies could potentially lead to inconsistent 
> kernel
> configuration states, especially when considering the necessary hardware
> support for CPU hot-plugging on PowerPC platforms. The patch aims to correct
> this by ensuring that ARCH_HIBERNATION_POSSIBLE is contingent upon the
> appropriate PowerPC configurations being active.
> 
> steps to reproduce (before applying the patch):
> 
> Run 'make pseries_le_defconfig'
> Run 'make menuconfig'
> Enable hibernation [ Kernel options -> Hibernation (aka 'suspend to disk') ] 
> Disable [ Platform support -> IBM PowerNV (Non-Virtualized) platform support ]
> Disable [ Platform support -> IBM pSeries & new (POWER5-based) iSeries ]
> Enable SMP [ Processor support -> Symmetric multi-processing support ]
> Save the config
> Run 'make olddefconfig'
> 
>  arch/powerpc/Kconfig | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 6f105ee4f3cf..6e7e9af2f0c1 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -155,6 +155,8 @@ config PPC
>   select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_HAS_UACCESS_FLUSHCACHE
>   select ARCH_HAS_UBSAN_SANITIZE_ALL
> + select ARCH_HIBERNATION_POSSIBLEif (PPC_PSERIES || PPC_PMAC || 
> PPC_POWERNV || FSL_SOC_BOOKE)
> + select ARCH_SUSPEND_POSSIBLEif (ADB_PMU || PPC_EFIKA || 
> PPC_LITE5200 || PPC_83xx || (PPC_85xx && !PPC_E500MC) || PPC_86xx || 
> PPC_PSERIES || 44x || 40x)

We should keep that sorted as explained in comment around that. 

>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select ARCH_KEEP_MEMBLOCK
>   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
> @@ -380,14 +382,10 @@ config DEFAULT_UIMAGE
> Used to allow a board to specify it wants a uImage built by default
>  
>  config ARCH_HIBERNATION_POSSIBLE
> - bool
> - default y
> + def_bool n


We should be able to keep this 
config ARCH_HIBERNATION_POSSIBLE
bool


That is how we have rest of the ARCH_* config. I am not sure we need to 
explicitly say "def_bool n" 

>  
>  config ARCH_SUSPEND_POSSIBLE
> - def_bool y
> - depends on ADB_PMU || PPC_EFIKA || PPC_LITE5200 || PPC_83xx || \
> -(PPC_85xx && !PPC_E500MC) || PPC_86xx || PPC_PSERIES \
> -|| 44x || 40x
> + def_bool n
>  
>  config ARCH_SUSPEND_NONZERO_CPU
>   def_bool y
> @@ -568,8 +566,7 @@ config ARCH_USING_PATCHABLE_FUNCTION_ENTRY
>  
>  config HOTPLUG_CPU
>   bool "Support for enabling/disabling CPUs"
> - depends on SMP && (PPC_PSERIES || \
> - PPC_PMAC || PPC_POWERNV || FSL_SOC_BOOKE)
> + depends on SMP && (ARCH_HIBERNATION_POSSIBLE || ARCH_SUSPEND_POSSIBLE)
>   help
> Say Y here to be able to disable and re-enable individual
> CPUs at runtime on SMP machines.



Re: [PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

2023-11-16 Thread Aneesh Kumar K V
On 11/17/23 10:03 AM, Sourabh Jain wrote:
> Hi Aneesh,
> 
> Thanks for reviewing the patch.
> 
> On 15/11/23 10:14, Aneesh Kumar K.V wrote:
>> Sourabh Jain  writes:
>>
>> 
>>
>>> diff --git a/arch/powerpc/include/asm/fadump-internal.h 
>>> b/arch/powerpc/include/asm/fadump-internal.h
>>> index 27f9e11eda28..7be3d8894520 100644
>>> --- a/arch/powerpc/include/asm/fadump-internal.h
>>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>>     #define FADUMP_CPU_UNKNOWN    (~((u32)0))
>>>   -#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPINF")
>>> +/*
>>> + * The introduction of new fields in the fadump crash info header has
>>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>>> + * This alteration ensures backward compatibility, enabling the kernel
>>> + * with the updated fadump crash info to handle kernel dumps from older
>>> + * kernels.
>>> + *
>>> + * To prevent the need for further changes to the magic number in the
>>> + * event of future modifications to the fadump header, a version field
>>> + * has been introduced to track the fadump crash info header version.
>>> + *
>>> + * Historically, there was no connection between the magic number and
>>> + * the fadump crash info header version. However, moving forward, the
>>> + * `FADMPINF` magic number in header will be treated as version 0, while
>>> + * the `FADMPSIG` magic number in header will include a version field to
>>> + * determine its version.
>>> + */
>>> +#define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPSIG")
>>> +#define FADUMP_VERSION    1
>>>
>> Can we keep the old magic details as
>>
>> #define FADUMP_CRASH_INFO_MAGIC_OLD    fadump_str_to_u64("FADMPINF")
>> #define FADUMP_CRASH_INFO_MAGIC    fadump_str_to_u64("FADMPSIG")
> 
> Sure.
> 
>> Also considering the struct need not be backward compatible, can we just
>> do
>>
>> struct fadump_crash_info_header {
>> u64    magic_number;
>> u32    crashing_cpu;
>> u64    elfcorehdr_addr;
>> u64    elfcorehdr_size;
>> u64    vmcoreinfo_raddr;
>> u64    vmcoreinfo_size;
>> struct pt_regs    regs;
>> struct cpumask    cpu_mask;
>> };
>> static inline bool fadump_compatible(struct fadump_crash_info_header
>> *fdh)
>> {
>> return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
>> }
>>
>> and fail fadump if we find it not compatible?
> 
> Agree that it is unsafe to collect a dump with an incompatible fadump crash 
> info header.
> 
> Given that I am updating the fadump crash info header, we can make a few 
> arrangements
> like adding a size filed for the dynamic size attribute like pt_regs and 
> cpumask to ensure
> better compatibility in the future.
> 
> Additionally, let's introduce a version field to the fadump crash info header 
> to avoid changing
> the magic number in the future.
>

I am not sure whether we need to add all the complexity to enable supporting 
different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume 
that with fadump the
crash kernel and fadump kernel will be same version? if yes we can simply fail 
with a magic number
mismatch because that indicates an user config error? 

-aneesh



Re: [PATCH] powerpc: Restrict ARCH_HIBERNATION_POSSIBLE to supported configurations

2023-11-15 Thread Aneesh Kumar K V
On 11/15/23 5:23 PM, Vishal Chourasia wrote:
> 
> On 15/11/23 1:39 pm, Aneesh Kumar K.V wrote:
>> Vishal Chourasia  writes:
>>
>>> This patch modifies the ARCH_HIBERNATION_POSSIBLE option to ensure that it
>>> correctly depends on these PowerPC configurations being enabled. As a 
>>> result,
>>> it prevents the HOTPLUG_CPU from being selected when the required 
>>> dependencies
>>> are not satisfied.
>>>
>>> This change aligns the dependency tree with the expected hardware support 
>>> for
>>> CPU hot-plugging under PowerPC architectures, ensuring that the kernel
>>> configuration steps do not lead to inconsistent states.
>>>
>>> Signed-off-by: Vishal Chourasia 
>>> ---
>>> During the configuration process with 'make randconfig' followed by
>>> 'make olddefconfig', we observed a warning indicating an unmet direct
>>> dependency for the HOTPLUG_CPU option. The dependency in question relates to
>>> various PowerPC configurations (PPC_PSERIES, PPC_PMAC, PPC_POWERNV,
>>> FSL_SOC_BOOKE) which were not enabled, yet the HOTPLUG_CPU was being
>>> erroneously selected due to an implicit assumption by the PM_SLEEP_SMP 
>>> option.
>>> This misalignment in dependencies could potentially lead to inconsistent 
>>> kernel
>>> configuration states, especially when considering the necessary hardware
>>> support for CPU hot-plugging on PowerPC platforms. The patch aims to correct
>>> this by ensuring that ARCH_HIBERNATION_POSSIBLE is contingent upon the
>>> appropriate PowerPC configurations being active.
>>>
>>> steps to reproduce (before applying the patch):
>>>
>>> Run 'make pseries_le_defconfig'
>>> Run 'make menuconfig'
>>> Enable hibernation [ Kernel options -> Hibernation (aka 'suspend to disk') 
>>> ] 
>>> Disable [ Platform support -> IBM PowerNV (Non-Virtualized) platform 
>>> support ]
>>> Disable [ Platform support -> IBM pSeries & new (POWER5-based) iSeries ]
>>> Enable SMP [ Processor support -> Symmetric multi-processing support ]
>>> Save the config
>>> Run 'make olddefconfig'
>>>
>>>  arch/powerpc/Kconfig | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>>> index 6f105ee4f3cf..bf99ff9869f6 100644
>>> --- a/arch/powerpc/Kconfig
>>> +++ b/arch/powerpc/Kconfig
>>> @@ -380,8 +380,9 @@ config DEFAULT_UIMAGE
>>>   Used to allow a board to specify it wants a uImage built by default
>>>  
>>>  config ARCH_HIBERNATION_POSSIBLE
>>> -   bool
>>> -   default y
>>> +   def_bool y
>>> +   depends on PPC_PSERIES || \
>>> +   PPC_PMAC || PPC_POWERNV || FSL_SOC_BOOKE
>>>  
>>>  config ARCH_SUSPEND_POSSIBLE
>>> def_bool y
>>>
>> I am wondering whether it should be switched to using select from
>> config PPC? 
> 
> selecting ARCH_HIBERNATION_POSSIBLE based on value of config PPC
> will not guarantee config PPC_PSERIES being set
> 
> PPC_PSERIES can be set to N, even when config PPC is set.
> 
> grep -A 5 -i "config ppc_pseries" arch/powerpc/platforms/pseries/Kconfig
> config PPC_PSERIES
>     depends on PPC64 && PPC_BOOK3S
>     bool "IBM pSeries & new (POWER5-based) iSeries"
>     select HAVE_PCSPKR_PLATFORM
>     select MPIC
>     select OF_DYNAMIC
> 

modified   arch/powerpc/Kconfig
@@ -156,6 +156,7 @@ config PPC
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_HIBERNATION_POSSIBLEif (PPC_PSERIES || PPC_PMAC || 
PPC_POWERNV || FSL_SOC_BOOKE)
select ARCH_KEEP_MEMBLOCK
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
select ARCH_MIGHT_HAVE_PC_PARPORT


-aneesh


Re: [PATCH] powerpc/sched: Cleanup vcpu_is_preempted()

2023-11-14 Thread Aneesh Kumar K V
On 11/14/23 3:16 PM, Srikar Dronamraju wrote:
> * Aneesh Kumar K.V  [2023-11-14 12:42:19]:
> 
>> No functional change in this patch. A helper is added to find if
>> vcpu is dispatched by hypervisor. Use that instead of opencoding.
>> Also clarify some of the comments.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/include/asm/paravirt.h | 33 ++---
>>  1 file changed, 25 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/paravirt.h 
>> b/arch/powerpc/include/asm/paravirt.h
>> index ac4279208d63..b78b82d66057 100644
>> --- a/arch/powerpc/include/asm/paravirt.h
>> +++ b/arch/powerpc/include/asm/paravirt.h
>> @@ -76,6 +76,17 @@ static inline bool is_vcpu_idle(int vcpu)
>>  {
>>  return lppaca_of(vcpu).idle;
>>  }
>> +
>> +static inline bool vcpu_is_dispatched(int vcpu)
>> +{
>> +/*
>> + * This is the yield_count.  An "odd" value (low bit on) means that
>> + * the processor is yielded (either because of an OS yield or a
>> + * hypervisor preempt).  An even value implies that the processor is
>> + * currently executing.
>> + */
>> +return (!(yield_count_of(vcpu) & 1));
>> +}
>>  #else
>>  static inline bool is_shared_processor(void)
>>  {
>> @@ -109,6 +120,10 @@ static inline bool is_vcpu_idle(int vcpu)
>>  {
>>  return false;
>>  }
>> +static inline bool vcpu_is_dispatched(int vcpu)
>> +{
>> +return true;
>> +}
>>  #endif
> 
> If we are introducing vcpu_is_dispatched, we should remove 
> yield_count_of() and use vcpu_is_dispatched everwhere
> 
> No point in having yield_count_of() and vcpu_is_dispatched, since
> yield_count_of() is only used to check if we are running in OS or not.
> 

We do

yield_count = yield_count_of(owner);
yield_to_preempted(owner, yield_count);

-aneesh


Re: [PATCH] powerpc/sched: Cleanup vcpu_is_preempted()

2023-11-14 Thread Aneesh Kumar K V
On 11/14/23 2:53 PM, Shrikanth Hegde wrote:
> 
> 
> On 11/14/23 12:42 PM, Aneesh Kumar K.V wrote:
>> No functional change in this patch. A helper is added to find if
>> vcpu is dispatched by hypervisor. Use that instead of opencoding.
>> Also clarify some of the comments.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/include/asm/paravirt.h | 33 ++---
>>  1 file changed, 25 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/paravirt.h 
>> b/arch/powerpc/include/asm/paravirt.h
>> index ac4279208d63..b78b82d66057 100644
>> --- a/arch/powerpc/include/asm/paravirt.h
>> +++ b/arch/powerpc/include/asm/paravirt.h
>> @@ -76,6 +76,17 @@ static inline bool is_vcpu_idle(int vcpu)
>>  {
>>  return lppaca_of(vcpu).idle;
>>  }
>> +
>> +static inline bool vcpu_is_dispatched(int vcpu)
>> +{
>> +/*
>> + * This is the yield_count.  An "odd" value (low bit on) means that
>> + * the processor is yielded (either because of an OS yield or a
>> + * hypervisor preempt).  An even value implies that the processor is
>> + * currently executing.
>> + */
>> +return (!(yield_count_of(vcpu) & 1));
>> +}
>>  #else
>>  static inline bool is_shared_processor(void)
>>  {
>> @@ -109,6 +120,10 @@ static inline bool is_vcpu_idle(int vcpu)
>>  {
>>  return false;
>>  }
>> +static inline bool vcpu_is_dispatched(int vcpu)
>> +{
>> +return true;
>> +}
>>  #endif
>>
> 
> Similar code can be changed in lib/qspinlock.c and lib/locks.c as well. 

I avoided doing that because they used the fetched yield_count value

yield_to_preempted(owner, yield_count);


and the checking comes with a comment

if ((yield_count & 1) == 0)
goto relax; /* owner vcpu is running */


-aneesh



Re: [RFC PATCH] powerpc/book3s/hash: Drop _PAGE_PRIVILEGED from PAGE_NONE

2023-11-13 Thread Aneesh Kumar K V
On 11/13/23 5:17 PM, Nicholas Piggin wrote:
> On Mon Nov 13, 2023 at 8:45 PM AEST, Aneesh Kumar K V wrote:



>>>> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
>>>> b/arch/powerpc/mm/book3s64/hash_utils.c
>>>> index ad2afa08e62e..b2eda22195f0 100644
>>>> --- a/arch/powerpc/mm/book3s64/hash_utils.c
>>>> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
>>>> @@ -310,9 +310,26 @@ unsigned long htab_convert_pte_flags(unsigned long 
>>>> pteflags, unsigned long flags
>>>>else
>>>>rflags |= 0x3;
>>>>}
>>>> +  WARN_ON(!(pteflags & _PAGE_RWX));
>>>>} else {
>>>>if (pteflags & _PAGE_RWX)
>>>>rflags |= 0x2;
>>>> +  else {
>>>> +  /*
>>>> +   * PAGE_NONE will get mapped to 0b110 (slb key 1 no 
>>>> access)
>>>> +   * We picked 0b110 instead of 0b000 so that slb key 0 
>>>> will
>>>> +   * get only read only access for the same rflags.
>>>> +   */
>>>> +  if (mmu_has_feature(MMU_FTR_KERNEL_RO))
>>>> +  rflags |= (HPTE_R_PP0 | 0x2);
>>>> +  /*
>>>> +   * rflags = HPTE_R_N
>>>> +   * Without KERNEL_RO feature this will result in slb
>>>> +   * key 0 with read/write. But ISA only supports that.
>>>> +   * There is no key 1 no-access and key 0 read-only
>>>> +   * pp bit support.
>>>> +   */
>>>> +  }
>>>>if (!((pteflags & _PAGE_WRITE) && (pteflags & _PAGE_DIRTY)))
>>>>rflags |= 0x1;
>>>>}
>>>
>>
>> V2 is also dropping the above change, because we will never have hash table 
>> entries inserted. 
>>
>> This is added to commit message.
>>
>> Hash fault handling code did get some WARN_ON added because those
>> functions are not expected to get called with _PAGE_READ cleared.
>> commit 18061c17c8ec ("powerpc/mm: Update PROTFAULT handling in the page 
>> fault path")
>> explains the details.
> 
> Should it be a WARN_ON_ONCE? Any useful way to recover from it? Could
> the added WARN come with some comments from that commit that explain
> it?
> 

This should never happen unless someone messed up check_pte_access(). The 
WARN_ON() is a way
to remind me not to add no access ppp mapping in the htab_convert_pte_flags() 
as I did above.
Initially I was planning to add only a comment, but then in the rare case of we 
getting it wrong
it is nice to do a console log.

-aneesh



Re: [RFC PATCH] powerpc/book3s/hash: Drop _PAGE_PRIVILEGED from PAGE_NONE

2023-11-13 Thread Aneesh Kumar K V
On 11/13/23 3:46 PM, Nicholas Piggin wrote:
> On Thu Nov 2, 2023 at 11:23 PM AEST, Aneesh Kumar K.V wrote:
>> There used to be a dependency on _PAGE_PRIVILEGED with pte_savedwrite.
>> But that got dropped by
>> commit 6a56ccbcf6c6 ("mm/autonuma: use can_change_(pte|pmd)_writable() to 
>> replace savedwrite")
>>
>> With this change numa fault pte (pte_protnone()) gets mapped as regular user 
>> pte
>> with RWX cleared (no-access).
> 
> You mean "that" above change (not *this* change), right?
> 

With the change in this patch numa fault pte will not have _PAGE_PRIVILEGED set 
because 
PAGE_NONE now maps to just _PAGE_BASE


>> This also remove pte_user() from
>> book3s/64.
> 
> Nice cleanup. That was an annoying hack.
> 
>> pte_access_permitted() now checks for _PAGE_EXEC because we now support
>> EXECONLY mappings.
> 
> AFAIKS pte_exec() is not required, GUP is really only for read or
> write access. It should be a separate patch if you think it's needed.
> 

I have a v2 dropping that based on 
https://lore.kernel.org/linux-mm/87bkc1oe8c@linux.ibm.com 
I kept pte_user with pte_access_permitted being the only user. I can open code 
that
if needed. 

>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h | 23 +---
>>  arch/powerpc/mm/book3s64/hash_utils.c| 17 +++
>>  2 files changed, 23 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index cb77eddca54b..7c7de7b56df0 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -17,12 +17,6 @@
>>  #define _PAGE_EXEC  0x1 /* execute permission */
>>  #define _PAGE_WRITE 0x2 /* write access allowed */
>>  #define _PAGE_READ  0x4 /* read access allowed */
>> -#define _PAGE_NA_PAGE_PRIVILEGED
>> -#define _PAGE_NAX   _PAGE_EXEC
>> -#define _PAGE_RO_PAGE_READ
>> -#define _PAGE_ROX   (_PAGE_READ | _PAGE_EXEC)
>> -#define _PAGE_RW(_PAGE_READ | _PAGE_WRITE)
>> -#define _PAGE_RWX   (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)
>>  #define _PAGE_PRIVILEGED0x8 /* kernel access only */
>>  #define _PAGE_SAO   0x00010 /* Strong access order */
>>  #define _PAGE_NON_IDEMPOTENT0x00020 /* non idempotent memory */
> 
> Did you leave PAGE_NONE as _PAGE_BASE | _PAGE_PRIVILEGED below?
> Shouldn't that be changed too? Then this patch is not only hash
> but also radix.
> 

A recent patch moved PAGE_NONE to pgtable-mask.h 
a5a08dc90f4513d1a78582ec24b687fad01cc843

> Why is the hash change required? Previously PAGE_NONE relied on
> privileged bit to prevent access, now you need to handle a PTE
> without that bit? In that case could that be patch 1, then the
> rest patch 2?
> 

Looking at older kernel, I guess check_pte_access used _PAGE_PRIVILEGED 
as a way to prevent access to PAGE_NONE ptes. We now depend on
_PAGE_READ

> __pte_flags_need_flush() should be updated after this too,
> basically revert commit 1abce0580b894.
> 

Will update the patch to include the revert.

>> @@ -119,9 +113,9 @@
>>  /*
>>   * user access blocked by key
>>   */
>> -#define _PAGE_KERNEL_RW (_PAGE_PRIVILEGED | _PAGE_RW | 
>> _PAGE_DIRTY)
>>  #define _PAGE_KERNEL_RO  (_PAGE_PRIVILEGED | _PAGE_READ)
>>  #define _PAGE_KERNEL_ROX (_PAGE_PRIVILEGED | _PAGE_READ | _PAGE_EXEC)
>> +#define _PAGE_KERNEL_RW (_PAGE_PRIVILEGED | _PAGE_RW | 
>> _PAGE_DIRTY)
>>  #define _PAGE_KERNEL_RWX(_PAGE_PRIVILEGED | _PAGE_DIRTY | _PAGE_RW | 
>> _PAGE_EXEC)
>>  /*
>>   * _PAGE_CHG_MASK masks of bits that are to be preserved across
> 
> No need to reorder defines.
> 

ok


> Thanks,
> Nick
> 
>> @@ -523,19 +517,14 @@ static inline bool arch_pte_access_permitted(u64 pte, 
>> bool write, bool execute)
>>  }
>>  #endif /* CONFIG_PPC_MEM_KEYS */
>>  
>> -static inline bool pte_user(pte_t pte)
>> -{
>> -return !(pte_raw(pte) & cpu_to_be64(_PAGE_PRIVILEGED));
>> -}
>> -
>>  #define pte_access_permitted pte_access_permitted
>>  static inline bool pte_access_permitted(pte_t pte, bool write)
>>  {
>> -/*
>> - * _PAGE_READ is needed for any access and will be
>> - * cleared for PROT_NONE
>> - */
>> -if (!pte_present(pte) || !pte_user(pte) || !pte_read(pte))
>> +
>> +if (!pte_present(pte))
>> +return false;
>> +
>> +if (!(pte_read(pte) || pte_exec(pte)))
>>  return false;
>>  
>>  if (write && !pte_write(pte))
>> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
>> b/arch/powerpc/mm/book3s64/hash_utils.c
>> index ad2afa08e62e..b2eda22195f0 100644
>> --- a/arch/powerpc/mm/book3s64/hash_utils.c
>> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
>> @@ -310,9 +310,26 @@ unsigned long htab_convert_pte_flags(unsigned long 
>> pteflags, unsigned long flags
>>  else
>>

Re: get_user_pages() and EXEC_ONLY mapping.

2023-11-10 Thread Aneesh Kumar K V
On 11/10/23 8:23 PM, Jason Gunthorpe wrote:
> On Fri, Nov 10, 2023 at 08:19:23PM +0530, Aneesh Kumar K.V wrote:
>>
>> Hello,
>>
>> Some architectures can now support EXEC_ONLY mappings and I am wondering
>> what get_user_pages() on those addresses should return. 
> 
> -EPERM
> 
>> Earlier PROT_EXEC implied PROT_READ and pte_access_permitted()
>> returned true for that. But arm64 does have this explicit comment
>> that says
>>
>>  /*
>>  * p??_access_permitted() is true for valid user mappings (PTE_USER
>>  * bit set, subject to the write permission check). For execute-only
>>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
>>  * not set) must return false. PROT_NONE mappings do not have the
>>  * PTE_VALID bit set.
>>  */
>>
>> Is that correct? We should be able to get struct page for PROT_EXEC
>> mappings?
> 
> If the memory is unreadable then providing a back door through
> O_DIRECT and everthing else to read it sounds wrong to me.
> 
> If there is some case where a get_user_pages caller is exec-only
> compatible then a new FOLL_EXEC flag to permit it would make sense.
> 

I was expecting pin_user_pages() to return -EPERM and get_user_pages()
return struct page. This was based on Documentation/core-api/pin_user_pages.rst 
 

"Another way of thinking about these flags is as a progression of restrictions:
FOLL_GET is for struct page manipulation, without affecting the data that the
struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
short term pins on pages whose data *will* get accessed. "

May be we can clarify PROT_EXEC details in the documentation? 

-aneesh


Re: [PATCH v2 37/37] powerpc: Support execute-only on all powerpc

2023-11-06 Thread Aneesh Kumar K V
On 11/6/23 6:53 PM, Christophe Leroy wrote:
> 
> 
> Le 02/11/2023 à 06:39, Aneesh Kumar K.V a écrit :
>> Christophe Leroy  writes:
>>
>>> Introduce PAGE_EXECONLY_X macro which provides exec-only rights.
>>> The _X may be seen as redundant with the EXECONLY but it helps
>>> keep consistancy, all macros having the EXEC right have _X.
>>>
>>> And put it next to PAGE_NONE as PAGE_EXECONLY_X is
>>> somehow PAGE_NONE + EXEC just like all other SOMETHING_X are
>>> just SOMETHING + EXEC.
>>>
>>> On book3s/64 PAGE_EXECONLY becomes PAGE_READONLY_X.
>>>
>>> On book3s/64, as PAGE_EXECONLY is only valid for Radix add
>>> VM_READ flag in vm_get_page_prot() for non-Radix.
>>>
>>> And update access_error() so that a non exec fault on a VM_EXEC only
>>> mapping is always invalid, even when the underlying layer don't
>>> always generate a fault for that.
>>>
>>> For 8xx, set PAGE_EXECONLY_X as _PAGE_NA | _PAGE_EXEC.
>>> For others, only set it as just _PAGE_EXEC
>>>
>>> With that change, 8xx, e500 and 44x fully honor execute-only
>>> protection.
>>>
>>> On 40x that is a partial implementation of execute-only. The
>>> implementation won't be complete because once a TLB has been loaded
>>> via the Instruction TLB miss handler, it will be possible to read
>>> the page. But at least it can't be read unless it is executed first.
>>>
>>> On 603 MMU, TLB missed are handled by SW and there are separate
>>> DTLB and ITLB. Execute-only is therefore now supported by not loading
>>> DTLB when read access is not permitted.
>>>
>>> On hash (604) MMU it is more tricky because hash table is common to
>>> load/store and execute. Nevertheless it is still possible to check
>>> whether _PAGE_READ is set before loading hash table for a load/store
>>> access. At least it can't be read unless it is executed first.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> Cc: Russell Currey 
>>> Cc: Kees Cook 
>>> ---
>>>   arch/powerpc/include/asm/book3s/32/pgtable.h |  2 +-
>>>   arch/powerpc/include/asm/book3s/64/pgtable.h |  4 +---
>>>   arch/powerpc/include/asm/nohash/32/pte-8xx.h |  1 +
>>>   arch/powerpc/include/asm/nohash/pgtable.h|  2 +-
>>>   arch/powerpc/include/asm/nohash/pte-e500.h   |  1 +
>>>   arch/powerpc/include/asm/pgtable-masks.h |  2 ++
>>>   arch/powerpc/mm/book3s64/pgtable.c   | 10 --
>>>   arch/powerpc/mm/fault.c  |  9 +
>>>   arch/powerpc/mm/pgtable.c|  4 ++--
>>>   9 files changed, 18 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
>>> b/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> index 244621c88510..52971ee30717 100644
>>> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> @@ -425,7 +425,7 @@ static inline bool pte_access_permitted(pte_t pte, bool 
>>> write)
>>>   {
>>> /*
>>>  * A read-only access is controlled by _PAGE_READ bit.
>>> -* We have _PAGE_READ set for WRITE and EXECUTE
>>> +* We have _PAGE_READ set for WRITE
>>>  */
>>> if (!pte_present(pte) || !pte_read(pte))
>>> return false;
>>>
>>
>> Should this now be updated to check for EXEC bit ?
> 
> I don't think so based on what I see in arm64: 
> https://elixir.bootlin.com/linux/v6.6/source/arch/arm64/include/asm/pgtable.h#L146
> 

But then there can be a get_user_pages() (FOLL_GET) on an exec only pte right?
if we are going to access the page data(FOLL_PIN), then yes exec only mapping 
should
fail for that. But if we using it to do struct page manipulation we need 
pte_access_permitted
to return true for exec only mapping?


-aneesh




Re: [PATCH] powerpc/mm: Update set_ptes to call pte_filter for all the ptes

2023-10-18 Thread Aneesh Kumar K V
On 10/18/23 11:25 AM, Christophe Leroy wrote:
> 
> 
> Le 18/10/2023 à 06:55, Aneesh Kumar K.V a écrit :
>> With commit 9fee28baa601 ("powerpc: implement the new page table range
>> API") we added set_ptes to powerpc architecture but the implementation
>> missed calling the pte filter for all the ptes we are setting in the
>> range. set_pte_filter can be used for filter pte values and on some
>> platforms which don't support coherent icache it clears the exec bit so
>> that we can flush the icache on exec fault
>>
>> The patch also removes the usage of arch_enter/leave_lazy_mmu() because
>> set_pte is not supposed to be used when updating a pte entry. Powerpc
>> architecture uses this rule to skip the expensive tlb invalidate which
>> is not needed when you are setting up the pte for the first time. See
>> commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating
>> _PAGE_NUMA bit") for more details
>>
>> Fixes: 9fee28baa601 ("powerpc: implement the new page table range API")
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/mm/pgtable.c | 33 -
>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index 3ba9fe411604..95ab20cca2da 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -191,28 +191,35 @@ void set_ptes(struct mm_struct *mm, unsigned long 
>> addr, pte_t *ptep,
>>  pte_t pte, unsigned int nr)
>>   {
>>  /*
>> - * Make sure hardware valid bit is not set. We don't do
>> - * tlb flush for this update.
>> + * We don't need to call arch_enter/leave_lazy_mmu_mode()
>> + * because we expect set_ptes to be only be used on not present
>> + * and not hw_valid ptes. Hence there is not translation cache flush
>> + * involved that need to be batched.
>>   */
>> -VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
>> +for (;;) {
>>   
>> -/* Note: mm->context.id might not yet have been assigned as
>> - * this context might not have been activated yet when this
>> - * is called.
>> - */
>> -pte = set_pte_filter(pte);
>> +/*
>> + * Make sure hardware valid bit is not set. We don't do
>> + * tlb flush for this update.
>> + */
>> +VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
>>   
>> -/* Perform the setting of the PTE */
>> -arch_enter_lazy_mmu_mode();
>> -for (;;) {
>> +/* Note: mm->context.id might not yet have been assigned as
>> + * this context might not have been activated yet when this
>> + * is called.
>> + */
>> +pte = set_pte_filter(pte);
> 
> Why do you need to call set_pte_filter() inside the loop ?
> The only difference between previous pte and next pte is the RPN, other 
> flags remain untouched so I can't see why you need to call 
> set_pte_filter() again.
> 

I missed the fact that we use the filtered pte in all the ptes in the range. 
One other details
that made me look at calling the filter in the loop was we clearing the struct 
page->flags.
The only flag right now we care about the PG_dcache_clean and that moved to 
folio. So we might be
good here. May be we add a comment in set_pte_filter saying can operate only on 
folio->flags ? 

>> +
>> +/* Perform the setting of the PTE */
>>  __set_pte_at(mm, addr, ptep, pte, 0);
>>  if (--nr == 0)
>>  break;
>>  ptep++;
>> -pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>  addr += PAGE_SIZE;
>> +/* increment the pfn */
>> +pte = __pte(pte_val(pte) + PAGE_SIZE);
> 
> PAGE_SIZE doesn't work on all platforms, see for instance e500.
> 
> see comment at 
> https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/powerpc/include/asm/nohash/32/pgtable.h#L147
> 
> And then you see 
> https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/powerpc/include/asm/nohash/pte-e500.h#L63
> 

Didn't know that. I actually wanted to do pfn_pte(pte_pfn(pte) + 1) . But that 
needs pgprot_t. I
can move it back to PTE_RPN_SHIFT with details of the above documented. 

>> +
>>  }
>> -arch_leave_lazy_mmu_mode();
>>   }
>>   
>>   void unmap_kernel_page(unsigned long va)
> 
> Christophe

-aneesh


Re: [PATCH 1/2] powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled

2023-08-28 Thread Aneesh Kumar K V
On 8/28/23 1:16 PM, Aneesh Kumar K.V wrote:
> With CONFIG_SPARSEMEM disabled the below kernel build error is observed.
> 
>  arch/powerpc/mm/init_64.c:477:38: error: use of undeclared identifier 
> 'SECTION_SIZE_BITS'
> 
> CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM and it is more clear
> to describe the code dependency in terms of MEMORY_HOTPLUG. Outside
> memory hotplug the kernel uses memory_block_size for kernel directmap.
> Instead of depending on SECTION_SIZE_BITS to compute the direct map
> page size, add a new #define which defaults to 16M(same as existing
> SECTION_SIZE)
> 

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202308251532.k9ppwead-...@intel.com/

> Fixes: 4d15721177d5 ("powerpc/mm: Cleanup memory block size probing")
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/init_64.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index fcda46c2b8df..e3d7379ef480 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -472,12 +472,23 @@ static int __init dt_scan_mmu_pid_width(unsigned long 
> node,
>   return 1;
>  }
>  
> +/*
> + * Outside hotplug the kernel uses this value to map the kernel direct map
> + * with radix. To be compatible with older kernels, let's keep this value
> + * as 16M which is also SECTION_SIZE with SPARSEMEM. We can ideally map
> + * things with 1GB size in the case where we don't support hotplug.
> + */
> +#ifndef CONFIG_MEMORY_HOTPLUG
> +#define DEFAULT_MEMORY_BLOCK_SIZESZ_16M
> +#else
> +#define DEFAULT_MEMORY_BLOCK_SIZEMIN_MEMORY_BLOCK_SIZE
> +#endif
> +
>  static void update_memory_block_size(unsigned long *block_size, unsigned 
> long mem_size)
>  {
> - unsigned long section_size = 1UL << SECTION_SIZE_BITS;
> -
> - for (; *block_size > section_size; *block_size >>= 2) {
> + unsigned long min_memory_block_size = DEFAULT_MEMORY_BLOCK_SIZE;
>  
> + for (; *block_size > min_memory_block_size; *block_size >>= 2) {
>   if ((mem_size & *block_size) == 0)
>   break;
>   }
> @@ -507,7 +518,7 @@ static int __init probe_memory_block_size(unsigned long 
> node, const char *uname,
>   /*
>* Nothing in the device tree
>*/
> - *block_size = MIN_MEMORY_BLOCK_SIZE;
> + *block_size = DEFAULT_MEMORY_BLOCK_SIZE;
>   else
>   *block_size = of_read_number(prop, dt_root_size_cells);
>   /*



Re: [powerpc:next 84/128] arch/powerpc/mm/init_64.c:477:38: error: use of undeclared identifier 'SECTION_SIZE_BITS'

2023-08-25 Thread Aneesh Kumar K V
On 8/25/23 12:39 PM, kernel test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> head:   b9bbbf4979073d5536b7650decd37fcb901e6556
> commit: 4d15721177d539d743fcf31d7bb376fb3b81aeb6 [84/128] powerpc/mm: Cleanup 
> memory block size probing
> config: powerpc64-randconfig-r005-20230825 
> (https://download.01.org/0day-ci/archive/20230825/202308251532.k9ppwead-...@intel.com/config)
> compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 
> 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
> reproduce: 
> (https://download.01.org/0day-ci/archive/20230825/202308251532.k9ppwead-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202308251532.k9ppwead-...@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>>> arch/powerpc/mm/init_64.c:477:38: error: use of undeclared identifier 
>>> 'SECTION_SIZE_BITS'
>  477 | unsigned long section_size = 1UL << SECTION_SIZE_BITS;
>  | ^
>arch/powerpc/mm/init_64.c:510:18: error: use of undeclared identifier 
> 'SECTION_SIZE_BITS'
>  510 | *block_size = MIN_MEMORY_BLOCK_SIZE;
>  |   ^
>include/linux/memory.h:23:43: note: expanded from macro 
> 'MIN_MEMORY_BLOCK_SIZE'
>   23 | #define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
>  |   ^
>2 errors generated.
> 
> 
> vim +/SECTION_SIZE_BITS +477 arch/powerpc/mm/init_64.c
> 
>474
>475static void update_memory_block_size(unsigned long *block_size, 
> unsigned long mem_size)
>476{
>  > 477unsigned long section_size = 1UL << SECTION_SIZE_BITS;
>478
>479for (; *block_size > section_size; *block_size >>= 2) {
>480
>481if ((mem_size & *block_size) == 0)
>482break;
>483}
>484}
>485
> 

Not a nice fix: 

modified   arch/powerpc/mm/init_64.c
@@ -472,11 +472,17 @@ static int __init dt_scan_mmu_pid_width(unsigned long 
node,
return 1;
 }
 
+#ifndef CONFIG_MEMORY_HOTPLUG
+#define MEMORY_BLOCK_SIZE SZ_16M
+#else
+#define MEMORY_BLOCK_SIZE MIN_MEMORY_BLOCK_SIZE
+#endif
+
 static void update_memory_block_size(unsigned long *block_size, unsigned long 
mem_size)
 {
-   unsigned long section_size = 1UL << SECTION_SIZE_BITS;
+   unsigned long min_memory_block_size = MEMORY_BLOCK_SIZE;
 
-   for (; *block_size > section_size; *block_size >>= 2) {
+   for (; *block_size > min_memory_block_size; *block_size >>= 2) {
 
if ((mem_size & *block_size) == 0)
break;
@@ -507,7 +513,7 @@ static int __init probe_memory_block_size(unsigned long 
node, const char *uname,
/*
 * Nothing in the device tree
 */
-   *block_size = MIN_MEMORY_BLOCK_SIZE;
+   *block_size = MEMORY_BLOCK_SIZE;
else
*block_size = of_read_number(prop, dt_root_size_cells);
/*




Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-08 Thread Aneesh Kumar K V
On 8/8/23 12:05 AM, David Hildenbrand wrote:
> On 07.08.23 14:41, David Hildenbrand wrote:
>> On 07.08.23 14:27, Michal Hocko wrote:
>>> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
>>> [...]
>>>> Do you see a need for firmware-managed memory to be hotplugged in with
>>>> different memory block sizes?
>>>
>>> In short. Yes. Slightly longer, a fixed size memory block semantic is
>>> just standing in the way and I would even argue it is actively harmful.
>>> Just have a look at ridicously small memory blocks on ppc. I do
>>> understand that it makes some sense to be aligned to the memory model
>>> (so sparsmem section aligned). In an ideal world, memory hotplug v2
>>> interface (if we ever go that path) should be physical memory range based.
>>
>> Yes, we discussed that a couple of times already (and so far nobody
>> cared to implement any of that).
>>
>> Small memory block sizes are very beneficial for use cases like PPC
>> dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual
>> environments where you might want to add/remove memory in very small
>> granularity. I don't see that changing any time soon. Rather the opposite.
>>
>> Small memory block sizes are suboptimal for large machines where you
>> might never end up removing such memory (boot memory), or when dealing
>> with devices that can only be removed in one piece (DIMM/kmem). We
>> already have memory groups in place to model that.
>>
>> For the latter it might be beneficial to have memory blocks of larger
>> size that correspond to the physical memory ranges. That might also make
>> a memmap (re-)configuration easier.
>>
>> Not sure if that is standing in any way or is harmful, though.
>>
> 
> Just because I thought of something right now, I'll share it, maybe it makes 
> sense.
> 
> Assume when we get add_memory*(MHP_MEMMAP_ON_MEMORY) and it is enabled by the 
> admin:
> 
> 1) We create a single altmap at the beginning of the memory
> 
> 2) We create the existing fixed-size memory block devices, but flag them
>    to be part of a single "altmap" unit.
> 
> 3) Whenever we trigger offlining of a single such memory block, we
>    offline *all* memory blocks belonging to that altmap, essentially
>    using a single offline_pages() call and updating all memory block
>    states accordingly.
> 
> 4) Whenever we trigger onlining of a single such memory block, we
>    online *all* memory blocks belonging to that altmap, using a single
>    online_pages() call.
> 
> 5) We fail remove_memory() if it doesn't cover the same (altmap) range.
> 
> So we can avoid having a memory block v2 (and all that comes with that ...) 
> for now and still get that altmap stuff sorted out. As that altmap behavior 
> can be controlled by the admin, we should be fine for now.
> 
> I think all memory notifiers should already be able to handle bigger 
> granularity, but it would be easy to check. Some internal things might 
> require a bit of tweaking.
> 

We can look at the possibility of using the altmap space reserved for a 
namespace (via option -M dev) for allocating struct page memory even with 
dax/kmem. 

-aneesh



Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-08 Thread Aneesh Kumar K V
On 8/8/23 12:05 AM, David Hildenbrand wrote:
> On 07.08.23 14:41, David Hildenbrand wrote:
>> On 07.08.23 14:27, Michal Hocko wrote:
>>> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
>>> [...]
>>>> Do you see a need for firmware-managed memory to be hotplugged in with
>>>> different memory block sizes?
>>>
>>> In short. Yes. Slightly longer, a fixed size memory block semantic is
>>> just standing in the way and I would even argue it is actively harmful.
>>> Just have a look at ridicously small memory blocks on ppc. I do
>>> understand that it makes some sense to be aligned to the memory model
>>> (so sparsmem section aligned). In an ideal world, memory hotplug v2
>>> interface (if we ever go that path) should be physical memory range based.
>>
>> Yes, we discussed that a couple of times already (and so far nobody
>> cared to implement any of that).
>>
>> Small memory block sizes are very beneficial for use cases like PPC
>> dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual
>> environments where you might want to add/remove memory in very small
>> granularity. I don't see that changing any time soon. Rather the opposite.
>>
>> Small memory block sizes are suboptimal for large machines where you
>> might never end up removing such memory (boot memory), or when dealing
>> with devices that can only be removed in one piece (DIMM/kmem). We
>> already have memory groups in place to model that.
>>
>> For the latter it might be beneficial to have memory blocks of larger
>> size that correspond to the physical memory ranges. That might also make
>> a memmap (re-)configuration easier.
>>
>> Not sure if that is standing in any way or is harmful, though.
>>
> 
> Just because I thought of something right now, I'll share it, maybe it makes 
> sense.
> 
> Assume when we get add_memory*(MHP_MEMMAP_ON_MEMORY) and it is enabled by the 
> admin:
> 
> 1) We create a single altmap at the beginning of the memory
> 
> 2) We create the existing fixed-size memory block devices, but flag them
>    to be part of a single "altmap" unit.
> 
> 3) Whenever we trigger offlining of a single such memory block, we
>    offline *all* memory blocks belonging to that altmap, essentially
>    using a single offline_pages() call and updating all memory block
>    states accordingly.
> 
> 4) Whenever we trigger onlining of a single such memory block, we
>    online *all* memory blocks belonging to that altmap, using a single
>    online_pages() call.
> 
> 5) We fail remove_memory() if it doesn't cover the same (altmap) range.
> 
> So we can avoid having a memory block v2 (and all that comes with that ...) 
> for now and still get that altmap stuff sorted out. As that altmap behavior 
> can be controlled by the admin, we should be fine for now.
> 
> I think all memory notifiers should already be able to handle bigger 
> granularity, but it would be easy to check. Some internal things might 
> require a bit of tweaking.
> 
> Just a thought.
> 


W.r.t enabling memmap_on_memory for ppc64, I guess I will drop patch 7 and 
resend the series so that Andrew can pick rest of the patches? 

-aneesh





Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-05 Thread Aneesh Kumar K V
On 8/3/23 5:00 PM, Michal Hocko wrote:
> On Thu 03-08-23 11:24:08, David Hildenbrand wrote:
> [...]
>>> would be readable only when the block is offline and it would reallocate
>>> vmemmap on the change. Makes sense? Are there any risks? Maybe pfn
>>> walkers?
>>
>> The question is: is it of any real value such that it would be worth the
>> cost and risk?
>>
>>
>> One of the primary reasons for memmap_on_memory is that *memory hotplug*
>> succeeds even in low-memory situations (including, low on ZONE_NORMAL
>> situations).
> 
> One usecase I would have in mind is a mix of smaller and larger memory
> blocks. For larger ones you want to have memmap_on_memory in general
> because they do not eat memory from outside but small(er) ones might be
> more tricky because now you can add a lot of blocks that would be
> internally fragmented to prevent larger allocations to form.
> 


I guess that closely aligns with device memory and being able to add
device memory via dax/kmem using a larger memory block size.
We can then make sure we enable the memmap_on_memory feature
at the device level for this device memory. Do you see a need for
firmware-managed memory to be hotplugged in with different memory block sizes?



>> So you want that behavior already when hotplugging such
>> devices. While there might be value to relocate it later, I'm not sure if
>> that is really worth it, and it does not solve the main use case.
> 
> Is it worth it? TBH I am not sure same as I am not sure the global
> default should be writable after boot. If we want to make it more
> dynamic we should however talk about the proper layer this is
> implemented on.

-aneesh


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-02 Thread Aneesh Kumar K V
On 8/2/23 9:24 PM, David Hildenbrand wrote:
> On 02.08.23 17:50, Michal Hocko wrote:
>> On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
>>> On 8/1/23 4:20 PM, Michal Hocko wrote:
>>>> On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
>>>>> On 8/1/23 2:28 PM, Michal Hocko wrote:
>>>>>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>>>>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>>>>> value.
>>>>>>
>>>>>> Well, this is a user space kABI extension and as such you should spend
>>>>>> more words about the usecase. Why we could live with this static and now
>>>>>> need dynamic?
>>>>>>
>>>>>
>>>>> This enables easy testing of memmap_on_memory feature without a kernel 
>>>>> reboot.
>>>>
>>>> Testing alone is rather weak argument to be honest.
>>>>
>>>>> I also expect people wanting to use that when they find dax kmem memory 
>>>>> online
>>>>> failing because of struct page allocation failures[1]. User could reboot 
>>>>> back with
>>>>> memmap_on_memory=y kernel parameter. But being able to enable it via 
>>>>> sysfs makes
>>>>> the feature much more useful.
>>>>
>>>> Sure it can be useful but that holds for any feature, right. The main
>>>> question is whether this is worth maintaing. The current implementation
>>>> seems rather trivial which is an argument to have it but are there any
>>>> risks long term? Have you evaluated a potential long term maintenance
>>>> cost? There is no easy way to go back and disable it later on without
>>>> breaking some userspace.
>>>>
>>>> All that should be in the changelog!
>>>
>>> I updated it as below.
>>>
>>> mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
>>>
>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>> hotplug done after the mode update will use the new mmemap_on_memory
>>> value.
>>>
>>> It is now possible to test the memmap_on_memory feature easily without
>>> the need for a kernel reboot. Additionally, for those encountering
>>> struct page allocation failures while using dax kmem memory online, this
>>> feature may prove useful. Instead of rebooting with the
>>> memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
>>> which greatly enhances its usefulness.
>>
>>
>> I do not really see a solid argument why rebooting is really a problem
>> TBH. Also is the global policy knob really the right fit for existing
>> hotplug usecases? In other words, if we really want to make
>> memmap_on_memory more flexible would it make more sense to have it per
>> memory block property instead (the global knob being the default if
>> admin doesn't specify it differently).
> 
> Per memory block isn't possible, due to the creation order. Also, I think 
> it's not the right approach.
> 
> I thought about driver toggles. At least for dax/kmem people are looking into 
> that:
> 
> https://lkml.kernel.org/r/20230801-vv-kmem_memmap-v3-2-406e9aaf5...@intel.com
> 
> Where that can also be toggled per device.
> 

That still is dependent on the global memmap_on_memory enabled right? We could 
make the dax facility independent of the
global feature toggle? 

-aneesh


Re: [PATCH v7 6/7] mm/memory_hotplug: Embed vmem_altmap details in memory block

2023-08-01 Thread Aneesh Kumar K V
On 8/2/23 4:40 AM, Verma, Vishal L wrote:
> On Tue, 2023-08-01 at 10:11 +0530, Aneesh Kumar K.V wrote:
>> With memmap on memory, some architecture needs more details w.r.t altmap
>> such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
>> computing them again when we remove a memory block, embed vmem_altmap
>> details in struct memory_block if we are using memmap on memory block
>> feature.
>>
>> Acked-by: David Hildenbrand 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  drivers/base/memory.c  | 27 +
>>  include/linux/memory.h |  8 ++
>>  mm/memory_hotplug.c    | 55 ++
>>  3 files changed, 53 insertions(+), 37 deletions(-)
>>
> 
> 
>> @@ -2136,10 +2148,10 @@ EXPORT_SYMBOL(try_offline_node);
>>  
>>  static int __ref try_remove_memory(u64 start, u64 size)
>>  {
>> -   struct vmem_altmap mhp_altmap = {};
>> -   struct vmem_altmap *altmap = NULL;
>> -   unsigned long nr_vmemmap_pages;
>> +   int ret;
> 
> Minor nit - there is already an 'int rc' below - just use that, or
> rename it to 'ret' if that's better for consistency.
> 


I reused the existing rc variable. 

>> +   struct memory_block *mem;
>> int rc = 0, nid = NUMA_NO_NODE;
>> +   struct vmem_altmap *altmap = NULL;
>>  
>> BUG_ON(check_hotplug_memory_range(start, size));
>>  
>> @@ -2161,25 +2173,20 @@ static int __ref try_remove_memory(u64 start, u64 
>> size)
>>  * the same granularity it was added - a single memory block.
>>  */
>> if (mhp_memmap_on_memory()) {
>> -   nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
>> - 
>> get_nr_vmemmap_pages_cb);
>> -   if (nr_vmemmap_pages) {
>> +   ret = walk_memory_blocks(start, size, , 
>> test_has_altmap_cb);
>> +   if (ret) {
>> if (size != memory_block_size_bytes()) {
>> pr_warn("Refuse to remove %#llx - %#llx,"
>> "wrong granularity\n",
>> start, start + size);
>> return -EINVAL;
>> }
>> -
>> +   altmap = mem->altmap;
>> /*
>> -    * Let remove_pmd_table->free_hugepage_table do the
>> -    * right thing if we used vmem_altmap when hot-adding
>> -    * the range.
>> +    * Mark altmap NULL so that we can add a debug
>> +    * check on memblock free.
>>  */
>> -   mhp_altmap.base_pfn = PHYS_PFN(start);
>> -   mhp_altmap.free = nr_vmemmap_pages;
>> -   mhp_altmap.alloc = nr_vmemmap_pages;
>> -   altmap = _altmap;
>> +   mem->altmap = NULL;
>> }
>> }
>>  


Thank you for taking the time to review the patch.

-aneesh


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-01 Thread Aneesh Kumar K V
On 8/1/23 4:20 PM, Michal Hocko wrote:
> On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
>> On 8/1/23 2:28 PM, Michal Hocko wrote:
>>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>> value.
>>>
>>> Well, this is a user space kABI extension and as such you should spend
>>> more words about the usecase. Why we could live with this static and now
>>> need dynamic?
>>>
>>
>> This enables easy testing of memmap_on_memory feature without a kernel 
>> reboot.
> 
> Testing alone is rather weak argument to be honest.
> 
>> I also expect people wanting to use that when they find dax kmem memory 
>> online
>> failing because of struct page allocation failures[1]. User could reboot 
>> back with
>> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs 
>> makes
>> the feature much more useful.
> 
> Sure it can be useful but that holds for any feature, right. The main
> question is whether this is worth maintaing. The current implementation
> seems rather trivial which is an argument to have it but are there any
> risks long term? Have you evaluated a potential long term maintenance
> cost? There is no easy way to go back and disable it later on without
> breaking some userspace.
> 
> All that should be in the changelog!

I updated it as below. 

mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

Allow updating memmap_on_memory mode after the kernel boot. Memory
hotplug done after the mode update will use the new mmemap_on_memory
value.

It is now possible to test the memmap_on_memory feature easily without
the need for a kernel reboot. Additionally, for those encountering
struct page allocation failures while using dax kmem memory online, this
feature may prove useful. Instead of rebooting with the
memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
which greatly enhances its usefulness. Making the necessary changes to
support runtime updates is a simple change that should not affect the
addition of new features or result in long-term maintenance problems.




Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-01 Thread Aneesh Kumar K V
On 8/1/23 2:28 PM, Michal Hocko wrote:
> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>> hotplug done after the mode update will use the new mmemap_on_memory
>> value.
> 
> Well, this is a user space kABI extension and as such you should spend
> more words about the usecase. Why we could live with this static and now
> need dynamic?
> 

This enables easy testing of memmap_on_memory feature without a kernel reboot.
I also expect people wanting to use that when they find dax kmem memory online
failing because of struct page allocation failures[1]. User could reboot back 
with
memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
the feature much more useful.

-aneesh


Re: [PATCH v3 1/2] powerpc/mm: Cleanup memory block size probing

2023-07-29 Thread Aneesh Kumar K V
On 7/29/23 3:25 AM, Reza Arbab wrote:
> On Fri, Jul 28, 2023 at 04:05:55PM +0530, Aneesh Kumar K.V wrote:
>> --- a/arch/powerpc/mm/init_64.c
>> +++ b/arch/powerpc/mm/init_64.c
> [snip]
>> +    /*
>> + * "ibm,coherent-device-memory with linux,usable-memory = 0
>> + * Force 256MiB block size. Work around for GPUs on P9 PowerNV
>> + * linux,usable-memory == 0 implies driver managed memory and
>> + * we can't use large memory block size due to hotplug/unplug
>> + * limitations.
>> + */
>> +    compatible = of_get_flat_dt_prop(node, "compatible", NULL);
>> +    if (compatible && !strcmp(compatible, "ibm,coherent-device-memory")) {
>> +    int len = 0;
>> +    const __be32 *usm;
>> +
>> +    usm = of_get_flat_dt_prop(node, "linux,drconf-usable-memory", );
> 
> I think this should be "linux,usable-memory".
> 
>> +    if (usm && !len) {
>> +    *block_size = SZ_256M;
>> +    return 1;
>> +    }
> 
> This isn't quite right. The criteria is not that the property itself has no 
> registers, it's that the base/size combo has size zero.
> 
> If you fold in the patch appended to the end of this mail, things worked for 
> me.
> 
>> +    }
>> +
>> +    reg = of_get_flat_dt_prop(node, "reg", );
>> +    endp = reg + (l / sizeof(__be32));
>> +
>> +    while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
>> +    u64 base, size;
>> +
>> +    base = dt_mem_next_cell(dt_root_addr_cells, );
>> +    size = dt_mem_next_cell(dt_root_size_cells, );
>> +
>> +    if (size == 0)
>> +    continue;
>> +
>> +    update_memory_block_size(block_size, size);
>> +    }
>> +    /* continue looking for other memory device types */
>> +    return 0;
>> +}
>> +
>> +/*
>> + * start with 1G memory block size. Early init will
>> + * fix this with correct value.
>> + */
>> +unsigned long memory_block_size __ro_after_init = 1UL << 30;
> 
> Could use SZ_1G here.
> 
> With the following fixup, I got 256MiB blocks on a system with
> "ibm,coherent-device-memory" nodes.
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index dbed37d6cffb..1ac58e72a885 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -487,7 +487,6 @@ static int __init probe_memory_block_size(unsigned long 
> node, const char *uname,
>    depth, void *data)
>  {
>  const char *type;
> -    const char *compatible;
>  unsigned long *block_size = (unsigned long *)data;
>  const __be32 *reg, *endp;
>  int l;
> @@ -532,38 +531,38 @@ static int __init probe_memory_block_size(unsigned long 
> node, const char *uname,
>  if (type == NULL || strcmp(type, "memory") != 0)
>  return 0;
>  
> -    /*
> - * "ibm,coherent-device-memory with linux,usable-memory = 0
> - * Force 256MiB block size. Work around for GPUs on P9 PowerNV
> - * linux,usable-memory == 0 implies driver managed memory and
> - * we can't use large memory block size due to hotplug/unplug
> - * limitations.
> - */
> -    compatible = of_get_flat_dt_prop(node, "compatible", NULL);
> -    if (compatible && !strcmp(compatible, "ibm,coherent-device-memory")) {
> -    int len = 0;
> -    const __be32 *usm;
> -
> -    usm = of_get_flat_dt_prop(node, "linux,drconf-usable-memory", );
> -    if (usm && !len) {
> -    *block_size = SZ_256M;
> -    return 1;
> -    }
> -    }
> +    reg = of_get_flat_dt_prop(node, "linux,usable-memory", );
> +    if (!reg)
> +    reg = of_get_flat_dt_prop(node, "reg", );
> +    if (!reg)
> +    return 0;
>  
> -    reg = of_get_flat_dt_prop(node, "reg", );
>  endp = reg + (l / sizeof(__be32));
>  
>  while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> +    const char *compatible;
>  u64 base, size;
>  
>  base = dt_mem_next_cell(dt_root_addr_cells, );
>  size = dt_mem_next_cell(dt_root_size_cells, );
>  
> -    if (size == 0)
> +    if (size) {
> +    update_memory_block_size(block_size, size);
>  continue;
> +    }
>  
> -    update_memory_block_size(block_size, size);
> +    /*
> + * ibm,coherent-device-memory with linux,usable-memory = 0
> + * Force 256MiB block size. Work around for GPUs on P9 PowerNV
> + * linux,usable-memory == 0 implies driver managed memory and
> + * we can't use large memory block size due to hotplug/unplug
> + * limitations.
> + */
> +    compatible = of_get_flat_dt_prop(node, "compatible", NULL);
> +    if (compatible && !strcmp(compatible, "ibm,coherent-device-memory")) 
> {
> +    *block_size = SZ_256M;
> +    return 1;
> +    }
>  }
>  /* continue looking for other memory device types */
>  return 0;

Thanks for correcting the right device tree node and testing the changes. Can I 
add

Co-authored-by: Reza Arbab 

-aneesh


Re: [PATCH v6 6/7] mm/memory_hotplug: Embed vmem_altmap details in memory block

2023-07-27 Thread Aneesh Kumar K V
On 7/27/23 2:55 PM, Michal Hocko wrote:
> On Thu 27-07-23 13:32:31, Aneesh Kumar K.V wrote:
>> With memmap on memory, some architecture needs more details w.r.t altmap
>> such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
>> computing them again when we remove a memory block, embed vmem_altmap
>> details in struct memory_block if we are using memmap on memory block
>> feature.
>>
>> No functional change in this patch
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  drivers/base/memory.c  | 25 +++---
>>  include/linux/memory.h |  8 ++
>>  mm/memory_hotplug.c| 58 +++---
>>  3 files changed, 55 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index b456ac213610..57ed61212277 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
>>  {
>>  struct memory_block *mem = to_memory_block(dev);
>>  
>> +WARN_ON(mem->altmap);
> 
> What is this supposed to catch? A comment would be handy so that we know
> what to look at should it ever trigger.
> 

I did add a comment where we clear the altmap in try_remove_memory(). I will 
also add
more details here.

+* Mark altmap NULL so that we can add a debug
+* check on memblock free.
 */

WARN_ON is an indication of memory leak because if we have mem->altmap != NULL
then the allocated altmap is not freed . It also indicate that memblock got 
freed
without going through the try_remove_memory(). 

>>  kfree(mem);
>>  }


=aneesh


Re: [PATCH v6 4/7] mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

2023-07-27 Thread Aneesh Kumar K V
On 7/27/23 2:53 PM, Michal Hocko wrote:
> On Thu 27-07-23 13:32:29, Aneesh Kumar K.V wrote:
> [...]
>> +if (mode == MEMMAP_ON_MEMORY_FORCE) {
>> +unsigned long memmap_pages = 
>> memory_block_memmap_on_memory_pages();
>> +
>> +pr_info_once("Memory hotplug will reserve %ld pages in each 
>> memory block\n",
>> + memmap_pages - PFN_UP(memory_block_memmap_size()));
>> +}
>> +return 0;
>> +}
> 
> Why should we print this only for the forced case? Isn't that
> interesting for any on memory memmap? Also is this the above sufficient
> on its own? the size depends on the block size and that can vary.
> I think it would make more sense to print the block size and the vmemmap
> reservation and for the force case also any wasted amount on top (if
> any).
> 

For the other cases the space is completely used by for struct page allocation. 
What
the information is indicating here is that for each memblock we add we are 
loosing/wasting so many pages. 
May be I should have used the term "waste" instead of "reserve" ?

-aneesh


Re: [PATCH v5 4/7] mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

2023-07-26 Thread Aneesh Kumar K V
On 7/26/23 2:34 PM, David Hildenbrand wrote:
> 
    /*
 @@ -1310,7 +1400,10 @@ int __ref add_memory_resource(int nid, struct 
 resource *res, mhp_t mhp_flags)
    {
    struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
    enum memblock_flags memblock_flags = MEMBLOCK_NONE;
 -    struct vmem_altmap mhp_altmap = {};
 +    struct vmem_altmap mhp_altmap = {
 +    .base_pfn =  PHYS_PFN(res->start),
 +    .end_pfn  =  PHYS_PFN(res->end),
>>>
>>> Is it required to set .end_pfn, and if so, shouldn't we also set it to
>>> base_pfn + memory_block_memmap_on_memory_pages()) ?
>>>
>>
>> We use that in ppc64 for checking altmap boundary condition. As we
>> discussed earlier, ppc64 due to vmemmap mapping size restrictions can't
>> always allocate vmemmap pages from altmap area even if requested. We
>> fallback to regular memory alocation in that case (only used now with
>> pmem). We use altmap.end_pfn for that boundary check. You can refer to
>> altmap_cross_boundary() for more details.
> 
> But even then, setting the end to the end of the resource size is wrong, no? 
> We don't want anybody to allocate beyond base_pfn + 
> memory_block_memmap_on_memory_pages().
> 

altmap.end is the end pfn of the resource

__nvdimm_setup_pfn()
...
resource_size_t end = nsio->res.end - end_trunc;
struct vmem_altmap __altmap = {
.base_pfn = init_altmap_base(base),
.reserve = init_altmap_reserve(base),
.end_pfn = PHYS_PFN(end),
};


And we use it to find that the page_to_pfn mapping we use by allocating a page 
from altmap doesn't point to a pfn that is outside the resource range.

-aneesh


Re: [PATCH v6 00/13] Add support for DAX vmemmap optimization for ppc64

2023-07-25 Thread Aneesh Kumar K V
On 7/26/23 12:59 AM, Andrew Morton wrote:
> On Tue, 25 Jul 2023 00:37:46 +0530 "Aneesh Kumar K.V" 
>  wrote:
> 
>> This patch series implements changes required to support DAX vmemmap
>> optimization for ppc64.
> 
> Do we have any measurements to help us understand the magnitude
> of this optimization?
> 
> And any documentation which helps users understand whether and
> why they should enable this feature?

That is memory space optimization due to kernel reusing the tail page struct 
pages. The details
of the optimization is documented in patch 11. We document there the impact 
with both 4k and
64K page size.

-aneesh


Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-24 Thread Aneesh Kumar K V
On 7/24/23 9:11 PM, David Hildenbrand wrote:
> On 24.07.23 17:16, Aneesh Kumar K V wrote:
> 
>>>
>>> /*
>>>   * In "forced" memmap_on_memory mode, we always align the vmemmap size up 
>>> to cover
>>>   * full pageblocks. That way, we can add memory even if the vmemmap size 
>>> is not properly
>>>   * aligned, however, we might waste memory.
>>>   */
>>
>> I am finding that confusing. We do want things to be pageblock_nr_pages 
>> aligned both ways.
>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap 
>> and
>> in the default case we do that by making sure only memory blocks of specific 
>> size supporting
>> that alignment can use MEMMAP_ON_MEMORY feature.
> 
> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense 
> then.
> 
> But if you have any ideas on how to clarify that (terminology), I'm all ears!
> 


I updated the commit message 

mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

Currently, memmap_on_memory feature is only supported with memory block
sizes that result in vmemmap pages covering full page blocks. This is
because memory onlining/offlining code requires applicable ranges to be
pageblock-aligned, for example, to set the migratetypes properly.

This patch helps to lift that restriction by reserving more pages than
required for vmemmap space. This helps to align the start addr to be
page block aligned with different memory block sizes. This implies the
kernel will be reserving some pages for every memoryblock. This also
allows the memmap on memory feature to be widely useful with different
memory block size values.

For ex: with 64K page size and 256MiB memory block size, we require 4
pages to map vmemmap pages, To align things correctly we end up adding a
reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.


Also while implementing your  suggestion to use 
memory_block_memmap_on_memory_size()
I am finding it not really useful because in mhp_supports_memmap_on_memory() we 
are checking
if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that 
helper
later). Also I still think altmap.reserve is easier because of the start_pfn 
calculation.
(more on this below)



> [...]
> 
>>>> +    return arch_supports_memmap_on_memory(size);
>>>>    }
>>>>      /*
>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct 
>>>> resource *res, mhp_t mhp_flags)
>>>>    {
>>>>    struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>    enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>> -    struct vmem_altmap mhp_altmap = {};
>>>> +    struct vmem_altmap mhp_altmap = {
>>>> +    .base_pfn =  PHYS_PFN(res->start),
>>>> +    .end_pfn  =  PHYS_PFN(res->end),
>>>> +    .reserve  = memory_block_align_base(resource_size(res)),
>>>
>>> Can you remind me why we have to set reserve here at all?
>>>
>>> IOW, can't we simply set
>>>
>>> .free = memory_block_memmap_on_memory_size();
>>>
>>> end then pass
>>>
>>> mhp_altmap.alloc + mhp_altmap.free
>>>
>>> to create_memory_block_devices() instead?
>>>
>>
>> But with the dax usage of altmap, altmap->reserve is what we use to reserve 
>> things to get
>> the required alignment. One difference is where we allocate the struct page 
>> at. For this specific
>> case it should not matter.
>>
>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap 
>> *altmap)
>> {
>> return altmap->base_pfn + altmap->reserve + altmap->alloc
>>     + altmap->align;
>> }
>>
>> And other is where we online a memory block
>>
>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>
>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you 
>> want to use altmap->free for this?
> 
> "Reserve" is all about "reserving that much memory for driver usage".
> 
> We don't care about that. We simply want vmemmap allocations coming from the 
> pageblock(s) we set aside. Where exactly, we don't care.
> 
>> I find it confusing to update free when we haven't allocated any altmap 
>> blocks yet.
> 
> "
> @reserve: pages mapped, but reserved for driver use (relative to @base)"
> @free: free pages set aside in the mapping for memmap storage
> @alloc: track pages consumed, private to vmemmap_populate(

Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-24 Thread Aneesh Kumar K V
On 7/24/23 8:03 PM, David Hildenbrand wrote:
> On 18.07.23 04:44, Aneesh Kumar K.V wrote:
>> Add a new kconfig option that can be selected if we want to allow
> 
> That description seems outdated.
> 


Will update

>> pageblock alignment by reserving pages in the vmemmap altmap area.
>> This implies we will be reserving some pages for every memoryblock
>> This also allows the memmap on memory feature to be widely useful
>> with different memory block size values.
> 
> Can you add some more meat to the description, and especially, in
> which cases this might be desired and in which cases it might be
> completely undesired?
> 
> 
> Let's assume we hotplug a 1 GiB DIMM on arm64/64k. With 512 MiB pageblocks,
> we'd waste 50% of the hotplugged memory.
> 
> Also, see below on the case where we could end up with 100% wasted memory,
> which we want to block compeltely.
> 
> 
> Also, I wonder if we can avoid talking about "page reservation" or "altmap 
> reservation",
> that's rather an implementation detail.
> 
> For example, I'd call this patch
> 
> "mm/hotplug: Support memmap_on_memory when memmap is not aligned to 
> pageblocks"
> 
> 

Ok will update

> 
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   mm/memory_hotplug.c | 109 ++--
>>   1 file changed, 96 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 5921c81fcb70..c409f5ff6a59 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -41,17 +41,85 @@
>>   #include "internal.h"
>>   #include "shuffle.h"
>>   +enum {
>> +    MEMMAP_ON_MEMORY_DISABLE = 0,
>> +    MEMMAP_ON_MEMORY_ENABLE,
>> +    MEMMAP_ON_MEMORY_FORCE,
>> +};
>> +
>> +static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
>> +
>> +static inline unsigned long memory_block_align_base(unsigned long size)
>> +{
> 
> Can we start with something like this instead?
> 
> memory_block_memmap_size() might be reasonable to put into the previous patch.
> 
> 
> static inline unsigned long memory_block_memmap_size(void)
> {
> return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
> }
> 
> /*
>  * In "forced" memmap_on_memory mode, we always align the vmemmap size up to 
> cover
>  * full pageblocks. That way, we can add memory even if the vmemmap size is 
> not properly
>  * aligned, however, we might waste memory.
>  */

I am finding that confusing. We do want things to be pageblock_nr_pages aligned 
both ways. 
With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
in the default case we do that by making sure only memory blocks of specific 
size supporting
that alignment can use MEMMAP_ON_MEMORY feature. 

> static inline unsigned long memory_block_memmap_on_memory_size(void)
> {
> unsigned long size = memory_block_memmap_size();
> 
> if (memmap_mode != MEMMAP_ON_MEMORY_FORCE)
>     return size;
> return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
> }
> 
> 
> 
>> +    if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
>> +    unsigned long align;
>> +    unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>> +    unsigned long vmemmap_size;
>> +
>> +    vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), 
>> PAGE_SIZE);
>> +    align = pageblock_align(vmemmap_size) - vmemmap_size;
>> +    return align;
>> +    } else
>> +    return 0;
>> +}
>> +
>>   #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>>   /*
>>    * memory_hotplug.memmap_on_memory parameter
>>    */
>> -static bool memmap_on_memory __ro_after_init;
>> -module_param(memmap_on_memory, bool, 0444);
>> -MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory 
>> hotplug");
>> +static int set_memmap_mode(const char *val, const struct kernel_param *kp)
>> +{
>> +    int ret, mode;
>> +    bool enabled;
>> +
>> +    if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
>> +    mode =  MEMMAP_ON_MEMORY_FORCE;
>> +    goto matched;
>> +    }
>> +
>> +    ret = kstrtobool(val, );
>> +    if (ret < 0)
>> +    return ret;
>> +    if (enabled)
>> +    mode =  MEMMAP_ON_MEMORY_ENABLE;
>> +    else
>> +    mode =  MEMMAP_ON_MEMORY_DISABLE;
>> +
>> +matched:
>> +    *((int *)kp->arg) =  mode;
>> +    if (mode == MEMMAP_ON_MEMORY_FORCE) {
>> +    pr_info("Memory hotplug will reserve %ld pages in each memory 
>> block\n",
>> +    memory_block_align_base(memory_block_size_bytes()));
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
>> +{
>> +    if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
>> +    return sprintf(buffer,  "force\n");
>> +    if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
>> +    return sprintf(buffer,  "y\n");
>> +
>> +    return sprintf(buffer,  "n\n");
> 
> param_get_bool() uses uppercase Y / N. Maybe just return the uppercase 
> variants here as well.
> 
>> +}
>> +
>> +static const struct kernel_param_ops 

Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-24 Thread Aneesh Kumar K V
On 7/24/23 8:04 PM, David Hildenbrand wrote:
> On 18.07.23 04:44, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. Hence we skip the
>> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
>> makes the feature widely useful because to use PMD_SIZE vmemmap area we
>> require a memory block size of 2GiB
>>
>> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
>> can work with a memory block size of 256MB. Using altmap.reserve feature
>> to align things correctly at pageblock granularity. We can end up
>> losing some pages in memory with this. For ex: with a 256MiB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096
>> pages 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/Kconfig  |  1 +
>>   arch/powerpc/include/asm/pgtable.h    | 24 +++
>>   .../platforms/pseries/hotplug-memory.c    |  3 ++-
>>   mm/memory_hotplug.c   |  2 ++
>>   4 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>   select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>   select ARCH_MIGHT_HAVE_PC_PARPORT
>>   select ARCH_MIGHT_HAVE_PC_SERIO
>>   select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/include/asm/pgtable.h 
>> b/arch/powerpc/include/asm/pgtable.h
>> index 68817ea7f994..3d35371395a9 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
>>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
>> vmemmap_map_size);
>>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>>  unsigned long page_size);
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    unsigned long nr_pages = size >> PAGE_SHIFT;
>> +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>> +
>> +    if (!radix_enabled())
>> +    return false;
>> +
>> +    if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
>> +    return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> 
> Can you add a comment why we care about that in the 4K case only?


Sure. We keep the PMD_SIZE alignment for the same reason we have it for x86. 
With 4K page size and 2M hugepage size
things get properly aligned and we can make this feature useful even with this 
alignment restrictions. With 64K
page size and 2M hugepage size, having this alignment restrictions makes it 
more or less not useful to a large
number of memory blocksize we support. I will add that comment in here. 

> 
>> +    /*
>> + * The pageblock alignment requirement is met by using
>> + * reserve blocks in altmap.
>> + */
> 
> Just drop that comment, that's handled by common code now.
> 

Ok. 

>> +    return true;
>> +}
>> +
>>   #endif /* CONFIG_PPC64 */
>>     #endif /* __ASSEMBLY__ */
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 9c62c2c3b3d0..1447509357a7 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
>> u32 drc_index)
>>     static int dlpar_add_lmb(struct drmem_lmb *lmb)
>>   {
>> +    mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
>>   unsigned long block_sz;
>>   int nid, rc;
>>   @@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>>   nid = first_online_node;
>>     /* Add the memory */
>> -    rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
>> +    rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>>   if (rc) {
>>   invalidate_lmb_associativity_index(lmb);
>>   return rc;
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c409f5ff6a59..6da063c80733 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 

Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

2023-07-18 Thread Aneesh Kumar K V
On 7/19/23 10:34 AM, Hugh Dickins wrote:
> On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
>> Hugh Dickins  writes:
>>
>>> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
>>> in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
>>> stricter than the previous implementation, which skipped when pmd_none()
>>> (with a comment on khugepaged collapse transitions): but wouldn't we want
>>> to know, if an assert_pte_locked() caller can be racing such transitions?
>>>
>>
>> The reason we had that pmd_none check there was to handle khugpaged. In
>> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
>> ppc64 had the assert_pte_locked check inside that ptep_clear.
>>
>> _pmd = pmdp_collapse_flush(vma, address, pmd);
>> ..
>> ptep_clear()
>> -> asset_ptep_locked()
>> ---> pmd_none
>> -> BUG
>>
>>
>> The problem is how assert_pte_locked() verify whether we are holding
>> ptl. It does that by walking the page table again and in this specific
>> case by the time we call the function we already had cleared pmd .
> 
> Aneesh, please clarify, I've spent hours on this.
> 
> From all your use of past tense ("had"), I thought you were Acking my
> patch; but now, after looking again at v3.11 source and today's,
> I think you are NAKing my patch in its present form.
> 

Sorry for the confusion my reply created. 

> You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
> uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
> *pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
> Is that your point?
> 

Yes. I haven't tested this yet to verify that it is indeed hitting that BUG.
But a code inspection tells me we will hit that BUG on powerpc because of
the above details.

> I can easily restore that khugepaged comment (which had appeared to me
> out of date at the time, but now looks still relevant) and pmd_none(*pmd)
> check: but please clarify.
> 

That is correct. if we add that pmd_none check back we should be good here.


-aneesh


Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-11 Thread Aneesh Kumar K V
On 7/11/23 10:49 PM, David Hildenbrand wrote:
> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>> Add a new kconfig option that can be selected if we want to allow
>> pageblock alignment by reserving pages in the vmemmap altmap area.
>> This implies we will be reserving some pages for every memoryblock
>> This also allows the memmap on memory feature to be widely useful
>> with different memory block size values.
> 
> "reserving pages" is a nice way of saying "wasting memory". :) Let's spell 
> that out.
> 
> I think we have to find a better name for this, and I think we should have a 
> toggle similar to memory_hotplug.memmap_on_memory. This should be an admin 
> decision, not some kernel config option.
> 
> 
> memory_hotplug.force_memmap_on_memory
> 
> "Enable the memmap on memory feature even if it could result in memory waste 
> due to memmap size limitations. For example, if the memmap for a memory block 
> requires 1 MiB, but the pageblock size is 2 MiB, 1 MiB
> of hotplugged memory will be wasted. Note that there are still cases where 
> the feature cannot be enforced: for example, if the memmap is smaller than a 
> single page, or if the architecture does not support the forced mode in all 
> configurations."
> 
> Thoughts?
> 

With module parameter, do we still need the Kconfig option? 

-aneesh



Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check

2023-07-11 Thread Aneesh Kumar K V
On 7/11/23 4:06 PM, David Hildenbrand wrote:
> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>> Some architectures would want different restrictions. Hence add an
>> architecture-specific override.
>>
>> Both the PMD_SIZE check and pageblock alignment check are moved there.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   mm/memory_hotplug.c | 17 -
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 1b19462f4e72..07c99b0cc371 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1247,12 +1247,20 @@ static int online_memory_block(struct memory_block 
>> *mem, void *arg)
>>   return device_online(>dev);
>>   }
>>   -static bool mhp_supports_memmap_on_memory(unsigned long size)
>> +#ifndef arch_supports_memmap_on_memory
> 
> Can we make that a __weak function instead?


We can. It is confusing because we do have these two patterns within the kernel 
where we use 

#ifndef x
#endif 

vs

__weak x 

What is the recommended way to override ? I have mostly been using #ifndef for 
most of the arch overrides till now.


> 
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>>   {
>> -    unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>> +    unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>>   unsigned long remaining_size = size - vmemmap_size;
>>   +    return IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> +    IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> 
> You're moving that check back to mhp_supports_memmap_on_memory() in the 
> following patch, where it actually belongs. So this check should stay in 
> mhp_supports_memmap_on_memory(). Might be reasonable to factor out the 
> vmemmap_size calculation.
> 
> 
> Also, let's a comment
> 
> /*
>  * As default, we want the vmemmap to span a complete PMD such that we
>  * can map the vmemmap using a single PMD if supported by the
>  * architecture.
>  */
> return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> 
>> +}
>> +#endif
>> +
>> +static bool mhp_supports_memmap_on_memory(unsigned long size)
>> +{
>>   /*
>>    * Besides having arch support and the feature enabled at runtime, we
>>    * need a few more assumptions to hold true:
>> @@ -1280,9 +1288,8 @@ static bool mhp_supports_memmap_on_memory(unsigned 
>> long size)
>>    *   populate a single PMD.
>>    */
>>   return mhp_memmap_on_memory() &&
>> -   size == memory_block_size_bytes() &&
>> -   IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> -   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>> +    size == memory_block_size_bytes() &&
> 
> If you keep the properly aligned indentation, this will not be detected as a 
> change by git.
> 
>> +    arch_supports_memmap_on_memory(size);
>>   }
>>     /*
> 

Will update the code based on the above feedback.

-aneesh


Re: [PATCH v3 2/7] mm/hotplug: Allow memmap on memory hotplug request to fallback

2023-07-11 Thread Aneesh Kumar K V
On 7/11/23 3:53 PM, David Hildenbrand wrote:
>> -bool mhp_supports_memmap_on_memory(unsigned long size)
>> +static bool mhp_supports_memmap_on_memory(unsigned long size)
>>   {
>>   unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>> @@ -1339,13 +1339,12 @@ int __ref add_memory_resource(int nid, struct 
>> resource *res, mhp_t mhp_flags)
>>    * Self hosted memmap array
>>    */
>>   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> -    if (!mhp_supports_memmap_on_memory(size)) {
>> -    ret = -EINVAL;
>> -    goto error;
>> +    if (mhp_supports_memmap_on_memory(size)) {
>> +    mhp_altmap.free = PHYS_PFN(size);
>> +    mhp_altmap.base_pfn = PHYS_PFN(start);
>> +    params.altmap = _altmap;
>>   }
>> -    mhp_altmap.free = PHYS_PFN(size);
>> -    mhp_altmap.base_pfn = PHYS_PFN(start);
>> -    params.altmap = _altmap;
>> +    /* fallback to not using altmap  */
>>   }
>>     /* call arch's memory hotadd */
> 
> In general, LGTM, but please extend the documentation of the parameter in 
> memory_hotplug.h, stating that this is just a hint and that the core can 
> decide to no do that.
> 

will update

modified   include/linux/memory_hotplug.h
@@ -97,6 +97,8 @@ typedef int __bitwise mhp_t;
  * To do so, we will use the beginning of the hot-added range to build
  * the page tables for the memmap array that describes the entire range.
  * Only selected architectures support it with SPARSE_VMEMMAP.
+ * This is only a hint, core kernel can decide to not do this based on
+ * different alignment checks.
  */
 #define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))



Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-11 Thread Aneesh Kumar K V
On 7/11/23 9:14 PM, David Hildenbrand wrote:
> On 11.07.23 17:40, Aneesh Kumar K V wrote:
>> On 7/11/23 8:56 PM, David Hildenbrand wrote:
>>> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>>>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>>>> level based on different device boundary checks. Hence we skip the
>>>> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
>>>> makes the feature widely useful because to use PMD_SIZE vmemmap area we
>>>> require a memory block size of 2GiB
>>>>
>>>> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
>>>> can work with a memory block size of 256MB. Using altmap.reserve feature
>>>> to align things correctly at pageblock granularity. We can end up
>>>> losing some pages in memory with this. For ex: with a 256MiB memory block
>>>> size, we require 4 pages to map vmemmap pages, In order to align things
>>>> correctly we end up adding a reserve of 28 pages. ie, for every 4096
>>>> pages 28 pages get reserved.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V 
>>>> ---
>>>>    arch/powerpc/Kconfig  |  1 +
>>>>    arch/powerpc/include/asm/pgtable.h    | 28 +++
>>>>    .../platforms/pseries/hotplug-memory.c    |  3 +-
>>>>    mm/memory_hotplug.c   |  2 ++
>>>>    4 files changed, 33 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>>>> index 116d6add0bb0..f890907e5bbf 100644
>>>> --- a/arch/powerpc/Kconfig
>>>> +++ b/arch/powerpc/Kconfig
>>>> @@ -157,6 +157,7 @@ config PPC
>>>>    select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>>    select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>>>    select ARCH_KEEP_MEMBLOCK
>>>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>>>    select ARCH_MIGHT_HAVE_PC_PARPORT
>>>>    select ARCH_MIGHT_HAVE_PC_SERIO
>>>>    select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
>>>> diff --git a/arch/powerpc/include/asm/pgtable.h 
>>>> b/arch/powerpc/include/asm/pgtable.h
>>>> index 68817ea7f994..8e6c92dde6ad 100644
>>>> --- a/arch/powerpc/include/asm/pgtable.h
>>>> +++ b/arch/powerpc/include/asm/pgtable.h
>>>> @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x)
>>>>    int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
>>>> vmemmap_map_size);
>>>>    bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long 
>>>> start,
>>>>   unsigned long page_size);
>>>> +/*
>>>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>>>> + * some of the restrictions. We don't check for PMD_SIZE because our
>>>> + * vmemmap allocation code can fallback correctly. The pageblock
>>>> + * alignment requirement is met using altmap->reserve blocks.
>>>> + */
>>>> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
>>>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>>>> +{
>>>> +    unsigned long nr_pages = size >> PAGE_SHIFT;
>>>> +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>>>> +
>>>> +    if (!radix_enabled())
>>>> +    return false;
>>>> +
>>>> +#ifdef CONFIG_PPC_4K_PAGES
>>>> +    return IS_ALIGNED(vmemmap_size, PMD_SIZE);
>>>> +#else
>>>> +    /*
>>>> + * Make sure the vmemmap allocation is fully contianed
>>>> + * so that we always allocate vmemmap memory from altmap area.
>>>> + * The pageblock alignment requirement is met by using
>>>> + * reserve blocks in altmap.
>>>> + */
>>>> +    return IS_ALIGNED(vmemmap_size,  PAGE_SIZE);
>>>
>>> Can we move that check into common code as well?
>>>
>>> If our (original) vmemmap size would not fit into a single page, we would 
>>> be in trouble on any architecture. Did not check if it would be an issue 
>>> for arm64 as well in case we would allow eventually wasting memory.
>>>
>>
>>
>> For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in 
>> arch_supports_memmap_on_memory(). That should i

Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-11 Thread Aneesh Kumar K V
On 7/11/23 8:56 PM, David Hildenbrand wrote:
> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. Hence we skip the
>> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
>> makes the feature widely useful because to use PMD_SIZE vmemmap area we
>> require a memory block size of 2GiB
>>
>> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
>> can work with a memory block size of 256MB. Using altmap.reserve feature
>> to align things correctly at pageblock granularity. We can end up
>> losing some pages in memory with this. For ex: with a 256MiB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096
>> pages 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/Kconfig  |  1 +
>>   arch/powerpc/include/asm/pgtable.h    | 28 +++
>>   .../platforms/pseries/hotplug-memory.c    |  3 +-
>>   mm/memory_hotplug.c   |  2 ++
>>   4 files changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>   select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>   select ARCH_MIGHT_HAVE_PC_PARPORT
>>   select ARCH_MIGHT_HAVE_PC_SERIO
>>   select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/include/asm/pgtable.h 
>> b/arch/powerpc/include/asm/pgtable.h
>> index 68817ea7f994..8e6c92dde6ad 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x)
>>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
>> vmemmap_map_size);
>>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>>  unsigned long page_size);
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    unsigned long nr_pages = size >> PAGE_SHIFT;
>> +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>> +
>> +    if (!radix_enabled())
>> +    return false;
>> +
>> +#ifdef CONFIG_PPC_4K_PAGES
>> +    return IS_ALIGNED(vmemmap_size, PMD_SIZE);
>> +#else
>> +    /*
>> + * Make sure the vmemmap allocation is fully contianed
>> + * so that we always allocate vmemmap memory from altmap area.
>> + * The pageblock alignment requirement is met by using
>> + * reserve blocks in altmap.
>> + */
>> +    return IS_ALIGNED(vmemmap_size,  PAGE_SIZE);
> 
> Can we move that check into common code as well?
> 
> If our (original) vmemmap size would not fit into a single page, we would be 
> in trouble on any architecture. Did not check if it would be an issue for 
> arm64 as well in case we would allow eventually wasting memory.
> 


For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in 
arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment.
If arm64 allow the usage of altmap.reserve, I would expect the 
arch_supports_memmap_on_memory to have the PAGE_SIZE check.  

Adding the PAGE_SIZE check in  mhp_supports_memmap_on_memory() makes it 
redundant check for x86 and arm currently? 

modified   mm/memory_hotplug.c
@@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 */
if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
return false;
+
+   /*
+* Make sure the vmemmap allocation is fully contianed
+* so that we always allocate vmemmap memory from altmap area.
+*/
+   if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+   return false;
 /*
  * Without page reservation remaining pages should be pageblock 
aligned.
  */




Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-11 Thread Aneesh Kumar K V
On 7/11/23 11:51 AM, Huang, Ying wrote:
> "Aneesh Kumar K.V"  writes:
> 
>> Add a new kconfig option that can be selected if we want to allow
>> pageblock alignment by reserving pages in the vmemmap altmap area.
>> This implies we will be reserving some pages for every memoryblock
>> This also allows the memmap on memory feature to be widely useful
>> with different memory block size values.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  mm/Kconfig  |  9 +++
>>  mm/memory_hotplug.c | 59 +
>>  2 files changed, 58 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 932349271e28..88a1472b2086 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -570,6 +570,15 @@ config MHP_MEMMAP_ON_MEMORY
>>  depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
>>  depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>>  
>> +config MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY
>> +   bool "Allow Reserving pages for page block aligment"
>> +   depends on MHP_MEMMAP_ON_MEMORY
>> +   help
>> +This option allows memmap on memory feature to be more useful
>> +with different memory block sizes. This is achieved by marking some 
>> pages
>> +in each memory block as reserved so that we can get page-block alignment
>> +for the remaining pages.
>> +
>>  endif # MEMORY_HOTPLUG
>>  
>>  config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 07c99b0cc371..f36aec1f7626 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1252,15 +1252,17 @@ static inline bool 
>> arch_supports_memmap_on_memory(unsigned long size)
>>  {
>>  unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>>  unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>> -unsigned long remaining_size = size - vmemmap_size;
>>  
>> -return IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> -IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>> +return IS_ALIGNED(vmemmap_size, PMD_SIZE);
>>  }
>>  #endif
>>  
>>  static bool mhp_supports_memmap_on_memory(unsigned long size)
>>  {
>> +unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>> +unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>> +unsigned long remaining_size = size - vmemmap_size;
>> +
>>  /*
>>   * Besides having arch support and the feature enabled at runtime, we
>>   * need a few more assumptions to hold true:
>> @@ -1287,9 +1289,30 @@ static bool mhp_supports_memmap_on_memory(unsigned 
>> long size)
>>   *   altmap as an alternative source of memory, and we do not 
>> exactly
>>   *   populate a single PMD.
>>   */
>> -return mhp_memmap_on_memory() &&
>> -size == memory_block_size_bytes() &&
>> -arch_supports_memmap_on_memory(size);
>> +if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
>> +return false;
>> + /*
>> +  * Without page reservation remaining pages should be pageblock 
>> aligned.
>> +  */
>> +if (!IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY) &&
>> +!IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
>> +return false;
>> +
>> +return arch_supports_memmap_on_memory(size);
>> +}
>> +
>> +static inline unsigned long memory_block_align_base(unsigned long size)
>> +{
>> +if (IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY)) {
>> +unsigned long align;
>> +unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>> +unsigned long vmemmap_size;
>> +
>> +vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) >> 
>> PAGE_SHIFT;
> 
> DIV_ROUND_UP()?
> 
>

yes. Will update.

vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct 
page), PAGE_SIZE);

-aneesh



Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-07 Thread Aneesh Kumar K V
On 7/7/23 9:12 PM, David Hildenbrand wrote:
> On 07.07.23 15:30, Aneesh Kumar K V wrote:
>> On 7/7/23 5:47 PM, David Hildenbrand wrote:
>>> On 06.07.23 18:06, Aneesh Kumar K V wrote:
>>>> On 7/6/23 6:29 PM, David Hildenbrand wrote:
>>>>> On 06.07.23 14:32, Aneesh Kumar K V wrote:
>>>>>> On 7/6/23 4:44 PM, David Hildenbrand wrote:
>>>>>>> On 06.07.23 11:36, Aneesh Kumar K V wrote:
>>>>>>>> On 7/6/23 2:48 PM, David Hildenbrand wrote:
>>>>>>>>> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>>>>>>>>>> With memmap on memory, some architecture needs more details w.r.t 
>>>>>>>>>> altmap
>>>>>>>>>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
>>>>>>>>>
>>>>>>>>> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
>>>>>>>>>
>>>>>>>>> IOW, why can't ppc64 simply allocate the vmemmap from the start of 
>>>>>>>>> the memblock (-> base_pfn) and use the stored number of vmemmap pages 
>>>>>>>>> to calculate the end_pfn?
>>>>>>>>>
>>>>>>>>> To rephrase: if the vmemmap is not at the beginning and doesn't cover 
>>>>>>>>> full apgeblocks, memory onlining/offlining would be broken.
>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>
>>>>>>>>
>>>>>>>> With ppc64 and 64K pagesize and different memory block sizes, we can 
>>>>>>>> end up allocating vmemmap backing memory from outside altmap because
>>>>>>>> a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct 
>>>>>>>> page)). and that can point to pages outside the dev_pagemap range.
>>>>>>>> So on free we  check
>>>>>>>
>>>>>>> So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
>>>>>>> pages? That sound wrong (and is counter-intuitive to the feature in 
>>>>>>> general, where we *don't* want to allocate the vmemmap from outside the 
>>>>>>> altmap).
>>>>>>>
>>>>>>> (64 * 1024) / sizeof(struct page) -> 1024 pages
>>>>>>>
>>>>>>> 1024 pages * 64k = 64 MiB.
>>>>>>>
>>>>>>> What's the memory block size on these systems? If it's >= 64 MiB the 
>>>>>>> vmemmap of a single memory block fits into a single page and we should 
>>>>>>> be fine.
>>>>>>>
>>>>>>> Smells like you want to disable the feature on a 64k system.
>>>>>>>
>>>>>>
>>>>>> But that part of vmemmap_free is common for both dax,dax kmem and the 
>>>>>> new memmap on memory feature. ie, ppc64 vmemmap_free have checks which 
>>>>>> require
>>>>>> a full altmap structure with all the details in. So for memmap on 
>>>>>> memmory to work on ppc64 we do require similar altmap struct. Hence the 
>>>>>> idea
>>>>>> of adding vmemmap_altmap to  struct memory_block
>>>>>
>>>>> I'd suggest making sure that for the memmap_on_memory case your really 
>>>>> *always* allocate from the altmap (that's what the feature is about after 
>>>>> all), and otherwise block the feature (i.e., arch_mhp_supports_... should 
>>>>> reject it).
>>>>>
>>>>
>>>> Sure. How about?
>>>>
>>>> bool mhp_supports_memmap_on_memory(unsigned long size)
>>>> {
>>>>
>>>>  unsigned long nr_pages = size >> PAGE_SHIFT;
>>>>  unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>>>>
>>>>  if (!radix_enabled())
>>>>  return false;
>>>>  /*
>>>>   * memmap on memory only supported with memory block size add/remove
>>>>   */
>>>>  if (size != memory_block_size_bytes())
>>>>  return false;
>>>>  /*
>>>>   * Also make sure the vmemmap allocation is fully contianed
>>>>   * so that we always allocate vmemmap memory from altmap area.
>>>>   */
>>>>  if (!IS_ALIG

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-07 Thread Aneesh Kumar K V
On 7/7/23 5:47 PM, David Hildenbrand wrote:
> On 06.07.23 18:06, Aneesh Kumar K V wrote:
>> On 7/6/23 6:29 PM, David Hildenbrand wrote:
>>> On 06.07.23 14:32, Aneesh Kumar K V wrote:
>>>> On 7/6/23 4:44 PM, David Hildenbrand wrote:
>>>>> On 06.07.23 11:36, Aneesh Kumar K V wrote:
>>>>>> On 7/6/23 2:48 PM, David Hildenbrand wrote:
>>>>>>> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>>>>>>>> With memmap on memory, some architecture needs more details w.r.t 
>>>>>>>> altmap
>>>>>>>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
>>>>>>>
>>>>>>> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
>>>>>>>
>>>>>>> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
>>>>>>> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
>>>>>>> calculate the end_pfn?
>>>>>>>
>>>>>>> To rephrase: if the vmemmap is not at the beginning and doesn't cover 
>>>>>>> full apgeblocks, memory onlining/offlining would be broken.
>>>>>>>
>>>>>>> [...]
>>>>>>
>>>>>>
>>>>>> With ppc64 and 64K pagesize and different memory block sizes, we can end 
>>>>>> up allocating vmemmap backing memory from outside altmap because
>>>>>> a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct 
>>>>>> page)). and that can point to pages outside the dev_pagemap range.
>>>>>> So on free we  check
>>>>>
>>>>> So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
>>>>> pages? That sound wrong (and is counter-intuitive to the feature in 
>>>>> general, where we *don't* want to allocate the vmemmap from outside the 
>>>>> altmap).
>>>>>
>>>>> (64 * 1024) / sizeof(struct page) -> 1024 pages
>>>>>
>>>>> 1024 pages * 64k = 64 MiB.
>>>>>
>>>>> What's the memory block size on these systems? If it's >= 64 MiB the 
>>>>> vmemmap of a single memory block fits into a single page and we should be 
>>>>> fine.
>>>>>
>>>>> Smells like you want to disable the feature on a 64k system.
>>>>>
>>>>
>>>> But that part of vmemmap_free is common for both dax,dax kmem and the new 
>>>> memmap on memory feature. ie, ppc64 vmemmap_free have checks which require
>>>> a full altmap structure with all the details in. So for memmap on memmory 
>>>> to work on ppc64 we do require similar altmap struct. Hence the idea
>>>> of adding vmemmap_altmap to  struct memory_block
>>>
>>> I'd suggest making sure that for the memmap_on_memory case your really 
>>> *always* allocate from the altmap (that's what the feature is about after 
>>> all), and otherwise block the feature (i.e., arch_mhp_supports_... should 
>>> reject it).
>>>
>>
>> Sure. How about?
>>
>> bool mhp_supports_memmap_on_memory(unsigned long size)
>> {
>>
>> unsigned long nr_pages = size >> PAGE_SHIFT;
>> unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>>
>> if (!radix_enabled())
>>     return false;
>> /*
>>  * memmap on memory only supported with memory block size add/remove
>>  */
>> if (size != memory_block_size_bytes())
>>     return false;
>> /*
>>  * Also make sure the vmemmap allocation is fully contianed
>>  * so that we always allocate vmemmap memory from altmap area.
>>  */
>> if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
>>     return false;
>> /*
>>  * The pageblock alignment requirement is met by using
>>  * reserve blocks in altmap.
>>  */
>> return true;
>> }
> 
> Better, but the PAGE_SIZE that could be added to common code as well.
> 
> ... but, the pageblock check in common code implies a PAGE_SIZE check, so why 
> do we need any other check besides the radix_enabled() check for arm64 and 
> just keep all the other checks in common code as they are?
> 
> If your vmemmap does not cover full pageblocks (which implies full pages), 
> the feature cannot be used *unless* we'd waste altmap space in the vmemmap to 
> cover one pageblock.
> 
> Wasting hotplugged memory certainly sounds wrong?
> 
> 
> So I appreciate if you could explain why the pageblock check should not be 
> had for ppc64?
> 

If we want things to be aligned to pageblock (2M) we will have to use 2M 
vmemmap space and that implies a memory block of 2G with 64K page size. That 
requirements makes the feature not useful at all
on power. The compromise i came to was what i mentioned in the commit message 
for enabling the feature on ppc64. 

We  use altmap.reserve feature to align things correctly at pageblock 
granularity. We can end up loosing some pages in memory with this. For ex: with 
256MB memory block
size, we require 4 pages to map vmemmap pages, In order to align things 
correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages
28 pages get reserved.


-aneesh







Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K V
On 7/6/23 6:29 PM, David Hildenbrand wrote:
> On 06.07.23 14:32, Aneesh Kumar K V wrote:
>> On 7/6/23 4:44 PM, David Hildenbrand wrote:
>>> On 06.07.23 11:36, Aneesh Kumar K V wrote:
>>>> On 7/6/23 2:48 PM, David Hildenbrand wrote:
>>>>> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>>>>>> With memmap on memory, some architecture needs more details w.r.t altmap
>>>>>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
>>>>>
>>>>> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
>>>>>
>>>>> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
>>>>> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
>>>>> calculate the end_pfn?
>>>>>
>>>>> To rephrase: if the vmemmap is not at the beginning and doesn't cover 
>>>>> full apgeblocks, memory onlining/offlining would be broken.
>>>>>
>>>>> [...]
>>>>
>>>>
>>>> With ppc64 and 64K pagesize and different memory block sizes, we can end 
>>>> up allocating vmemmap backing memory from outside altmap because
>>>> a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). 
>>>> and that can point to pages outside the dev_pagemap range.
>>>> So on free we  check
>>>
>>> So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
>>> pages? That sound wrong (and is counter-intuitive to the feature in 
>>> general, where we *don't* want to allocate the vmemmap from outside the 
>>> altmap).
>>>
>>> (64 * 1024) / sizeof(struct page) -> 1024 pages
>>>
>>> 1024 pages * 64k = 64 MiB.
>>>
>>> What's the memory block size on these systems? If it's >= 64 MiB the 
>>> vmemmap of a single memory block fits into a single page and we should be 
>>> fine.
>>>
>>> Smells like you want to disable the feature on a 64k system.
>>>
>>
>> But that part of vmemmap_free is common for both dax,dax kmem and the new 
>> memmap on memory feature. ie, ppc64 vmemmap_free have checks which require
>> a full altmap structure with all the details in. So for memmap on memmory to 
>> work on ppc64 we do require similar altmap struct. Hence the idea
>> of adding vmemmap_altmap to  struct memory_block
> 
> I'd suggest making sure that for the memmap_on_memory case your really 
> *always* allocate from the altmap (that's what the feature is about after 
> all), and otherwise block the feature (i.e., arch_mhp_supports_... should 
> reject it).
>

Sure. How about?

bool mhp_supports_memmap_on_memory(unsigned long size)
{

unsigned long nr_pages = size >> PAGE_SHIFT;
unsigned long vmemmap_size = nr_pages * sizeof(struct page);

if (!radix_enabled())
return false;
/*
 * memmap on memory only supported with memory block size add/remove
 */
if (size != memory_block_size_bytes())
return false;
/*
 * Also make sure the vmemmap allocation is fully contianed
 * so that we always allocate vmemmap memory from altmap area.
 */
if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
return false;
/*
 * The pageblock alignment requirement is met by using
 * reserve blocks in altmap.
 */
return true;
}



 
> Then, you can reconstruct the altmap layout trivially
> 
> base_pfn: start of the range to unplug
> end_pfn: base_pfn + nr_vmemmap_pages
> 
> and pass that to the removal code, which will do the right thing, no?
> 
> 
> Sure, remembering the altmap might be a potential cleanup (eventually?), but 
> the basic reasoning why this is required as patch #1 IMHO is wrong: if you 
> say you support memmap_on_memory for a configuration, then you should also 
> properly support it (allocate from the hotplugged memory), not silently fall 
> back to something else.

I guess you want to keep the altmap introduction as a later patch in the series 
and not the preparatory patch? Or are you ok with just adding the additional 
check I mentioned above w.r.t size value and keep this patch as patch 1  as a 
generic cleanup (avoiding
the recomputation of altmap->alloc/base_pfn/end_pfn? 

-aneesh
 



Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K V
On 7/6/23 4:44 PM, David Hildenbrand wrote:
> On 06.07.23 11:36, Aneesh Kumar K V wrote:
>> On 7/6/23 2:48 PM, David Hildenbrand wrote:
>>> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>>>> With memmap on memory, some architecture needs more details w.r.t altmap
>>>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
>>>
>>> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
>>>
>>> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
>>> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
>>> calculate the end_pfn?
>>>
>>> To rephrase: if the vmemmap is not at the beginning and doesn't cover full 
>>> apgeblocks, memory onlining/offlining would be broken.
>>>
>>> [...]
>>
>>
>> With ppc64 and 64K pagesize and different memory block sizes, we can end up 
>> allocating vmemmap backing memory from outside altmap because
>> a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). 
>> and that can point to pages outside the dev_pagemap range.
>> So on free we  check
> 
> So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
> pages? That sound wrong (and is counter-intuitive to the feature in general, 
> where we *don't* want to allocate the vmemmap from outside the altmap).
> 
> (64 * 1024) / sizeof(struct page) -> 1024 pages
> 
> 1024 pages * 64k = 64 MiB.
> 
> What's the memory block size on these systems? If it's >= 64 MiB the vmemmap 
> of a single memory block fits into a single page and we should be fine.
> 
> Smells like you want to disable the feature on a 64k system.
>

But that part of vmemmap_free is common for both dax,dax kmem and the new 
memmap on memory feature. ie, ppc64 vmemmap_free have checks which require
a full altmap structure with all the details in. So for memmap on memmory to 
work on ppc64 we do require similar altmap struct. Hence the idea
of adding vmemmap_altmap to  struct memory_block

 
>>
>> vmemmap_free() {
>> ...
>> if (altmap) {
>>     alt_start = altmap->base_pfn;
>>     alt_end = altmap->base_pfn + altmap->reserve +
>>   altmap->free + altmap->alloc + altmap->align;
>> }
>>
>> ...
>>     if (base_pfn >= alt_start && base_pfn < alt_end) {
>>     vmem_altmap_free(altmap, nr_pages);
>>
>> to see whether we did use altmap for the vmemmap allocation.
>>


-aneesh




Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag

2023-07-06 Thread Aneesh Kumar K V
On 7/6/23 2:54 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> Instead of checking for memmap on memory feature enablement within the
>> functions checking for alignment, use the kernel parameter to control the
>> memory hotplug flags. The generic kernel now enables memmap on memory
>> feature if the hotplug flag request for the same.
>>
>> The ACPI code now can pass the flag unconditionally because the kernel will
>> fallback to not using the feature if the alignment rules are not met.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   drivers/acpi/acpi_memhotplug.c |  3 +--
>>   include/linux/memory_hotplug.h | 14 ++
>>   mm/memory_hotplug.c    | 35 +++---
>>   3 files changed, 26 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>> index 24f662d8bd39..4d0096fc4cc2 100644
>> --- a/drivers/acpi/acpi_memhotplug.c
>> +++ b/drivers/acpi/acpi_memhotplug.c
>> @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct 
>> acpi_memory_device *mem_device)
>>   if (!info->length)
>>   continue;
>>   -    if (mhp_supports_memmap_on_memory(info->length))
>> -    mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>> +    mhp_flags |= get_memmap_on_memory_flags();
>>   result = __add_memory(mgid, info->start_addr, info->length,
>>     mhp_flags);
>>   diff --git a/include/linux/memory_hotplug.h 
>> b/include/linux/memory_hotplug.h
>> index a769f44b8368..af7017122506 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size);
>>   bool __mhp_supports_memmap_on_memory(unsigned long size);
>>   #endif /* CONFIG_MEMORY_HOTPLUG */
>>   +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>> +extern bool memmap_on_memory;
>> +static inline unsigned long get_memmap_on_memory_flags(void)
>> +{
>> +    if (memmap_on_memory)
>> +    return MHP_MEMMAP_ON_MEMORY;
>> +    return 0;
>> +}
>> +#else
>> +static inline unsigned long get_memmap_on_memory_flags(void)
>> +{
>> +    return 0;
>> +}
>> +#endif
> 
> That's kind-of ugly TBH.
> 
> 
> Why do we need this change?
> 

I was trying to avoid rest of the kernel doing 

if (mhp_supports_memmap_on_memory(info->length))
 mhp_flags |= MHP_MEMMAP_ON_MEMORY;

I was also following pattern similar to 

static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
{
return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
}


Overall goal of the patch is to handle the fallback of not using altmap/memmap 
in memory 
inside add_memory_resource and the callsites only express the desire to use 
memmap on memory
if possible/enabled. 

-aneesh


Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K V
On 7/6/23 2:48 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> With memmap on memory, some architecture needs more details w.r.t altmap
>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
> 
> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
> 
> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
> calculate the end_pfn?
> 
> To rephrase: if the vmemmap is not at the beginning and doesn't cover full 
> apgeblocks, memory onlining/offlining would be broken.
> 
> [...]


With ppc64 and 64K pagesize and different memory block sizes, we can end up 
allocating vmemmap backing memory from outside altmap because 
a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and 
that can point to pages outside the dev_pagemap range. 
So on free we  check 

vmemmap_free() {
...
if (altmap) {
alt_start = altmap->base_pfn;
alt_end = altmap->base_pfn + altmap->reserve +
  altmap->free + altmap->alloc + altmap->align;
}

...
if (base_pfn >= alt_start && base_pfn < alt_end) {
vmem_altmap_free(altmap, nr_pages);

to see whether we did use altmap for the vmemmap allocation. 

> 
>>   +/**
>> + * struct vmem_altmap - pre-allocated storage for vmemmap_populate
>> + * @base_pfn: base of the entire dev_pagemap mapping
>> + * @reserve: pages mapped, but reserved for driver use (relative to @base)
>> + * @free: free pages set aside in the mapping for memmap storage
>> + * @align: pages reserved to meet allocation alignments
>> + * @alloc: track pages consumed, private to vmemmap_populate()
>> + */
>> +struct vmem_altmap {
>> +    unsigned long base_pfn;
>> +    const unsigned long end_pfn;
>> +    const unsigned long reserve;
>> +    unsigned long free;
>> +    unsigned long align;
>> +    unsigned long alloc;
>> +};
> 
> Instead of embedding that, what about conditionally allocating it and store a 
> pointer to it in the "struct memory_block"?
> 
> In the general case as of today, we don't have an altmap.
> 

Sure but with memmap on memory option it is essentially adding that right?. Is 
the concern related to the increase in the size of
struct memory_block  ?

>> +
>>   struct memory_block {
>>   unsigned long start_section_nr;
>>   unsigned long state;    /* serialized by the dev->lock */
>> @@ -77,11 +94,7 @@ struct memory_block {
>>    */
>>   struct zone *zone;
>>   struct device dev;
>> -    /*
>> - * Number of vmemmap pages. These pages
>> - * lay at the beginning of the memory block.
>> - */
>> -    unsigned long nr_vmemmap_pages;
>> +    struct vmem_altmap altmap;
>>   struct memory_group *group;    /* group (if any) for this block */
>>   struct list_head group_next;    /* next block inside memory group */
>>   #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG)
>> @@ -147,7 +160,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t 
>> fn, int pri)
>>   extern int register_memory_notifier(struct notifier_block *nb);
>>   extern void unregister_memory_notifier(struct notifier_block *nb);
>>   int create_memory_block_devices(unsigned long start, unsigned long size,
> 
> [...]
> 
>>   static int check_cpu_on_node(int nid)
>> @@ -2036,9 +2042,8 @@ EXPORT_SYMBOL(try_offline_node);
>>     static int __ref try_remove_memory(u64 start, u64 size)
>>   {
>> -    struct vmem_altmap mhp_altmap = {};
>> +    int ret;
>>   struct vmem_altmap *altmap = NULL;
>> -    unsigned long nr_vmemmap_pages;
>>   int rc = 0, nid = NUMA_NO_NODE;
>>     BUG_ON(check_hotplug_memory_range(start, size));
>> @@ -2060,24 +2065,16 @@ static int __ref try_remove_memory(u64 start, u64 
>> size)
>>    * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
>>    * the same granularity it was added - a single memory block.
>>    */
>> +
> 
> ^ unrealted change?
> 



Re: [PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-06 Thread Aneesh Kumar K V
On 7/6/23 2:37 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. We also use altmap.reserve
>> feature to align things correctly at pageblock granularity. We can end up
>> loosing some pages in memory with this. For ex: with 256MB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages
>> 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/Kconfig  |  1 +
>>   arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 +++
>>   .../platforms/pseries/hotplug-memory.c    |  4 ++-
>>   3 files changed, 32 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>   select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>   select ARCH_MIGHT_HAVE_PC_PARPORT
>>   select ARCH_MIGHT_HAVE_PC_SERIO
>>   select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
>> b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index a62729f70f2a..c0bd60b5fb64 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>>     return 1;
>>   }
>> +
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
> 
> x86 also has the fallback IIRC; the concern is rather that you end up with a 
> PTE-mapped vmemmap, which is slower at runtime than a PMD-mapped vmemmap.


The memory block size is dependent on a config option at the hypervisor for 
power and we can have values ranging from 16MB - 512MB
With these different memory block sizes I was thinking relaxing these checks 
makes the feature more useful. Also with page size
of 64K using a 2M vmemmap requires larger memory block size. 

> 
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +bool mhp_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    if (!radix_enabled())
>> +    return false;
>> +    /*
>> + * The pageblock alignment requirement is met by using
>> + * reserve blocks in altmap.
>> + */
>> +    return size == memory_block_size_bytes();
>> +}
> 
> I really dislike such arch overrides.
> 
> I think the flow should be something like that, having a generic:
> 
> bool mhp_supports_memmap_on_memory(unsigned long size)
> {
> ...
> return arch_mhp_supports_memmap_on_memory(size)) &&
>    size == memory_block_size_bytes() &&
>    ...
> }
> 

Sure we can do that. But for ppc64 we also want to skip the PMD_SIZE and 
pageblock
alignment restrictions. 

> where we'll also keep the pageblock check here.

For ppc64, I converted that pageblock rule as a reserve blocks allocation with 
altmap.
I am not sure whether we want to do that outside ppc64? 

> 
> And a ppc specific
> 
> bool arch_mhp_supports_memmap_on_memory(unsigned long size)
> {
> /*
>  * Don't check for the vmemmap covering PMD_SIZE, we accept that
>  * we might get a PTE-mapped vmemmap.
>  */
> return radix_enabled();
> }
> 
> We can then move the PMD_SIZE check into arch specific code (x86-aarch64).
> 

sure. will rework the changes accordingly. 

-aneesh


Re: [PATCH v2 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-06-27 Thread Aneesh Kumar K V
On 6/28/23 7:03 AM, Ritesh Harjani (IBM) wrote:
> "Aneesh Kumar K.V"  writes:



>> +int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
>> +unsigned long addr, unsigned long next)
>> +{
>> +int large = pmd_large(*pmd);
>> +
>> +if (pmd_large(*pmd))
> 
> we already got the value of pmd_large into "large" variable.
> we can use just if (large) right?
> 
>> +vmemmap_verify((pte_t *)pmd, node, addr, next);
> 
> maybe we can use pmdp_ptep() function here which we used in the 1st patch?
> also shouldn't this be pmdp in the function argument instead of pmd?
> 

updated

>> +
>> +return large;
>> +}
>> +
>> +void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node,
>> +   unsigned long addr, unsigned long next)
>> +{
>> +pte_t entry;
>> +pte_t *ptep = pmdp_ptep(pmdp);
>> +
>> +VM_BUG_ON(!IS_ALIGNED(addr, PMD_SIZE));
>> +entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>> +set_pte_at(_mm, addr, ptep, entry);
>> +asm volatile("ptesync": : :"memory");
>> +
>> +vmemmap_verify(ptep, node, addr, next);
>> +}
>> +
>> +static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned 
>> long addr, int node,
>> + struct vmem_altmap *altmap,
>> + struct page *reuse)
>> +{
>> +pte_t *pte = pte_offset_kernel(pmd, addr);
>> +
>> +if (pte_none(*pte)) {
>> +pte_t entry;
>> +void *p;
>> +
>> +if (!reuse) {
>> +/*
>> + * make sure we don't create altmap mappings
>> + * covering things outside the device.
>> + */
>> +if (altmap && altmap_cross_boundary(altmap, addr, 
>> PAGE_SIZE))
>> +altmap = NULL;
>> +
>> +p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> +if (!p) {
>> +if (altmap)
>> +p = vmemmap_alloc_block_buf(PAGE_SIZE, 
>> node, NULL);
>> +if (!p)
>> +return NULL;
>> +}
> 
> Above if conditions are quite confusing when looking for the 1st time?
> Can we do this? Did I get it right?
> 
> if (!p && altmap)
>   p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL);
> 
> if (!p)
>   return NULL;
> 

updated


-aneesh


Re: [PATCH v2 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-06-27 Thread Aneesh Kumar K V
On 6/28/23 6:53 AM, Ritesh Harjani (IBM) wrote:
> "Aneesh Kumar K.V"  writes:



>>  
>>  static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
>> @@ -292,9 +320,18 @@ static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
>>  return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
>>  }
>>  
>> +static inline pud_t radix__pud_mkdevmap(pud_t pud)
>> +{
>> +return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP));
>> +}
>> +
>> +struct vmem_altmap;
>> +struct dev_pagemap;
> 
> Minor nit.
> I guess this struct dev_pagemap is meant to be added in a later patch.

Moved that change to a later patch.

> 
>>  extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
>>   unsigned long page_size,
>>   unsigned long phys);
>> +int __meminit radix__vmemmap_populate(unsigned long start, unsigned long 
>> end,
>> +  int node, struct vmem_altmap *altmap);
>>  extern void radix__vmemmap_remove_mapping(unsigned long start,
>>  unsigned long page_size);
>>  
>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
>> b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
>> index 77797a2a82eb..a38542259fab 100644
>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
>> @@ -68,6 +68,8 @@ void radix__flush_tlb_pwc_range_psize(struct mm_struct 
>> *mm, unsigned long start,
>>unsigned long end, int psize);
>>  extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end);
>> +extern void radix__flush_pud_tlb_range(struct vm_area_struct *vma,
>> +   unsigned long start, unsigned long end);
>>  extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned 
>> long start,
>>  unsigned long end);
>>  extern void radix__flush_tlb_kernel_range(unsigned long start, unsigned 
>> long end);
>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
>> b/arch/powerpc/include/asm/book3s/64/tlbflush.h
>> index 0d0c1447ecf0..a01c20a8fbf7 100644
>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
>> @@ -50,6 +50,14 @@ static inline void flush_pmd_tlb_range(struct 
>> vm_area_struct *vma,
>>  radix__flush_pmd_tlb_range(vma, start, end);
>>  }
>>  
>> +#define __HAVE_ARCH_FLUSH_PUD_TLB_RANGE
>> +static inline void flush_pud_tlb_range(struct vm_area_struct *vma,
>> +   unsigned long start, unsigned long end)
>> +{
>> +if (radix_enabled())
>> +radix__flush_pud_tlb_range(vma, start, end);
>> +}
>> +
>>  #define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
>>  static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
>> unsigned long start,
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
>> b/arch/powerpc/mm/book3s64/pgtable.c
>> index 85c84e89e3ea..9e5f01a1738c 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -64,11 +64,39 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, 
>> unsigned long address,
>>  return changed;
>>  }
>>  
>> +int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
>> +  pud_t *pudp, pud_t entry, int dirty)
>> +{
>> +int changed;
>> +#ifdef CONFIG_DEBUG_VM
>> +WARN_ON(!pud_devmap(*pudp));
> 
> just a query -
> for pmdp_set_access_flags() we had
> WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> 
> so why don't we require the same check here?
> 

Because we don't support generic anon PUD level THP. and we want to catch a 
wrong usage.

>> +assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
>> +#endif
>> +changed = !pud_same(*(pudp), entry);
>> +if (changed) {
>> +/*
>> + * We can use MMU_PAGE_2M here, because only radix
> 
> s/MMU_PAGE_2M/MMY_PAGE_1G
> 

updated

>> + * path look at the psize.
>> + */
>> +__ptep_set_access_flags(vma, pudp_ptep(pudp),
>> +pud_pte(entry), address, MMU_PAGE_1G);
>> +}
>> +return changed;
>> +}
>> +
>> +
>>  int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>>unsigned long address, pmd_t *pmdp)
>>  {
>>  return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
>>  }
>> +
>> +int pudp_test_and_clear_young(struct vm_area_struct *vma,
>> +  unsigned long address, pud_t *pudp)
>> +{
>> +return __pudp_test_and_clear_young(vma->vm_mm, address, pudp);
>> +}
>> +
>>  /*
>>   * set a new huge pmd. We should not be called for updating
>>   * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ 

Re: [PATCH v2 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-06-27 Thread Aneesh Kumar K V
On 6/28/23 6:39 AM, Ritesh Harjani (IBM) wrote:
> "Aneesh Kumar K.V"  writes:
> 
>> Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
>> optimization includes an update of both the permissions (writeable to
>> read-only) and the output address (pfn) of the vmemmap ptes. That is not
>> supported without unmapping of pte(marking it invalid) by some
>> architectures.
>>
>> With DAX vmemmap optimization we don't require such pte updates and
>> architectures can enable DAX vmemmap optimization while having hugetlb
>> vmemmap optimization disabled. Hence split DAX optimization support into a
>> different config.
>>
>> loongarch and riscv don't have devdax support. So the DAX config is not
>> enabled for them. With this change, arm64 should be able to select DAX
>> optimization
>>
>> [1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
>> HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/loongarch/Kconfig | 2 +-
>>  arch/riscv/Kconfig | 2 +-
>>  arch/x86/Kconfig   | 3 ++-
>>  fs/Kconfig | 2 +-
>>  include/linux/mm.h | 2 +-
>>  mm/Kconfig | 5 -
>>  6 files changed, 10 insertions(+), 6 deletions(-)
> 
> what about s390?
> 
> git grep "ARCH_WANT_OPTIMIZE_VMEMMAP" .
> arch/s390/Kconfig:  select ARCH_WANT_OPTIMIZE_VMEMMAP
> 

Thanks for catching that. Updated

-aneesh



Re: [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix

2023-06-21 Thread Aneesh Kumar K V
On 6/21/23 9:38 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V"  writes:
>> This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
>> on radix while mapping the vmemmap area.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
>>  1 file changed, 2 deletions(-)
> 
> This breaks microwatt_defconfig, which does not enable 
> CONFIG_PPC_64S_HASH_MMU:
> 
>   ../arch/powerpc/mm/book3s64/radix_pgtable.c: In function 
> ‘radix__early_init_mmu’:
>   ../arch/powerpc/mm/book3s64/radix_pgtable.c:601:27: error: lvalue required 
> as left operand of assignment
> 601 | mmu_virtual_psize = MMU_PAGE_4K;
> |   ^
>   make[5]: *** [../scripts/Makefile.build:252: 
> arch/powerpc/mm/book3s64/radix_pgtable.o] Error 1
>   make[4]: *** [../scripts/Makefile.build:494: arch/powerpc/mm/book3s64] 
> Error 2
>   make[3]: *** [../scripts/Makefile.build:494: arch/powerpc/mm] Error 2
>   make[2]: *** [../scripts/Makefile.build:494: arch/powerpc] Error 2
>   make[2]: *** Waiting for unfinished jobs
>   make[1]: *** [/home/michael/linux/Makefile:2026: .] Error 2
>   make: *** [Makefile:226: __sub-make] Error 2
> 
> Because mmu_virtual_psize is defined in hash_utils.c, which isn't built.
> 

Ok i missed the mmu_virtual_psize dependency there. Will add 
microwatt_defconfig to build configs. 


modified   arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,12 +594,14 @@ void __init radix__early_init_mmu(void)
 {
unsigned long lpcr;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
/* PAGE_SIZE mappings */
mmu_virtual_psize = MMU_PAGE_64K;
 #else
mmu_virtual_psize = MMU_PAGE_4K;
 #endif
+#endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
/* vmemmap mapping */



> cheers



Re: [RFC PATCH] fs/hugetlb: Fix UBSAN warning reported on hugetlb

2022-09-08 Thread Aneesh Kumar K V
On 9/8/22 10:23 PM, Matthew Wilcox wrote:
> On Thu, Sep 08, 2022 at 12:56:59PM +0530, Aneesh Kumar K.V wrote:
>> +++ b/fs/dax.c
>> @@ -1304,7 +1304,7 @@ EXPORT_SYMBOL_GPL(dax_zero_range);
>>  int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>>  const struct iomap_ops *ops)
>>  {
>> -unsigned int blocksize = i_blocksize(inode);
>> +size_t blocksize = i_blocksize(inode);
>>  unsigned int off = pos & (blocksize - 1);
> 
> If blocksize is larger than 4GB, then off also needs to be size_t.
> 
>> +++ b/fs/iomap/buffered-io.c
>> @@ -955,7 +955,7 @@ int
>>  iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>>  const struct iomap_ops *ops)
>>  {
>> -unsigned int blocksize = i_blocksize(inode);
>> +size_t blocksize = i_blocksize(inode);
>>  unsigned int off = pos & (blocksize - 1);
> 
> Ditto.
> 
> (maybe there are others; I didn't check closely)

Thanks. will check those. 

Any feedback on statx? Should we really fix that?

I am still not clear why we chose to set blocksize = pagesize for hugetlbfs.
Was that done to enable application find the hugetlb pagesize via stat()? 

-aneesh


Re: [PATCH] mm: gup: fix the fast GUP race against THP collapse

2022-09-06 Thread Aneesh Kumar K V
On 9/7/22 12:37 AM, Yang Shi wrote:
> On Mon, Sep 5, 2022 at 1:56 AM Aneesh Kumar K.V
>  wrote:
>>
>> Yang Shi  writes:
>>
>>>
>>> On Fri, Sep 2, 2022 at 9:00 AM Peter Xu  wrote:

 On Thu, Sep 01, 2022 at 04:50:45PM -0700, Yang Shi wrote:
> On Thu, Sep 1, 2022 at 4:26 PM Peter Xu  wrote:
>>
>> Hi, Yang,
>>
>> On Thu, Sep 01, 2022 at 03:27:07PM -0700, Yang Shi wrote:
>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no 
>>> longer
>>> sufficient to handle concurrent GUP-fast in all cases, it only handles
>>> traditional IPI-based GUP-fast correctly.
>>
>> If TLB flush (or, IPI broadcasts) used to work to protect against 
>> gup-fast,
>> I'm kind of confused why it's not sufficient even if with RCU gup?  Isn't
>> that'll keep working as long as interrupt disabled (which current 
>> fast-gup
>> will still do)?
>
> Actually the wording was copied from David's commit log for his
> PageAnonExclusive fix. My understanding is the IPI broadcast still
> works, but it may not be supported by all architectures and not
> preferred anymore. So we should avoid depending on IPI broadcast IIUC.
>
>>
>> IIUC the issue is you suspect not all archs correctly implemented
>> pmdp_collapse_flush(), or am I wrong?
>
> This is a possible fix, please see below for details.
>
>>
>>> On architectures that send
>>> an IPI broadcast on TLB flush, it works as expected.  But on the
>>> architectures that do not use IPI to broadcast TLB flush, it may have
>>> the below race:
>>>
>>>CPU A  CPU B
>>> THP collapse fast GUP
>>>   gup_pmd_range() <-- see 
>>> valid pmd
>>>   gup_pte_range() <-- 
>>> work on pte
>>> pmdp_collapse_flush() <-- clear pmd and flush
>>> __collapse_huge_page_isolate()
>>> check page pinned <-- before GUP bump refcount
>>>   pin the page
>>>   check PTE <-- no 
>>> change
>>> __collapse_huge_page_copy()
>>> copy data to huge page
>>> ptep_clear()
>>> install huge pmd for the huge page
>>>   return the stale 
>>> page
>>> discard the stale page
>>>
>>> The race could be fixed by checking whether PMD is changed or not after
>>> taking the page pin in fast GUP, just like what it does for PTE.  If the
>>> PMD is changed it means there may be parallel THP collapse, so GUP
>>> should back off.
>>
>> Could the race also be fixed by impl pmdp_collapse_flush() correctly for
>> the archs that are missing? Do you know which arch(s) is broken with it?
>
> Yes, and this was suggested by me in the first place, but per the
> suggestion from John and David, this is not the preferred way. I think
> it is because:
>
> Firstly, using IPI to serialize against fast GUP is not recommended
> anymore since fast GUP does check PTE then back off so we should avoid
> it.
> Secondly, if checking PMD then backing off could solve the problem,
> why do we still need broadcast IPI? It doesn't sound performant.
>
>>
>> It's just not clear to me whether this patch is an optimization or a fix,
>> if it's a fix whether the IPI broadcast in ppc pmdp_collapse_flush() 
>> would
>> still be needed.
>
> It is a fix and the fix will make IPI broadcast not useful anymore.

 How about another patch to remove the ppc impl too?  Then it can be a two
 patches series.
>>>
>>> BTW, I don't think we could remove the ppc implementation since it is
>>> different from the generic pmdp_collapse_flush(), particularly for the
>>> hash part IIUC.
>>>
>>> The generic version calls flush_tlb_range() -> hash__flush_tlb_range()
>>> for hash, but the hash call is actually no-op. The ppc version calls
>>> hash__pmdp_collapse_flush() -> flush_tlb_pmd_range(), which does
>>> something useful.
>>>
>>
>> We should actually rename flush_tlb_pmd_range(). It actually flush the
>> hash page table entries.
>>
>> I will do the below patch for ppc64 to clarify this better
> 
> Thanks, Aneesh. It looks more readable. A follow-up question, I think
> we could remove serialize_against_pte_lookup(), which just issues IPI
> broadcast to run a dummy function. This IPI should not be needed
> anymore with my patch. Of course, we need to keep the memory barrier.
> 


For radix translation yes. For hash we still need that. W.r.t memory barrier,
radix do use radix__flush_tlb_collapsed_pmd() which does a tlb invalidate.
IIUC that will enfocre 

Re: [PATCH v3] powerpc/mm: Support execute-only memory on the Radix MMU

2022-08-09 Thread Aneesh Kumar K V
On 8/9/22 11:21 AM, Christophe Leroy wrote:
> Le 09/08/2022 à 04:44, Russell Currey a écrit :
>> The Hash MMU already supports XOM (i.e. mmap with PROT_EXEC only)
>> through the execute-only pkey.  A PROT_EXEC-only mapping will actually
>> map to RX, and then the pkey will be applied on top of it.
> 
> I don't think XOM is a commonly understood accronym. Maybe the first 
> time you use it it'd be better to say something like:
> 
> The Hash MMU already supports execute-only memory (XOM)
> 
> When you say that Hash MMU supports it through the execute-only pkey, 
> does it mean that it is taken into account automatically at mmap time, 
> or does the userspace app has to do something special to use the key ? 
> If it is the second, it means that depending on whether you are radix or 
> not, you must do something different ? Is that expected ?
> 
>>
>> Radix doesn't have pkeys, but it does have execute permissions built-in
>> to the MMU, so all we have to do to support XOM is expose it.
>>
>> Signed-off-by: Russell Currey 
>> ---
>> v3: Incorporate Aneesh's suggestions, leave protection_map untouched
>> Basic test: https://github.com/ruscur/junkcode/blob/main/mmap_test.c
>>
>>   arch/powerpc/include/asm/book3s/64/pgtable.h |  2 ++
>>   arch/powerpc/mm/book3s64/pgtable.c   | 11 +--
>>   arch/powerpc/mm/fault.c  |  6 +-
>>   3 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 392ff48f77df..486902aff040 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -151,6 +151,8 @@
>>   #define PAGE_COPY_X__pgprot(_PAGE_BASE | _PAGE_READ | _PAGE_EXEC)
>>   #define PAGE_READONLY  __pgprot(_PAGE_BASE | _PAGE_READ)
>>   #define PAGE_READONLY_X__pgprot(_PAGE_BASE | _PAGE_READ | _PAGE_EXEC)
>> +/* Radix only, Hash uses PAGE_READONLY_X + execute-only pkey instead */
>> +#define PAGE_EXECONLY   __pgprot(_PAGE_BASE | _PAGE_EXEC)
>>   
>>   /* Permission masks used for kernel mappings */
>>   #define PAGE_KERNEL__pgprot(_PAGE_BASE | _PAGE_KERNEL_RW)
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
>> b/arch/powerpc/mm/book3s64/pgtable.c
>> index 7b9966402b25..62f63d344596 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -553,8 +553,15 @@ EXPORT_SYMBOL_GPL(memremap_compat_align);
>>   
>>   pgprot_t vm_get_page_prot(unsigned long vm_flags)
>>   {
>> -unsigned long prot = pgprot_val(protection_map[vm_flags &
>> -(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]);
>> +unsigned long prot;
>> +
>> +/* Radix supports execute-only, but protection_map maps X -> RX */
>> +if (radix_enabled() && ((vm_flags & (VM_READ|VM_WRITE|VM_EXEC)) == 
>> VM_EXEC)) {
> 
> Maybe use VM_ACCESS_FLAGS ?
> 
>> +prot = pgprot_val(PAGE_EXECONLY);
>> +} else {
>> +prot = pgprot_val(protection_map[vm_flags &
>> +  (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]);
>> +}
>>   
>>  if (vm_flags & VM_SAO)
>>  prot |= _PAGE_SAO;
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 014005428687..59e4cbcf3109 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -270,7 +270,11 @@ static bool access_error(bool is_write, bool is_exec, 
>> struct vm_area_struct *vma
>>  return false;
>>  }
>>   
>> -if (unlikely(!vma_is_accessible(vma)))
>> +/* On Radix, a read fault could be from PROT_NONE or PROT_EXEC */
>> +if (unlikely(radix_enabled() && !(vma->vm_flags & VM_READ)))
>> +return true;
> 
> Why do you need the radix_enabled() here ?
> Even if it doesn't fault directly, reading a non readable area is still 
> an error and should be handled as such, even on hardware that will not 
> generate a fault for it at the first place. So I'd just do:
> 
>   if (!(vma->vm_flags & VM_READ)))
>   return true;
> 
>> +/* Check for a PROT_NONE fault on other MMUs */
>> +else if (unlikely(!vma_is_accessible(vma)))
>>  return true;
>>  /*
>>   * We should ideally do the vma pkey access check here. But in the
> 
> Don't use an if/else construct, there is no other 'else' in that 
> function, or in similar functions like bad_kernel_fault() for instance.
> 
> So leave the !vma_is_accessible(vma) untouched and add your check as a 
> standalone check before or after it.


What does vma_is_accessible() check bring if we have the VM_READ check 
unconditional ?

-aneesh




Re: [PATCH v2 2/2] powerpc/mm: Support execute-only memory on the Radix MMU

2022-08-08 Thread Aneesh Kumar K V
On 8/8/22 6:31 PM, Russell Currey wrote:
> The Hash MMU already supports XOM (i.e. mmap with PROT_EXEC only)
> through the execute-only pkey.  A PROT_EXEC-only mapping will actually
> map to RX, and then the pkey will be applied on top of it.
> 
> Radix doesn't have pkeys, but it does have execute permissions built-in
> to the MMU, so all we have to do to support XOM is expose it.
> 
> That's not possible with protection_map being const, so make it RO after
> init instead.
> 
> Signed-off-by: Russell Currey 
> ---
> v2: Make protection_map __ro_after_init and set it in an initcall
> (v1 didn't work, I tested before rebasing on Anshuman's patches)
> 
> basic test: https://raw.githubusercontent.com/ruscur/junkcode/main/mmap_test.c
> 
>  arch/powerpc/include/asm/book3s/64/radix.h |  3 +++
>  arch/powerpc/include/asm/pgtable.h |  1 -
>  arch/powerpc/mm/fault.c| 10 ++
>  arch/powerpc/mm/pgtable.c  | 16 +++-
>  4 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
> b/arch/powerpc/include/asm/book3s/64/radix.h
> index 686001eda936..bf316b773d73 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -19,6 +19,9 @@
>  #include 
>  #endif
>  
> +/* Execute-only page protections, Hash can use RX + execute-only pkey */
> +#define PAGE_EXECONLY__pgprot(_PAGE_BASE | _PAGE_EXEC)
> +
>  /* An empty PTE can still have a R or C writeback */
>  #define RADIX_PTE_NONE_MASK  (_PAGE_DIRTY | _PAGE_ACCESSED)
>  
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 33f4bf8d22b0..3cbb6de20f9d 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -60,7 +60,6 @@ extern void paging_init(void);
>  void poking_init(void);
>  
>  extern unsigned long ioremap_bot;
> -extern const pgprot_t protection_map[16];
>  
>  /*
>   * kern_addr_valid is intended to indicate whether an address is a valid
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 014005428687..887c0cc45ca6 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -270,6 +270,16 @@ static bool access_error(bool is_write, bool is_exec, 
> struct vm_area_struct *vma
>   return false;
>   }
>  
> + if (unlikely(!(vma->vm_flags & VM_READ))) {
> + /*
> +  * If we're on Radix, then this could be a read attempt on
> +  * execute-only memory.  On other MMUs, an "exec-only" page
> +  * will be given RX flags, so this might be redundant.
> +  */
> + if (radix_enabled())
> + return true;
> + }
> +
>   if (unlikely(!vma_is_accessible(vma)))
>   return true;
>   /*
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 0b2bbde5fb65..6e1a6a999c3c 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -475,7 +475,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>  EXPORT_SYMBOL_GPL(__find_linux_pte);
>  
>  /* Note due to the way vm flags are laid out, the bits are XWR */
> -const pgprot_t protection_map[16] = {
> +static pgprot_t protection_map[16] __ro_after_init = {
>   [VM_NONE]   = PAGE_NONE,
>   [VM_READ]   = PAGE_READONLY,
>   [VM_WRITE]  = PAGE_COPY,
> @@ -494,6 +494,20 @@ const pgprot_t protection_map[16] = {
>   [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]  = PAGE_SHARED_X
>  };
>  
> +#ifdef CONFIG_PPC_RADIX_MMU
> +static int __init radix_update_protection_map(void)
> +{
> + if (early_radix_enabled()) {
> + /* Radix directly supports execute-only page protections */
> + protection_map[VM_EXEC] = PAGE_EXECONLY;
> + protection_map[VM_EXEC | VM_SHARED] = PAGE_EXECONLY;
> + }
> +
> + return 0;
> +}
> +arch_initcall(radix_update_protection_map);

Instead of this can we do this in vm_get_page_prot() ?

/* EXEC only shared or non shared mapping ? */
if (radix_enabled() && ((vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) == 
VM_EXEC))
prot = PAGE_EXECONLY; 


> +#endif /* CONFIG_PPC_RADIX_MMU */
> +
>  #ifdef CONFIG_PPC_BOOK3S_64
>  pgprot_t vm_get_page_prot(unsigned long vm_flags)
>  {



Re: [PATCH] powerpc/mm: Support execute-only memory on the Radix MMU

2022-08-08 Thread Aneesh Kumar K V
On 8/8/22 5:28 PM, Russell Currey wrote:
> The Hash MMU already supports XOM (i.e. mmap with PROT_EXEC only)
> through the execute-only pkey.  A PROT_ONLY mapping will actually map to
> RX, and then the pkey will be applied on top of it.
> 
> Radix doesn't have pkeys, but it does have execute permissions built-in
> to the MMU, so all we have to do to support XOM is expose it.
> 
> Signed-off-by: Russell Currey 
> ---
> quick test: https://raw.githubusercontent.com/ruscur/junkcode/main/mmap_test.c
> I can make it a selftest.
> 
>  arch/powerpc/include/asm/book3s/64/radix.h |  3 +++
>  arch/powerpc/mm/book3s64/radix_pgtable.c   |  4 
>  arch/powerpc/mm/fault.c| 10 ++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
> b/arch/powerpc/include/asm/book3s/64/radix.h
> index 686001eda936..bf316b773d73 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -19,6 +19,9 @@
>  #include 
>  #endif
>  
> +/* Execute-only page protections, Hash can use RX + execute-only pkey */
> +#define PAGE_EXECONLY__pgprot(_PAGE_BASE | _PAGE_EXEC)
> +
>  /* An empty PTE can still have a R or C writeback */
>  #define RADIX_PTE_NONE_MASK  (_PAGE_DIRTY | _PAGE_ACCESSED)
>  
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 698274109c91..2edb56169805 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -617,6 +617,10 @@ void __init radix__early_init_mmu(void)
>   __pmd_frag_nr = RADIX_PMD_FRAG_NR;
>   __pmd_frag_size_shift = RADIX_PMD_FRAG_SIZE_SHIFT;
>  
> + /* Radix directly supports execute-only page protections */
> + protection_map[VM_EXEC] = PAGE_EXECONLY;
> + protection_map[VM_EXEC | VM_SHARED] = PAGE_EXECONLY;
> +
>   radix_init_pgtable();
>  
>   if (!firmware_has_feature(FW_FEATURE_LPAR)) {
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 014005428687..887c0cc45ca6 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -270,6 +270,16 @@ static bool access_error(bool is_write, bool is_exec, 
> struct vm_area_struct *vma
>   return false;
>   }
>  
> + if (unlikely(!(vma->vm_flags & VM_READ))) {
> + /*
> +  * If we're on Radix, then this could be a read attempt on
> +  * execute-only memory.  On other MMUs, an "exec-only" page
> +  * will be given RX flags, so this might be redundant.
> +  */
> + if (radix_enabled())
> + return true;
> + }
> +


should we do 

/* This cover both PROT_NONE (due to check above) and exec only mapping */
if (radix_enabled() && !(vma->vm_flags & VM_READ)) {
return true;
/* PROT_NONE check */
else if (!vma_is_accessible(vma)) 
   return true;

return false;



>   if (unlikely(!vma_is_accessible(vma)))
>   return true;
>   /*

-aneesh


Re: [PATCH v2] powerpc/mm: Use is_vmalloc_addr to validate addr

2022-07-04 Thread Aneesh Kumar K V
On 7/4/22 12:43 PM, Christophe Leroy wrote:
> 
> 
> Le 04/07/2022 à 08:39, Aneesh Kumar K.V a écrit :
>> Instead of high_memory use is_vmalloc_addr to validate that the address is
>> not in the vmalloc range.
> 
> 
> Do we really need even more extra checks, and a function that is not 
> inlined anymore ?
> 
> virt_addr_valid() used to be pretty simple. Some extra tests were added 
> by commit ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit 
> Book3E & 32-bit") in order to work around some corner cases, and the 
> commit message say they are temporary.
> 
> virt_addr_valid() is there to check that an address is a valid linear 
> mapping, not that an address IS NOT a vmalloc address. What will happen 
> with your check if you pass an address that is from an ioremap done 
> prior to the start of the vmalloc system ?
> 

I was expecting the io range to be handled by pfn_valid(). IS there a memory 
layout
ascii diagram of book3e/64 like asm/book3s/64/radix.h:51 ? My goal with the
change was to make it more explicit what is it being validated. 

> WIth the series I send last week to add KASAN to book3e/64, we now have 
> VMALLOC above PAGE_OFFSET on all platforms so we should be able to come 
> back to the original virt_addr_valid(), based exclusively on pfn_valid() 
> for PPC64, and pfn_valid() && high_memory for PPC32 (Or maybe only for 
> PPC32 having HIGHMEM)
> 
> 
> Christophe
> 
> 
>>
>> Cc: Kefeng Wang 
>> Cc: Christophe Leroy 
>> Signed-off-by: Aneesh Kumar K.V 
>> -
>> ---
>>   arch/powerpc/include/asm/page.h | 10 --
>>   arch/powerpc/mm/mem.c   | 11 +++
>>   2 files changed, 15 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/page.h 
>> b/arch/powerpc/include/asm/page.h
>> index e5f75c70eda8..977835570db3 100644
>> --- a/arch/powerpc/include/asm/page.h
>> +++ b/arch/powerpc/include/asm/page.h
>> @@ -131,12 +131,10 @@ static inline bool pfn_valid(unsigned long pfn)
>>   #define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)
>>   #define virt_to_page(kaddr)pfn_to_page(virt_to_pfn(kaddr))
>>   #define pfn_to_kaddr(pfn)  __va((pfn) << PAGE_SHIFT)
>> -
>> -#define virt_addr_valid(vaddr)  ({  
>> \
>> -unsigned long _addr = (unsigned long)vaddr; \
>> -_addr >= PAGE_OFFSET && _addr < (unsigned long)high_memory &&   \
>> -pfn_valid(virt_to_pfn(_addr));  \
>> -})
>> +#ifndef __ASSEMBLY__
>> +extern bool __virt_addr_valid(unsigned long kaddr);
>> +#endif
>> +#define virt_addr_valid(kaddr)  __virt_addr_valid((unsigned long) 
>> (kaddr))
>>   
>>   /*
>>* On Book-E parts we need __va to parse the device tree and we can't
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index 7b0d286bf9ba..622f8bac808b 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -406,3 +406,14 @@ int devmem_is_allowed(unsigned long pfn)
>>* the EHEA driver. Drop this when drivers/net/ethernet/ibm/ehea is 
>> removed.
>>*/
>>   EXPORT_SYMBOL_GPL(walk_system_ram_range);
>> +
>> +bool __virt_addr_valid(unsigned long kaddr)
>> +{
>> +if (kaddr < PAGE_OFFSET)
>> +return false;
>> +if (is_vmalloc_addr((void *) kaddr))
>> +return false;
>> +return pfn_valid(virt_to_pfn(kaddr));
>> +}
>> +EXPORT_SYMBOL(__virt_addr_valid);
>> +



Re: [PATCH v3] powerpc/memhotplug: Add add_pages override for PPC

2022-06-29 Thread Aneesh Kumar K V
On 6/29/22 12:00 PM, Kefeng Wang wrote:
> Hi,
> 
> On 2022/6/29 13:09, Aneesh Kumar K.V wrote:
>> With commit ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E 
>> & 32-bit")
>> the kernel now validate the addr against high_memory value. This results
>> in the below BUG_ON with dax pfns.
>>
>> [  635.798741][T26531] kernel BUG at mm/page_alloc.c:5521!
>> 1:mon> e
>> cpu 0x1: Vector: 700 (Program Check) at [c7287630]
>>  pc: c055ed48: free_pages.part.0+0x48/0x110
>>  lr: c053ca70: tlb_finish_mmu+0x80/0xd0
>>  sp: c72878d0
>>     msr: 8282b033
>>    current = 0xcafabe00
>>    paca    = 0xc0037300   irqmask: 0x03   irq_happened: 0x05
>>  pid   = 26531, comm = 50-landscape-sy
>> kernel BUG at :5521!
>> Linux version 5.19.0-rc3-14659-g4ec05be7c2e1 (kvaneesh@ltc-boston8) (gcc 
>> (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 
>> 2.34) #625 SMP Thu Jun 23 00:35:43 CDT 2022
>> 1:mon> t
>> [link register   ] c053ca70 tlb_finish_mmu+0x80/0xd0
>> [c72878d0] c053ca54 tlb_finish_mmu+0x64/0xd0 (unreliable)
>> [c7287900] c0539424 exit_mmap+0xe4/0x2a0
>> [c72879e0] c019fc1c mmput+0xcc/0x210
>> [c7287a20] c0629230 begin_new_exec+0x5e0/0xf40
>> [c7287ae0] c070b3cc load_elf_binary+0x3ac/0x1e00
>> [c7287c10] c0627af0 bprm_execve+0x3b0/0xaf0
>> [c7287cd0] c0628414 do_execveat_common.isra.0+0x1e4/0x310
>> [c7287d80] c062858c sys_execve+0x4c/0x60
>> [c7287db0] c002c1b0 system_call_exception+0x160/0x2c0
>> [c7287e10] c000c53c system_call_common+0xec/0x250
>>
>> The fix is to make sure we update high_memory on memory hotplug.
>> This is similar to what x86 does in commit 3072e413e305 ("mm/memory_hotplug: 
>> introduce add_pages")
>>
>> Fixes: ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 
>> 32-bit")
>> Cc: Kefeng Wang 
>> Cc: Christophe Leroy 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>> Changes from v2:
>> * drop WARN_ON_ONCE
>> * check for error from __add_pages
>>
>>   arch/powerpc/Kconfig  |  4 
>>   arch/powerpc/mm/mem.c | 33 -
>>   2 files changed, 36 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index c2ce2e60c8f0..7aa12e88c580 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -358,6 +358,10 @@ config ARCH_SUSPEND_NONZERO_CPU
>>   def_bool y
>>   depends on PPC_POWERNV || PPC_PSERIES
>>   +config ARCH_HAS_ADD_PAGES
>> +    def_bool y
>> +    depends on ARCH_ENABLE_MEMORY_HOTPLUG
>> +
>>   config PPC_DCR_NATIVE
>>   bool
>>   diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index 52b77684acda..a97128a48817 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -105,6 +105,37 @@ void __ref arch_remove_linear_mapping(u64 start, u64 
>> size)
>>   vm_unmap_aliases();
>>   }
>>   +/*
>> + * After memory hotplug the variables max_pfn, max_low_pfn and high_memory 
>> need
>> + * updating.
>> + */
>> +static void update_end_of_memory_vars(u64 start, u64 size)
>> +{
>> +    unsigned long end_pfn = PFN_UP(start + size);
>> +
>> +    if (end_pfn > max_pfn) {
>> +    max_pfn = end_pfn;
>> +    max_low_pfn = end_pfn;
>> +    high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
>> +    }
>> +}
>> +
>> +int __ref add_pages(int nid, unsigned long start_pfn, unsigned long 
>> nr_pages,
>> +    struct mhp_params *params)
>> +{
>> +    int ret;
>                 int ret = -EINVAL;
>> +
>> +    ret = __add_pages(nid, start_pfn, nr_pages, params);
>> +    if (ret)
>> +    return ret;
>> +

considering we are updating ret immediately why should we initialize that to 
EINVAL?

int ret = -EINVAL;
ret = __add_pages(nid, start_pfn, nr_pages, params);



>> +    /* update max_pfn, max_low_pfn and high_memory */
>> +    update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> +  nr_pages << PAGE_SHIFT);
>> +
>> +    return ret;
>> +}
>> +
> and could we only call update_end_of_memory_vars() in arch_add_memory()?
>>   int __ref arch_add_memory(int nid, u64 start, u64 size,
>>     struct mhp_params *params)
>>   {
>> @@ -115,7 +146,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>>   rc = arch_create_linear_mapping(nid, start, size, params);
>>   if (rc)
>>   return rc;
>> -    rc = __add_pages(nid, start_pfn, nr_pages, params);
>> +    rc = add_pages(nid, start_pfn, nr_pages, params);
>>   if (rc)
>>   arch_remove_linear_mapping(start, size);
> 
> if (!rc)
> 
>     update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << 
> PAGE_SHIFT);
> 
> else
> 
>     arch_remove_linear_mapping(start, size);
> 
> Thanks
> 

commit 3072e413e305 goes into the details of why it is done in 

Re: [PATCH V2] powerpc/memhotplug: Add add_pages override for PPC

2022-06-28 Thread Aneesh Kumar K V
On 6/28/22 6:26 PM, Michael Ellerman wrote:
> "Aneesh Kumar K.V"  writes:
>> With commit ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E 
>> & 32-bit")
>> the kernel now validate the addr against high_memory value. This results
>> in the below BUG_ON with dax pfns.
>>
>> [  635.798741][T26531] kernel BUG at mm/page_alloc.c:5521!
>> 1:mon> e
>> cpu 0x1: Vector: 700 (Program Check) at [c7287630]
>> pc: c055ed48: free_pages.part.0+0x48/0x110
>> lr: c053ca70: tlb_finish_mmu+0x80/0xd0
>> sp: c72878d0
>>msr: 8282b033
>>   current = 0xcafabe00
>>   paca= 0xc0037300   irqmask: 0x03   irq_happened: 0x05
>> pid   = 26531, comm = 50-landscape-sy
>> kernel BUG at :5521!
>> Linux version 5.19.0-rc3-14659-g4ec05be7c2e1 (kvaneesh@ltc-boston8) (gcc 
>> (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 
>> 2.34) #625 SMP Thu Jun 23 00:35:43 CDT 2022
>> 1:mon> t
>> [link register   ] c053ca70 tlb_finish_mmu+0x80/0xd0
>> [c72878d0] c053ca54 tlb_finish_mmu+0x64/0xd0 (unreliable)
>> [c7287900] c0539424 exit_mmap+0xe4/0x2a0
>> [c72879e0] c019fc1c mmput+0xcc/0x210
>> [c7287a20] c0629230 begin_new_exec+0x5e0/0xf40
>> [c7287ae0] c070b3cc load_elf_binary+0x3ac/0x1e00
>> [c7287c10] c0627af0 bprm_execve+0x3b0/0xaf0
>> [c7287cd0] c0628414 do_execveat_common.isra.0+0x1e4/0x310
>> [c7287d80] c062858c sys_execve+0x4c/0x60
>> [c7287db0] c002c1b0 system_call_exception+0x160/0x2c0
>> [c7287e10] c000c53c system_call_common+0xec/0x250
>>
>> The fix is to make sure we update high_memory on memory hotplug.
>> This is similar to what x86 does in commit 3072e413e305 ("mm/memory_hotplug: 
>> introduce add_pages")
>>
>> Fixes: ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 
>> 32-bit")
>> Cc: Kefeng Wang 
>> Cc: Christophe Leroy 
>> Signed-off-by: Aneesh Kumar K.V 
>  
> ...
> 
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index 52b77684acda..2a63920c369d 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -105,6 +105,36 @@ void __ref arch_remove_linear_mapping(u64 start, u64 
>> size)
>>  vm_unmap_aliases();
>>  }
>>  
>> +/*
>> + * After memory hotplug the variables max_pfn, max_low_pfn and high_memory 
>> need
>> + * updating.
>> + */
>> +static void update_end_of_memory_vars(u64 start, u64 size)
>> +{
>> +unsigned long end_pfn = PFN_UP(start + size);
>> +
>> +if (end_pfn > max_pfn) {
>> +max_pfn = end_pfn;
>> +max_low_pfn = end_pfn;
>> +high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
>> +}
>> +}
>> +
>> +int __ref add_pages(int nid, unsigned long start_pfn, unsigned long 
>> nr_pages,
>> +struct mhp_params *params)
>> +{
>> +int ret;
>> +
>> +ret = __add_pages(nid, start_pfn, nr_pages, params);
>> +WARN_ON_ONCE(ret);
> 
> What's the justification for making this a WARN_ON_ONCE(), and then
> continuing to update the variables anyway?
> 
> I realise that's what x86 does, but it seems kind of wrong.
> 
>

That is copy paste bug from my side. I guess we should skip updating vars on 
__add_pages failure.
WARN_ON_ONCE() was done as part of fe8b868eccb9f85a0e231e35f0abac5b39bac801. 
The original code 
did print a WARN all the time. So the above commit made it conditional. Later 
in ea0854170c95245a258b386c7a9314399c949fe0
the return value was never checked. So the error path is not handled there at 
all. 

we should do 

if (!ret) update_end_of_memory_vars(...); 

Do you want a v3 with the changes? 

-aneesh




Re: [PATCH 2/2] powerpc/papr_scm: Fix build failure when CONFIG_PERF_EVENTS is not set

2022-03-23 Thread Aneesh Kumar K V

On 3/22/22 10:02 PM, Dan Williams wrote:

On Tue, Mar 22, 2022 at 7:30 AM kajoljain  wrote:




On 3/22/22 03:09, Dan Williams wrote:

On Fri, Mar 18, 2022 at 4:42 AM Kajol Jain  wrote:


The following build failure occures when CONFIG_PERF_EVENTS is not set
as generic pmu functions are not visible in that scenario.

arch/powerpc/platforms/pseries/papr_scm.c:372:35: error: ‘struct perf_event’ 
has no member named ‘attr’
  p->nvdimm_events_map[event->attr.config],
^~
In file included from ./include/linux/list.h:5,
  from ./include/linux/kobject.h:19,
  from ./include/linux/of.h:17,
  from arch/powerpc/platforms/pseries/papr_scm.c:5:
arch/powerpc/platforms/pseries/papr_scm.c: In function 
‘papr_scm_pmu_event_init’:
arch/powerpc/platforms/pseries/papr_scm.c:389:49: error: ‘struct perf_event’ 
has no member named ‘pmu’
   struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
  ^~
./include/linux/container_of.h:18:26: note: in definition of macro 
‘container_of’
   void *__mptr = (void *)(ptr); \
   ^~~
arch/powerpc/platforms/pseries/papr_scm.c:389:30: note: in expansion of macro 
‘to_nvdimm_pmu’
   struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
   ^
In file included from ./include/linux/bits.h:22,
  from ./include/linux/bitops.h:6,
  from ./include/linux/of.h:15,
  from arch/powerpc/platforms/pseries/papr_scm.c:5:

Fix the build issue by adding check for CONFIG_PERF_EVENTS config option
and disabling the papr_scm perf interface support incase this config
is not set

Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support") (Commit id
based on linux-next tree)
Signed-off-by: Kajol Jain 
---
  arch/powerpc/platforms/pseries/papr_scm.c | 15 +++


This is a bit messier than I would have liked mainly because it dumps
a bunch of ifdefery into a C file contrary to coding style, "Wherever
possible, don't use preprocessor conditionals (#if, #ifdef) in .c
files". I would expect this all to move to an organization like:


Hi Dan,
   Thanks for reviewing the patches. Inorder to avoid the multiple
ifdefs checks, we can also add stub function for papr_scm_pmu_register.
With that change we will just have one ifdef check for
CONFIG_PERF_EVENTS config in both papr_scm.c and nd.h file. Hence we can
avoid adding new files specific for papr_scm perf interface.

Below is the code snippet for that change, let me know if looks fine to
you. I tested it
with set/unset PAPR_SCM config value and set/unset PERF_EVENTS config
value combinations.

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
b/arch/powerpc/platforms/pseries/papr_scm.c
index 4dd513d7c029..38fabb44d3c3 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -69,8 +69,6 @@
  #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
  #define PAPR_SCM_PERF_STATS_VERSION 0x1

-#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
-
  /* Struct holding a single performance metric */
  struct papr_scm_perf_stat {
 u8 stat_id[8];
@@ -346,6 +344,9 @@ static ssize_t drc_pmem_query_stats(struct
papr_scm_priv *p,
 return 0;
  }

+#ifdef CONFIG_PERF_EVENTS
+#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
+
  static int papr_scm_pmu_get_value(struct perf_event *event, struct
device *dev, u64 *count)
  {
 struct papr_scm_perf_stat *stat;
@@ -558,6 +559,10 @@ static void papr_scm_pmu_register(struct
papr_scm_priv *p)
 dev_info(>pdev->dev, "nvdimm pmu didn't register rc=%d\n", rc);
  }

+#else
+static inline void papr_scm_pmu_register(struct papr_scm_priv *p) { }


Since this isn't in a header file, it does not need to be marked
"inline" the compiler will figure it out.


+#endif


It might be time to create:

arch/powerpc/platforms/pseries/papr_scm.h

...there is quite a bit of header material accrued in papr_scm.c and
once the ifdefs start landing in it then it becomes a nominal coding
style issue. That said, this is certainly more palatable than the
previous version. So if you don't want to create papr_scm.h yet for
this, at least make a note in the changelog that the first portion of
arch/powerpc/platforms/pseries/papr_scm.c is effectively papr_scm.h
content and may move there in the future, or something like that.


There was an attempt to do that earlier. We should get that reposted 
with further changes move more details to header.


https://lore.kernel.org/nvdimm/163454440296.431294.2368481747380790011.st...@lep8c.aus.stglabs.ibm.com 





Re: [PATCH v6 0/4] Add perf interface to expose nvdimm

2022-02-24 Thread Aneesh Kumar K V
On Fri, 2022-02-25 at 12:08 +0530, kajoljain wrote:
> 
> 
> On 2/25/22 11:25, Nageswara Sastry wrote:
> > 
> > 
> > On 17/02/22 10:03 pm, Kajol Jain wrote:
> > > 

> > > 
> > > Changelog
> > 
> > Tested these patches with the automated tests at
> > avocado-misc-tests/perf/perf_nmem.py
> > URL:
> > https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py
> > 
> > 
> > 1. On the system where target id and online id were different then
> > not
> > seeing value in 'cpumask' and those tests failed.
> > 
> > Example:
> > Log from dmesg
> > ...
> > papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region
> > registered
> > with target node 1 and online node 0
> > ...
> 
> Hi Nageswara Sastry,
>    Thanks for testing the patch set. Yes you right, incase target
> node id and online node id is different, it can happen when target
> node is not online and hence can cause this issue, thanks for
> pointing
> it.
> 
> Function dev_to_node will return node id for a given nvdimm device
> which
> can be offline in some scenarios. We should use numa node id return
> by
> numa_map_to_online_node function in that scenario. This function
> incase
> given node is offline, it will lookup for next closest online node
> and
> return that nodeid.
> 
> Can you try with below change and see, if you are still getting this
> issue. Please let me know.
> 
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index bdf2620db461..4dd513d7c029 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct
> papr_scm_priv *p)
>     PERF_PMU_CAP_NO_EXCLUDE;
> 
>     /*updating the cpumask variable */
> -   nodeid = dev_to_node(>pdev->dev);
> +   nodeid = numa_map_to_online_node(dev_to_node(>pdev->dev));
>     nd_pmu->arch_cpumask = *cpumask_of_node(nodeid);
> 
> > 

Can you use p->region->numa_node? 

-aneesh




Re: [PATCH v2 1/2] selftest/vm: Use correct PAGE_SHIFT value for ppc64

2022-02-11 Thread Aneesh Kumar K V

On 2/11/22 18:58, Mike Rapoport wrote:

Hi Aneesh,

On Fri, Feb 11, 2022 at 05:22:13PM +0530, Aneesh Kumar K V wrote:

On 2/11/22 16:03, Mike Rapoport wrote:

On Fri, Feb 11, 2022 at 12:03:28PM +0530, Aneesh Kumar K.V wrote:

Keep it simple by using a #define and limiting hugepage size to 2M.
This keeps the test simpler instead of dynamically finding the page size
and huge page size.

Without this tests are broken w.r.t reading /proc/self/pagemap

if (pread(pagemap_fd, ent, sizeof(ent),
(uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
err(2, "read pagemap");

Cc: Shuah Khan 
Cc: linux-kselft...@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V 
---
   tools/testing/selftests/vm/ksm_tests.c| 9 -
   tools/testing/selftests/vm/transhuge-stress.c | 9 -
   2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/ksm_tests.c 
b/tools/testing/selftests/vm/ksm_tests.c
index 1436e1a9a3d3..cae72872152b 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -22,7 +22,14 @@
   #define KSM_MERGE_ACROSS_NODES_DEFAULT true
   #define MB (1ul << 20)
-#define PAGE_SHIFT 12
+#ifdef __powerpc64__
+#define PAGE_SHIFT 16
+#else
+#define PAGE_SHIFT 12
+#endif


Page size can be other than 4096 for other configurations as well. And even
on ppc64 it's not necessarily 64k.



But most common test config is with 64K page size.


Ideally page size in selftests/vm should be sysconf(_SC_PAGESIZE)



yes. As explained in commit message, the idea was to keep it simpler.


I think it's simple enough (compile tested on x86 only):

 From 219577d87041f19f2c00dc7c23e0fd5aad8b02d5 Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Fri, 11 Feb 2022 15:24:13 +0200
Subject: [PATCH] selftest/vm: add helpers to detect PAGE_SIZE and PAGE_SHIFT

PAGE_SIZE is not 4096 in many configurations, particularily ppc64 uses
64K pages in majority of cases.

Add helpers to detect PAGE_SIZE and PAGE_SHIFT dynamically.

Signed-off-by: Mike Rapoport 
---
  tools/testing/selftests/vm/gup_test.c |  3 +-
  tools/testing/selftests/vm/ksm_tests.c|  8 +
  tools/testing/selftests/vm/transhuge-stress.c |  9 ++
  tools/testing/selftests/vm/util.h | 29 +++
  4 files changed, 34 insertions(+), 15 deletions(-)
  create mode 100644 tools/testing/selftests/vm/util.h

diff --git a/tools/testing/selftests/vm/gup_test.c 
b/tools/testing/selftests/vm/gup_test.c
index fe043f67798b..cda837a14736 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -10,8 +10,9 @@
  #include 
  #include "../../../../mm/gup_test.h"
  
+#include "util.h"

+
  #define MB (1UL << 20)
-#define PAGE_SIZE sysconf(_SC_PAGESIZE)
  
  /* Just the flags we need, copied from mm.h: */

  #define FOLL_WRITE0x01/* check pte is writable */
diff --git a/tools/testing/selftests/vm/ksm_tests.c 
b/tools/testing/selftests/vm/ksm_tests.c
index cae72872152b..7faafd24446f 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -12,6 +12,7 @@
  
  #include "../kselftest.h"

  #include "../../../../include/vdso/time64.h"
+#include "util.h"
  
  #define KSM_SYSFS_PATH "/sys/kernel/mm/ksm/"

  #define KSM_FP(s) (KSM_SYSFS_PATH s)
@@ -22,17 +23,10 @@
  #define KSM_MERGE_ACROSS_NODES_DEFAULT true
  #define MB (1ul << 20)
  
-#ifdef __powerpc64__

-#define PAGE_SHIFT 16
-#else
-#define PAGE_SHIFT 12
-#endif
  /*
   * On ppc64 this will only work with radix 2M hugepage size
   */
  #define HPAGE_SHIFT 21
-
-#define PAGE_SIZE (1 << PAGE_SHIFT)
  #define HPAGE_SIZE (1 << HPAGE_SHIFT)
  
  #define PAGEMAP_PRESENT(ent)	(((ent) & (1ull << 63)) != 0)

diff --git a/tools/testing/selftests/vm/transhuge-stress.c 
b/tools/testing/selftests/vm/transhuge-stress.c
index b1f8d98355c5..baf90a745d28 100644
--- a/tools/testing/selftests/vm/transhuge-stress.c
+++ b/tools/testing/selftests/vm/transhuge-stress.c
@@ -16,17 +16,12 @@
  #include 
  #include 
  
-#ifdef __powerpc64__

-#define PAGE_SHIFT 16
-#else
-#define PAGE_SHIFT 12
-#endif
+#include "util.h"
+
  /*
   * On ppc64 this will only work with radix 2M hugepage size
   */
  #define HPAGE_SHIFT 21
-
-#define PAGE_SIZE (1 << PAGE_SHIFT)
  #define HPAGE_SIZE (1 << HPAGE_SHIFT)
  
  #define PAGEMAP_PRESENT(ent)	(((ent) & (1ull << 63)) != 0)

diff --git a/tools/testing/selftests/vm/util.h 
b/tools/testing/selftests/vm/util.h
new file mode 100644
index ..1c85d7583bac
--- /dev/null
+++ b/tools/testing/selftests/vm/util.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __KSELFTEST_VM_UTIL_H
+#define __KSELFTEST_VM_UTIL_H
+
+#include  /* ffsl() */
+#include  /* _SC_PAGESIZE */
+
+static unsigned __pa

Re: [PATCH v2 1/2] selftest/vm: Use correct PAGE_SHIFT value for ppc64

2022-02-11 Thread Aneesh Kumar K V

On 2/11/22 16:03, Mike Rapoport wrote:

On Fri, Feb 11, 2022 at 12:03:28PM +0530, Aneesh Kumar K.V wrote:

Keep it simple by using a #define and limiting hugepage size to 2M.
This keeps the test simpler instead of dynamically finding the page size
and huge page size.

Without this tests are broken w.r.t reading /proc/self/pagemap

if (pread(pagemap_fd, ent, sizeof(ent),
(uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
err(2, "read pagemap");

Cc: Shuah Khan 
Cc: linux-kselft...@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V 
---
  tools/testing/selftests/vm/ksm_tests.c| 9 -
  tools/testing/selftests/vm/transhuge-stress.c | 9 -
  2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/ksm_tests.c 
b/tools/testing/selftests/vm/ksm_tests.c
index 1436e1a9a3d3..cae72872152b 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -22,7 +22,14 @@
  #define KSM_MERGE_ACROSS_NODES_DEFAULT true
  #define MB (1ul << 20)
  
-#define PAGE_SHIFT 12

+#ifdef __powerpc64__
+#define PAGE_SHIFT 16
+#else
+#define PAGE_SHIFT 12
+#endif


Page size can be other than 4096 for other configurations as well. And even
on ppc64 it's not necessarily 64k.



But most common test config is with 64K page size.


Ideally page size in selftests/vm should be sysconf(_SC_PAGESIZE)



yes. As explained in commit message, the idea was to keep it simpler.

"Keep it simple by using a #define and limiting hugepage size to 2M.
This keeps the test simpler instead of dynamically finding the page size
and huge page size.

Without this tests are broken w.r.t reading /proc/self/pagemap"

We can definitely look at updating multiple tests in selftest/vm to work 
with dynamic value of pagesize and huagepage size. But that can be 
outside this patch?





+/*
+ * On ppc64 this will only work with radix 2M hugepage size
+ */
  #define HPAGE_SHIFT 21
  
  #define PAGE_SIZE (1 << PAGE_SHIFT)

diff --git a/tools/testing/selftests/vm/transhuge-stress.c 
b/tools/testing/selftests/vm/transhuge-stress.c
index 5e4c036f6ad3..b1f8d98355c5 100644
--- a/tools/testing/selftests/vm/transhuge-stress.c
+++ b/tools/testing/selftests/vm/transhuge-stress.c
@@ -16,7 +16,14 @@
  #include 
  #include 
  
-#define PAGE_SHIFT 12

+#ifdef __powerpc64__
+#define PAGE_SHIFT 16
+#else
+#define PAGE_SHIFT 12
+#endif
+/*
+ * On ppc64 this will only work with radix 2M hugepage size
+ */
  #define HPAGE_SHIFT 21
  
  #define PAGE_SIZE (1 << PAGE_SHIFT)

--
2.34.1








Re: [PATCH v2] powerpc/mm: Update default hugetlb size early

2022-02-11 Thread Aneesh Kumar K V

On 2/11/22 14:00, David Hildenbrand wrote:

On 11.02.22 07:52, Aneesh Kumar K.V wrote:

commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by 
mobility")
introduced pageblock_order which will be used to group pages better.
The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
should be set before we call set_pageblock_order.

set_pageblock_order happens early in the boot and default hugetlb page size
should be initialized before that to compute the right pageblock_order value.

Currently, default hugetlbe page size is set via arch_initcalls which happens
late in the boot as shown via the below callstack:

[c7383b10] [c1289328] hugetlbpage_init+0x2b8/0x2f8
[c7383bc0] [c12749e4] do_one_initcall+0x14c/0x320
[c7383c90] [c127505c] kernel_init_freeable+0x410/0x4e8
[c7383da0] [c0012664] kernel_init+0x30/0x15c
[c7383e10] [c000cf14] ret_from_kernel_thread+0x5c/0x64

and the pageblock_order initialization is done early during the boot.

[c18bfc80] [c12ae120] set_pageblock_order+0x50/0x64
[c18bfca0] [c12b3d94] sparse_init+0x188/0x268
[c18bfd60] [c1288bfc] initmem_init+0x28c/0x328
[c18bfe50] [c127b370] setup_arch+0x410/0x480
[c18bfed0] [c127401c] start_kernel+0xb8/0x934
[c18bff90] [c000d984] start_here_common+0x1c/0x98

delaying default hugetlb page size initialization implies the kernel will
initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
value for mobility grouping. IIUC we always had this issue. But it was not
a problem for hash translation mode because (MAX_ORDER - 1) is the same as
HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
5 instead of 8.



A related question: Can we on ppc still have pageblock_order > MAX_ORDER
- 1? We have some code for that and I am not so sure if we really need that.



I also have been wondering about the same. On book3s64 I don't think we 
need that support for both 64K and 4K page size because with hash 
hugetlb size is MAX_ORDER -1. (16MB hugepage size)


I am not sure about the 256K page support. Christophe may be able to 
answer that.


For the gigantic hugepage support we depend on cma based allocation or
firmware reservation. So I am not sure why we ever considered pageblock 
> MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever 
needed, I could double-check whether ppc64 is still dependent on that.


-aneesh


Re: [PATCH] selftest/vm: Use correct PAGE_SHIFT value for ppc64

2022-02-10 Thread Aneesh Kumar K V

On 2/10/22 20:09, Shuah Khan wrote:

On 2/9/22 9:12 PM, Aneesh Kumar K.V wrote:

Shuah Khan  writes:


On 2/9/22 8:43 AM, Aneesh Kumar K.V wrote:

Keep it simple by using a #define and limiting hugepage size to 2M.
This keeps the test simpler instead of dynamically finding the page 
size

and huge page size.

Without this tests are broken w.r.t reading /proc/self/pagemap

if (pread(pagemap_fd, ent, sizeof(ent),
    (uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
    err(2, "read pagemap");

Cc: Shuah Khan 
Signed-off-by: Aneesh Kumar K.V 
---
   tools/testing/selftests/vm/ksm_tests.c    | 8 
   tools/testing/selftests/vm/transhuge-stress.c | 8 
   2 files changed, 16 insertions(+)

diff --git a/tools/testing/selftests/vm/ksm_tests.c 
b/tools/testing/selftests/vm/ksm_tests.c

index 1436e1a9a3d3..8200328ff018 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -22,8 +22,16 @@
   #define KSM_MERGE_ACROSS_NODES_DEFAULT true
   #define MB (1ul << 20)
+#ifdef __powerpc64__
+#define PAGE_SHIFT    16
+/*
+ * This will only work with radix 2M hugepage size
+ */
+#define HPAGE_SHIFT 21
+#else
   #define PAGE_SHIFT 12
   #define HPAGE_SHIFT 21
+#endif
   #define PAGE_SIZE (1 << PAGE_SHIFT)
   #define HPAGE_SIZE (1 << HPAGE_SHIFT)
diff --git a/tools/testing/selftests/vm/transhuge-stress.c 
b/tools/testing/selftests/vm/transhuge-stress.c

index 5e4c036f6ad3..f04c8aa4bcf6 100644
--- a/tools/testing/selftests/vm/transhuge-stress.c
+++ b/tools/testing/selftests/vm/transhuge-stress.c
@@ -16,8 +16,16 @@
   #include 
   #include 
+#ifdef __powerpc64__
+#define PAGE_SHIFT    16
+/*
+ * This will only work with radix 2M hugepage size
+ */
+#define HPAGE_SHIFT 21


Why not have this is in common code?


Can you suggest where I can move that. We also have helper functions
like allocate_transhuge() duplicated between tests. I didn't find
libutil.a or anything similar supported by the selftets build.





I noticed that HPAGE_SHIFT is defined in #ifdef __powerpc64__ block
as well as #else. I am asking is it necessary to be part of both
blocks.

+#ifdef __powerpc64__
+#define PAGE_SHIFT    16
+/*
+ * This will only work with radix 2M hugepage size
+ */
+#define HPAGE_SHIFT 21  --- this one
+#else
   #define PAGE_SHIFT 12
   #define HPAGE_SHIFT 21   --- this one
+#endif




The reason I did that was to add the comment which is relevant only for 
ppc64. ppc64 supports two hugepage sizes, 2M and 16M. The test won't 
work correctly with 16M hugepage size. We do have other tests in 
selftest/vm/ with similar restrictions.



-aneesh