from:"Baoquan He"

Re: [PATCH 1/3] crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_CORE

2016-11-13 Thread Baoquan He

On 11/10/16 at 05:27pm, Hari Bathini wrote:
> Traditionally, kdump is used to save vmcore in case of a crash. Some
> architectures like powerpc can save vmcore using architecture specific
> support instead of kexec/kdump mechanism. Such architecture specific
> support also needs to reserve memory, to be used by dump capture kernel.
> crashkernel parameter can be a reused, for memory reservation, by such
> architecture specific infrastructure.
> 
> But currently, code related to vmcoreinfo and parsing of crashkernel
> parameter is built under CONFIG_KEXEC_CORE. This patch introduces
> CONFIG_CRASH_CORE and moves the above mentioned code under this config,
> allowing code reuse without dependency on CONFIG_KEXEC. While here,
> removing the multiple definitions of append_elf_note() and final_note()
> for one defined under CONFIG_CONFIG_CORE. There is no functional change
> with this patch.

Can't think of a reason to object.

Could it be that do the moving from kexec_core.c to crash_core.c only,
then do the arch specific clean up in another patch?

Besides there's already a file crash_dump.h, can we reuse that?

> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/Kconfig   |4 
>  arch/ia64/kernel/crash.c   |   22 --
>  arch/powerpc/Kconfig   |   10 -
>  arch/powerpc/include/asm/fadump.h  |2 
>  arch/powerpc/kernel/crash.c|2 
>  arch/powerpc/kernel/fadump.c   |   34 ---
>  arch/powerpc/kernel/setup-common.c |5 
>  include/linux/crash_core.h |   75 ++
>  include/linux/kexec.h  |   63 -
>  kernel/Makefile|1 
>  kernel/crash_core.c|  450 
> 
>  kernel/kexec_core.c|  435 ---
>  12 files changed, 550 insertions(+), 553 deletions(-)
>  create mode 100644 include/linux/crash_core.h
>  create mode 100644 kernel/crash_core.c
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 659bdd0..4ad34b9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -2,7 +2,11 @@
>  # General architecture dependent options
>  #
>  
> +config CRASH_CORE
> + bool
> +
>  config KEXEC_CORE
> + select CRASH_CORE
>   bool
>  
>  config OPROFILE
> diff --git a/arch/ia64/kernel/crash.c b/arch/ia64/kernel/crash.c
> index 2955f35..75859a0 100644
> --- a/arch/ia64/kernel/crash.c
> +++ b/arch/ia64/kernel/crash.c
> @@ -27,28 +27,6 @@ static int kdump_freeze_monarch;
>  static int kdump_on_init = 1;
>  static int kdump_on_fatal_mca = 1;
>  
> -static inline Elf64_Word
> -*append_elf_note(Elf64_Word *buf, char *name, unsigned type, void *data,
> - size_t data_len)
> -{
> - struct elf_note *note = (struct elf_note *)buf;
> - note->n_namesz = strlen(name) + 1;
> - note->n_descsz = data_len;
> - note->n_type   = type;
> - buf += (sizeof(*note) + 3)/4;
> - memcpy(buf, name, note->n_namesz);
> - buf += (note->n_namesz + 3)/4;
> - memcpy(buf, data, data_len);
> - buf += (data_len + 3)/4;
> - return buf;
> -}
> -
> -static void
> -final_note(void *buf)
> -{
> - memset(buf, 0, sizeof(struct elf_note));
> -}
> -
>  extern void ia64_dump_cpu_regs(void *);
>  
>  static DEFINE_PER_CPU(struct elf_prstatus, elf_prstatus);
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 65fba4c..644703f 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -479,21 +479,23 @@ config RELOCATABLE
> load address of the kernel (eg. u-boot/mkimage).
>  
>  config CRASH_DUMP
> - bool "Build a kdump crash kernel"
> + bool "Build a dump capture kernel"
>   depends on PPC64 || 6xx || FSL_BOOKE || (44x && !SMP)
>   select RELOCATABLE if (PPC64 && !COMPILE_TEST) || 44x || FSL_BOOKE
>   help
> -   Build a kernel suitable for use as a kdump capture kernel.
> +   Build a kernel suitable for use as a dump capture kernel.
> The same kernel binary can be used as production kernel and dump
> capture kernel.
>  
>  config FA_DUMP
>   bool "Firmware-assisted dump"
> - depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC
> + depends on PPC64 && PPC_RTAS
> + select CRASH_CORE
> + select CRASH_DUMP
>   help
> A robust mechanism to get reliable kernel crash dump with
> assistance from firmware. This approach does not use kexec,
> -   instead firmware assists in booting the kdump kernel
> +   instead firmware assists in booting the capture kernel
> while preserving memory contents. Firmware-assisted dump
> is meant to be a kdump replacement offering robustness and
> speed not possible without system firmware assistance.
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 0031806..60b9108 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -73,6 +73,8 @@
>

Re: [RFC] kexec_file: Add support for purgatory built as PIE

2016-11-04 Thread Baoquan He

On 11/02/16 at 04:00am, Thiago Jung Bauermann wrote:
> Hello,
> 
> The kexec_file code currently builds the purgatory as a partially linked 
> object 
> (using ld -r). Is there a particular reason to use that instead of a position 
> independent executable (PIE)?

It's taken as "-r", relocatable in user space kexec-tools too originally.
I think Vivek just keeps it the same when moving into kernel.

> 
> I found a discussion from 2013 in the archives but from what I understood it 
> was about the purgatory as a separate object vs having it linked into the 
> kernel, which is different from what I'm asking:
> 
> http://lists.infradead.org/pipermail/kexec/2013-December/010535.html
> 
> Here is my motivation for this question:
> 
>  On ppc64 purgatory.ro has 12 relocation types when built as a partially 
> linked object. This makes arch_kexec_apply_relocations_add duplicate a lot of 
> code with module_64.c:apply_relocate_add to implement these relocations. The 
> alternative is to do some refactoring so that both functions can share the 
> implementation of the relocations. This is done in patches 5 and 6 of the 
> kexec_file_load implementation for powerpc:

In user space kexec-tools utility, you also got this problem?

> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-October/149984.html
> 
> Michael Ellerman would prefer if module_64.c didn't need to be changed, and 
> suggested that the purgatory could be a position independent executable. 
> Indeed, in that case there are only 4 relocation types in purgatory.ro (which 
> aren't even implemented in module_64.c:apply_relocate_add), so the relocation 
> code for the purgatory can leave that file alone and have its own relocation 
> implementation.
> 
> Also, the purgatory is an executable and not an intermediary output from the 
> compiler, so in my mind it makes sense conceptually that it is easier to 
> build 
> it as a PIE than as a partially linked object.
> 
> The patch below adds the support needed in kexec_file.c to allow powerpc-
> specific code to load and relocate a purgatory binary built as PIE. This is 
> WIP 
> and can probably be refined a bit. Would you accept a change along these 
> lines?
> 
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  arch/Kconfig|   3 +
>  kernel/kexec_file.c | 159 
> ++--
>  kernel/kexec_internal.h |  26 
>  3 files changed, 183 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 659bdd079277..7fd6879be222 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -5,6 +5,9 @@
>  config KEXEC_CORE
>   bool
>  
> +config HAVE_KEXEC_FILE_PIE_PURGATORY
> + bool
> +
>  config OPROFILE
>   tristate "OProfile system profiling"
>   depends on PROFILING
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 0c2df7f73792..dfc3e015160d 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -633,7 +633,149 @@ static int kexec_calculate_store_digests(struct kimage 
> *image)
>   return ret;
>  }
>  
> -/* Actually load purgatory. Lot of code taken from kexec-tools */
> +#ifdef CONFIG_HAVE_KEXEC_FILE_PIE_PURGATORY
> +/* Load PIE purgatory using the program header information. */
> +static int __kexec_load_purgatory(struct kimage *image, unsigned long min,
> +   unsigned long max, int top_down)
> +{
> + struct purgatory_info *pi = >purgatory_info;
> + unsigned long first_offset;
> + unsigned long orig_load_addr = 0;
> + const void *src;
> + int i, ret;
> + const Elf_Phdr *phdrs = (const void *) pi->ehdr + pi->ehdr->e_phoff;
> + const Elf_Phdr *phdr;
> + const Elf_Shdr *sechdrs_c;
> + Elf_Shdr *sechdr;
> + Elf_Shdr *sechdrs = NULL;
> + struct kexec_buf kbuf = { .image = image, .bufsz = 0, .buf_align = 1,
> +   .buf_min = min, .buf_max = max,
> +   .top_down = top_down };
> +
> + /*
> +  * sechdrs_c points to section headers in purgatory and are read
> +  * only. No modifications allowed.
> +  */
> + sechdrs_c = (void *) pi->ehdr + pi->ehdr->e_shoff;
> +
> + /*
> +  * We can not modify sechdrs_c[] and its fields. It is read only.
> +  * Copy it over to a local copy where one can store some temporary
> +  * data and free it at the end. We need to modify ->sh_addr and
> +  * ->sh_offset fields to keep track of permanent and temporary
> +  * locations of sections.
> +  */
> + sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> + if (!sechdrs)
> + return -ENOMEM;
> +
> + memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
> +
> + /*
> +  * We seem to have multiple copies of sections. First copy is which
> +  * is embedded in kernel in read only section. Some of these sections
> +  * will be copied to a temporary buffer and

Re: [PATCH v7 3/4] lib/cmdline.c Remove quotes symmetrically.

2017-08-21 Thread Baoquan He

On 08/17/17 at 10:14pm, Michal Suchanek wrote:
> Remove quotes from argument value only if there is qoute on both sides.
> 
> Signed-off-by: Michal Suchanek 

Sounds reasonable. Just for curiosity, do we have chance to pass in
option with a single '"'?

> ---
>  arch/powerpc/kernel/fadump.c | 6 ++
>  lib/cmdline.c| 7 ++-
>  2 files changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index a1614d9b8a21..d7da4ce9f7ae 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -489,10 +489,8 @@ static void __init fadump_update_params(struct 
> param_info *param_info,
>   *tgt++ = ' ';
>  
>   /* next_arg removes one leading and one trailing '"' */
> - if (*tgt == '"')
> - shortening += 1;
> - if (*(tgt + vallen + shortening) == '"')
> - shortening += 1;
> + if ((*tgt == '"') && (*(tgt + vallen + shortening) == '"'))
> + shortening += 2;
>  
>   /* remove one leading and one trailing quote if both are present */
>   if ((val[0] == '"') && (val[vallen - 1] == '"')) {
> diff --git a/lib/cmdline.c b/lib/cmdline.c
> index 4c0888c4a68d..01e701b2afe8 100644
> --- a/lib/cmdline.c
> +++ b/lib/cmdline.c
> @@ -227,14 +227,11 @@ char *next_arg(char *args, char **param, char **val)
>   *val = args + equals + 1;
>  
>   /* Don't include quotes in value. */
> - if (**val == '"') {
> + if ((**val == '"') && (args[i-1] == '"')) {
>   (*val)++;
> - if (args[i-1] == '"')
> - args[i-1] = '\0';
> + args[i-1] = '\0';
>   }
>   }
> - if (quoted && args[i-1] == '"')
> - args[i-1] = '\0';
>  
>   if (args[i]) {
>   args[i] = '\0';
> -- 
> 2.10.2
>

Re: [PATCH v5 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-06-11 Thread Baoquan He

On 06/12/18 at 11:28am, Baoquan He wrote:
> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> so that it's shared. Later its code also need be updated using list_head
> to replace singly linked list.
> 
> Signed-off-by: Baoquan He 
> Cc: Michal Simek 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> ---
> v4->v5:
>   Fix several code bugs reported by test robot on ARCH powerpc and
>   microblaze.

Oops, I mistakenly added the patch change log of the current patch 0002
here. This patch is a newly added one.

> 
> v3->v4:
>   Fix several bugs test robot reported. And change patch log.
> 
> v2->v3:
>   Rename resource functions first_child() and sibling() to
>   resource_first_chils() and resource_sibling(). Dan suggested this.
> 
>   Move resource_first_chils() and resource_sibling() to linux/ioport.h
>   and make them as inline function. Rob suggested this. Accordingly add
>   linux/list.h including in linux/ioport.h, please help review if this
>   bring efficiency degradation or code redundancy.
> 
>   The change on struct resource {} bring two pointers of size increase,
>   mention this in git log to make it more specifically, Rob suggested
>   this.
> 
>  arch/microblaze/pci/pci-common.c | 37 -
>  arch/powerpc/kernel/pci-common.c | 35 ---
>  include/linux/ioport.h   |  1 +
>  kernel/resource.c| 36 
>  4 files changed, 37 insertions(+), 72 deletions(-)
> 
> diff --git a/arch/microblaze/pci/pci-common.c 
> b/arch/microblaze/pci/pci-common.c
> index f34346d56095..7899bafab064 100644
> --- a/arch/microblaze/pci/pci-common.c
> +++ b/arch/microblaze/pci/pci-common.c
> @@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
>  EXPORT_SYMBOL(pcibios_add_device);
>  
>  /*
> - * Reparent resource children of pr that conflict with res
> - * under res, and make res replace those children.
> - */
> -static int __init reparent_resources(struct resource *parent,
> -  struct resource *res)
> -{
> - struct resource *p, **pp;
> - struct resource **firstpp = NULL;
> -
> - for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
> - if (p->end < res->start)
> - continue;
> - if (res->end < p->start)
> - break;
> - if (p->start < res->start || p->end > res->end)
> - return -1;  /* not completely contained */
> - if (firstpp == NULL)
> - firstpp = pp;
> - }
> - if (firstpp == NULL)
> - return -1;  /* didn't find any conflicting entries? */
> - res->parent = parent;
> - res->child = *firstpp;
> - res->sibling = *pp;
> - *firstpp = res;
> - *pp = NULL;
> - for (p = res->child; p != NULL; p = p->sibling) {
> - p->parent = res;
> - pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
> -  p->name,
> -  (unsigned long long)p->start,
> -  (unsigned long long)p->end, res->name);
> - }
> - return 0;
> -}
> -
> -/*
>   *  Handle resources of PCI devices.  If the world were perfect, we could
>   *  just allocate all the resource regions and do nothing more.  It isn't.
>   *  On the other hand, we cannot just re-allocate all devices, as it would
> diff --git a/arch/powerpc/kernel/pci-common.c 
> b/arch/powerpc/kernel/pci-common.c
> index fe9733aa..926035bb378d 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, 
> const struct resource *res,
>  EXPORT_SYMBOL(pcibios_align_resource);
>  
>  /*
> - * Reparent resource children of pr that conflict with res
> - * under res, and make res replace those children.
> - */
> -static int reparent_resources(struct resource *parent,
> -  struct resource *res)
> -{
> - struct resource *p, **pp;
> - struct resource **firstpp = NULL;
> -
> - for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
> - if (p->end < res->start)
> - continue;
> - if (res->end < p->start)
> - break;
> - if (p->start < res->start || p->end > res->end)
> -

[PATCH v5 4/4] kexec_file: Load kernel at top of system RAM if required

2018-06-11 Thread Baoquan He

For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
is used to load kernel/initrd/purgatory is supposed to be allocated from
top to down. This is what we have been doing all along in the old kexec
loading interface and the kexec loading is still default setting in some
distributions. However, the current kexec_file loading interface doesn't
do likt this. The function arch_kexec_walk_mem() it calls ignores checking
kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
all resources of System RAM from bottom to up, to try to find memory region
which can contain the specific kexec buffer, then call 
locate_mem_hole_callback()
to allocate memory in that found memory region from top to down. This brings
confusion especially when KASLR is widely supported , users have to make clear
why kexec/kdump kernel loading position is different between these two
interfaces in order to exclude unnecessary noises. Hence these two interfaces
need be unified on behaviour.

Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(),
if yes, call the newly added walk_system_ram_res_rev() to find memory region
from top to down to load kernel.

Signed-off-by: Baoquan He 
Cc: Eric Biederman 
Cc: Vivek Goyal 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Yinghai Lu 
Cc: ke...@lists.infradead.org
---
 kernel/kexec_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 75d8e7cf040e..7a66d9d5a534 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   IORESOURCE_SYSTEM_RAM | 
IORESOURCE_BUSY,
   crashk_res.start, crashk_res.end,
   kbuf, func);
+   else if (kbuf->top_down)
+   return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func);
else
return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
-- 
2.13.6

[PATCH v5 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-06-11 Thread Baoquan He

reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
so that it's shared. Later its code also need be updated using list_head
to replace singly linked list.

Signed-off-by: Baoquan He 
Cc: Michal Simek 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
---
v4->v5:
  Fix several code bugs reported by test robot on ARCH powerpc and
  microblaze.

v3->v4:
  Fix several bugs test robot reported. And change patch log.

v2->v3:
  Rename resource functions first_child() and sibling() to
  resource_first_chils() and resource_sibling(). Dan suggested this.

  Move resource_first_chils() and resource_sibling() to linux/ioport.h
  and make them as inline function. Rob suggested this. Accordingly add
  linux/list.h including in linux/ioport.h, please help review if this
  bring efficiency degradation or code redundancy.

  The change on struct resource {} bring two pointers of size increase,
  mention this in git log to make it more specifically, Rob suggested
  this.

 arch/microblaze/pci/pci-common.c | 37 -
 arch/powerpc/kernel/pci-common.c | 35 ---
 include/linux/ioport.h   |  1 +
 kernel/resource.c| 36 
 4 files changed, 37 insertions(+), 72 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index f34346d56095..7899bafab064 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
 EXPORT_SYMBOL(pcibios_add_device);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int __init reparent_resources(struct resource *parent,
-struct resource *res)
-{
-   struct resource *p, **pp;
-   struct resource **firstpp = NULL;
-
-   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
-   if (p->end < res->start)
-   continue;
-   if (res->end < p->start)
-   break;
-   if (p->start < res->start || p->end > res->end)
-   return -1;  /* not completely contained */
-   if (firstpp == NULL)
-   firstpp = pp;
-   }
-   if (firstpp == NULL)
-   return -1;  /* didn't find any conflicting entries? */
-   res->parent = parent;
-   res->child = *firstpp;
-   res->sibling = *pp;
-   *firstpp = res;
-   *pp = NULL;
-   for (p = res->child; p != NULL; p = p->sibling) {
-   p->parent = res;
-   pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
-p->name,
-(unsigned long long)p->start,
-(unsigned long long)p->end, res->name);
-   }
-   return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733aa..926035bb378d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, const 
struct resource *res,
 EXPORT_SYMBOL(pcibios_align_resource);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int reparent_resources(struct resource *parent,
-struct resource *res)
-{
-   struct resource *p, **pp;
-   struct resource **firstpp = NULL;
-
-   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
-   if (p->end < res->start)
-   continue;
-   if (res->end < p->start)
-   break;
-   if (p->start < res->start || p->end > res->end)
-   return -1;  /* not completely contained */
-   if (firstpp == NULL)
-   firstpp = pp;
-   }
-   if (firstpp == NULL)
-   return -1;  /* didn't find any conflicting entries? */
-   res->parent = parent;
-   res->child = *firstpp;
-   res->sibling = *pp;
-   *firstpp = res;
-   *pp = NULL;
-   for (p = res->child; p != NULL; p = p->sibling) {
-   p->parent = res;
-   pr_debug("PCI: Reparented %s %pR under %s\n",
-p->name, p, res->name);
-   }
-   r

[PATCH v5 2/4] resource: Use list_head to link sibling resource

2018-06-11 Thread Baoquan He

The struct resource uses singly linked list to link siblings, implemented
by pointer operation. Replace it with list_head for better code readability.

Based on this list_head replacement, it will be very easy to do reverse
iteration on iomem_resource's sibling list in later patch.

Besides, type of member variables of struct resource, sibling and child, are
changed from 'struct resource *' to 'struct list_head'. This brings two
pointers of size increase.

Suggested-by: Andrew Morton 
Signed-off-by: Baoquan He 
Cc: Patrik Jakobsson 
Cc: David Airlie 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Dmitry Torokhov 
Cc: Dan Williams 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: Keith Busch 
Cc: Jonathan Derrick 
Cc: Lorenzo Pieralisi 
Cc: Bjorn Helgaas 
Cc: Thomas Gleixner 
Cc: Brijesh Singh 
Cc: "Jérôme Glisse" 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Yaowei Bai 
Cc: Wei Yang 
Cc: de...@linuxdriverproject.org
Cc: linux-in...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: devicet...@vger.kernel.org
Cc: linux-...@vger.kernel.org
---
 arch/arm/plat-samsung/pm-check.c|   6 +-
 arch/microblaze/pci/pci-common.c|   4 +-
 arch/powerpc/kernel/pci-common.c|   4 +-
 arch/sparc/kernel/ioport.c  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h|   4 +-
 drivers/eisa/eisa-bus.c |   2 +
 drivers/gpu/drm/drm_memory.c|   3 +-
 drivers/gpu/drm/gma500/gtt.c|   5 +-
 drivers/hv/vmbus_drv.c  |  52 +++
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c |   6 +-
 drivers/nvdimm/nd.h |   5 +-
 drivers/of/address.c|   4 +-
 drivers/parisc/lba_pci.c|   4 +-
 drivers/pci/host/vmd.c  |   8 +-
 drivers/pci/probe.c |   2 +
 drivers/pci/setup-bus.c |   2 +-
 include/linux/ioport.h  |  17 ++-
 kernel/resource.c   | 211 ++--
 19 files changed, 176 insertions(+), 169 deletions(-)

diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c
index cd2c02c68bc3..5494355b1c49 100644
--- a/arch/arm/plat-samsung/pm-check.c
+++ b/arch/arm/plat-samsung/pm-check.c
@@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg);
 static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 {
while (ptr != NULL) {
-   if (ptr->child != NULL)
-   s3c_pm_run_res(ptr->child, fn, arg);
+   if (!list_empty(>child))
+   s3c_pm_run_res(resource_first_child(>child), fn, 
arg);
 
if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
== IORESOURCE_SYSTEM_RAM) {
@@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, 
u32 *arg)
arg = (fn)(ptr, arg);
}
 
-   ptr = ptr->sibling;
+   ptr = resource_sibling(ptr);
}
 }
 
diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 7899bafab064..2bf73e27e231 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -533,7 +533,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller 
*hose,
res->flags = range.flags;
res->start = range.cpu_addr;
res->end = range.cpu_addr + range.size - 1;
-   res->parent = res->child = res->sibling = NULL;
+   res->parent = NULL;
+   INIT_LIST_HEAD(>child);
+   INIT_LIST_HEAD(>sibling);
}
}
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 926035bb378d..28fbe83c9daf 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -761,7 +761,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller 
*hose,
res->flags = range.flags;
res->start = range.cpu_addr;
res->end = range.cpu_addr + range.size - 1;
-   res->parent = res->child = res->sibling = NULL;
+   res->parent = NULL;
+   INIT_LIST_HEAD(>child);
+   INIT_LIST_HEAD(>sibling);
}
}
 }
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index cca9134cfa7d..99efe4e98b16 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -669,7 +669,7 @@ static int sparc_io_proc_show(struct seq_file *m, void *v)
struct resource *root = m->private, *r;
const c

[PATCH v5 0/4] resource: Use list_head to link sibling resource

2018-06-11 Thread Baoquan He

This patchset is doing:
1) Replace struct resource's sibling list from singly linked list to
list_head. Clearing out those pointer operation within singly linked
list for better code readability.
2) Based on list_head replacement, add a new function
walk_system_ram_res_rev() which can does reversed iteration on
iomem_resource's siblings.
3) Change kexec_file loading to search system RAM top down for kernel
loadin, using walk_system_ram_res_rev().

Note:
This patchset passed testing on my kvm guest, x86_64 arch with network
enabling. The thing we need pay attetion to is that a root resource's
child member need be initialized specifically with LIST_HEAD_INIT() if
statically defined or INIT_LIST_HEAD() for dynamically definition. Here
Just like we do for iomem_resource/ioport_resource, or the change in
get_pci_domain_busn_res().


Links of the old post (Boris pointed out that we should use
https://lkml.kernel.org/r/Message-ID, while it can't be opened from
my side, so paste all of them here.):
v4:
https://lkml.kernel.org/r/20180507063224.24229-1-...@redhat.com
https://lkml.org/lkml/2018/5/7/36

v3:
https://lkml.kernel.org/r/20180419001848.3041-1-...@redhat.com
https://lkml.org/lkml/2018/4/18/767

v2:
https://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com
https://lkml.org/lkml/2018/4/7/169

v1:
https://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com
https://lkml.org/lkml/2018/3/21/952

Changelog:
v4->v5:
  Add new patch 0001 to move duplicated reparent_resources() to
  kernel/resource.c to make it be shared by different ARCH-es.

  Fix several code bugs reported by test robot on ARCH powerpc and
  microblaze.
v3->v4:
  Fix several bugs test robot reported. Rewrite cover letter and patch
  log according to reviewer's comment.

v2->v3:
  Rename resource functions first_child() and sibling() to
  resource_first_chils() and resource_sibling(). Dan suggested this.

  Move resource_first_chils() and resource_sibling() to linux/ioport.h
  and make them as inline function. Rob suggested this. Accordingly add
  linux/list.h including in linux/ioport.h, please help review if this
  bring efficiency degradation or code redundancy.

  The change on struct resource {} bring two pointers of size increase,
  mention this in git log to make it more specifically, Rob suggested
  this.

v1->v2:
  Use list_head instead to link resource siblings. This is suggested by
  Andrew.

  Rewrite walk_system_ram_res_rev() after list_head is taken to link
  resouce siblings.

Baoquan He (4):
  resource: Move reparent_resources() to kernel/resource.c and make it
public
  resource: Use list_head to link sibling resource
  resource: add walk_system_ram_res_rev()
  kexec_file: Load kernel at top of system RAM if required

 arch/arm/plat-samsung/pm-check.c|   6 +-
 arch/microblaze/pci/pci-common.c|  41 +
 arch/powerpc/kernel/pci-common.c|  39 +
 arch/sparc/kernel/ioport.c  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h|   4 +-
 drivers/eisa/eisa-bus.c |   2 +
 drivers/gpu/drm/drm_memory.c|   3 +-
 drivers/gpu/drm/gma500/gtt.c|   5 +-
 drivers/hv/vmbus_drv.c  |  52 +++---
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c |   6 +-
 drivers/nvdimm/nd.h |   5 +-
 drivers/of/address.c|   4 +-
 drivers/parisc/lba_pci.c|   4 +-
 drivers/pci/host/vmd.c  |   8 +-
 drivers/pci/probe.c |   2 +
 drivers/pci/setup-bus.c |   2 +-
 include/linux/ioport.h  |  21 ++-
 kernel/kexec_file.c |   2 +
 kernel/resource.c   | 259 ++--
 20 files changed, 244 insertions(+), 227 deletions(-)

-- 
2.13.6

[PATCH v5 3/4] resource: add walk_system_ram_res_rev()

2018-06-11 Thread Baoquan He

This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM
in reversed order, i.e., from higher to lower.

It will be used in kexec_file code.

Signed-off-by: Baoquan He 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Brijesh Singh 
Cc: "Jérôme Glisse" 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: Wei Yang 
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c  | 40 
 2 files changed, 43 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index b7456ae889dd..066cc263e2cc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -279,6 +279,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
int (*func)(struct resource *, void *));
 extern int
+walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(struct resource *, void *));
+extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 
end,
void *arg, int (*func)(struct resource *, void *));
 
diff --git a/kernel/resource.c b/kernel/resource.c
index ef9a20b75234..3128ac938f38 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 
@@ -443,6 +445,44 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
 }
 
 /*
+ * This function, being a variant of walk_system_ram_res(), calls the @func
+ * callback against all memory ranges of type System RAM which are marked as
+ * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
+ * higher to lower.
+ */
+int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(struct resource *, void *))
+{
+   unsigned long flags;
+   struct resource *res;
+   int ret = -1;
+
+   flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+
+   read_lock(_lock);
+   list_for_each_entry_reverse(res, _resource.child, sibling) {
+   if (start >= end)
+   break;
+   if ((res->flags & flags) != flags)
+   continue;
+   if (res->desc != IORES_DESC_NONE)
+   continue;
+   if (res->end < start)
+   break;
+
+   if ((res->end >= start) && (res->start < end)) {
+   ret = (*func)(res, arg);
+   if (ret)
+   break;
+   }
+   end = res->start - 1;
+
+   }
+   read_unlock(_lock);
+   return ret;
+}
+
+/*
  * This function calls the @func callback against all memory ranges, which
  * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  */
-- 
2.13.6

Re: [PATCH v5 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-06-12 Thread Baoquan He

On 06/12/18 at 11:29am, Andy Shevchenko wrote:
> On Tue, Jun 12, 2018 at 6:28 AM, Baoquan He  wrote:
> > reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> > and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> > so that it's shared. Later its code also need be updated using list_head
> > to replace singly linked list.
> 
> While this is a good deduplication of the code, some requirements for
> public functions would be good to satisfy.
> 
> > +/*
> > + * Reparent resource children of pr that conflict with res
> > + * under res, and make res replace those children.
> > + */
> 
> kernel doc format, though...

Will rewrite it, thanks.

> 
> > +static int reparent_resources(struct resource *parent,
> > +struct resource *res)
> 
> ...is it really public with static keyword?!
> 
> 
> 
> > +{
> 
> > +   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
> > +   if (p->end < res->start)
> > +   continue;
> > +   if (res->end < p->start)
> > +   break;
> 
> > +   if (p->start < res->start || p->end > res->end)
> > +   return -1;  /* not completely contained */
> 
> Usually we are expecting real eeror codes.
> 
> > +   if (firstpp == NULL)
> > +   firstpp = pp;
> > +   }
> 
> > +   if (firstpp == NULL)
> > +   return -1;  /* didn't find any conflicting entries? */
> 
> Ditto.
> 
> > +}
> > +EXPORT_SYMBOL(reparent_resources);
> 
> -- 
> With Best Regards,
> Andy Shevchenko

Re: [PATCH v5 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-06-12 Thread Baoquan He

On 06/12/18 at 11:29am, Andy Shevchenko wrote:
> On Tue, Jun 12, 2018 at 6:28 AM, Baoquan He  wrote:
> > reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> > and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> > so that it's shared. Later its code also need be updated using list_head
> > to replace singly linked list.
> 
> While this is a good deduplication of the code, some requirements for
> public functions would be good to satisfy.
> 
> > +/*
> > + * Reparent resource children of pr that conflict with res
> > + * under res, and make res replace those children.
> > + */
> 
> kernel doc format, though...

> 
> > +static int reparent_resources(struct resource *parent,
> > +struct resource *res)
> 
> ...is it really public with static keyword?!

Thanks for looking into this. This is a code bug, I copied and changed,
but forgot merging the changing to local commit. And the error reported
by test robot in patch 2 was changed too locally, forgot merging it to
patch. Will repost to address this.

> 
> 
> 
> > +{
> 
> > +   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
> > +   if (p->end < res->start)
> > +   continue;
> > +   if (res->end < p->start)
> > +   break;
> 
> > +   if (p->start < res->start || p->end > res->end)
> > +   return -1;  /* not completely contained */
> 
> Usually we are expecting real eeror codes.

Hmm, I just copied it from arch/powerpc/kernel/pci-common.c. The
function interface expects an integer returned value, not sure what a
real error codes look like, could you give more hints? Will change
accordingly.

> 
> > +   if (firstpp == NULL)
> > +   firstpp = pp;
> > +   }
> 
> > +   if (firstpp == NULL)
> > +   return -1;  /* didn't find any conflicting entries? */
> 
> Ditto.
> 
> > +}
> > +EXPORT_SYMBOL(reparent_resources);
> 
> -- 
> With Best Regards,
> Andy Shevchenko

Re: [PATCH v5 2/4] resource: Use list_head to link sibling resource

2018-06-13 Thread Baoquan He

On 06/12/18 at 05:10pm, Julia Lawall wrote:
> This looks wrong.  After a list iterator, the index variable points to a
> dummy structure.
> 
> julia
> 
> url:
> https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180612-113600
> :: branch date: 7 hours ago
> :: commit date: 7 hours ago
> 
> >> kernel/resource.c:265:17-20: ERROR: invalid reference to the index 
> >> variable of the iterator on line 253
> 
> # 
> https://github.com/0day-ci/linux/commit/e906f15906750a86913ba2b1f08bad99129d3dfc
> git remote add linux-review https://github.com/0day-ci/linux
> git remote update linux-review
> git checkout e906f15906750a86913ba2b1f08bad99129d3dfc
> vim +265 kernel/resource.c
> 
> ^1da177e4 Linus Torvalds 2005-04-16  247
> 5eeec0ec9 Yinghai Lu 2009-12-22  248  static void 
> __release_child_resources(struct resource *r)
> 5eeec0ec9 Yinghai Lu 2009-12-22  249  {
> e906f1590 Baoquan He 2018-06-12  250  struct resource *tmp, *next;
> 5eeec0ec9 Yinghai Lu 2009-12-22  251      resource_size_t size;
> 5eeec0ec9 Yinghai Lu 2009-12-22  252
> e906f1590 Baoquan He 2018-06-12 @253  list_for_each_entry_safe(tmp, 
> next, >child, sibling) {
> 5eeec0ec9 Yinghai Lu 2009-12-22  254  tmp->parent = NULL;
> e906f1590 Baoquan He 2018-06-12  255  
> INIT_LIST_HEAD(>sibling);


list_del_init(>sibling);

Thanks, Julia. Here I should use list_del_init(>list) to
replace INIT_LIST_HEAD(>sibling). 

> 5eeec0ec9 Yinghai Lu 2009-12-22  256  
> __release_child_resources(tmp);
> 5eeec0ec9 Yinghai Lu 2009-12-22  257
> 5eeec0ec9 Yinghai Lu 2009-12-22  258  printk(KERN_DEBUG 
> "release child resource %pR\n", tmp);
> 5eeec0ec9 Yinghai Lu 2009-12-22  259  /* need to restore 
> size, and keep flags */
> 5eeec0ec9 Yinghai Lu 2009-12-22  260  size = 
> resource_size(tmp);
> 5eeec0ec9 Yinghai Lu 2009-12-22  261          tmp->start = 0;
> 5eeec0ec9 Yinghai Lu 2009-12-22  262  tmp->end = size - 1;
> 5eeec0ec9 Yinghai Lu 2009-12-22  263  }
> e906f1590 Baoquan He 2018-06-12  264
> e906f1590 Baoquan He 2018-06-12 @265  INIT_LIST_HEAD(>child);
> 5eeec0ec9 Yinghai Lu 2009-12-22  266  }
> 5eeec0ec9 Yinghai Lu 2009-12-22  267
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[PATCH v6 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-03 Thread Baoquan He

reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
so that it's shared.

Signed-off-by: Baoquan He 
---
 arch/microblaze/pci/pci-common.c | 37 -
 arch/powerpc/kernel/pci-common.c | 35 ---
 include/linux/ioport.h   |  1 +
 kernel/resource.c| 39 +++
 4 files changed, 40 insertions(+), 72 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index f34346d56095..7899bafab064 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
 EXPORT_SYMBOL(pcibios_add_device);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int __init reparent_resources(struct resource *parent,
-struct resource *res)
-{
-   struct resource *p, **pp;
-   struct resource **firstpp = NULL;
-
-   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
-   if (p->end < res->start)
-   continue;
-   if (res->end < p->start)
-   break;
-   if (p->start < res->start || p->end > res->end)
-   return -1;  /* not completely contained */
-   if (firstpp == NULL)
-   firstpp = pp;
-   }
-   if (firstpp == NULL)
-   return -1;  /* didn't find any conflicting entries? */
-   res->parent = parent;
-   res->child = *firstpp;
-   res->sibling = *pp;
-   *firstpp = res;
-   *pp = NULL;
-   for (p = res->child; p != NULL; p = p->sibling) {
-   p->parent = res;
-   pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
-p->name,
-(unsigned long long)p->start,
-(unsigned long long)p->end, res->name);
-   }
-   return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733aa..926035bb378d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, const 
struct resource *res,
 EXPORT_SYMBOL(pcibios_align_resource);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int reparent_resources(struct resource *parent,
-struct resource *res)
-{
-   struct resource *p, **pp;
-   struct resource **firstpp = NULL;
-
-   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
-   if (p->end < res->start)
-   continue;
-   if (res->end < p->start)
-   break;
-   if (p->start < res->start || p->end > res->end)
-   return -1;  /* not completely contained */
-   if (firstpp == NULL)
-   firstpp = pp;
-   }
-   if (firstpp == NULL)
-   return -1;  /* didn't find any conflicting entries? */
-   res->parent = parent;
-   res->child = *firstpp;
-   res->sibling = *pp;
-   *firstpp = res;
-   *pp = NULL;
-   for (p = res->child; p != NULL; p = p->sibling) {
-   p->parent = res;
-   pr_debug("PCI: Reparented %s %pR under %s\n",
-p->name, p, res->name);
-   }
-   return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..dfdcd0bfe54e 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -192,6 +192,7 @@ extern int allocate_resource(struct resource *root, struct 
resource *new,
 struct resource *lookup_resource(struct resource *root, resource_size_t start);
 int adjust_resource(struct resource *res, resource_size_t start,
resource_size_t size);
+int reparent_resources(struct resource *parent, struct resource *res);
 resource_size_t resource_alignment(struct resource *res);
 static inline resource_size_t resource_size(const struct resource *res)
 {
diff

[PATCH v6 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-03 Thread Baoquan He

For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
is used to load kernel/initrd/purgatory is supposed to be allocated from
top to down. This is what we have been doing all along in the old kexec
loading interface and the kexec loading is still default setting in some
distributions. However, the current kexec_file loading interface doesn't
do likt this. The function arch_kexec_walk_mem() it calls ignores checking
kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
all resources of System RAM from bottom to up, to try to find memory region
which can contain the specific kexec buffer, then call 
locate_mem_hole_callback()
to allocate memory in that found memory region from top to down. This brings
confusion especially when KASLR is widely supported , users have to make clear
why kexec/kdump kernel loading position is different between these two
interfaces in order to exclude unnecessary noises. Hence these two interfaces
need be unified on behaviour.

Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(),
if yes, call the newly added walk_system_ram_res_rev() to find memory region
from top to down to load kernel.

Signed-off-by: Baoquan He 
Cc: Eric Biederman 
Cc: Vivek Goyal 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Yinghai Lu 
Cc: ke...@lists.infradead.org
---
 kernel/kexec_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index c6a3b6851372..75226c1d08ce 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   IORESOURCE_SYSTEM_RAM | 
IORESOURCE_BUSY,
   crashk_res.start, crashk_res.end,
   kbuf, func);
+   else if (kbuf->top_down)
+   return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func);
else
return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
-- 
2.13.6

[PATCH v6 3/4] resource: add walk_system_ram_res_rev()

2018-07-03 Thread Baoquan He

This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM
in reversed order, i.e., from higher to lower.

It will be used in kexec_file code.

Signed-off-by: Baoquan He 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Brijesh Singh 
Cc: "Jérôme Glisse" 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: Wei Yang 
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c  | 40 
 2 files changed, 43 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index b7456ae889dd..066cc263e2cc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -279,6 +279,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
int (*func)(struct resource *, void *));
 extern int
+walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(struct resource *, void *));
+extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 
end,
void *arg, int (*func)(struct resource *, void *));
 
diff --git a/kernel/resource.c b/kernel/resource.c
index 6d647a3824b1..4c5fbef4ea24 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 
@@ -443,6 +445,44 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
 }
 
 /*
+ * This function, being a variant of walk_system_ram_res(), calls the @func
+ * callback against all memory ranges of type System RAM which are marked as
+ * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
+ * higher to lower.
+ */
+int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(struct resource *, void *))
+{
+   unsigned long flags;
+   struct resource *res;
+   int ret = -1;
+
+   flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+
+   read_lock(_lock);
+   list_for_each_entry_reverse(res, _resource.child, sibling) {
+   if (start >= end)
+   break;
+   if ((res->flags & flags) != flags)
+   continue;
+   if (res->desc != IORES_DESC_NONE)
+   continue;
+   if (res->end < start)
+   break;
+
+   if ((res->end >= start) && (res->start < end)) {
+   ret = (*func)(res, arg);
+   if (ret)
+   break;
+   }
+   end = res->start - 1;
+
+   }
+   read_unlock(_lock);
+   return ret;
+}
+
+/*
  * This function calls the @func callback against all memory ranges, which
  * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  */
-- 
2.13.6

[PATCH v6 0/4] resource: Use list_head to link sibling resource

2018-07-03 Thread Baoquan He

This patchset is doing:
1) Replace struct resource's sibling list from singly linked list to
list_head. Clearing out those pointer operation within singly linked
list for better code readability.
2) Based on list_head replacement, add a new function
walk_system_ram_res_rev() which can does reversed iteration on
iomem_resource's siblings.
3) Change kexec_file loading to search system RAM top down for kernel
loadin, using walk_system_ram_res_rev().

Note:
This patchset only passed testing on  x86_64 arch with network
enabling. The thing we need pay attetion to is that a root resource's
child member need be initialized specifically with LIST_HEAD_INIT() if
statically defined or INIT_LIST_HEAD() for dynamically definition. Here
Just like we do for iomem_resource/ioport_resource, or the change in
get_pci_domain_busn_res().

v5:
http://lkml.kernel.org/r/20180612032831.29747-1-...@redhat.com

v4:
http://lkml.kernel.org/r/20180507063224.24229-1-...@redhat.com

v3:
http://lkml.kernel.org/r/20180419001848.3041-1-...@redhat.com

v2:
http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com

v1:
http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com

Changelog:
v5->v6:
  Fix code style problems in reparent_resources() and use existing
  error codes, according to Andy's suggestion.

  Fix bugs test robot reported.
  
v4->v5:
  Add new patch 0001 to move duplicated reparent_resources() to
  kernel/resource.c to make it be shared by different ARCH-es.

  Fix several code bugs reported by test robot on ARCH powerpc and
  microblaze.
v3->v4:
  Fix several bugs test robot reported. Rewrite cover letter and patch
  log according to reviewer's comment.

v2->v3:
  Rename resource functions first_child() and sibling() to
  resource_first_chils() and resource_sibling(). Dan suggested this.

  Move resource_first_chils() and resource_sibling() to linux/ioport.h
  and make them as inline function. Rob suggested this. Accordingly add
  linux/list.h including in linux/ioport.h, please help review if this
  bring efficiency degradation or code redundancy.

  The change on struct resource {} bring two pointers of size increase,
  mention this in git log to make it more specifically, Rob suggested
  this.

v1->v2:
  Use list_head instead to link resource siblings. This is suggested by
  Andrew.

  Rewrite walk_system_ram_res_rev() after list_head is taken to link
  resouce siblings.



Baoquan He (4):
  resource: Move reparent_resources() to kernel/resource.c and make it
public
  resource: Use list_head to link sibling resource
  resource: add walk_system_ram_res_rev()
  kexec_file: Load kernel at top of system RAM if required

 arch/arm/plat-samsung/pm-check.c|   6 +-
 arch/microblaze/pci/pci-common.c|  41 +
 arch/powerpc/kernel/pci-common.c|  39 +
 arch/sparc/kernel/ioport.c  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h|   4 +-
 drivers/eisa/eisa-bus.c |   2 +
 drivers/gpu/drm/drm_memory.c|   3 +-
 drivers/gpu/drm/gma500/gtt.c|   5 +-
 drivers/hv/vmbus_drv.c  |  52 +++---
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c |   6 +-
 drivers/nvdimm/nd.h |   5 +-
 drivers/of/address.c|   4 +-
 drivers/parisc/lba_pci.c|   4 +-
 drivers/pci/controller/vmd.c|   8 +-
 drivers/pci/probe.c |   2 +
 drivers/pci/setup-bus.c |   2 +-
 include/linux/ioport.h  |  21 ++-
 kernel/kexec_file.c |   2 +
 kernel/resource.c   | 263 ++--
 20 files changed, 248 insertions(+), 227 deletions(-)

-- 
2.13.6

Re: [PATCH v5 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-03 Thread Baoquan He

On 07/03/18 at 11:57pm, Andy Shevchenko wrote:
> On Tue, Jul 3, 2018 at 5:55 PM, Baoquan He  wrote:
> > On 06/12/18 at 05:24pm, Andy Shevchenko wrote:
> >> On Tue, Jun 12, 2018 at 5:20 PM, Andy Shevchenko
> >>  wrote:
> 
> >> > I briefly looked at the code and error codes we have, so, my proposal
> >> > is one of the following
> >>
> >> >  - use -ECANCELED (not the best choice for first occurrence here,
> >> > though I can't find better)
> >>
> >> Actually -ENOTSUPP might suit the first case (although the actual
> >> would be something like -EOVERLAP, which we don't have)
> >
> > Sorry for late reply, and many thanks for your great suggestion.
> >
> 
> > I am fine to use -ENOTSUPP as the first returned value, and -ECANCELED
> > for the 2nd one.
> 
> I have no strong opinion, but I like (slightly better) this approach ^^^

Done, post v6 in this way, many thanks.

> 
> > Or define an enum as you suggested inside the function
> > or in header file.
> 
> >
> > Or use -EBUSY for the first case because existing resource is
> > overlapping but not fully contained by 'res'; and -EINVAL for
> > the 2nd case since didn't find any one resources which is contained by
> > 'res', means we passed in a invalid resource.
> >
> > All is fine to me, I can repost with each of them.
> 
> >> >  - use positive integers (or enum), like
> >> >   #define RES_REPARENTED 0
> >> >   #define RES_OVERLAPPED 1
> >> >   #define RES_NOCONFLICT 2
> 
> -- 
> With Best Regards,
> Andy Shevchenko

[PATCH v6 2/4] resource: Use list_head to link sibling resource

2018-07-03 Thread Baoquan He

The struct resource uses singly linked list to link siblings, implemented
by pointer operation. Replace it with list_head for better code readability.

Based on this list_head replacement, it will be very easy to do reverse
iteration on iomem_resource's sibling list in later patch.

Besides, type of member variables of struct resource, sibling and child, are
changed from 'struct resource *' to 'struct list_head'. This brings two
pointers of size increase.

Suggested-by: Andrew Morton 
Signed-off-by: Baoquan He 
Cc: Patrik Jakobsson 
Cc: David Airlie 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Dmitry Torokhov 
Cc: Dan Williams 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: Keith Busch 
Cc: Jonathan Derrick 
Cc: Lorenzo Pieralisi 
Cc: Bjorn Helgaas 
Cc: Thomas Gleixner 
Cc: Brijesh Singh 
Cc: "Jérôme Glisse" 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Yaowei Bai 
Cc: Wei Yang 
Cc: de...@linuxdriverproject.org
Cc: linux-in...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: devicet...@vger.kernel.org
Cc: linux-...@vger.kernel.org
---
 arch/arm/plat-samsung/pm-check.c|   6 +-
 arch/microblaze/pci/pci-common.c|   4 +-
 arch/powerpc/kernel/pci-common.c|   4 +-
 arch/sparc/kernel/ioport.c  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h|   4 +-
 drivers/eisa/eisa-bus.c |   2 +
 drivers/gpu/drm/drm_memory.c|   3 +-
 drivers/gpu/drm/gma500/gtt.c|   5 +-
 drivers/hv/vmbus_drv.c  |  52 +++
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c |   6 +-
 drivers/nvdimm/nd.h |   5 +-
 drivers/of/address.c|   4 +-
 drivers/parisc/lba_pci.c|   4 +-
 drivers/pci/controller/vmd.c|   8 +-
 drivers/pci/probe.c |   2 +
 drivers/pci/setup-bus.c |   2 +-
 include/linux/ioport.h  |  17 ++-
 kernel/resource.c   | 208 ++--
 19 files changed, 175 insertions(+), 167 deletions(-)

diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c
index cd2c02c68bc3..5494355b1c49 100644
--- a/arch/arm/plat-samsung/pm-check.c
+++ b/arch/arm/plat-samsung/pm-check.c
@@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg);
 static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 {
while (ptr != NULL) {
-   if (ptr->child != NULL)
-   s3c_pm_run_res(ptr->child, fn, arg);
+   if (!list_empty(>child))
+   s3c_pm_run_res(resource_first_child(>child), fn, 
arg);
 
if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
== IORESOURCE_SYSTEM_RAM) {
@@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, 
u32 *arg)
arg = (fn)(ptr, arg);
}
 
-   ptr = ptr->sibling;
+   ptr = resource_sibling(ptr);
}
 }
 
diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 7899bafab064..2bf73e27e231 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -533,7 +533,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller 
*hose,
res->flags = range.flags;
res->start = range.cpu_addr;
res->end = range.cpu_addr + range.size - 1;
-   res->parent = res->child = res->sibling = NULL;
+   res->parent = NULL;
+   INIT_LIST_HEAD(>child);
+   INIT_LIST_HEAD(>sibling);
}
}
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 926035bb378d..28fbe83c9daf 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -761,7 +761,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller 
*hose,
res->flags = range.flags;
res->start = range.cpu_addr;
res->end = range.cpu_addr + range.size - 1;
-   res->parent = res->child = res->sibling = NULL;
+   res->parent = NULL;
+   INIT_LIST_HEAD(>child);
+   INIT_LIST_HEAD(>sibling);
}
}
 }
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index cca9134cfa7d..99efe4e98b16 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -669,7 +669,7 @@ static int sparc_io_proc_show(struct seq_file *m, void *v)
struct resource *root = m->private, *r;
const c

Re: [PATCH v5 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-03 Thread Baoquan He

Hi Andy,

On 06/12/18 at 05:24pm, Andy Shevchenko wrote:
> On Tue, Jun 12, 2018 at 5:20 PM, Andy Shevchenko
>  wrote:
> >> Hmm, I just copied it from arch/powerpc/kernel/pci-common.c. The
> >> function interface expects an integer returned value, not sure what a
> >> real error codes look like, could you give more hints? Will change
> >> accordingly.
> >
> > I briefly looked at the code and error codes we have, so, my proposal
> > is one of the following
> 
> >  - use -ECANCELED (not the best choice for first occurrence here,
> > though I can't find better)
> 
> Actually -ENOTSUPP might suit the first case (although the actual
> would be something like -EOVERLAP, which we don't have)

Sorry for late reply, and many thanks for your great suggestion.

I am fine to use -ENOTSUPP as the first returned value, and -ECANCELED
for the 2nd one. Or define an enum as you suggested inside the function
or in header file.

Or use -EBUSY for the first case because existing resource is
overlapping but not fully contained by 'res'; and -EINVAL for
the 2nd case since didn't find any one resources which is contained by
'res', means we passed in a invalid resource. 

All is fine to me, I can repost with each of them.

Thanks
Baoquan

> 
> >  - use positive integers (or enum), like
> >   #define RES_REPARENTED 0
> >   #define RES_OVERLAPPED 1
> >   #define RES_NOCONFLICT 2
> >
> >
> >>> > +   if (firstpp == NULL)
> >>> > +   firstpp = pp;
> >>> > +   }
> >>>
> >>> > +   if (firstpp == NULL)
> >>> > +   return -1;  /* didn't find any conflicting entries? 
> >>> > */
> >>>
> >>> Ditto.
> >
> > Ditto.
> >
> >>>
> >>> > +}
> >>> > +EXPORT_SYMBOL(reparent_resources);
> >
> > --
> > With Best Regards,
> > Andy Shevchenko
> 
> 
> 
> -- 
> With Best Regards,
> Andy Shevchenko

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Baoquan He

On 07/26/18 at 02:59pm, Michal Hocko wrote:
> On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > Kexec has been a formal feature in our distro, and customers owning
> > > > those kind of very large machine can make use of this feature to speed
> > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > search place to put kernel under 4G from top to down. As we know, the
> > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > > it. It may have possibility to not be able to find a usable space for
> > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > have this worry. 
> > > 
> > > I do not have the full context here but let me note that you should be
> > > careful when doing top-down reservation because you can easily get into
> > > hotplugable memory and break the hotremove usecase. We even warn when
> > > this is done. See memblock_find_in_range_node
> > 
> > Kexec read kernel/initrd file into buffer, just search usable positions
> > for them to do the later copying. You can see below struct kexec_segment, 
> > for the old kexec_load, kernel/initrd are read into user space buffer,
> > the @buf stores the user space buffer address, @mem stores the position
> > where kernel/initrd will be put. In kernel, it calls
> > kimage_load_normal_segment() to copy user space buffer to intermediate
> > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > are recorded as entries, later when user execute "kexec -e" to trigger
> > kexec jumping, it will do the final copying from the intermediate pages
> > to the real destination pages which @mem pointed. Because we can't touch
> > the existed data in 1st kernel when do kexec kernel loading. With my
> > understanding, GFP_KERNEL will make those intermediate pages be
> > allocated inside immovable area, it won't impact hotplugging. But the
> > @mem we searched in the whole system RAM might be lost along with
> > hotplug. Hence we need do kexec kernel again when hotplug event is
> > detected.
> 
> I am not sure I am following. If @mem is placed at movable node then the
> memory hotremove simply won't work, because we are seeing reserved pages
> and do not know what to do about them. They are not migrateable.
> Allocating intermediate pages from other nodes doesn't really help.

OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
in 1st kernel, it does impact the kernel which kexec jump into if kernel
is at top of system RAM and the top RAM is in movable node.

> 
> The memblock code warns exactly for that reason.
> -- 
> Michal Hocko
> SUSE Labs

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Baoquan He

On 07/26/18 at 03:14pm, Michal Hocko wrote:
> On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> > On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > > Kexec has been a formal feature in our distro, and customers 
> > > > > > > owning
> > > > > > > those kind of very large machine can make use of this feature to 
> > > > > > > speed
> > > > > > > up the reboot process. On uefi machine, the kexec_file loading 
> > > > > > > will
> > > > > > > search place to put kernel under 4G from top to down. As we know, 
> > > > > > > the
> > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to 
> > > > > > > consume
> > > > > > > it. It may have possibility to not be able to find a usable space 
> > > > > > > for
> > > > > > > kernel/initrd. From the top down of the whole memory space, we 
> > > > > > > don't
> > > > > > > have this worry. 
> > > > > > 
> > > > > > I do not have the full context here but let me note that you should 
> > > > > > be
> > > > > > careful when doing top-down reservation because you can easily get 
> > > > > > into
> > > > > > hotplugable memory and break the hotremove usecase. We even warn 
> > > > > > when
> > > > > > this is done. See memblock_find_in_range_node
> > > > > 
> > > > > Kexec read kernel/initrd file into buffer, just search usable 
> > > > > positions
> > > > > for them to do the later copying. You can see below struct 
> > > > > kexec_segment, 
> > > > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > > > the @buf stores the user space buffer address, @mem stores the 
> > > > > position
> > > > > where kernel/initrd will be put. In kernel, it calls
> > > > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > > > pages which are allocated with flag GFP_KERNEL. These intermediate 
> > > > > pages
> > > > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > > > kexec jumping, it will do the final copying from the intermediate 
> > > > > pages
> > > > > to the real destination pages which @mem pointed. Because we can't 
> > > > > touch
> > > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > > allocated inside immovable area, it won't impact hotplugging. But the
> > > > > @mem we searched in the whole system RAM might be lost along with
> > > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > > detected.
> > > > 
> > > > I am not sure I am following. If @mem is placed at movable node then the
> > > > memory hotremove simply won't work, because we are seeing reserved pages
> > > > and do not know what to do about them. They are not migrateable.
> > > > Allocating intermediate pages from other nodes doesn't really help.
> > > 
> > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact 
> > > hotremove
> > > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > > is at top of system RAM and the top RAM is in movable node.
> > 
> > It will affect the 1st kernel (which does the memblock allocation
> > top-down) as well. For reasons mentioned above.
> 
> And btw. in the ideal world, we would restrict the memblock allocation
> top-down from the non-movable nodes. But I do not think we have that
> information ready at the time when the reservation is done.

Oh, you could mix kexec loading up with kdump kernel loading. For kdump
kernel, we need reserve memory region during bootup with memblock
allocator. For kexec loading, we just operate after system up, and do
not need to reserve any memmory region. About memory used to load them,
it's quite different way.

Thanks
Baoquan

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Baoquan He

On 07/26/18 at 04:01pm, Michal Hocko wrote:
> On Thu 26-07-18 21:37:05, Baoquan He wrote:
> > On 07/26/18 at 03:14pm, Michal Hocko wrote:
> > > On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> > > > On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > > > > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > > > > Kexec has been a formal feature in our distro, and customers 
> > > > > > > > > owning
> > > > > > > > > those kind of very large machine can make use of this feature 
> > > > > > > > > to speed
> > > > > > > > > up the reboot process. On uefi machine, the kexec_file 
> > > > > > > > > loading will
> > > > > > > > > search place to put kernel under 4G from top to down. As we 
> > > > > > > > > know, the
> > > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try 
> > > > > > > > > to consume
> > > > > > > > > it. It may have possibility to not be able to find a usable 
> > > > > > > > > space for
> > > > > > > > > kernel/initrd. From the top down of the whole memory space, 
> > > > > > > > > we don't
> > > > > > > > > have this worry. 
> > > > > > > > 
> > > > > > > > I do not have the full context here but let me note that you 
> > > > > > > > should be
> > > > > > > > careful when doing top-down reservation because you can easily 
> > > > > > > > get into
> > > > > > > > hotplugable memory and break the hotremove usecase. We even 
> > > > > > > > warn when
> > > > > > > > this is done. See memblock_find_in_range_node
> > > > > > > 
> > > > > > > Kexec read kernel/initrd file into buffer, just search usable 
> > > > > > > positions
> > > > > > > for them to do the later copying. You can see below struct 
> > > > > > > kexec_segment, 
> > > > > > > for the old kexec_load, kernel/initrd are read into user space 
> > > > > > > buffer,
> > > > > > > the @buf stores the user space buffer address, @mem stores the 
> > > > > > > position
> > > > > > > where kernel/initrd will be put. In kernel, it calls
> > > > > > > kimage_load_normal_segment() to copy user space buffer to 
> > > > > > > intermediate
> > > > > > > pages which are allocated with flag GFP_KERNEL. These 
> > > > > > > intermediate pages
> > > > > > > are recorded as entries, later when user execute "kexec -e" to 
> > > > > > > trigger
> > > > > > > kexec jumping, it will do the final copying from the intermediate 
> > > > > > > pages
> > > > > > > to the real destination pages which @mem pointed. Because we 
> > > > > > > can't touch
> > > > > > > the existed data in 1st kernel when do kexec kernel loading. With 
> > > > > > > my
> > > > > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > > > > allocated inside immovable area, it won't impact hotplugging. But 
> > > > > > > the
> > > > > > > @mem we searched in the whole system RAM might be lost along with
> > > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > > > > detected.
> > > > > > 
> > > > > > I am not sure I am following. If @mem is placed at movable node 
> > > > > > then the
> > > > > > memory hotremove simply won't work, because we are seeing reserved 
> > > > > > pages
> > > > > > and do not know what to do about them. They are not migrateable.
> > > > > > Allocating intermediate pages from other nodes doesn't really help.
> > > > > 
> > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact 
> > > > > hotremove
> > > > > in 1st kernel, it does impact the kernel which kexec jump into if 
> > > > > kernel
> > > > > is at top of system RAM and the top RAM is in movable node.
> > > > 
> > > > It will affect the 1st kernel (which does the memblock allocation
> > > > top-down) as well. For reasons mentioned above.
> > > 
> > > And btw. in the ideal world, we would restrict the memblock allocation
> > > top-down from the non-movable nodes. But I do not think we have that
> > > information ready at the time when the reservation is done.
> > 
> > Oh, you could mix kexec loading up with kdump kernel loading. For kdump
> > kernel, we need reserve memory region during bootup with memblock
> > allocator. For kexec loading, we just operate after system up, and do
> > not need to reserve any memmory region. About memory used to load them,
> > it's quite different way.
> 
> I didn't know about that. I thought both use the same underlying
> reservation mechanism. My bad and sorry for the noise.

Not at all. It's truly confusing. I often need take time to recall those
details.

[PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-17 Thread Baoquan He

For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
is used to load kernel/initrd/purgatory is supposed to be allocated from
top to down. This is what we have been doing all along in the old kexec
loading interface and the kexec loading is still default setting in some
distributions. However, the current kexec_file loading interface doesn't
do like this. The function arch_kexec_walk_mem() it calls ignores checking
kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
all resources of System RAM from bottom to up, to try to find memory region
which can contain the specific kexec buffer, then call 
locate_mem_hole_callback()
to allocate memory in that found memory region from top to down. This brings
confusion especially when KASLR is widely supported , users have to make clear
why kexec/kdump kernel loading position is different between these two
interfaces in order to exclude unnecessary noises. Hence these two interfaces
need be unified on behaviour.

Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(),
if yes, call the newly added walk_system_ram_res_rev() to find memory region
from top to down to load kernel.

Signed-off-by: Baoquan He 
Cc: Eric Biederman 
Cc: Vivek Goyal 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Yinghai Lu 
Cc: ke...@lists.infradead.org
---
 kernel/kexec_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index c6a3b6851372..75226c1d08ce 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   IORESOURCE_SYSTEM_RAM | 
IORESOURCE_BUSY,
   crashk_res.start, crashk_res.end,
   kbuf, func);
+   else if (kbuf->top_down)
+   return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func);
else
return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
-- 
2.13.6

[PATCH v7 0/4] resource: Use list_head to link sibling resource

2018-07-17 Thread Baoquan He

This patchset is doing:
1) Move reparent_resources() to kernel/resource.c to clean up duplicated
   code in arch/microblaze/pci/pci-common.c and
   arch/powerpc/kernel/pci-common.c .
2) Replace struct resource's sibling list from singly linked list to
   list_head. Clearing out those pointer operation within singly linked
   list for better code readability.
2) Based on list_head replacement, add a new function
   walk_system_ram_res_rev() which can does reversed iteration on
   iomem_resource's siblings.
3) Change kexec_file loading to search system RAM top down for kernel
   loadin, using walk_system_ram_res_rev().

Note:
This patchset only passed testing on  x86_64 arch with network
enabling. The thing we need pay attetion to is that a root resource's
child member need be initialized specifically with LIST_HEAD_INIT() if
statically defined or INIT_LIST_HEAD() for dynamically definition. Here
Just like we do for iomem_resource/ioport_resource, or the change in
get_pci_domain_busn_res().

v6:
http://lkml.kernel.org/r/20180704041038.8190-1-...@redhat.com

v5:
http://lkml.kernel.org/r/20180612032831.29747-1-...@redhat.com

v4:
http://lkml.kernel.org/r/20180507063224.24229-1-...@redhat.com

v3:
http://lkml.kernel.org/r/20180419001848.3041-1-...@redhat.com

v2:
http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com

v1:
http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com

Changelog:
v6->v7:
  Fix code bugs that test robot reported on mips and ia64.

  Add error code description in reparent_resources() according to
  Andy's comment, and fix minor log typo.
v5->v6:
  Fix code style problems in reparent_resources() and use existing
  error codes, according to Andy's suggestion.

  Fix bugs test robot reported.

v4->v5:
  Add new patch 0001 to move duplicated reparent_resources() to
  kernel/resource.c to make it be shared by different ARCH-es.

  Fix several code bugs reported by test robot on ARCH powerpc and
  microblaze.
v3->v4:
  Fix several bugs test robot reported. Rewrite cover letter and patch
  log according to reviewer's comment.

v2->v3:
  Rename resource functions first_child() and sibling() to
  resource_first_chils() and resource_sibling(). Dan suggested this.

  Move resource_first_chils() and resource_sibling() to linux/ioport.h
  and make them as inline function. Rob suggested this. Accordingly add
  linux/list.h including in linux/ioport.h, please help review if this
  bring efficiency degradation or code redundancy.

  The change on struct resource {} bring two pointers of size increase,
  mention this in git log to make it more specifically, Rob suggested
  this.

v1->v2:
  Use list_head instead to link resource siblings. This is suggested by
  Andrew.

  Rewrite walk_system_ram_res_rev() after list_head is taken to link
  resouce siblings.

Baoquan He (4):
  resource: Move reparent_resources() to kernel/resource.c and make it
public
  resource: Use list_head to link sibling resource
  resource: add walk_system_ram_res_rev()
  kexec_file: Load kernel at top of system RAM if required

 arch/arm/plat-samsung/pm-check.c|   6 +-
 arch/ia64/sn/kernel/io_init.c   |   2 +-
 arch/microblaze/pci/pci-common.c|  41 +
 arch/mips/pci/pci-rc32434.c |  12 +-
 arch/powerpc/kernel/pci-common.c|  39 +---
 arch/sparc/kernel/ioport.c  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h|   4 +-
 drivers/eisa/eisa-bus.c |   2 +
 drivers/gpu/drm/drm_memory.c|   3 +-
 drivers/gpu/drm/gma500/gtt.c|   5 +-
 drivers/hv/vmbus_drv.c  |  52 +++---
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c |   6 +-
 drivers/nvdimm/nd.h |   5 +-
 drivers/of/address.c|   4 +-
 drivers/parisc/lba_pci.c|   4 +-
 drivers/pci/controller/vmd.c|   8 +-
 drivers/pci/probe.c |   2 +
 drivers/pci/setup-bus.c |   2 +-
 include/linux/ioport.h  |  21 ++-
 kernel/kexec_file.c |   2 +
 kernel/resource.c   | 266 ++--
 22 files changed, 260 insertions(+), 232 deletions(-)

-- 
2.13.6

[PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-17 Thread Baoquan He

reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
so that it's shared.

Reviewed-by: Andy Shevchenko 
Signed-off-by: Baoquan He 
Cc: Michal Simek 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/microblaze/pci/pci-common.c | 37 ---
 arch/powerpc/kernel/pci-common.c | 35 -
 include/linux/ioport.h   |  1 +
 kernel/resource.c| 42 
 4 files changed, 43 insertions(+), 72 deletions(-)

diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index f34346d56095..7899bafab064 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
 EXPORT_SYMBOL(pcibios_add_device);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int __init reparent_resources(struct resource *parent,
-struct resource *res)
-{
-   struct resource *p, **pp;
-   struct resource **firstpp = NULL;
-
-   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
-   if (p->end < res->start)
-   continue;
-   if (res->end < p->start)
-   break;
-   if (p->start < res->start || p->end > res->end)
-   return -1;  /* not completely contained */
-   if (firstpp == NULL)
-   firstpp = pp;
-   }
-   if (firstpp == NULL)
-   return -1;  /* didn't find any conflicting entries? */
-   res->parent = parent;
-   res->child = *firstpp;
-   res->sibling = *pp;
-   *firstpp = res;
-   *pp = NULL;
-   for (p = res->child; p != NULL; p = p->sibling) {
-   p->parent = res;
-   pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
-p->name,
-(unsigned long long)p->start,
-(unsigned long long)p->end, res->name);
-   }
-   return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733aa..926035bb378d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, const 
struct resource *res,
 EXPORT_SYMBOL(pcibios_align_resource);
 
 /*
- * Reparent resource children of pr that conflict with res
- * under res, and make res replace those children.
- */
-static int reparent_resources(struct resource *parent,
-struct resource *res)
-{
-   struct resource *p, **pp;
-   struct resource **firstpp = NULL;
-
-   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
-   if (p->end < res->start)
-   continue;
-   if (res->end < p->start)
-   break;
-   if (p->start < res->start || p->end > res->end)
-   return -1;  /* not completely contained */
-   if (firstpp == NULL)
-   firstpp = pp;
-   }
-   if (firstpp == NULL)
-   return -1;  /* didn't find any conflicting entries? */
-   res->parent = parent;
-   res->child = *firstpp;
-   res->sibling = *pp;
-   *firstpp = res;
-   *pp = NULL;
-   for (p = res->child; p != NULL; p = p->sibling) {
-   p->parent = res;
-   pr_debug("PCI: Reparented %s %pR under %s\n",
-p->name, p, res->name);
-   }
-   return 0;
-}
-
-/*
  *  Handle resources of PCI devices.  If the world were perfect, we could
  *  just allocate all the resource regions and do nothing more.  It isn't.
  *  On the other hand, we cannot just re-allocate all devices, as it would
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..dfdcd0bfe54e 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -192,6 +192,7 @@ extern int allocate_resource(struct resource *root, struct 
resource *new,
 struct resource *lookup_resource(struct resource *root, resource_size_t start);
 int adjust_resource(struct resource *res, resource_size_t start,
resource_size_t size);
+int reparent_resources(struct resource *parent, struct res

[PATCH v7 2/4] resource: Use list_head to link sibling resource

2018-07-17 Thread Baoquan He

The struct resource uses singly linked list to link siblings, implemented
by pointer operation. Replace it with list_head for better code readability.

Based on this list_head replacement, it will be very easy to do reverse
iteration on iomem_resource's sibling list in later patch.

Besides, type of member variables of struct resource, sibling and child, are
changed from 'struct resource *' to 'struct list_head'. This brings two
pointers of size increase.

Suggested-by: Andrew Morton 
Signed-off-by: Baoquan He 
Cc: Patrik Jakobsson 
Cc: David Airlie 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Dmitry Torokhov 
Cc: Dan Williams 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: Keith Busch 
Cc: Jonathan Derrick 
Cc: Lorenzo Pieralisi 
Cc: Bjorn Helgaas 
Cc: Thomas Gleixner 
Cc: Brijesh Singh 
Cc: "Jérôme Glisse" 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Yaowei Bai 
Cc: Wei Yang 
Cc: de...@linuxdriverproject.org
Cc: linux-in...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: devicet...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: Michal Simek 
Cc: Benjamin Herrenschmidt
  
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linux-m...@linux-mips.org
---
 arch/arm/plat-samsung/pm-check.c|   6 +-
 arch/ia64/sn/kernel/io_init.c   |   2 +-
 arch/microblaze/pci/pci-common.c|   4 +-
 arch/mips/pci/pci-rc32434.c |  12 +-
 arch/powerpc/kernel/pci-common.c|   4 +-
 arch/sparc/kernel/ioport.c  |   2 +-
 arch/xtensa/include/asm/pci-bridge.h|   4 +-
 drivers/eisa/eisa-bus.c |   2 +
 drivers/gpu/drm/drm_memory.c|   3 +-
 drivers/gpu/drm/gma500/gtt.c|   5 +-
 drivers/hv/vmbus_drv.c  |  52 +++
 drivers/input/joystick/iforce/iforce-main.c |   4 +-
 drivers/nvdimm/namespace_devs.c |   6 +-
 drivers/nvdimm/nd.h |   5 +-
 drivers/of/address.c|   4 +-
 drivers/parisc/lba_pci.c|   4 +-
 drivers/pci/controller/vmd.c|   8 +-
 drivers/pci/probe.c |   2 +
 drivers/pci/setup-bus.c |   2 +-
 include/linux/ioport.h  |  17 ++-
 kernel/resource.c   | 206 ++--
 21 files changed, 183 insertions(+), 171 deletions(-)

diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c
index cd2c02c68bc3..5494355b1c49 100644
--- a/arch/arm/plat-samsung/pm-check.c
+++ b/arch/arm/plat-samsung/pm-check.c
@@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg);
 static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 {
while (ptr != NULL) {
-   if (ptr->child != NULL)
-   s3c_pm_run_res(ptr->child, fn, arg);
+   if (!list_empty(>child))
+   s3c_pm_run_res(resource_first_child(>child), fn, 
arg);
 
if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
== IORESOURCE_SYSTEM_RAM) {
@@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, 
u32 *arg)
arg = (fn)(ptr, arg);
}
 
-   ptr = ptr->sibling;
+   ptr = resource_sibling(ptr);
}
 }
 
diff --git a/arch/ia64/sn/kernel/io_init.c b/arch/ia64/sn/kernel/io_init.c
index d63809a6adfa..338a7b7f194d 100644
--- a/arch/ia64/sn/kernel/io_init.c
+++ b/arch/ia64/sn/kernel/io_init.c
@@ -192,7 +192,7 @@ sn_io_slot_fixup(struct pci_dev *dev)
 * if it's already in the device structure, remove it before
 * inserting
 */
-   if (res->parent && res->parent->child)
+   if (res->parent && !list_empty(>parent->child))
release_resource(res);
 
if (res->flags & IORESOURCE_IO)
diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c
index 7899bafab064..2bf73e27e231 100644
--- a/arch/microblaze/pci/pci-common.c
+++ b/arch/microblaze/pci/pci-common.c
@@ -533,7 +533,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller 
*hose,
res->flags = range.flags;
res->start = range.cpu_addr;
res->end = range.cpu_addr + range.size - 1;
-   res->parent = res->child = res->sibling = NULL;
+   res->parent = NULL;
+   INIT_LIST_HEAD(>child);
+   INIT_LIST_HEAD(>sibling);
}
}
 
diff --git a/arch/mips/pci/pci-rc32434.c b/arch/mips/pci/pci-rc32434.c
index 7f6ce6d734c0..e80283df7

[PATCH v7 3/4] resource: add walk_system_ram_res_rev()

2018-07-17 Thread Baoquan He

This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM
in reversed order, i.e., from higher to lower.

It will be used in kexec_file code.

Signed-off-by: Baoquan He 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Brijesh Singh 
Cc: "Jérôme Glisse" 
Cc: Borislav Petkov 
Cc: Tom Lendacky 
Cc: Wei Yang 
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c  | 40 
 2 files changed, 43 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index b7456ae889dd..066cc263e2cc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -279,6 +279,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
int (*func)(struct resource *, void *));
 extern int
+walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(struct resource *, void *));
+extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 
end,
void *arg, int (*func)(struct resource *, void *));
 
diff --git a/kernel/resource.c b/kernel/resource.c
index c96e58d3d2f8..3e18f24b90c4 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 
@@ -443,6 +445,44 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
 }
 
 /*
+ * This function, being a variant of walk_system_ram_res(), calls the @func
+ * callback against all memory ranges of type System RAM which are marked as
+ * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
+ * higher to lower.
+ */
+int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(struct resource *, void *))
+{
+   unsigned long flags;
+   struct resource *res;
+   int ret = -1;
+
+   flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+
+   read_lock(_lock);
+   list_for_each_entry_reverse(res, _resource.child, sibling) {
+   if (start >= end)
+   break;
+   if ((res->flags & flags) != flags)
+   continue;
+   if (res->desc != IORES_DESC_NONE)
+   continue;
+   if (res->end < start)
+   break;
+
+   if ((res->end >= start) && (res->start < end)) {
+   ret = (*func)(res, arg);
+   if (ret)
+   break;
+   }
+   end = res->start - 1;
+
+   }
+   read_unlock(_lock);
+   return ret;
+}
+
+/*
  * This function calls the @func callback against all memory ranges, which
  * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY.
  */
-- 
2.13.6

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-24 Thread Baoquan He

Hi Andrew,

On 07/19/18 at 12:44pm, Andrew Morton wrote:
> On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He  wrote:
> > > As far as I can tell, the above is the whole reason for the patchset,
> > > yes?  To avoid confusing users.
> > 
> > 
> > In fact, it's not just trying to avoid confusing users. Kexec loading
> > and kexec_file loading are just do the same thing in essence. Just we
> > need do kernel image verification on uefi system, have to port kexec
> > loading code to kernel. 
> > 
> > Kexec has been a formal feature in our distro, and customers owning
> > those kind of very large machine can make use of this feature to speed
> > up the reboot process. On uefi machine, the kexec_file loading will
> > search place to put kernel under 4G from top to down. As we know, the
> > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > it. It may have possibility to not be able to find a usable space for
> > kernel/initrd. From the top down of the whole memory space, we don't
> > have this worry. 
> > 
> > And at the first post, I just posted below with AKASHI's
> > walk_system_ram_res_rev() version. Later you suggested to use
> > list_head to link child sibling of resource, see what the code change
> > looks like.
> > http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com
> > 
> > Then I posted v2
> > http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com
> > Rob Herring mentioned that other components which has this tree struct
> > have planned to do the same thing, replacing the singly linked list with
> > list_head to link resource child sibling. Just quote Rob's words as
> > below. I think this could be another reason.
> > 
> > ~ From Rob
> > The DT struct device_node also has the same tree structure with
> > parent, child, sibling pointers and converting to list_head had been
> > on the todo list for a while. ACPI also has some tree walking
> > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
> > common tree struct and helpers defined either on top of list_head or a
> > ~
> > new struct if that saves some size.
> 
> Please let's get all this into the changelogs?

Sorry for late reply because of some urgent customer hotplug issues.

I am rewriting all change logs, and cover letter. Then found I was wrong
about the 2nd reason. The current kexec_file_load calls
kexec_locate_mem_hole() to go through all system RAM region, if one
region is larger than the size of kernel or initrd, it will search a
position in that region from top to down. Since kexec will jump to 2nd
kernel and don't need to care the 1st kernel's data, we can always find
a usable space to load kexec kernel/initrd under 4G.

So the only reason for this patch is keeping consistent with kexec_load
and avoid confusion.

And since x86 5-level paging mode has been added, we have another issue
for top-down searching in the whole system RAM. That is we support
dynamic 4-level to 5-level changing. Namely a kernel compiled with
5-level support, we can add 'no5lvl' to force 4-level. Then jumping from
a 5-level kernel to 4-level kernel, e.g we load kernel at the top of
system RAM in 5-level paging mode which might be bigger than 64TB, then
try to jump to 4-level kernel with the upper limit of 64TB. For this
case, we need add limit for kexec kernel loading if in 5-level kernel.

All this mess makes me hesitate to choose a deligate method. Maybe I
should drop this patchset.

> 
> > > 
> > > Is that sufficient?  Can we instead simplify their lives by providing
> > > better documentation or informative printks or better Kconfig text,
> > > etc?
> > > 
> > > And who *are* the people who are performing this configuration?  Random
> > > system administrators?  Linux distro engineers?  If the latter then
> > > they presumably aren't easily confused!
> > 
> > Kexec was invented for kernel developer to speed up their kernel
> > rebooting. Now high end sever admin, kernel developer and QE are also
> > keen to use it to reboot large box for faster feature testing, bug
> > debugging. Kernel dev could know this well, about kernel loading
> > position, admin or QE might not be aware of it very well. 
> > 
> > > 
> > > In other words, I'm trying to understand how much benefit this patchset
> > > will provide to our users as a whole.
> > 
> > Understood. The list_head replacing patch truly involes too many code
> > changes, it's risky. I am willing to try any idea from reviewers, won't
> > persuit they have to be accepted finally. If don't have a try, we don't
> > know what it looks like, and what impact it may have. I am fine to take
> > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
> > though it could be a little bit low efficient.
> 
> The larger patch produces a better result.  We can handle it ;)

For this issue, if we stop changing the kexec top down searching code,
I am not sure if we should post this replacing with list_head patches
separately.

Thanks
Baoquan

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-19 Thread Baoquan He

Hi Andrew,

On 07/18/18 at 03:33pm, Andrew Morton wrote:
> On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He  wrote:
> 
> > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > top to down. This is what we have been doing all along in the old kexec
> > loading interface and the kexec loading is still default setting in some
> > distributions. However, the current kexec_file loading interface doesn't
> > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > all resources of System RAM from bottom to up, to try to find memory region
> > which can contain the specific kexec buffer, then call 
> > locate_mem_hole_callback()
> > to allocate memory in that found memory region from top to down. This brings
> > confusion especially when KASLR is widely supported , users have to make 
> > clear
> > why kexec/kdump kernel loading position is different between these two
> > interfaces in order to exclude unnecessary noises. Hence these two 
> > interfaces
> > need be unified on behaviour.
> 
> As far as I can tell, the above is the whole reason for the patchset,
> yes?  To avoid confusing users.

In fact, it's not just trying to avoid confusing users. Kexec loading
and kexec_file loading are just do the same thing in essence. Just we
need do kernel image verification on uefi system, have to port kexec
loading code to kernel. 

Kexec has been a formal feature in our distro, and customers owning
those kind of very large machine can make use of this feature to speed
up the reboot process. On uefi machine, the kexec_file loading will
search place to put kernel under 4G from top to down. As we know, the
1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
it. It may have possibility to not be able to find a usable space for
kernel/initrd. From the top down of the whole memory space, we don't
have this worry. 

And at the first post, I just posted below with AKASHI's
walk_system_ram_res_rev() version. Later you suggested to use
list_head to link child sibling of resource, see what the code change
looks like.
http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com

Then I posted v2
http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com
Rob Herring mentioned that other components which has this tree struct
have planned to do the same thing, replacing the singly linked list with
list_head to link resource child sibling. Just quote Rob's words as
below. I think this could be another reason.

~ From Rob
The DT struct device_node also has the same tree structure with
parent, child, sibling pointers and converting to list_head had been
on the todo list for a while. ACPI also has some tree walking
functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
common tree struct and helpers defined either on top of list_head or a
~
new struct if that saves some size.

> 
> Is that sufficient?  Can we instead simplify their lives by providing
> better documentation or informative printks or better Kconfig text,
> etc?
> 
> And who *are* the people who are performing this configuration?  Random
> system administrators?  Linux distro engineers?  If the latter then
> they presumably aren't easily confused!

Kexec was invented for kernel developer to speed up their kernel
rebooting. Now high end sever admin, kernel developer and QE are also
keen to use it to reboot large box for faster feature testing, bug
debugging. Kernel dev could know this well, about kernel loading
position, admin or QE might not be aware of it very well. 

> 
> In other words, I'm trying to understand how much benefit this patchset
> will provide to our users as a whole.

Understood. The list_head replacing patch truly involes too many code
changes, it's risky. I am willing to try any idea from reviewers, won't
persuit they have to be accepted finally. If don't have a try, we don't
know what it looks like, and what impact it may have. I am fine to take
AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
though it could be a little bit low efficient.

Thanks
Baoquan

Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-19 Thread Baoquan He

On 07/18/18 at 07:37pm, Andy Shevchenko wrote:
> On Wed, Jul 18, 2018 at 7:36 PM, Andy Shevchenko
>  wrote:
> > On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He  wrote:
> >> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> >> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> >> so that it's shared.
> 
> >> + * Returns 0 on success, -ENOTSUPP if child resource is not completely
> >> + * contained by 'res', -ECANCELED if no any conflicting entry found.
> 
> You also can refer to constants by prefixing them with %, e.g. %-ENOTSUPP.
> But this is up to you completely.

Thanks, will fix when repost.

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-25 Thread Baoquan He

On 07/23/18 at 04:34pm, Michal Hocko wrote:
> On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > Kexec has been a formal feature in our distro, and customers owning
> > those kind of very large machine can make use of this feature to speed
> > up the reboot process. On uefi machine, the kexec_file loading will
> > search place to put kernel under 4G from top to down. As we know, the
> > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > it. It may have possibility to not be able to find a usable space for
> > kernel/initrd. From the top down of the whole memory space, we don't
> > have this worry. 
> 
> I do not have the full context here but let me note that you should be
> careful when doing top-down reservation because you can easily get into
> hotplugable memory and break the hotremove usecase. We even warn when
> this is done. See memblock_find_in_range_node

Kexec read kernel/initrd file into buffer, just search usable positions
for them to do the later copying. You can see below struct kexec_segment, 
for the old kexec_load, kernel/initrd are read into user space buffer,
the @buf stores the user space buffer address, @mem stores the position
where kernel/initrd will be put. In kernel, it calls
kimage_load_normal_segment() to copy user space buffer to intermediate
pages which are allocated with flag GFP_KERNEL. These intermediate pages
are recorded as entries, later when user execute "kexec -e" to trigger
kexec jumping, it will do the final copying from the intermediate pages
to the real destination pages which @mem pointed. Because we can't touch
the existed data in 1st kernel when do kexec kernel loading. With my
understanding, GFP_KERNEL will make those intermediate pages be
allocated inside immovable area, it won't impact hotplugging. But the
@mem we searched in the whole system RAM might be lost along with
hotplug. Hence we need do kexec kernel again when hotplug event is
detected.

#define KEXEC_CONTROL_MEMORY_GFP (GFP_KERNEL | __GFP_NORETRY)

struct kexec_segment {
/*
 * This pointer can point to user memory if kexec_load() system
 * call is used or will point to kernel memory if
 * kexec_file_load() system call is used.
 *
 * Use ->buf when expecting to deal with user memory and use ->kbuf
 * when expecting to deal with kernel memory.
 */
union {
void __user *buf;
void *kbuf;
};
size_t bufsz;   

unsigned long mem;
size_t memsz;
};

Thanks
Baoquan

Re: [PATCH v6 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-05 Thread Baoquan He

On 07/04/18 at 07:46pm, Andy Shevchenko wrote:
> On Wed, Jul 4, 2018 at 7:10 AM, Baoquan He  wrote:
> > reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> > and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> > so that it's shared.
> 
> With couple of comments below,

Sure, will fix these and repost with your reviewed-by, thanks. Should be
after I reproduce and fix those issues reported by test robot.

> 
> Reviewed-by: Andy Shevchenko 
> 
> P.S. In some commit message in this series you used 'likt' instead of 'like'.
> 
> >
> > Signed-off-by: Baoquan He 
> > ---
> >  arch/microblaze/pci/pci-common.c | 37 -
> >  arch/powerpc/kernel/pci-common.c | 35 ---
> >  include/linux/ioport.h   |  1 +
> >  kernel/resource.c| 39 
> > +++
> >  4 files changed, 40 insertions(+), 72 deletions(-)
> >
> > diff --git a/arch/microblaze/pci/pci-common.c 
> > b/arch/microblaze/pci/pci-common.c
> > index f34346d56095..7899bafab064 100644
> > --- a/arch/microblaze/pci/pci-common.c
> > +++ b/arch/microblaze/pci/pci-common.c
> > @@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev)
> >  EXPORT_SYMBOL(pcibios_add_device);
> >
> >  /*
> > - * Reparent resource children of pr that conflict with res
> > - * under res, and make res replace those children.
> > - */
> > -static int __init reparent_resources(struct resource *parent,
> > -struct resource *res)
> > -{
> > -   struct resource *p, **pp;
> > -   struct resource **firstpp = NULL;
> > -
> > -   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
> > -   if (p->end < res->start)
> > -   continue;
> > -   if (res->end < p->start)
> > -   break;
> > -   if (p->start < res->start || p->end > res->end)
> > -   return -1;  /* not completely contained */
> > -   if (firstpp == NULL)
> > -   firstpp = pp;
> > -   }
> > -   if (firstpp == NULL)
> > -   return -1;  /* didn't find any conflicting entries? */
> > -   res->parent = parent;
> > -   res->child = *firstpp;
> > -   res->sibling = *pp;
> > -   *firstpp = res;
> > -   *pp = NULL;
> > -   for (p = res->child; p != NULL; p = p->sibling) {
> > -   p->parent = res;
> > -   pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n",
> > -p->name,
> > -(unsigned long long)p->start,
> > -(unsigned long long)p->end, res->name);
> > -   }
> > -   return 0;
> > -}
> > -
> > -/*
> >   *  Handle resources of PCI devices.  If the world were perfect, we could
> >   *  just allocate all the resource regions and do nothing more.  It isn't.
> >   *  On the other hand, we cannot just re-allocate all devices, as it would
> > diff --git a/arch/powerpc/kernel/pci-common.c 
> > b/arch/powerpc/kernel/pci-common.c
> > index fe9733aa..926035bb378d 100644
> > --- a/arch/powerpc/kernel/pci-common.c
> > +++ b/arch/powerpc/kernel/pci-common.c
> > @@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, 
> > const struct resource *res,
> >  EXPORT_SYMBOL(pcibios_align_resource);
> >
> >  /*
> > - * Reparent resource children of pr that conflict with res
> > - * under res, and make res replace those children.
> > - */
> > -static int reparent_resources(struct resource *parent,
> > -struct resource *res)
> > -{
> > -   struct resource *p, **pp;
> > -   struct resource **firstpp = NULL;
> > -
> > -   for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
> > -   if (p->end < res->start)
> > -   continue;
> > -   if (res->end < p->start)
> > -   break;
> > -   if (p->start < res->start || p->end > res->end)
> > -   return -1;  /* not completely contained */
> > -   if (firstpp == NULL)
> > -   firstpp = pp;
> > -   }
> > -   if (firstpp == NULL)
> > -

Re: [PATCH v6 2/4] resource: Use list_head to link sibling resource

2018-07-08 Thread Baoquan He

On 07/08/18 at 08:48pm, Andy Shevchenko wrote:
> On Sun, Jul 8, 2018 at 5:59 AM, Baoquan He  wrote:
> > On 07/05/18 at 01:00am, kbuild test robot wrote:
> 
> > However, I didn't find below branch. And tried to open it in web
> > broswer, also failed.
> 
> While this is kinda valid point...
> 
> > Could you help have a look at this?
> 
> ...isn't obvious that you didn't change the file mentioned in a report?
> Just take latest linux-next and you will see.

Yes, it's clear to me. Just want to use the way to cross compile them on
ia64 and mips, hope I can find out all missed places on these ARCHes.
Now I think I can apply patches on linux-next, and use the config
attached to compile. Thanks.

> 
> 
> >> All error/warnings (new ones prefixed by >>):
> >>
> >> >> arch/mips/pci/pci-rc32434.c:57:11: error: initialization from 
> >> >> incompatible pointer type [-Werror=incompatible-pointer-types]
> >>  .child = _res_pci_mem2
> >>   ^
> >>arch/mips/pci/pci-rc32434.c:57:11: note: (near initialization for 
> >> 'rc32434_res_pci_mem1.child.next')
> >> >> arch/mips/pci/pci-rc32434.c:51:47: warning: missing braces around 
> >> >> initializer [-Wmissing-braces]
> >> static struct resource rc32434_res_pci_mem1 = {
> >>   ^
> >>arch/mips/pci/pci-rc32434.c:60:47: warning: missing braces around 
> >> initializer [-Wmissing-braces]
> >> static struct resource rc32434_res_pci_mem2 = {
> >>   ^
> >>cc1: some warnings being treated as errors
> >>
> >> vim +57 arch/mips/pci/pci-rc32434.c
> >>
> >> 73b4390f Ralf Baechle 2008-07-16  50
> >> 73b4390f Ralf Baechle 2008-07-16 @51  static struct resource 
> >> rc32434_res_pci_mem1 = {
> >> 73b4390f Ralf Baechle 2008-07-16  52  .name = "PCI MEM1",
> >> 73b4390f Ralf Baechle 2008-07-16  53  .start = 0x5000,
> >> 73b4390f Ralf Baechle 2008-07-16  54  .end = 0x5FFF,
> >> 73b4390f Ralf Baechle 2008-07-16  55  .flags = IORESOURCE_MEM,
> >> 73b4390f Ralf Baechle 2008-07-16  56  .sibling = NULL,
> >> 73b4390f Ralf Baechle 2008-07-16 @57  .child = 
> >> _res_pci_mem2
> >> 73b4390f Ralf Baechle 2008-07-16  58  };
> >> 73b4390f Ralf Baechle 2008-07-16  59
> >>
> >> :: The code at line 57 was first introduced by commit
> >> :: 73b4390fb23456964201abda79f1210fe337d01a [MIPS] Routerboard 532: 
> >> Support for base system
> >>
> >> :: TO: Ralf Baechle 
> >> :: CC: Ralf Baechle 
> >>
> >> ---
> >> 0-DAY kernel test infrastructureOpen Source Technology 
> >> Center
> >> https://lists.01.org/pipermail/kbuild-all   Intel 
> >> Corporation
> >
> >
> 
> 
> 
> -- 
> With Best Regards,
> Andy Shevchenko

Re: [kbuild-all] [PATCH v6 2/4] resource: Use list_head to link sibling resource

2018-07-09 Thread Baoquan He

On 07/10/18 at 08:59am, Ye Xiaolong wrote:
> Hi,
> 
> On 07/08, Baoquan He wrote:
> >Hi,
> >
> >On 07/05/18 at 01:00am, kbuild test robot wrote:
> >> Hi Baoquan,
> >> 
> >> I love your patch! Yet something to improve:
> >> 
> >> [auto build test ERROR on linus/master]
> >> [also build test ERROR on v4.18-rc3 next-20180704]
> >> [if your patch is applied to the wrong git tree, please drop us a note to 
> >> help improve the system]
> >
> >Thanks for telling. 
> >
> >I cloned 0day-ci/linut to my local pc.
> >https://github.com/0day-ci/linux.git
> >
> >However, I didn't find below branch. And tried to open it in web
> >broswer, also failed.
> >
> 
> Sorry for the inconvenience, 0day bot didn't push the branch to github 
> successfully,
> Just push it manually, you can have a try again.

Thanks, Xiaolong, I have applied them on top of linux-next/master, and
copy the config file attached, and run the command to reproduce as
suggested. Now I have fixed all those issues reported, will repost.

> 
> >
> >> url:
> >> https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180704-121402
> >> config: mips-rb532_defconfig (attached as .config)
> >> compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> >> reproduce:
> >> wget 
> >> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
> >> -O ~/bin/make.cross
> >> chmod +x ~/bin/make.cross
> >> # save the attached .config to linux build tree
> >> GCC_VERSION=7.2.0 make.cross ARCH=mips 
> >
> >I did find a old one which is for the old version 5 post.
> >
> >[bhe@linux]$ git remote -v
> >0day-ci  https://github.com/0day-ci/linux.git (fetch)
> >0day-ci  https://github.com/0day-ci/linux.git (push)
> >[bhe@dhcp-128-28 linux]$ git branch -a| grep Baoquan| grep resource
> >  
> > remotes/0day-ci/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180612-113600
> >
> >Could you help have a look at this?
> >
> >Thanks
> >Baoquan
> >
> >> 
> >> All error/warnings (new ones prefixed by >>):
> >> 
> >> >> arch/mips/pci/pci-rc32434.c:57:11: error: initialization from 
> >> >> incompatible pointer type [-Werror=incompatible-pointer-types]
> >>  .child = _res_pci_mem2
> >>   ^
> >>arch/mips/pci/pci-rc32434.c:57:11: note: (near initialization for 
> >> 'rc32434_res_pci_mem1.child.next')
> >> >> arch/mips/pci/pci-rc32434.c:51:47: warning: missing braces around 
> >> >> initializer [-Wmissing-braces]
> >> static struct resource rc32434_res_pci_mem1 = {
> >>   ^
> >>arch/mips/pci/pci-rc32434.c:60:47: warning: missing braces around 
> >> initializer [-Wmissing-braces]
> >> static struct resource rc32434_res_pci_mem2 = {
> >>   ^
> >>cc1: some warnings being treated as errors
> >> 
> >> vim +57 arch/mips/pci/pci-rc32434.c
> >> 
> >> 73b4390f Ralf Baechle 2008-07-16  50  
> >> 73b4390f Ralf Baechle 2008-07-16 @51  static struct resource 
> >> rc32434_res_pci_mem1 = {
> >> 73b4390f Ralf Baechle 2008-07-16  52   .name = "PCI MEM1",
> >> 73b4390f Ralf Baechle 2008-07-16  53   .start = 0x5000,
> >> 73b4390f Ralf Baechle 2008-07-16  54   .end = 0x5FFF,
> >> 73b4390f Ralf Baechle 2008-07-16  55   .flags = IORESOURCE_MEM,
> >> 73b4390f Ralf Baechle 2008-07-16  56   .sibling = NULL,
> >> 73b4390f Ralf Baechle 2008-07-16 @57   .child = _res_pci_mem2
> >> 73b4390f Ralf Baechle 2008-07-16  58  };
> >> 73b4390f Ralf Baechle 2008-07-16  59  
> >> 
> >> :: The code at line 57 was first introduced by commit
> >> :: 73b4390fb23456964201abda79f1210fe337d01a [MIPS] Routerboard 532: 
> >> Support for base system
> >> 
> >> :: TO: Ralf Baechle 
> >> :: CC: Ralf Baechle 
> >> 
> >> ---
> >> 0-DAY kernel test infrastructureOpen Source Technology 
> >> Center
> >> https://lists.01.org/pipermail/kbuild-all   Intel 
> >> Corporation
> >
> >
> >___
> >kbuild-all mailing list
> >kbuild-...@lists.01.org
> >https://lists.01.org/mailman/listinfo/kbuild-all

Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-11 Thread Baoquan He

Hi Michael,

On 07/11/18 at 10:49pm, Michael Ellerman wrote:
> a...@linux-foundation.org writes:
> > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> >
> >http://www.ozlabs.org/~akpm/mmotm/
> ...
> 
> > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > * 
> > mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > * 
> > mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> 
> > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> 
> This seems to be breaking my powerpc pseries qemu boots.
> 
> The boot log with some extra debug shows eg:
> 
>   $ make pseries_le_defconfig
>   $ qemu-system-ppc64 -nographic -vga none -M pseries -m 2G -kernel vmlinux 
>   vmemmap_populate f000..f0024000, node 0
> * f000..f100 allocated at c0007600
>   hash__vmemmap_create_mapping: start 0xf000 size 0x100 phys 
> 0x7600
>   hash__vmemmap_create_mapping: failed -1
> 
>   
> 
> Then there's lots of other warnings about bad page states and eventually
> a NULL deref and we panic().
> 
> 
> The problem seems to be that we're calling down into
> hash__vmemmap_create_mapping() for every call to vmemmap_populate(),
> whereas previously we would only call hash__vmemmap_create_mapping()
> once because our vmemmap_populated() would return true.
> 
> There's actually a comment in sparse_init() that says:
> 
>* powerpc need to call sparse_init_one_section right after each
>* sparse_early_mem_map_alloc, so allocate usemap_map at first.
> 
> So changing that behaviour does seem to be the problem.
> 
> I assume that comment is talking about the fact that we use pfn_valid()
> in vmemmap_populated().
> 
> I'm not clear on how to fix it though.

Have you tried reverting that patch and building kernel to test again?
Does it work?

Re: [PATCH v6 2/4] resource: Use list_head to link sibling resource

2018-07-07 Thread Baoquan He

Hi,

On 07/05/18 at 01:00am, kbuild test robot wrote:
> Hi Baoquan,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18-rc3 next-20180704]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]

Thanks for telling. 

I cloned 0day-ci/linut to my local pc.
https://github.com/0day-ci/linux.git

However, I didn't find below branch. And tried to open it in web
broswer, also failed.


> url:
> https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180704-121402
> config: mips-rb532_defconfig (attached as .config)
> compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=mips 

I did find a old one which is for the old version 5 post.

[bhe@linux]$ git remote -v
0day-ci https://github.com/0day-ci/linux.git (fetch)
0day-ci https://github.com/0day-ci/linux.git (push)
[bhe@dhcp-128-28 linux]$ git branch -a| grep Baoquan| grep resource
  
remotes/0day-ci/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180612-113600

Could you help have a look at this?

Thanks
Baoquan

> 
> All error/warnings (new ones prefixed by >>):
> 
> >> arch/mips/pci/pci-rc32434.c:57:11: error: initialization from incompatible 
> >> pointer type [-Werror=incompatible-pointer-types]
>  .child = _res_pci_mem2
>   ^
>arch/mips/pci/pci-rc32434.c:57:11: note: (near initialization for 
> 'rc32434_res_pci_mem1.child.next')
> >> arch/mips/pci/pci-rc32434.c:51:47: warning: missing braces around 
> >> initializer [-Wmissing-braces]
> static struct resource rc32434_res_pci_mem1 = {
>   ^
>arch/mips/pci/pci-rc32434.c:60:47: warning: missing braces around 
> initializer [-Wmissing-braces]
> static struct resource rc32434_res_pci_mem2 = {
>   ^
>cc1: some warnings being treated as errors
> 
> vim +57 arch/mips/pci/pci-rc32434.c
> 
> 73b4390f Ralf Baechle 2008-07-16  50  
> 73b4390f Ralf Baechle 2008-07-16 @51  static struct resource 
> rc32434_res_pci_mem1 = {
> 73b4390f Ralf Baechle 2008-07-16  52  .name = "PCI MEM1",
> 73b4390f Ralf Baechle 2008-07-16  53  .start = 0x5000,
> 73b4390f Ralf Baechle 2008-07-16  54  .end = 0x5FFF,
> 73b4390f Ralf Baechle 2008-07-16  55  .flags = IORESOURCE_MEM,
> 73b4390f Ralf Baechle 2008-07-16  56  .sibling = NULL,
> 73b4390f Ralf Baechle 2008-07-16 @57  .child = _res_pci_mem2
> 73b4390f Ralf Baechle 2008-07-16  58  };
> 73b4390f Ralf Baechle 2008-07-16  59  
> 
> :: The code at line 57 was first introduced by commit
> :: 73b4390fb23456964201abda79f1210fe337d01a [MIPS] Routerboard 532: 
> Support for base system
> 
> :: TO: Ralf Baechle 
> :: CC: Ralf Baechle 
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: Is it worth to fix the crashkernel reserved memory blocks the hotplug issue?

2018-12-10 Thread Baoquan He

On 12/10/18 at 12:08pm, Pingfan Liu wrote:
> Hi,
> I found in powerpc code, it is doable to reserve memory region in
> movable zone, such as crashkernel does. But in x86 code, it checks the
> hotpluggable attribute of memory, hence if manually specifying a
> region in hotpluggable region, it will fail.

Yes, it is a problem. In x86, for crashkernel=xx@yy case to specify base
address of crashkernel reservation, it will find the region firstly, if
found region's base is not equal to specified base 'yy', means that
region has been occupied. The reservation will fail. And memblock
finding will only iterate unhotpluggable area, this will avoid reserving
crashkernel memory on hotpluggable region. 

In my opinion, it's worth fixing.

Thanks
Baoquan


> The x86 code:
> /* 0 means: find the address automatically */
> if (crash_base <= 0) {
> /*
> * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
> * as old kexec-tools loads bzImage below that, unless
> * "crashkernel=size[KMG],high" is specified.
> */
> crash_base = memblock_find_in_range(CRASH_ALIGN,
>high ? CRASH_ADDR_HIGH_MAX
> : CRASH_ADDR_LOW_MAX,
>crash_size, CRASH_ALIGN);
> if (!crash_base) {
> pr_info("crashkernel reservation failed - No suitable area found.\n");
> return;
> }
> 
> } else {
> unsigned long long start;
> 
> start = memblock_find_in_range(crash_base,  --> this func will check
> the hotpluggable attribute of memory and return failure if the
> specifying region intersects with it.
>   crash_base + crash_size,
>   crash_size, 1 << 20);
> if (start != crash_base) {
> pr_info("crashkernel reservation failed - memory is in use.\n");
> return;
> }
> }
> 
> Thanks,
> Pingfan

Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-26 Thread Baoquan He

On 02/21/20 at 02:26pm, Alastair D'Silva wrote:
> From: Alastair D'Silva 
> 
> Function declarations don't need externs, remove the existing ones
> so they are consistent with newer code
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  arch/powerpc/include/asm/pnv-ocxl.h | 32 ++---
>  include/misc/ocxl.h |  6 +++---
>  2 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 0b2a6707e555..b23c99bc0c84 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -9,29 +9,27 @@
>  #define PNV_OCXL_TL_BITS_PER_RATE   4
>  #define PNV_OCXL_TL_RATE_BUF_SIZE   ((PNV_OCXL_TL_MAX_TEMPLATE+1) * 
> PNV_OCXL_TL_BITS_PER_RATE / 8)
>  
> -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16 *enabled,
> - u16 *supported);

It works w or w/o extern when declare functions. Searching 'extern'
under include can find so many functions with 'extern' adding. Do we
have a explicit standard if we should add or remove 'exter' in function
declaration?

I have no objection to this patch, just want to make clear so that I can
handle it w/o confusion.

Thanks
Baoquan

Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-26 Thread 'Baoquan He'

On 02/26/20 at 03:20pm, Greg Kurz wrote:
> On Wed, 26 Feb 2020 22:15:23 +0800
> 'Baoquan He'  wrote:
> 
> > On 02/26/20 at 10:01am, Greg Kurz wrote:
> > > On Wed, 26 Feb 2020 19:26:34 +1100
> > > "Alastair D'Silva"  wrote:
> > > 
> > > > > -Original Message-
> > > > > From: Baoquan He 
> > > > > Sent: Wednesday, 26 February 2020 7:15 PM
> > > > > To: Alastair D'Silva 
> > > > > Cc: alast...@d-silva.org; Aneesh Kumar K . V
> > > > > ; Oliver O'Halloran ;
> > > > > Benjamin Herrenschmidt ; Paul Mackerras
> > > > > ; Michael Ellerman ; Frederic
> > > > > Barrat ; Andrew Donnellan ;
> > > > > Arnd Bergmann ; Greg Kroah-Hartman
> > > > > ; Dan Williams ;
> > > > > Vishal Verma ; Dave Jiang
> > > > > ; Ira Weiny ; Andrew Morton
> > > > > ; Mauro Carvalho Chehab
> > > > > ; David S. Miller ;
> > > > > Rob Herring ; Anton Blanchard ;
> > > > > Krzysztof Kozlowski ; Mahesh Salgaonkar
> > > > > ; Madhavan Srinivasan
> > > > > ; Cédric Le Goater ; Anju T
> > > > > Sudhakar ; Hari Bathini
> > > > > ; Thomas Gleixner ; Greg
> > > > > Kurz ; Nicholas Piggin ; Masahiro
> > > > > Yamada ; Alexey Kardashevskiy
> > > > > ; linux-ker...@vger.kernel.org; linuxppc-
> > > > > d...@lists.ozlabs.org; linux-nvd...@lists.01.org; linux...@kvack.org
> > > > > Subject: Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs
> > > > > 
> > > > > On 02/21/20 at 02:26pm, Alastair D'Silva wrote:
> > > > > > From: Alastair D'Silva 
> > > > > >
> > > > > > Function declarations don't need externs, remove the existing ones 
> > > > > > so
> > > > > > they are consistent with newer code
> > > > > >
> > > > > > Signed-off-by: Alastair D'Silva 
> > > > > > ---
> > > > > >  arch/powerpc/include/asm/pnv-ocxl.h | 32 
> > > > > > ++---
> > > > > >  include/misc/ocxl.h |  6 +++---
> > > > > >  2 files changed, 18 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > > b/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > > index 0b2a6707e555..b23c99bc0c84 100644
> > > > > > --- a/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > > +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> > > > > > @@ -9,29 +9,27 @@
> > > > > >  #define PNV_OCXL_TL_BITS_PER_RATE   4
> > > > > >  #define PNV_OCXL_TL_RATE_BUF_SIZE
> > > > > ((PNV_OCXL_TL_MAX_TEMPLATE+1) * PNV_OCXL_TL_BITS_PER_RATE / 8)
> > > > > >
> > > > > > -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16
> > > > > *enabled,
> > > > > > -   u16 *supported);
> > > > > 
> > > > > It works w or w/o extern when declare functions. Searching 'extern'
> > > > > under include can find so many functions with 'extern' adding. Do we 
> > > > > have
> > > > a
> > > > > explicit standard if we should add or remove 'exter' in function
> > > > declaration?
> > > > > 
> > > > > I have no objection to this patch, just want to make clear so that I 
> > > > > can
> > > > handle
> > > > > it w/o confusion.
> > > > > 
> > > > > Thanks
> > > > > Baoquan
> > > > > 
> > > > 
> > > > For the OpenCAPI driver, we have settled on not having 'extern' on
> > > > functions.
> > > > 
> > > > I don't think I've seen a standard that supports or refutes this, but it
> > > > does not value add.
> > > > 
> > > 
> > > FWIW this is a warning condition for checkpatch:
> > > 
> > > $ ./scripts/checkpatch.pl --strict -f include/misc/ocxl.h
> > 
> > Good to know, thanks.
> > 
> > I didn't know checkpatch.pl can run on header file directly. Tried to
> > check patch with '--strict -f', the below info doesn't appear. But it
> 
> Hmm... -f is to check a source file, not a patch... What did you try
> exactly ?

OK, that's it. I can see the 'CHECK' line when run checkpatch.pl on
patch with '--strict' only. I think this can be a good reason that we
should not add extern when add function declaration into header file.
Thanks.

> 
> > does give out below information when run on header file.
> > 
> > > 
> > > [...]
> > > 
> > > CHECK: extern prototypes should be avoided in .h files
> > > #176: FILE: include/misc/ocxl.h:176:
> > > +extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
> > > 
> > > [...]
> > > 
> > 
> 
>

Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs

2020-02-26 Thread 'Baoquan He'

On 02/26/20 at 10:01am, Greg Kurz wrote:
> On Wed, 26 Feb 2020 19:26:34 +1100
> "Alastair D'Silva"  wrote:
> 
> > > -Original Message-
> > > From: Baoquan He 
> > > Sent: Wednesday, 26 February 2020 7:15 PM
> > > To: Alastair D'Silva 
> > > Cc: alast...@d-silva.org; Aneesh Kumar K . V
> > > ; Oliver O'Halloran ;
> > > Benjamin Herrenschmidt ; Paul Mackerras
> > > ; Michael Ellerman ; Frederic
> > > Barrat ; Andrew Donnellan ;
> > > Arnd Bergmann ; Greg Kroah-Hartman
> > > ; Dan Williams ;
> > > Vishal Verma ; Dave Jiang
> > > ; Ira Weiny ; Andrew Morton
> > > ; Mauro Carvalho Chehab
> > > ; David S. Miller ;
> > > Rob Herring ; Anton Blanchard ;
> > > Krzysztof Kozlowski ; Mahesh Salgaonkar
> > > ; Madhavan Srinivasan
> > > ; Cédric Le Goater ; Anju T
> > > Sudhakar ; Hari Bathini
> > > ; Thomas Gleixner ; Greg
> > > Kurz ; Nicholas Piggin ; Masahiro
> > > Yamada ; Alexey Kardashevskiy
> > > ; linux-ker...@vger.kernel.org; linuxppc-
> > > d...@lists.ozlabs.org; linux-nvd...@lists.01.org; linux...@kvack.org
> > > Subject: Re: [PATCH v3 04/27] ocxl: Remove unnecessary externs
> > > 
> > > On 02/21/20 at 02:26pm, Alastair D'Silva wrote:
> > > > From: Alastair D'Silva 
> > > >
> > > > Function declarations don't need externs, remove the existing ones so
> > > > they are consistent with newer code
> > > >
> > > > Signed-off-by: Alastair D'Silva 
> > > > ---
> > > >  arch/powerpc/include/asm/pnv-ocxl.h | 32 ++---
> > > >  include/misc/ocxl.h |  6 +++---
> > > >  2 files changed, 18 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/include/asm/pnv-ocxl.h
> > > > b/arch/powerpc/include/asm/pnv-ocxl.h
> > > > index 0b2a6707e555..b23c99bc0c84 100644
> > > > --- a/arch/powerpc/include/asm/pnv-ocxl.h
> > > > +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> > > > @@ -9,29 +9,27 @@
> > > >  #define PNV_OCXL_TL_BITS_PER_RATE   4
> > > >  #define PNV_OCXL_TL_RATE_BUF_SIZE
> > > ((PNV_OCXL_TL_MAX_TEMPLATE+1) * PNV_OCXL_TL_BITS_PER_RATE / 8)
> > > >
> > > > -extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 *base, u16
> > > *enabled,
> > > > -   u16 *supported);
> > > 
> > > It works w or w/o extern when declare functions. Searching 'extern'
> > > under include can find so many functions with 'extern' adding. Do we have
> > a
> > > explicit standard if we should add or remove 'exter' in function
> > declaration?
> > > 
> > > I have no objection to this patch, just want to make clear so that I can
> > handle
> > > it w/o confusion.
> > > 
> > > Thanks
> > > Baoquan
> > > 
> > 
> > For the OpenCAPI driver, we have settled on not having 'extern' on
> > functions.
> > 
> > I don't think I've seen a standard that supports or refutes this, but it
> > does not value add.
> > 
> 
> FWIW this is a warning condition for checkpatch:
> 
> $ ./scripts/checkpatch.pl --strict -f include/misc/ocxl.h

Good to know, thanks.

I didn't know checkpatch.pl can run on header file directly. Tried to
check patch with '--strict -f', the below info doesn't appear. But it
does give out below information when run on header file.

> 
> [...]
> 
> CHECK: extern prototypes should be avoided in .h files
> #176: FILE: include/misc/ocxl.h:176:
> +extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id);
> 
> [...]
>

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-04 Thread Baoquan He

On 10/06/19 at 10:56am, David Hildenbrand wrote:
> If we have holes, the holes will automatically get detected and removed
> once we remove the next bigger/smaller section. The extra checks can
> go.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 34 +++---
>  1 file changed, 7 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f294918f7211..8dafa1ba8d9f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
> long start_pfn,
>   if (pfn) {
>   zone->zone_start_pfn = pfn;
>   zone->spanned_pages = zone_end_pfn - pfn;
> + } else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;
>   }
>   } else if (zone_end_pfn == end_pfn) {
>   /*
> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> unsigned long start_pfn,
>  start_pfn);
>   if (pfn)
>   zone->spanned_pages = pfn - zone_start_pfn + 1;
> + else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;

Thinking in which case (zone_start_pfn != start_pfn) and it comes here.

> + }
>   }
> -
> - /*
> -  * The section is not biggest or smallest mem_section in the zone, it
> -  * only creates a hole in the zone. So in this case, we need not
> -  * change the zone. But perhaps, the zone has only hole data. Thus
> -  * it check the zone has only hole or not.
> -  */
> - pfn = zone_start_pfn;
> - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
> - if (unlikely(!pfn_to_online_page(pfn)))
> - continue;
> -
> - if (page_zone(pfn_to_page(pfn)) != zone)
> - continue;
> -
> - /* Skip range to be removed */
> - if (pfn >= start_pfn && pfn < end_pfn)
> - continue;
> -
> - /* If we find valid section, we have nothing to do */
> - zone_span_writeunlock(zone);
> - return;
> - }
> -
> - /* The zone has no valid section */
> - zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
>   zone_span_writeunlock(zone);
>  }
>  
> -- 
> 2.21.0
> 
>

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

On 02/06/20 at 07:26am, Wei Yang wrote:
> On Thu, Feb 06, 2020 at 07:08:26AM +0800, Baoquan He wrote:
> >On 02/06/20 at 06:56am, Wei Yang wrote:
> >> On Wed, Feb 05, 2020 at 10:48:11PM +0800, Baoquan He wrote:
> >> >Hi Wei Yang,
> >> >
> >> >On 02/05/20 at 05:59pm, Wei Yang wrote:
> >> >> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >> >> >index f294918f7211..8dafa1ba8d9f 100644
> >> >> >--- a/mm/memory_hotplug.c
> >> >> >+++ b/mm/memory_hotplug.c
> >> >> >@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
> >> >> >unsigned long start_pfn,
> >> >> >   if (pfn) {
> >> >> >   zone->zone_start_pfn = pfn;
> >> >> >   zone->spanned_pages = zone_end_pfn - pfn;
> >> >> >+  } else {
> >> >> >+  zone->zone_start_pfn = 0;
> >> >> >+  zone->spanned_pages = 0;
> >> >> >   }
> >> >> >   } else if (zone_end_pfn == end_pfn) {
> >> >> >   /*
> >> >> >@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> >> >> >unsigned long start_pfn,
> >> >> >  start_pfn);
> >> >> >   if (pfn)
> >> >> >   zone->spanned_pages = pfn - zone_start_pfn + 1;
> >> >> >+  else {
> >> >> >+  zone->zone_start_pfn = 0;
> >> >> >+  zone->spanned_pages = 0;
> >> >> >+  }
> >> >> >   }
> >> >> 
> >> >> If it is me, I would like to take out these two similar logic out.
> >> >
> >> >I also like this style. 
> >> >> 
> >> >> For example:
> >> >> 
> >> >> if () {
> >> >> } else if () {
> >> >> } else {
> >> >> goto out;
> >> >Here the last else is unnecessary, right?
> >> >
> >> 
> >> I am afraid not.
> >> 
> >> If the range is not the first or last, we would leave pfn not initialized.
> >
> >Ah, you are right. I forgot that one. Then pfn can be assigned the
> >zone_start_pfn as the old code. Then the following logic is the same
> >as the original code, find_smallest_section_pfn()/find_biggest_section_pfn() 
> >have done the iteration the old for loop was doing.
> >
> > unsigned long pfn = zone_start_pfn; 
> > if () {
> > } else if () {
> > } 
> >
> > /* The zone has no valid section */
> > if (!pfn) {
> > zone->zone_start_pfn = 0;
> > zone->spanned_pages = 0;
> > }
> 
> This one look better :-)

Thanks for your confirmation, I will make one patch like this and post.

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

On 02/06/20 at 06:56am, Wei Yang wrote:
> On Wed, Feb 05, 2020 at 10:48:11PM +0800, Baoquan He wrote:
> >Hi Wei Yang,
> >
> >On 02/05/20 at 05:59pm, Wei Yang wrote:
> >> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >> >index f294918f7211..8dafa1ba8d9f 100644
> >> >--- a/mm/memory_hotplug.c
> >> >+++ b/mm/memory_hotplug.c
> >> >@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
> >> >unsigned long start_pfn,
> >> >  if (pfn) {
> >> >  zone->zone_start_pfn = pfn;
> >> >  zone->spanned_pages = zone_end_pfn - pfn;
> >> >+ } else {
> >> >+ zone->zone_start_pfn = 0;
> >> >+ zone->spanned_pages = 0;
> >> >  }
> >> >  } else if (zone_end_pfn == end_pfn) {
> >> >  /*
> >> >@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> >> >unsigned long start_pfn,
> >> > start_pfn);
> >> >  if (pfn)
> >> >  zone->spanned_pages = pfn - zone_start_pfn + 1;
> >> >+ else {
> >> >+ zone->zone_start_pfn = 0;
> >> >+ zone->spanned_pages = 0;
> >> >+ }
> >> >  }
> >> 
> >> If it is me, I would like to take out these two similar logic out.
> >
> >I also like this style. 
> >> 
> >> For example:
> >> 
> >>if () {
> >>} else if () {
> >>} else {
> >>goto out;
> >Here the last else is unnecessary, right?
> >
> 
> I am afraid not.
> 
> If the range is not the first or last, we would leave pfn not initialized.

Ah, you are right. I forgot that one. Then pfn can be assigned the
zone_start_pfn as the old code. Then the following logic is the same
as the original code, find_smallest_section_pfn()/find_biggest_section_pfn() 
have done the iteration the old for loop was doing.

unsigned long pfn = zone_start_pfn; 
if () {
} else if () {
} 

/* The zone has no valid section */
if (!pfn) {
zone->zone_start_pfn = 0;
zone->spanned_pages = 0;
}

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

On 02/04/20 at 03:42pm, David Hildenbrand wrote:
> On 04.02.20 15:25, Baoquan He wrote:
> > On 10/06/19 at 10:56am, David Hildenbrand wrote:
> >> If we have holes, the holes will automatically get detected and removed
> >> once we remove the next bigger/smaller section. The extra checks can
> >> go.
> >>
> >> Cc: Andrew Morton 
> >> Cc: Oscar Salvador 
> >> Cc: Michal Hocko 
> >> Cc: David Hildenbrand 
> >> Cc: Pavel Tatashin 
> >> Cc: Dan Williams 
> >> Cc: Wei Yang 
> >> Signed-off-by: David Hildenbrand 
> >> ---
> >>  mm/memory_hotplug.c | 34 +++---
> >>  1 file changed, 7 insertions(+), 27 deletions(-)
> >>
> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >> index f294918f7211..8dafa1ba8d9f 100644
> >> --- a/mm/memory_hotplug.c
> >> +++ b/mm/memory_hotplug.c
> >> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
> >> unsigned long start_pfn,
> >>if (pfn) {
> >>zone->zone_start_pfn = pfn;
> >>zone->spanned_pages = zone_end_pfn - pfn;
> >> +  } else {
> >> +  zone->zone_start_pfn = 0;
> >> +  zone->spanned_pages = 0;
> >>}
> >>} else if (zone_end_pfn == end_pfn) {
> >>/*
> >> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> >> unsigned long start_pfn,
> >>   start_pfn);
> >>if (pfn)
> >>zone->spanned_pages = pfn - zone_start_pfn + 1;
> >> +  else {
> >> +  zone->zone_start_pfn = 0;
> >> +  zone->spanned_pages = 0;
> > 
> > Thinking in which case (zone_start_pfn != start_pfn) and it comes here.
> 
> Could only happen in case the zone_start_pfn would have been "out of the
> zone already". If you ask me: unlikely :)

Yeah, I also think it's unlikely to come here.

The 'if (zone_start_pfn == start_pfn)' checking also covers the case
(zone_start_pfn == start_pfn && zone_end_pfn == end_pfn). So this
zone_start_pfn/spanned_pages resetting can be removed to avoid
confusion.

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

On 02/05/20 at 03:16pm, David Hildenbrand wrote:
>  Anyhow, that patch is already upstream and I don't consider this high
>  priority. Thanks :)
> >>>
> >>> Yeah, noticed you told Wei the status in another patch thread, I am fine
> >>> with it, just leave it to you to decide. Thanks.
> >>
> >> I am fairly busy right now. Can you send a patch (double-checking and
> >> making this eventually unconditional?). Thanks!
> > 
> > Understood, sorry about the noise, David. I will think about this.
> > 
> 
> No need to excuse, really, I'm very happy about review feedback!
> 

Glad to hear it, thanks.

> The review of this series happened fairly late. Bad, because it's not
> perfect, but good, because no serious stuff was found (so far :) ). If
> you also don't have time to look into this, I can put it onto my todo
> list, just let me know.

Both is OK to me, as long as thing is clear to us. I will discuss with
Wei Yang for now. You can post patch anytime if you make one.

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

On 02/05/20 at 02:20pm, David Hildenbrand wrote:
> On 05.02.20 13:43, Baoquan He wrote:
> > On 02/04/20 at 03:42pm, David Hildenbrand wrote:
> >> On 04.02.20 15:25, Baoquan He wrote:
> >>> On 10/06/19 at 10:56am, David Hildenbrand wrote:
> >>>> If we have holes, the holes will automatically get detected and removed
> >>>> once we remove the next bigger/smaller section. The extra checks can
> >>>> go.
> >>>>
> >>>> Cc: Andrew Morton 
> >>>> Cc: Oscar Salvador 
> >>>> Cc: Michal Hocko 
> >>>> Cc: David Hildenbrand 
> >>>> Cc: Pavel Tatashin 
> >>>> Cc: Dan Williams 
> >>>> Cc: Wei Yang 
> >>>> Signed-off-by: David Hildenbrand 
> >>>> ---
> >>>>  mm/memory_hotplug.c | 34 +++---
> >>>>  1 file changed, 7 insertions(+), 27 deletions(-)
> >>>>
> >>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>>> index f294918f7211..8dafa1ba8d9f 100644
> >>>> --- a/mm/memory_hotplug.c
> >>>> +++ b/mm/memory_hotplug.c
> >>>> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
> >>>> unsigned long start_pfn,
> >>>>  if (pfn) {
> >>>>  zone->zone_start_pfn = pfn;
> >>>>  zone->spanned_pages = zone_end_pfn - pfn;
> >>>> +} else {
> >>>> +zone->zone_start_pfn = 0;
> >>>> +zone->spanned_pages = 0;
> >>>>  }
> >>>>  } else if (zone_end_pfn == end_pfn) {
> >>>>  /*
> >>>> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> >>>> unsigned long start_pfn,
> >>>> start_pfn);
> >>>>  if (pfn)
> >>>>  zone->spanned_pages = pfn - zone_start_pfn + 1;
> >>>> +else {
> >>>> +zone->zone_start_pfn = 0;
> >>>> +zone->spanned_pages = 0;
> >>>
> >>> Thinking in which case (zone_start_pfn != start_pfn) and it comes here.
> >>
> >> Could only happen in case the zone_start_pfn would have been "out of the
> >> zone already". If you ask me: unlikely :)
> > 
> > Yeah, I also think it's unlikely to come here.
> > 
> > The 'if (zone_start_pfn == start_pfn)' checking also covers the case
> > (zone_start_pfn == start_pfn && zone_end_pfn == end_pfn). So this
> > zone_start_pfn/spanned_pages resetting can be removed to avoid
> > confusion.
> 
> At least I would find it more confusing without it (or want a comment
> explaining why this does not have to be handled and why the !pfn case is
> not possible).

I don't get why being w/o it will be more confusing, but it's OK since
it doesn't impact anything. 

> 
> Anyhow, that patch is already upstream and I don't consider this high
> priority. Thanks :)

Yeah, noticed you told Wei the status in another patch thread, I am fine
with it, just leave it to you to decide. Thanks.

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

Hi Wei Yang,

On 02/05/20 at 05:59pm, Wei Yang wrote:
> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >index f294918f7211..8dafa1ba8d9f 100644
> >--- a/mm/memory_hotplug.c
> >+++ b/mm/memory_hotplug.c
> >@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
> >long start_pfn,
> > if (pfn) {
> > zone->zone_start_pfn = pfn;
> > zone->spanned_pages = zone_end_pfn - pfn;
> >+} else {
> >+zone->zone_start_pfn = 0;
> >+zone->spanned_pages = 0;
> > }
> > } else if (zone_end_pfn == end_pfn) {
> > /*
> >@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> >unsigned long start_pfn,
> >start_pfn);
> > if (pfn)
> > zone->spanned_pages = pfn - zone_start_pfn + 1;
> >+else {
> >+zone->zone_start_pfn = 0;
> >+zone->spanned_pages = 0;
> >+}
> > }
> 
> If it is me, I would like to take out these two similar logic out.

I also like this style. 
> 
> For example:
> 
>   if () {
>   } else if () {
>   } else {
>   goto out;
Here the last else is unnecessary, right?

>   }
> 
> 

Like this, I believe both David and I will be satisfactory. Even though
I still think his 2nd resetting is not needed :-)

>   /* The zone has no valid section */
>   if (!pfn) {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
>   }
> 
> out:
>   zone_span_writeunlock(zone);
> 
> Well, this is just my personal taste :-)
> 
> >-
> >-/*
> >- * The section is not biggest or smallest mem_section in the zone, it
> >- * only creates a hole in the zone. So in this case, we need not
> >- * change the zone. But perhaps, the zone has only hole data. Thus
> >- * it check the zone has only hole or not.
> >- */
> >-pfn = zone_start_pfn;
> >-for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
> >-if (unlikely(!pfn_to_online_page(pfn)))
> >-continue;
> >-
> >-if (page_zone(pfn_to_page(pfn)) != zone)
> >-continue;
> >-
> >-/* Skip range to be removed */
> >-if (pfn >= start_pfn && pfn < end_pfn)
> >-continue;
> >-
> >-/* If we find valid section, we have nothing to do */
> >-zone_span_writeunlock(zone);
> >-return;
> >-}
> >-
> >-/* The zone has no valid section */
> >-zone->zone_start_pfn = 0;
> >-zone->spanned_pages = 0;
> > zone_span_writeunlock(zone);
> > }
> > 
> >-- 
> >2.21.0
> 
> -- 
> Wei Yang
> Help you, Help me
>

Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread Baoquan He

On 02/05/20 at 02:38pm, David Hildenbrand wrote:
> On 05.02.20 14:34, Baoquan He wrote:
> > On 02/05/20 at 02:20pm, David Hildenbrand wrote:
> >> On 05.02.20 13:43, Baoquan He wrote:
> >>> On 02/04/20 at 03:42pm, David Hildenbrand wrote:
> >>>> On 04.02.20 15:25, Baoquan He wrote:
> >>>>> On 10/06/19 at 10:56am, David Hildenbrand wrote:
> >>>>>> If we have holes, the holes will automatically get detected and removed
> >>>>>> once we remove the next bigger/smaller section. The extra checks can
> >>>>>> go.
> >>>>>>
> >>>>>> Cc: Andrew Morton 
> >>>>>> Cc: Oscar Salvador 
> >>>>>> Cc: Michal Hocko 
> >>>>>> Cc: David Hildenbrand 
> >>>>>> Cc: Pavel Tatashin 
> >>>>>> Cc: Dan Williams 
> >>>>>> Cc: Wei Yang 
> >>>>>> Signed-off-by: David Hildenbrand 
> >>>>>> ---
> >>>>>>  mm/memory_hotplug.c | 34 +++---
> >>>>>>  1 file changed, 7 insertions(+), 27 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>>>>> index f294918f7211..8dafa1ba8d9f 100644
> >>>>>> --- a/mm/memory_hotplug.c
> >>>>>> +++ b/mm/memory_hotplug.c
> >>>>>> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, 
> >>>>>> unsigned long start_pfn,
> >>>>>>if (pfn) {
> >>>>>>zone->zone_start_pfn = pfn;
> >>>>>>zone->spanned_pages = zone_end_pfn - pfn;
> >>>>>> +  } else {
> >>>>>> +  zone->zone_start_pfn = 0;
> >>>>>> +  zone->spanned_pages = 0;
> >>>>>>}
> >>>>>>} else if (zone_end_pfn == end_pfn) {
> >>>>>>/*
> >>>>>> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> >>>>>> unsigned long start_pfn,
> >>>>>>   start_pfn);
> >>>>>>if (pfn)
> >>>>>>zone->spanned_pages = pfn - zone_start_pfn + 1;
> >>>>>> +  else {
> >>>>>> +  zone->zone_start_pfn = 0;
> >>>>>> +  zone->spanned_pages = 0;
> >>>>>
> >>>>> Thinking in which case (zone_start_pfn != start_pfn) and it comes here.
> >>>>
> >>>> Could only happen in case the zone_start_pfn would have been "out of the
> >>>> zone already". If you ask me: unlikely :)
> >>>
> >>> Yeah, I also think it's unlikely to come here.
> >>>
> >>> The 'if (zone_start_pfn == start_pfn)' checking also covers the case
> >>> (zone_start_pfn == start_pfn && zone_end_pfn == end_pfn). So this
> >>> zone_start_pfn/spanned_pages resetting can be removed to avoid
> >>> confusion.
> >>
> >> At least I would find it more confusing without it (or want a comment
> >> explaining why this does not have to be handled and why the !pfn case is
> >> not possible).
> > 
> > I don't get why being w/o it will be more confusing, but it's OK since
> > it doesn't impact anything. 
> 
> Because we could actually BUG_ON(!pfn) here, right? Only having a "if
> (pfn)" leaves the reader wondering "why is the other case not handled".
> 
> > 
> >>
> >> Anyhow, that patch is already upstream and I don't consider this high
> >> priority. Thanks :)
> > 
> > Yeah, noticed you told Wei the status in another patch thread, I am fine
> > with it, just leave it to you to decide. Thanks.
> 
> I am fairly busy right now. Can you send a patch (double-checking and
> making this eventually unconditional?). Thanks!

Understood, sorry about the noise, David. I will think about this.

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-04-10 Thread Baoquan He

On 04/09/20 at 07:27pm, Mike Rapoport wrote:
> On Tue, Mar 31, 2020 at 04:21:38PM +0200, Michal Hocko wrote:
> > On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > > Hi Michal,
> > > 
> > > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > > Maybe I mis-read the code, but I don't see how this could happen. In 
> > > > > the
> > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > > calculate_node_totalpages() that ensures that node->node_zones are 
> > > > > entirely
> > > > > within the node because this is checked in 
> > > > > zone_spanned_pages_in_node().
> > > > 
> > > > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > > > node boundaries. But that doesn't really tell anything about other
> > > > potential zones interleaving with the physical memory range.
> > > > zone->spanned_pages simply gives the physical range for the zone
> > > > including holes. Interleaving nodes are essentially a hole
> > > > (__absent_pages_in_range is going to skip those).
> > > > 
> > > > That means that when free_area_init_core simply goes over the whole
> > > > physical zone range including holes and that is why we need to check
> > > > both for physical and logical holes (aka other nodes).
> > > > 
> > > > The life would be so much easier if the whole thing would simply iterate
> > > > over memblocks...
> > > 
> > > The memblock iterating sounds a great idea. I tried with putting the
> > > memblock iterating in the upper layer, memmap_init(), which is used for
> > > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > > work out a formal patch to make this simpler as you said. The draft code
> > > is as below. Like this it uses the existing code and involves little 
> > > change.
> > 
> > Doing this would be a step in the right direction! I haven't checked the
> > code very closely though. The below sounds way too simple to be truth I
> > am afraid. First for_each_mem_pfn_range is available only for
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> > saying that I really hate that being conditional). Also I haven't really
> > checked the deferred initialization path - I have a very vague
> > recollection that it has been converted to the memblock api but I have
> > happilly dropped all that memory.
> 
> The Baoquan's patch almost did it, at least for simple case of qemu with 2
> nodes. It's only missing the adjustment to the size passed to
> memmap_init_zone() as it may change because of clamping.

Right, the size need be adjusted after start and end clamping.

> 
> I've drafted something that removes HAVE_MEMBLOCK_NODE_MAP and added this
> patch there [1]. For several memory configurations I could emulate with
> qemu it worked.
> I'm going to wait a bit to see of kbuild is happy and then I'll send the
> patches.
> 
> Baoquan, I took liberty to add your SoB, hope you don't mind.
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memblock/all-have-node-map
>  

Of course not. Thanks for doing this, and look forward to seeing your
formal patchset posting when it's ready.

>   
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 138a56c0f48f..558d421f294b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long 
> > > size, int nid, unsigned long zone,
> > >* function.  They do not exist on hotplugged memory.
> > >*/
> > >   if (context == MEMMAP_EARLY) {
> > > - if (!early_pfn_valid(pfn)) {
> > > - pfn = next_pfn(pfn);
> > > - continue;
> > > - }
> > > - if (!early_pfn_in_nid(pfn, nid)) {
> > > - pfn++;
> > > - continue;
> > > - }
> > >   if (overlap_memmap_init(zone, ))
> > >   continue;
> > >   if (defer_init(nid, pfn, end_pfn))
> > > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct 
> > > zone *zone)
> > >  }
> > >  
> > >  void __meminit __weak memmap_init(u

Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2020-04-10 Thread Baoquan He

On 04/09/20 at 05:33pm, Michal Hocko wrote:
> On Thu 09-04-20 22:41:19, Baoquan He wrote:
> > On 04/02/20 at 10:01am, Michal Hocko wrote:
> > > On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> > > > Hi,
> > > > 
> > > > On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> > > [...]
> > > > > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > > > > replace it with CONFIG_NUMA. That sounds more sensible to store nid 
> > > > > into
> > > > > memblock when NUMA support is enabled.
> > > >  
> > > > Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> > > > this will not help cleaning up the whole node/zone initialization mess 
> > > > and
> > > > we'll be stuck with two implementations.
> > > 
> > > Yeah, this is far from optimal.
> > > 
> > > > The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> > > > most architectures will anyway discard the entire memblock, so having 
> > > > it in
> > > > a UMA arch won't be a problem. The only exception is arm that uses
> > > > memblock for pfn_valid(), here we may also think about a solution to
> > > > compensate the addition of nid to the memblock structures. 
> > > 
> > > Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
> > > memblock_get_region_node would then unconditionally return 0 on UMA.
> > > Essentially the same way we do NUMA for other MM code. I only see few
> > > direct usage of region->nid.
> > 
> > Checked code again, seems HAVE_MEMBLOCK_NODE_MAP is selected directly in
> > all ARCHes which support it. Means HAVE_MEMBLOCK_NODE_MAP is enabled by
> > default on those ARCHes, and has no dependency on CONFIG_NUMA at all.
> > E.g on x86, it just calls free_area_init_nodes() in generic code path,
> > while free_area_init_nodes() is defined in CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > ifdeffery scope. So I tend to agree with Mike to remove
> > HAVE_MEMBLOCK_NODE_MAP firstly on all ARCHes. We can check if it's worth
> > only defining memblock_region->nid for CONFIG_NUMA case after
> > HAVE_MEMBLOCK_NODE_MAP is removed.
> 
> This can surely go in separate patches. What I meant to say is the
> region->nid is by definition 0 on !CONFIG_NUMA.

I see, thanks.

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread Baoquan He

On 04/14/20 at 04:49pm, David Hildenbrand wrote:
> > The root cause is kexec-ed kernel is targeted at hotpluggable memory
> > region. Just avoiding the movable area can fix it. In kexec_file_load(),
> > just checking or picking those unmovable region to put kernel/initrd in
> > function locate_mem_hole_callback() can fix it. The page or pageblock's
> > zone is movable or not, it's easy to know. This fix doesn't need to
> > bother other component.
> 
>  I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
>  does not imply that it cannot get offlined and removed e.g., this is
>  heavily used on ppc64, with 16MB sections.
> >>>
> >>> Really? I just know there are two kinds of mem hoplug in ppc, but don't
> >>> know the details. So in this case, is there any flag or a way to know
> >>> those memory block are hotpluggable? I am curious how those kernel data
> >>> is avoided to be put in this area. Or ppc just freely uses it for kernel
> >>> data or user space data, then try to migrate when hot remove?
> >>
> >> See
> >> arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_memory_remove_by_count()
> >>
> >> Under DLAPR, it can remove memory in LMB granularity, which is usually
> >> 16MB (== single section on ppc64). DLPAR will directly online all
> >> hotplugged memory (LMBs) from the kernel using device_online(), which
> >> will go to ZONE_NORMAL.
> >>
> >> When trying to remove memory, it simply scans for offlineable 16MB
> >> memory blocks (==section == LMB), offlines and removes them. No need for
> >> the movable zone and all the involved issues.
> > 
> > Yes, this is a different one, thanks for pointing it out. It sounds like
> > balloon driver in virt platform, doesn't it?
> 
> With DLPAR there is a hypervisor involved (which manages the actual HW
> DIMMs), so yes.
> 
> > 
> > Avoiding to put kexec kernel into movable zone can't solve this DLPAR
> > case as you said.
> > 
> >>
> >> Now, the interesting question is, can we have LMBs added during boot
> >> (not via add_memory()), that will later be removed via remove_memory().
> >> IIRC, we had BUGs related to that, so I think yes. If a section contains
> >> no unmovable allocations (after boot), it can get removed.
> > 
> > I do want to ask this question. If we can add LMB into system RAM, then
> > reload kexec can solve it. 
> > 
> > Another better way is adding a common function to filter out the
> > movable zone when search position for kexec kernel, use a arch specific
> > funciton to filter out DLPAR memory blocks for ppc only. Over there,
> > we can simply use for_each_drmem_lmb() to do that.
> 
> I was thinking about something similar. Maybe something like a notifier
> that can be used to test if selected memory can be used for kexec

Not sure if I get the notifier idea clearly. If you mean 

1) Add a common function to pick memory in unmovable zone;
2) Let DLPAR, balloon register with notifier;
3) In the common function, ask notified part to check if the picked
   unmovable memory is available for locating kexec kernel;

Sounds doable to me, and not complicated.

> images. It would apply to
> 
> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>   be used).

Do you mean hot added memory after boot can't be recognized and added
into system RAM on arm64?


> - powerpc to filter out all LMBs that can be removed (assuming not all
>   memory corresponds to LMBs that can be removed, otherwise we're in
>   trouble ... :) )
> - virtio-mem to filter out all memory it added.
> - hyper-v to filter out partially backed memory blocks (esp. the last
>   memory block it added and only partially backed it by memory).
> 
> This would make it work for kexec_file_load(), however, I do wonder how
> we would want to approach that from userspace kexec-tools when handling
> it from kexec_load().

Let's make kexec_file_load work firstly. Since this work is only first
step to make kexec-ed kernel not break memory hotplug. After kexec
rebooting, the KASLR may locate kernel into hotpluggable area too.

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread Baoquan He

On 04/16/20 at 04:09pm, David Hildenbrand wrote:
> >>> Sounds doable to me, and not complicated.
> >>>
>  images. It would apply to
> 
>  - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>    be used).
> >>>
> >>> Do you mean hot added memory after boot can't be recognized and added
> >>> into system RAM on arm64?
> >>
> >> See patch #3 of this patch set, which wants to avoid placing kexec
> >> binaries on hotplugged memory. But I have no idea what the current plan
> >> regarding arm64 is (this thread exploded :) ).
> >>
> >> I would assume that we don't want to place kexec images on any
> >> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.
> > 
> > Yes, noticed that and James replied to DaveY.
> > 
> > Later, when I was considering to make a draft patch to do the picking of
> > memory from normal zone, and add a notifier, as we discussed at above, I
> > suddenly realized that kexec_file_load doesn't have this issue. It
> > traverse system RAM bottom up to get an available region to put
> > kernel/initrd/boot_param, etc. I can't think of a system where its
> > low memory could be unavailable.
> 
> kexec_walk_memblock() has the option for "kbuf->top_down". Only
> kexec_walk_resources() seems to ignore it.

Yeah, that top down searching is done in a found low mem area. Means
firstly search an available region bottom up, then put kernel top down
in that region. The reason is our iomem res is linked with singly linked
list. So we can only search bottom up efficiently.

kexec_load is doing the real top down searching, so kernel will be put
at the top of system ram. I ever tried to change it to support top down
searching for kexec_file_load too with patches, since QE and customers
are often confused with this difference when debugging.

Andrew may remeber this, he suggested me to change the singly linked list 
to doubly linked list for iomem res, then do the top down searching for
kexec_file_load. I tried with some effort, the change introduced too much
code change, I just gave up finally.

http://archive.lwn.net:8080/devicetree/20180718024944.577-1-...@redhat.com/

I can see that top down searching for kexec can avoid the highly used
low memory region, esp under 4G, for dma, kinds of firmware reserving,
etc. And customers/QE of kexec get used to it. I can change kexec_file_load
to top down too with a simple way if people really complain it. But now, 
seems bottom up is not bad too.

> 
> So I think in case of memblocks (e.g., arm64), this still applies?

Yeah, aren't you trying to remove it? I haven't read your patches
carefully, maybe I got it wrong. And arm64 even can't support the hot added
memory being able to recorded into firmware, seems it's not so ready, 
won't they change that design in the future?
> 
> >>
> >>>
> >>>
>  - powerpc to filter out all LMBs that can be removed (assuming not all
>    memory corresponds to LMBs that can be removed, otherwise we're in
>    trouble ... :) )
>  - virtio-mem to filter out all memory it added.
>  - hyper-v to filter out partially backed memory blocks (esp. the last
>    memory block it added and only partially backed it by memory).
> 
>  This would make it work for kexec_file_load(), however, I do wonder how
>  we would want to approach that from userspace kexec-tools when handling
>  it from kexec_load().
> >>>
> >>> Let's make kexec_file_load work firstly. Since this work is only first
> >>> step to make kexec-ed kernel not break memory hotplug. After kexec
> >>> rebooting, the KASLR may locate kernel into hotpluggable area too.
> >>
> >> Can you elaborate how that would work?
> > 
> > Well, boot memory can be hotplugged or not after boot, they are marked
> > in uefi tables, the current kexec doesn't save and pass them into 2nd
> > kenrel, when kexec kernel bootup, it need read them and avoid them to
> > randomize kernel into.
> 
> What about e.g., memory hotplugged by ACPI? I would assume, that the
> kexec kernel will not make use of that (IOW detected that) until the
> ACPI driver comes up and re-detects + adds that memory.
> 
> Or how would that machinery work in case we have a DIMM hotplugged via ACPI?

ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
pass the efi, it won't get the SRAT table correctly, if I remember
correctly. Yeah, I remeber kvm guest can get memory hotplugged with
ACPI only, this won't happen on bare metal though. Need check carefully. 
I have been using kvm guest with uefi firmwire recently.

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-16 Thread Baoquan He

On 04/16/20 at 03:31pm, David Hildenbrand wrote:
> > Not sure if I get the notifier idea clearly. If you mean 
> > 
> > 1) Add a common function to pick memory in unmovable zone;
> 
> Not strictly required IMHO. But, minor detail.
> 
> > 2) Let DLPAR, balloon register with notifier;
> 
> Yeah, or virtio-mem, or any other technology that adds/removes memory
> dynamically.
> 
> > 3) In the common function, ask notified part to check if the picked
> >unmovable memory is available for locating kexec kernel;
> 
> Yeah.

These may not be needed, please see below comment.

> 
> > 
> > Sounds doable to me, and not complicated.
> > 
> >> images. It would apply to
> >>
> >> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
> >>   be used).
> > 
> > Do you mean hot added memory after boot can't be recognized and added
> > into system RAM on arm64?
> 
> See patch #3 of this patch set, which wants to avoid placing kexec
> binaries on hotplugged memory. But I have no idea what the current plan
> regarding arm64 is (this thread exploded :) ).
> 
> I would assume that we don't want to place kexec images on any
> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.

Yes, noticed that and James replied to DaveY.

Later, when I was considering to make a draft patch to do the picking of
memory from normal zone, and add a notifier, as we discussed at above, I
suddenly realized that kexec_file_load doesn't have this issue. It
traverse system RAM bottom up to get an available region to put
kernel/initrd/boot_param, etc. I can't think of a system where its
low memory could be unavailable.
> 
> > 
> > 
> >> - powerpc to filter out all LMBs that can be removed (assuming not all
> >>   memory corresponds to LMBs that can be removed, otherwise we're in
> >>   trouble ... :) )
> >> - virtio-mem to filter out all memory it added.
> >> - hyper-v to filter out partially backed memory blocks (esp. the last
> >>   memory block it added and only partially backed it by memory).
> >>
> >> This would make it work for kexec_file_load(), however, I do wonder how
> >> we would want to approach that from userspace kexec-tools when handling
> >> it from kexec_load().
> > 
> > Let's make kexec_file_load work firstly. Since this work is only first
> > step to make kexec-ed kernel not break memory hotplug. After kexec
> > rebooting, the KASLR may locate kernel into hotpluggable area too.
> 
> Can you elaborate how that would work?

Well, boot memory can be hotplugged or not after boot, they are marked
in uefi tables, the current kexec doesn't save and pass them into 2nd
kenrel, when kexec kernel bootup, it need read them and avoid them to
randomize kernel into.

Re: [PATCH v1 1/2] powerpc/pseries/hotplug-memory: stop checking is_mem_section_removable()

2020-04-07 Thread Baoquan He

Add Pingfan to CC since he usually handles ppc related bugs for RHEL.

On 04/07/20 at 03:54pm, David Hildenbrand wrote:
> In commit 53cdc1cb29e8 ("drivers/base/memory.c: indicate all memory
> blocks as removable"), the user space interface to compute whether a memory
> block can be offlined (exposed via
> /sys/devices/system/memory/memoryX/removable) has effectively been
> deprecated. We want to remove the leftovers of the kernel implementation.

Pingfan, can you have a look at this change on PPC?  Please feel free to
give comments if any concern, or offer ack if it's OK to you.

> 
> When offlining a memory block (mm/memory_hotplug.c:__offline_pages()),
> we'll start by:
> 1. Testing if it contains any holes, and reject if so
> 2. Testing if pages belong to different zones, and reject if so
> 3. Isolating the page range, checking if it contains any unmovable pages
> 
> Using is_mem_section_removable() before trying to offline is not only racy,
> it can easily result in false positives/negatives. Let's stop manually
> checking is_mem_section_removable(), and let device_offline() handle it
> completely instead. We can remove the racy is_mem_section_removable()
> implementation next.
> 
> We now take more locks (e.g., memory hotplug lock when offlining and the
> zone lock when isolating), but maybe we should optimize that
> implementation instead if this ever becomes a real problem (after all,
> memory unplug is already an expensive operation). We started using
> is_mem_section_removable() in commit 51925fb3c5c9 ("powerpc/pseries:
> Implement memory hotplug remove in the kernel"), with the initial
> hotremove support of lmbs.
> 
> Cc: Nathan Fontenot 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  .../platforms/pseries/hotplug-memory.c| 26 +++
>  1 file changed, 3 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index b2cde1732301..5ace2f9a277e 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -337,39 +337,19 @@ static int pseries_remove_mem_node(struct device_node 
> *np)
>  
>  static bool lmb_is_removable(struct drmem_lmb *lmb)
>  {
> - int i, scns_per_block;
> - bool rc = true;
> - unsigned long pfn, block_sz;
> - u64 phys_addr;
> -
>   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
>   return false;
>  
> - block_sz = memory_block_size_bytes();
> - scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> - phys_addr = lmb->base_addr;
> -
>  #ifdef CONFIG_FA_DUMP
>   /*
>* Don't hot-remove memory that falls in fadump boot memory area
>* and memory that is reserved for capturing old kernel memory.
>*/
> - if (is_fadump_memory_area(phys_addr, block_sz))
> + if (is_fadump_memory_area(lmb->base_addr, memory_block_size_bytes()))
>   return false;
>  #endif
> -
> - for (i = 0; i < scns_per_block; i++) {
> - pfn = PFN_DOWN(phys_addr);
> - if (!pfn_in_present_section(pfn)) {
> - phys_addr += MIN_MEMORY_BLOCK_SIZE;
> - continue;
> - }
> -
> - rc = rc && is_mem_section_removable(pfn, PAGES_PER_SECTION);
> - phys_addr += MIN_MEMORY_BLOCK_SIZE;
> - }
> -
> - return rc;
> + /* device_offline() will determine if we can actually remove this lmb */
> + return true;
>  }
>  
>  static int dlpar_add_lmb(struct drmem_lmb *);
> -- 
> 2.25.1
>

Re: [PATCH v1 2/2] mm/memory_hotplug: remove is_mem_section_removable()

2020-04-07 Thread Baoquan He

On 04/07/20 at 03:54pm, David Hildenbrand wrote:
> Fortunately, all users of is_mem_section_removable() are gone. Get rid of
> it, including some now unnecessary functions.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Assuming no issue to patch 1, this one looks good.

Reviewed-by: Baoquan He 

> ---
>  include/linux/memory_hotplug.h |  7 
>  mm/memory_hotplug.c| 75 --
>  2 files changed, 82 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 93d9ada74ddd..7dca9cd6076b 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -314,19 +314,12 @@ static inline void pgdat_resize_init(struct pglist_data 
> *pgdat) {}
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  
> -extern bool is_mem_section_removable(unsigned long pfn, unsigned long 
> nr_pages);
>  extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern int remove_memory(int nid, u64 start, u64 size);
>  extern void __remove_memory(int nid, u64 start, u64 size);
>  
>  #else
> -static inline bool is_mem_section_removable(unsigned long pfn,
> - unsigned long nr_pages)
> -{
> - return false;
> -}
> -
>  static inline void try_offline_node(int nid) {}
>  
>  static inline int offline_pages(unsigned long start_pfn, unsigned long 
> nr_pages)
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 47cf6036eb31..4d338d546d52 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1112,81 +1112,6 @@ int add_memory(int nid, u64 start, u64 size)
>  EXPORT_SYMBOL_GPL(add_memory);
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -/*
> - * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
> - * set and the size of the free page is given by page_order(). Using this,
> - * the function determines if the pageblock contains only free pages.
> - * Due to buddy contraints, a free page at least the size of a pageblock will
> - * be located at the start of the pageblock
> - */
> -static inline int pageblock_free(struct page *page)
> -{
> - return PageBuddy(page) && page_order(page) >= pageblock_order;
> -}
> -
> -/* Return the pfn of the start of the next active pageblock after a given 
> pfn */
> -static unsigned long next_active_pageblock(unsigned long pfn)
> -{
> - struct page *page = pfn_to_page(pfn);
> -
> - /* Ensure the starting page is pageblock-aligned */
> - BUG_ON(pfn & (pageblock_nr_pages - 1));
> -
> - /* If the entire pageblock is free, move to the end of free page */
> - if (pageblock_free(page)) {
> - int order;
> - /* be careful. we don't have locks, page_order can be changed.*/
> - order = page_order(page);
> - if ((order < MAX_ORDER) && (order >= pageblock_order))
> - return pfn + (1 << order);
> - }
> -
> - return pfn + pageblock_nr_pages;
> -}
> -
> -static bool is_pageblock_removable_nolock(unsigned long pfn)
> -{
> - struct page *page = pfn_to_page(pfn);
> - struct zone *zone;
> -
> - /*
> -  * We have to be careful here because we are iterating over memory
> -  * sections which are not zone aware so we might end up outside of
> -  * the zone but still within the section.
> -  * We have to take care about the node as well. If the node is offline
> -  * its NODE_DATA will be NULL - see page_zone.
> -  */
> - if (!node_online(page_to_nid(page)))
> - return false;
> -
> - zone = page_zone(page);
> - pfn = page_to_pfn(page);
> - if (!zone_spans_pfn(zone, pfn))
> - return false;
> -
> - return !has_unmovable_pages(zone, page, MIGRATE_MOVABLE,
> - MEMORY_OFFLINE);
> -}
> -
> -/* Checks if this range of memory is likely to be hot-removable. */
> -bool is_mem_section_removable(unsigned long start_pfn, unsigned long 
> nr_pages)
> -{
> - unsigned long end_pfn, pfn;
> -
> - end_pfn = min(start_pfn + nr_pages,
> - zone_end_pfn(page_zone(pfn_to_page(start_pfn;
> -
> - /* Check the starting page of each pageblock within the range */
> - for (pfn = start_pfn; pfn < end_pfn; pfn = next_active_pageblock(pfn)) {
> - if (!is_pageblock_removable_nolock(pfn))
> - return false;
> - cond_resched();
> - }
> -
> - /* All pageblocks in the memory block are likely to be hot-removable */
> - return true;
> -}
> -
>  /*
>   * Confirm all pages in a range [start, end) belong to the same zone 
> (skipping
>   * memory holes). When true, return the zone.
> -- 
> 2.25.1
>

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread Baoquan He

On 04/14/20 at 10:00am, David Hildenbrand wrote:
> On 14.04.20 08:40, Baoquan He wrote:
> > On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> >> Baoquan He  writes:
> >>
> >>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >>>>
> >>>> The only benefit of kexec_file_load is that it is simple enough from a
> >>>> kernel perspective that signatures can be checked.
> >>>
> >>> We don't have this restriction any more with below commit:
> >>>
> >>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> >>> and KEXEC_SIG_FORCE")
> >>>
> >>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> >>> secure boot or legacy system for kexec/kdump. Being simple enough is
> >>> enough to astract and convince us to use it instead. And kexec_file_load
> >>> has been in use for several years on systems with secure boot, since
> >>> added in 2014, on x86_64.
> >>
> >> No.  Actaully kexec_file_load is the less capable interface, and less
> >> flexible interface.  Which is why it is appropriate for signature
> >> verification.
> > 
> > Well, everyone has a stance and the corresponding view. You could have
> > wider view from long time maintenance and in upstrem position, and think
> > kexec_file_load is horrible. But I can only see from our work as a front
> > line engineer to maintain/develop kexec/kdump in RHEL, and think
> > kexec_file_load is easier to maintain.
> > 
> > Surely except of multiple kernel image format support. No matter it is
> > kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> > This is produced from kerel building by default. We have no way to
> > support it in our distros and add it into kexec_file_load.
> > 
> > [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> > https://lkml.org/lkml/2017/2/15/654
> > 
> >>
> >>>> kexec_load in every other respect is the more capable and functional
> >>>> interface.  It makes no sense to get rid of it.
> >>>>
> >>>> It does make sense to reload with a loaded kernel on memory hotplug.
> >>>> That is simple and easy.  If we are going to handle something in the
> >>>> kernel it should simple an automated unloading of the kernel on memory
> >>>> hotplug.
> >>>>
> >>>>
> >>>> I think it would be irresponsible to deprecate kexec_load on any
> >>>> platform.
> >>>>
> >>>> I also suspect that kexec_file_load could be taught to copy the dtb
> >>>> on arm32 if someone wants to deal with signatures.
> >>>>
> >>>> We definitely can not even think of deprecating kexec_load until
> >>>> architecture that supports it also supports kexec_file_load and everyone
> >>>> is happy with that interface.  That is Linus's no regression rule.
> >>>
> >>> I should pick a milder word to express our tendency and tell our plan
> >>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> >>> much. I didn't mean to say 'deprecate' at all when replied.
> >>>
> >>> The situation and trend I understand about kexec_load and kexec_file_load
> >>> are:
> >>>
> >>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> >>> have yet, just as x86_64, arm64 and s390 have done;
> >>>  
> >>> 2) kexec_file_load is suggested to use, and take precedence over
> >>> kexec_load in the future, if both are supported in one ARCH.
> >>
> >> The deep problem is that kexec_file_load is distinctly less expressive
> >> than kexec_load.
> >>
> >>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> >>> and by ARCHes for back compatibility w/ kexec_file_load support.
> >>>
> >>> For 1) and 2), I think the reason is obvious as Eric said,
> >>> kexec_file_load is simple enough. And currently, whenever we got a bug
> >>> report, we may need fix them twice, for kexec_load and kexec_file_load.
> >>> If kexec_file_load is made by default, e.g on x86_64, we will change it
> >>> in kernel space only, for kexec_file_load. This is what I meant about
> >>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> >>> Unless there's some critical/blocker bug in kex

Re: [PATCH 01/21] mm: memblock: replace dereferences of memblock_region.nid with API calls

2020-04-20 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> There are several places in the code that directly dereference
> memblock_region.nid despite this field being defined only when
> CONFIG_HAVE_MEMBLOCK_NODE_MAP=y.
> 
> Replace these with calls to memblock_get_region_nid() to improve code
> robustness and to avoid possible breakage when
> CONFIG_HAVE_MEMBLOCK_NODE_MAP will be removed.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/arm64/mm/numa.c | 9 ++---
>  arch/x86/mm/numa.c   | 6 --
>  mm/memblock.c| 8 +---
>  mm/page_alloc.c  | 4 ++--
>  4 files changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index 4decf1659700..aafcee3e3f7e 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -350,13 +350,16 @@ static int __init numa_register_nodes(void)
>   struct memblock_region *mblk;
>  
>   /* Check that valid nid is set to memblks */
> - for_each_memblock(memory, mblk)
> - if (mblk->nid == NUMA_NO_NODE || mblk->nid >= MAX_NUMNODES) {
> + for_each_memblock(memory, mblk) {
> + int mblk_nid = memblock_get_region_node(mblk);
> +
> + if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) {
>   pr_warn("Warning: invalid memblk node %d [mem 
> %#010Lx-%#010Lx]\n",
> - mblk->nid, mblk->base,
> + mblk_nid, mblk->base,
>   mblk->base + mblk->size - 1);
>   return -EINVAL;
>   }
> + }
>  
>   /* Finally register nodes. */
>   for_each_node_mask(nid, numa_nodes_parsed) {
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 59ba008504dc..fe024b2ac796 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -517,8 +517,10 @@ static void __init numa_clear_kernel_node_hotplug(void)
>*   reserve specific pages for Sandy Bridge graphics. ]
>*/
>   for_each_memblock(reserved, mb_region) {
> - if (mb_region->nid != MAX_NUMNODES)
> - node_set(mb_region->nid, reserved_nodemask);
> + int nid = memblock_get_region_node(mb_region);
> +
> + if (nid != MAX_NUMNODES)
> + node_set(nid, reserved_nodemask);
>   }
>  
>   /*
> diff --git a/mm/memblock.c b/mm/memblock.c
> index c79ba6f9920c..43e2fd3006c1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1207,13 +1207,15 @@ void __init_memblock __next_mem_pfn_range(int *idx, 
> int nid,
>  {
>   struct memblock_type *type = 
>   struct memblock_region *r;
> + int r_nid;
>  
>   while (++*idx < type->cnt) {
>   r = >regions[*idx];
> + r_nid = memblock_get_region_node(r);
>  
>   if (PFN_UP(r->base) >= PFN_DOWN(r->base + r->size))
>   continue;
> - if (nid == MAX_NUMNODES || nid == r->nid)
> + if (nid == MAX_NUMNODES || nid == r_nid)
>   break;
>   }
>   if (*idx >= type->cnt) {
> @@ -1226,7 +1228,7 @@ void __init_memblock __next_mem_pfn_range(int *idx, int 
> nid,
>   if (out_end_pfn)
>   *out_end_pfn = PFN_DOWN(r->base + r->size);
>   if (out_nid)
> - *out_nid = r->nid;
> + *out_nid = r_nid;
>  }
>  
>  /**
> @@ -1810,7 +1812,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned 
> long pfn,
>   *start_pfn = PFN_DOWN(type->regions[mid].base);
>   *end_pfn = PFN_DOWN(type->regions[mid].base + type->regions[mid].size);
>  
> - return type->regions[mid].nid;
> + return memblock_get_region_node(>regions[mid]);
>  }
>  #endif
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 69827d4fa052..0d012eda1694 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7208,7 +7208,7 @@ static void __init 
> find_zone_movable_pfns_for_nodes(void)
>   if (!memblock_is_hotpluggable(r))
>   continue;
>  
> - nid = r->nid;
> + nid = memblock_get_region_node(r);
>  
>   usable_startpfn = PFN_DOWN(r->base);
>       zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> @@ -7229,7 +7229,7 @@ static void __init 
> find_zone_movable_pfns_for_nodes(void)
>   if (memblock_is_mirror(r))
>   continue;
>  
> - nid = r->nid;
> + nid = memblock_get_region_node(r);
>  
>   usable_startpfn = memblock_region_memory_base_pfn(r);

Looks good to me.

Reviewed-by: Baoquan He

Re: [PATCH 02/21] mm: make early_pfn_to_nid() and related defintions close to each other

2020-04-20 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The early_pfn_to_nid() and it's helper __early_pfn_to_nid() are spread
> around include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c.
> 
> Drop unused stub for __early_pfn_to_nid() and move its actual generic
> implementation close to its users.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  include/linux/mm.h |  4 ++--
>  include/linux/mmzone.h |  9 
>  mm/page_alloc.c| 51 +-
>  3 files changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..a404026d14d4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2388,9 +2388,9 @@ extern void 
> sparse_memory_present_with_active_regions(int nid);
>  
>  #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
>  !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
> -static inline int __early_pfn_to_nid(unsigned long pfn,
> - struct mminit_pfnnid_cache *state)
> +static inline int early_pfn_to_nid(unsigned long pfn)
>  {
> + BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
>   return 0;
>  }
>  #else
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1b9de7d220fb..7b5b6eba402f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1078,15 +1078,6 @@ static inline struct zoneref 
> *first_zones_zonelist(struct zonelist *zonelist,
>  #include 
>  #endif
>  
> -#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
> - !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> -static inline unsigned long early_pfn_to_nid(unsigned long pfn)
> -{
> - BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
> - return 0;
> -}
> -#endif
> -
>  #ifdef CONFIG_FLATMEM
>  #define pfn_to_nid(pfn)  (0)
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d012eda1694..1ac775bfc9cf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1504,6 +1504,31 @@ void __free_pages_core(struct page *page, unsigned int 
> order)
>  
>  static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
>  
> +#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
> +
> +/*
> + * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
> + */
> +int __meminit __early_pfn_to_nid(unsigned long pfn,
> + struct mminit_pfnnid_cache *state)
> +{
> + unsigned long start_pfn, end_pfn;
> + int nid;
> +
> + if (state->last_start <= pfn && pfn < state->last_end)
> + return state->last_nid;
> +
> + nid = memblock_search_pfn_nid(pfn, _pfn, _pfn);
> + if (nid != NUMA_NO_NODE) {
> + state->last_start = start_pfn;
> + state->last_end = end_pfn;
> + state->last_nid = nid;
> + }
> +
> + return nid;
> +}
> +#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
> +
>  int __meminit early_pfn_to_nid(unsigned long pfn)
>  {
>   static DEFINE_SPINLOCK(early_pfn_lock);
> @@ -6298,32 +6323,6 @@ void __meminit init_currently_empty_zone(struct zone 
> *zone,
>   zone->initialized = 1;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

Here it's apparently removing CONFIG_HAVE_MEMBLOCK_NODE_MAP too early,
it should be done in patch 3, and its #end is kept there. I just found
it when I almost became dizzy in reviewing patch 3.

> -#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
> -
> -/*
> - * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
> - */
> -int __meminit __early_pfn_to_nid(unsigned long pfn,
> - struct mminit_pfnnid_cache *state)
> -{
> - unsigned long start_pfn, end_pfn;
> - int nid;
> -
> - if (state->last_start <= pfn && pfn < state->last_end)
> - return state->last_nid;
> -
> - nid = memblock_search_pfn_nid(pfn, _pfn, _pfn);
> - if (nid != NUMA_NO_NODE) {
> - state->last_start = start_pfn;
> - state->last_end = end_pfn;
> - state->last_nid = nid;
> - }
> -
> - return nid;
> -}
> -#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
> -
>  /**
>   * free_bootmem_with_active_regions - Call memblock_free_early_nid for each 
> active range
>   * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
> -- 
> 2.25.1
>

Re: [PATCH 02/21] mm: make early_pfn_to_nid() and related defintions close to each other

2020-04-20 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The early_pfn_to_nid() and it's helper __early_pfn_to_nid() are spread
> around include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c.
> 
> Drop unused stub for __early_pfn_to_nid() and move its actual generic
> implementation close to its users.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  include/linux/mm.h |  4 ++--
>  include/linux/mmzone.h |  9 
>  mm/page_alloc.c| 51 +-
>  3 files changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..a404026d14d4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2388,9 +2388,9 @@ extern void 
> sparse_memory_present_with_active_regions(int nid);
>  
>  #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
>  !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
> -static inline int __early_pfn_to_nid(unsigned long pfn,
> - struct mminit_pfnnid_cache *state)
> +static inline int early_pfn_to_nid(unsigned long pfn)
>  {
> + BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
>   return 0;
>  }

It's better to make a separate patch to drop __early_pfn_to_nid() here.

>  #else
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1b9de7d220fb..7b5b6eba402f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1078,15 +1078,6 @@ static inline struct zoneref 
> *first_zones_zonelist(struct zonelist *zonelist,
>  #include 
>  #endif
>  
> -#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
> - !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> -static inline unsigned long early_pfn_to_nid(unsigned long pfn)
> -{
> - BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
> - return 0;
> -}
> -#endif
> -
>  #ifdef CONFIG_FLATMEM
>  #define pfn_to_nid(pfn)  (0)
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d012eda1694..1ac775bfc9cf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1504,6 +1504,31 @@ void __free_pages_core(struct page *page, unsigned int 
> order)

#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)

This is the upper layer of ifdeffery scope.
>  
>  static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
>  
> +#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID

Moving __early_pfn_to_nid() here makes the upper layer of ifdeferry
scope a little werid. But seems no better way to optimize it.

Otherwise, this patch looks good to me.

Reviewed-by: Baoquan He 

> +
> +/*
> + * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
> + */
> +int __meminit __early_pfn_to_nid(unsigned long pfn,
> + struct mminit_pfnnid_cache *state)
> +{
> + unsigned long start_pfn, end_pfn;
> + int nid;
> +
> + if (state->last_start <= pfn && pfn < state->last_end)
> + return state->last_nid;
> +
> + nid = memblock_search_pfn_nid(pfn, _pfn, _pfn);
> + if (nid != NUMA_NO_NODE) {
> + state->last_start = start_pfn;
> + state->last_end = end_pfn;
> + state->last_nid = nid;
> + }
> +
> + return nid;
> +}
> +#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
> +
>  int __meminit early_pfn_to_nid(unsigned long pfn)
>  {
>   static DEFINE_SPINLOCK(early_pfn_lock);
> @@ -6298,32 +6323,6 @@ void __meminit init_currently_empty_zone(struct zone 
> *zone,
>   zone->initialized = 1;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> -#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
> -
> -/*
> - * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
> - */
> -int __meminit __early_pfn_to_nid(unsigned long pfn,
> - struct mminit_pfnnid_cache *state)
> -{
> - unsigned long start_pfn, end_pfn;
> - int nid;
> -
> - if (state->last_start <= pfn && pfn < state->last_end)
> - return state->last_nid;
> -
> - nid = memblock_search_pfn_nid(pfn, _pfn, _pfn);
> - if (nid != NUMA_NO_NODE) {
> - state->last_start = start_pfn;
> - state->last_end = end_pfn;
> - state->last_nid = nid;
> - }
> -
> - return nid;
> -}
> -#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
> -
>  /**
>   * free_bootmem_with_active_regions - Call memblock_free_early_nid for each 
> active range
>   * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
> -- 
> 2.25.1
>

Re: [PATCH 03/21] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option

2020-04-20 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization
> of nodes and zones structures between the systems that have region to node
> mapping in memblock and those that don't.
> 
> Currently all the NUMA architectures enable this option and for the
> non-NUMA systems we can presume that all the memory belongs to node 0 and
> therefore the compile time configuration option is not required.
> 
> The remaining few architectures that use DISCONTIGMEM without NUMA are
> easily updated to use memblock_add_node() instead of memblock_add() and
> thus have proper correspondence of memblock regions to NUMA nodes.
> 
> Still, free_area_init_node() must have a backward compatible version
> because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
> different. Once all the architectures will use the new semantics, the
> entire compatibility layer can be dropped.
> 
> To avoid addition of extra run time memory to store node id for
> architectures that keep memblock but have only a single node, the node id
> field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
> the corresponding accessors presume that in those cases it is always 0.
> 
> Signed-off-by: Mike Rapoport 
> ---
...
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 6bc37a731d27..45abfc54da37 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -50,7 +50,7 @@ struct memblock_region {
>   phys_addr_t base;
>   phys_addr_t size;
>   enum memblock_flags flags;
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> +#ifdef CONFIG_NEED_MULTIPLE_NODES
>   int nid;
>  #endif
>  };
> @@ -215,7 +215,6 @@ static inline bool memblock_is_nomap(struct 
> memblock_region *m)
>   return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>   unsigned long  *end_pfn);
>  void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> @@ -234,7 +233,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned 
> long *out_start_pfn,
>  #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)
> \
>   for (i = -1, __next_mem_pfn_range(, nid, p_start, p_end, p_nid); \
>i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid))
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> @@ -310,10 +308,10 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
> *zone,
>   for_each_mem_range_rev(i, , , \
>  nid, flags, p_start, p_end, p_nid)
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_set_node(phys_addr_t base, phys_addr_t size,
> struct memblock_type *type, int nid);
>  
> +#ifdef CONFIG_NEED_MULTIPLE_NODES
>  static inline void memblock_set_region_node(struct memblock_region *r, int 
> nid)
>  {
>   r->nid = nid;
> @@ -332,7 +330,7 @@ static inline int memblock_get_region_node(const struct 
> memblock_region *r)
>  {
>   return 0;
>  }
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> +#endif /* CONFIG_NEED_MULTIPLE_NODES */
>  
>  /* Flags for memblock allocation APIs */
>  #define MEMBLOCK_ALLOC_ANYWHERE  (~(phys_addr_t)0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a404026d14d4..5903bbbdb336 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2344,9 +2344,8 @@ static inline unsigned long get_num_physpages(void)
>   return phys_pages;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  /*
> - * With CONFIG_HAVE_MEMBLOCK_NODE_MAP set, an architecture may initialise its
> + * Using memblock node mappings, an architecture may initialise its
>   * zones, allocate the backing mem_map and account for memory holes in a more
>   * architecture independent manner. This is a substitute for creating the
>   * zone_sizes[] and zholes_size[] arrays and passing them to
> @@ -2367,9 +2366,6 @@ static inline unsigned long get_num_physpages(void)
>   * registered physical page range.  Similarly
>   * sparse_memory_present_with_active_regions() calls memory_present() for
>   * each range when SPARSEMEM is enabled.
> - *
> - * See mm/page_alloc.c for more information on each function exposed by
> - * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
>   */
>  extern void free_area_init_nodes(unsigned long *max_zone_pfn);
>  unsigned long node_map_pfn_alignment(void);
> @@ -2384,13 +2380,9 @@ extern void free_bootmem_with_active_regions(int nid,
>   unsigned long max_low_pfn);
>  extern void sparse_memory_present_with_active_regions(int nid);
>  
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> -
> -#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
> -!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
>

Re: [PATCH 02/21] mm: make early_pfn_to_nid() and related defintions close to each other

2020-04-21 Thread Baoquan He

On 04/21/20 at 11:49am, Mike Rapoport wrote:
> On Tue, Apr 21, 2020 at 10:24:35AM +0800, Baoquan He wrote:
> > On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> > > From: Mike Rapoport 
> > > 
> > > The early_pfn_to_nid() and it's helper __early_pfn_to_nid() are spread
> > > around include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c.
> > > 
> > > Drop unused stub for __early_pfn_to_nid() and move its actual generic
> > > implementation close to its users.
> > > 
> > > Signed-off-by: Mike Rapoport 
> > > ---
> > >  include/linux/mm.h |  4 ++--
> > >  include/linux/mmzone.h |  9 
> > >  mm/page_alloc.c| 51 +-
> > >  3 files changed, 27 insertions(+), 37 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 5a323422d783..a404026d14d4 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2388,9 +2388,9 @@ extern void 
> > > sparse_memory_present_with_active_regions(int nid);
> > >  
> > >  #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
> > >  !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
> > > -static inline int __early_pfn_to_nid(unsigned long pfn,
> > > - struct mminit_pfnnid_cache *state)
> > > +static inline int early_pfn_to_nid(unsigned long pfn)
> > >  {
> > > + BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA));
> > >   return 0;
> > >  }
> > 
> > It's better to make a separate patch to drop __early_pfn_to_nid() here.
> 
> Not sure it's really worth it.
> This patch anyway only moves the code around without any actual changes.

OK, it's fine to me.

Re: [PATCH 03/21] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option

2020-04-21 Thread Baoquan He

On 04/21/20 at 12:09pm, Mike Rapoport wrote:
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index fc0aad0bc1f5..e67dc501576a 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -1372,11 +1372,7 @@ check_pages_isolated_cb(unsigned long start_pfn, 
> > > unsigned long nr_pages,
> > >  
> > >  static int __init cmdline_parse_movable_node(char *p)
> > >  {
> > > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > >   movable_node_enabled = true;
> > > -#else
> > > - pr_warn("movable_node parameter depends on 
> > > CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
> > > -#endif
> > 
> > Wondering if this change will impact anything. Before, those ARCHes with
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP support movable_node. With this patch
> > applied, those ARCHes which don't support CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > can also have 'movable_node' specified in kernel cmdline.
> > 
> > >   return 0;
> > >  }
> > >  early_param("movable_node", cmdline_parse_movable_node);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 1ac775bfc9cf..4530e9cfd9f7 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -335,7 +335,6 @@ static unsigned long nr_kernel_pages __initdata;
> > >  static unsigned long nr_all_pages __initdata;
> > >  static unsigned long dma_reserve __initdata;
> > >  
> > > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > >  static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] 
> > > __initdata;
> > >  static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] 
> > > __initdata;
> > >  static unsigned long required_kernelcore __initdata;
> > 
> > Does it mean those ARCHes which don't support
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP before, will have 'kernelcore=' and
> > 'movablecore=' now, and will have MOVABLE zone?
> 
> I hesitated a lot about whether to hide the kernelcore/movablecore and
> related code behind an #ifdef.
> In the end I've decided to keep the code compiled unconditionally as it
> is anyway __init and no sane person would pass "kernelcore=" to the
> kernel on a UMA system.

I see. Then maybe can do something if someone complains about it
in the future, e.g warn out with a message in
cmdline_parse_movable_node(), cmdline_parse_kernelcore().

> 
> > > @@ -348,7 +347,6 @@ static bool mirrored_kernelcore __meminitdata;
> > >  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from 
> > > */
> > >  int movable_zone;
> > >  EXPORT_SYMBOL(movable_zone);
> > > -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > >  
> > >  #if MAX_NUMNODES > 1
> > >  unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
> > > @@ -1499,8 +1497,7 @@ void __free_pages_core(struct page *page, unsigned 
> > > int order)
> > >   __free_pages(page, order);
> > >  }
> > >  
> > > -#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
> > > - defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> > > +#ifdef CONFIG_NEED_MULTIPLE_NODES
> > >  
> > >  static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
> > >  
> > > @@ -1542,7 +1539,7 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
> > >  
> > >   return nid;
> > >  }
> > > -#endif
> > > +#endif /* CONFIG_NEED_MULTIPLE_NODES */
> > >  
> > >  #ifdef CONFIG_NODES_SPAN_OTHER_NODES
> > >  /* Only safe to use early in boot when initialisation is single-threaded 
> > > */
> > > @@ -5924,7 +5921,6 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> > >  static bool __meminit
> > >  overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> > >  {
> > > -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > >   static struct memblock_region *r;
> > >  
> > >   if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
> > > @@ -5940,7 +5936,6 @@ overlap_memmap_init(unsigned long zone, unsigned 
> > > long *pfn)
> > >   return true;
> > >   }
> > >   }
> > > -#endif
> > >   return false;
> > >  }
> > >  
> > > @@ -6573,8 +6568,7 @@ static unsigned long __init 
> > > zone_absent_pages_in_node(int nid,
> > >   return nr_absent;
> > >  }
> > >  
> > > -#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > > -static inline unsigned long __init zone_spanned_pages_in_node(int nid,
> > > +static inline unsigned long __init compat_zone_spanned_pages_in_node(int 
> > > nid,
> > 
> > Is it compact zone which has continuous memory region, and the
> > compat here is typo? Or it's compatible zone? The name seems a little
> > confusing, or I miss something.
> 
> It's 'compat' from 'compatibility'. This is kinda "the old way" and the
> version that was defined when CONFIG_HAVE_MEMBLOCK_NODE_MAP=y is the
> "new way", so I picked 'compat' for backwards compatibility. 
> Anyway, it will go away later in pacth 19. 

Got it, thanks for telling.

> 
> > >   unsigned long zone_type,
> > >   unsigned long node_start_pfn,
> > >   unsigned long node_end_pfn,
> > > @@ -6593,7 +6587,7 @@ static inline unsigned long __init 
> > >

Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2020-04-09 Thread Baoquan He

On 04/02/20 at 10:01am, Michal Hocko wrote:
> On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> > Hi,
> > 
> > On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> [...]
> > > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > > memblock when NUMA support is enabled.
> >  
> > Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> > this will not help cleaning up the whole node/zone initialization mess and
> > we'll be stuck with two implementations.
> 
> Yeah, this is far from optimal.
> 
> > The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> > most architectures will anyway discard the entire memblock, so having it in
> > a UMA arch won't be a problem. The only exception is arm that uses
> > memblock for pfn_valid(), here we may also think about a solution to
> > compensate the addition of nid to the memblock structures. 
> 
> Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
> memblock_get_region_node would then unconditionally return 0 on UMA.
> Essentially the same way we do NUMA for other MM code. I only see few
> direct usage of region->nid.

Checked code again, seems HAVE_MEMBLOCK_NODE_MAP is selected directly in
all ARCHes which support it. Means HAVE_MEMBLOCK_NODE_MAP is enabled by
default on those ARCHes, and has no dependency on CONFIG_NUMA at all.
E.g on x86, it just calls free_area_init_nodes() in generic code path,
while free_area_init_nodes() is defined in CONFIG_HAVE_MEMBLOCK_NODE_MAP
ifdeffery scope. So I tend to agree with Mike to remove
HAVE_MEMBLOCK_NODE_MAP firstly on all ARCHes. We can check if it's worth
only defining memblock_region->nid for CONFIG_NUMA case after
HAVE_MEMBLOCK_NODE_MAP is removed.

config X86
def_bool y
...
select HAVE_MEMBLOCK_NODE_MAP
...

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-04-03 Thread Baoquan He

On 04/02/20 at 09:46pm, Hoan Tran wrote:
> Hi All,
> 
> On 3/31/20 7:31 AM, Baoquan He wrote:
> > On 03/31/20 at 04:21pm, Michal Hocko wrote:
> > > On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > > > Hi Michal,
> > > > 
> > > > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > > > Maybe I mis-read the code, but I don't see how this could happen. 
> > > > > > In the
> > > > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > > > calculate_node_totalpages() that ensures that node->node_zones are 
> > > > > > entirely
> > > > > > within the node because this is checked in 
> > > > > > zone_spanned_pages_in_node().
> > > > > 
> > > > > zone_spanned_pages_in_node does chech the zone boundaries are within 
> > > > > the
> > > > > node boundaries. But that doesn't really tell anything about other
> > > > > potential zones interleaving with the physical memory range.
> > > > > zone->spanned_pages simply gives the physical range for the zone
> > > > > including holes. Interleaving nodes are essentially a hole
> > > > > (__absent_pages_in_range is going to skip those).
> > > > > 
> > > > > That means that when free_area_init_core simply goes over the whole
> > > > > physical zone range including holes and that is why we need to check
> > > > > both for physical and logical holes (aka other nodes).
> > > > > 
> > > > > The life would be so much easier if the whole thing would simply 
> > > > > iterate
> > > > > over memblocks...
> > > > 
> > > > The memblock iterating sounds a great idea. I tried with putting the
> > > > memblock iterating in the upper layer, memmap_init(), which is used for
> > > > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > > > work out a formal patch to make this simpler as you said. The draft code
> > > > is as below. Like this it uses the existing code and involves little 
> > > > change.
> > > 
> > > Doing this would be a step in the right direction! I haven't checked the
> > > code very closely though. The below sounds way too simple to be truth I
> > > am afraid. First for_each_mem_pfn_range is available only for
> > > CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> > > saying that I really hate that being conditional). Also I haven't really
> > > checked the deferred initialization path - I have a very vague
> > > recollection that it has been converted to the memblock api but I have
> > > happilly dropped all that memory.
> > 
> > Thanks for your quick response and pointing out the rest suspect aspects,
> > I will investigate what you mentioned, see if they impact.
> 
> I would like to check if we still move on with my patch to remove
> CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it?

I think we would like to replace CONFIG_NODES_SPAN_OTHER_NODES with
CONFIG_NUMA, and just let UMA return 0 as node id, as Michal replied in
another mail. Anyway, your patch 2~5 are still needed to sit on top of
the change of this new plan.

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-28 Thread Baoquan He

On 03/28/20 at 11:31am, Hoan Tran wrote:
> In NUMA layout which nodes have memory ranges that span across other nodes,
> the mm driver can detect the memory node id incorrectly.
> 
> For example, with layout below
> Node 0 address:    
> Node 1 address:    

Sorry, I read this example several times, but still don't get what it
means. Can it be given with real hex number address as an exmaple? I
mean just using the memory layout you have seen from some systems. The
change looks interesting though.

> 
> Note:
>  - Memory from low to high
>  - 0/1: Node id
>  - x: Invalid memory of a node
> 
> When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> config, mm only checks the memory validity but not the node id.
> Because of that, Node 1 also detects the memory from node 0 as below
> when it scans from the start address to the end address of node 1.
> 
> Node 0 address:    
> Node 1 address:    
> 
> This layout could occur on any architecture. Most of them enables
> this config by default with CONFIG_NUMA. This patch, by default, enables
> CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> v3:
>  * Revise the patch description
> 
> V2:
>  * Revise the patch description
> 
> Hoan Tran (5):
>   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
>   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> 
>  arch/powerpc/Kconfig | 9 -
>  arch/s390/Kconfig| 8 
>  arch/sparc/Kconfig   | 9 -
>  arch/x86/Kconfig | 9 -
>  mm/page_alloc.c  | 2 +-
>  5 files changed, 1 insertion(+), 36 deletions(-)
> 
> -- 
> 1.8.3.1
> 
>

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-31 Thread Baoquan He

On 03/31/20 at 04:21pm, Michal Hocko wrote:
> On Tue 31-03-20 22:03:32, Baoquan He wrote:
> > Hi Michal,
> > 
> > On 03/31/20 at 10:55am, Michal Hocko wrote:
> > > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > > calculate_node_totalpages() that ensures that node->node_zones are 
> > > > entirely
> > > > within the node because this is checked in zone_spanned_pages_in_node().
> > > 
> > > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > > node boundaries. But that doesn't really tell anything about other
> > > potential zones interleaving with the physical memory range.
> > > zone->spanned_pages simply gives the physical range for the zone
> > > including holes. Interleaving nodes are essentially a hole
> > > (__absent_pages_in_range is going to skip those).
> > > 
> > > That means that when free_area_init_core simply goes over the whole
> > > physical zone range including holes and that is why we need to check
> > > both for physical and logical holes (aka other nodes).
> > > 
> > > The life would be so much easier if the whole thing would simply iterate
> > > over memblocks...
> > 
> > The memblock iterating sounds a great idea. I tried with putting the
> > memblock iterating in the upper layer, memmap_init(), which is used for
> > boot mem only anyway. Do you think it's doable and OK? It yes, I can
> > work out a formal patch to make this simpler as you said. The draft code
> > is as below. Like this it uses the existing code and involves little change.
> 
> Doing this would be a step in the right direction! I haven't checked the
> code very closely though. The below sounds way too simple to be truth I
> am afraid. First for_each_mem_pfn_range is available only for
> CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
> saying that I really hate that being conditional). Also I haven't really
> checked the deferred initialization path - I have a very vague
> recollection that it has been converted to the memblock api but I have
> happilly dropped all that memory.

Thanks for your quick response and pointing out the rest suspect aspects,
I will investigate what you mentioned, see if they impact.

>  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 138a56c0f48f..558d421f294b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> >  * function.  They do not exist on hotplugged memory.
> >  */
> > if (context == MEMMAP_EARLY) {
> > -   if (!early_pfn_valid(pfn)) {
> > -   pfn = next_pfn(pfn);
> > -   continue;
> > -   }
> > -   if (!early_pfn_in_nid(pfn, nid)) {
> > -   pfn++;
> > -   continue;
> > -   }
> > if (overlap_memmap_init(zone, ))
> > continue;
> > if (defer_init(nid, pfn, end_pfn))
> > @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct 
> > zone *zone)
> >  }
> >  
> >  void __meminit __weak memmap_init(unsigned long size, int nid,
> > - unsigned long zone, unsigned long start_pfn)
> > + unsigned long zone, unsigned long 
> > range_start_pfn)
> >  {
> > -   memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> > +   unsigned long start_pfn, end_pfn;
> > +   unsigned long range_end_pfn = range_start_pfn + size;
> > +   int i;
> > +   for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
> > +   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > +   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > +   if (end_pfn > start_pfn)
> > +   memmap_init_zone(size, nid, zone, start_pfn, 
> > MEMMAP_EARLY, NULL);
> > +   }
> >  }
> >  
> >  static int zone_batchsize(struct zone *zone)
> 
> -- 
> Michal Hocko
> SUSE Labs
>

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-31 Thread Baoquan He

Hi Michal,

On 03/31/20 at 10:55am, Michal Hocko wrote:
> On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > Maybe I mis-read the code, but I don't see how this could happen. In the
> > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > calculate_node_totalpages() that ensures that node->node_zones are entirely
> > within the node because this is checked in zone_spanned_pages_in_node().
> 
> zone_spanned_pages_in_node does chech the zone boundaries are within the
> node boundaries. But that doesn't really tell anything about other
> potential zones interleaving with the physical memory range.
> zone->spanned_pages simply gives the physical range for the zone
> including holes. Interleaving nodes are essentially a hole
> (__absent_pages_in_range is going to skip those).
> 
> That means that when free_area_init_core simply goes over the whole
> physical zone range including holes and that is why we need to check
> both for physical and logical holes (aka other nodes).
> 
> The life would be so much easier if the whole thing would simply iterate
> over memblocks...

The memblock iterating sounds a great idea. I tried with putting the
memblock iterating in the upper layer, memmap_init(), which is used for
boot mem only anyway. Do you think it's doable and OK? It yes, I can
work out a formal patch to make this simpler as you said. The draft code
is as below. Like this it uses the existing code and involves little change.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 138a56c0f48f..558d421f294b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * function.  They do not exist on hotplugged memory.
 */
if (context == MEMMAP_EARLY) {
-   if (!early_pfn_valid(pfn)) {
-   pfn = next_pfn(pfn);
-   continue;
-   }
-   if (!early_pfn_in_nid(pfn, nid)) {
-   pfn++;
-   continue;
-   }
if (overlap_memmap_init(zone, ))
continue;
if (defer_init(nid, pfn, end_pfn))
@@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
 }
 
 void __meminit __weak memmap_init(unsigned long size, int nid,
- unsigned long zone, unsigned long start_pfn)
+ unsigned long zone, unsigned long 
range_start_pfn)
 {
-   memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
+   unsigned long start_pfn, end_pfn;
+   unsigned long range_end_pfn = range_start_pfn + size;
+   int i;
+   for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
+   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
+   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+   if (end_pfn > start_pfn)
+   memmap_init_zone(size, nid, zone, start_pfn, 
MEMMAP_EARLY, NULL);
+   }
 }
 
 static int zone_batchsize(struct zone *zone)

Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2020-03-31 Thread Baoquan He

On 04/01/20 at 12:56am, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> > 
> > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> > really love to see that thing go away. It is causing problems when
> > people try to use memblock api.
> 
> Well, it's a small patch in the end :)
> 
> Currently all NUMA architectures currently enable
> CONFIG_HAVE_MEMBLOCK_NODE_MAP and use free_area_init_nodes() to initialize
> nodes and zones structures.

I did some investigation, there are nine ARCHes having NUMA config. And
among them, alpha doesn't have HAVE_MEMBLOCK_NODE_MAP support. While the
interesting thing is there are two ARCHes which have
HAVE_MEMBLOCK_NODE_MAP config, but don't have NUMA config adding, they
are microblaze and riscv. Obviously it was not carefully considered to
add HAVE_MEMBLOCK_NODE_MAP config into riscv and microblaze.

arch/alpha/Kconfig:config NUMA
arch/arm64/Kconfig:config NUMA
arch/ia64/Kconfig:config NUMA
arch/mips/Kconfig:config NUMA
arch/powerpc/Kconfig:config NUMA
arch/s390/Kconfig:config NUMA
arch/sh/mm/Kconfig:config NUMA
arch/sparc/Kconfig:config NUMA
arch/x86/Kconfig:config NUMA

>From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
replace it with CONFIG_NUMA. That sounds more sensible to store nid into
memblock when NUMA support is enabled.


> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 079d17d96410..9de81112447e 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -50,9 +50,7 @@ struct memblock_region {
>   phys_addr_t base;
>   phys_addr_t size;
>   enum memblock_flags flags;
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>   int nid;
> -#endif

I didn't look into other change very carefully, but feel enabling
memblock node map for all ARCHes looks a little radical. After all, many
ARCHes even don't have NUMA support.

>  };
>  
>  /**
> @@ -215,7 +213,6 @@ static inline bool memblock_is_nomap(struct 
> memblock_region *m)
>   return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>   unsigned long  *end_pfn);
>  void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> @@ -234,7 +231,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned 
> long *out_start_pfn,
>  #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)
> \
>   for (i = -1, __next_mem_pfn_range(, nid, p_start, p_end, p_nid); \
>i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid))
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> @@ -310,7 +306,6 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
> *zone,
>   for_each_mem_range_rev(i, , , \
>  nid, flags, p_start, p_end, p_nid)
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_set_node(phys_addr_t base, phys_addr_t size,
> struct memblock_type *type, int nid);
>  
> @@ -323,16 +318,6 @@ static inline int memblock_get_region_node(const struct 
> memblock_region *r)
>  {
>   return r->nid;
>  }
> -#else
> -static inline void memblock_set_region_node(struct memblock_region *r, int 
> nid)
> -{
> -}
> -
> -static inline int memblock_get_region_node(const struct memblock_region *r)
> -{
> - return 0;
> -}
> -#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  /* Flags for memblock allocation APIs */
>  #define MEMBLOCK_ALLOC_ANYWHERE  (~(phys_addr_t)0)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c54fb96cb1e6..368a45d4696a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2125,9 +2125,8 @@ static inline unsigned long get_num_physpages(void)
>   return phys_pages;
>  }
>  
> -#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  /*
> - * With CONFIG_HAVE_MEMBLOCK_NODE_MAP set, an architecture may initialise its
> + * Using memblock node mappings, an architecture may initialise its
>   * zones, allocate the backing mem_map and account for memory holes in a more
>   * architecture independent manner. This is a substitute for creating the
>   * zone_sizes[] and zholes_size[] arrays and passing them to
> @@ -2148,9 +2147,6 @@ static inline unsigned long get_num_physpages(void)
>   * registered physical page range.  Similarly
>   * sparse_memory_present_with_active_regions() calls memory_present() for
>   * each range when SPARSEMEM is enabled.
> - *
> - * See mm/page_alloc.c for more information on each function exposed by
> - * CONFIG_HAVE_MEMBLOCK_NODE_MAP.
>   */
>  extern void free_area_init_nodes(unsigned long *max_zone_pfn);
>  unsigned long node_map_pfn_alignment(void);
> @@ -2165,22 +2161,12 @@ extern void free_bootmem_with_active_regions(int nid,
>   unsigned long max_low_pfn);

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Baoquan He

On 03/30/20 at 09:42am, Michal Hocko wrote:
> On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address:    
> > Node 1 address:    
> > 
> > Note:
> >  - Memory from low to high
> >  - 0/1: Node id
> >  - x: Invalid memory of a node
> > 
> > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > config, mm only checks the memory validity but not the node id.
> > Because of that, Node 1 also detects the memory from node 0 as below
> > when it scans from the start address to the end address of node 1.
> > 
> > Node 0 address:    
> > Node 1 address:    
> > 
> > This layout could occur on any architecture. Most of them enables
> > this config by default with CONFIG_NUMA. This patch, by default, enables
> > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> I am not opposed to this at all. It reduces the config space and that is
> a good thing on its own. The history has shown that meory layout might
> be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> which is clearly an overkill.
> 
> Your description doesn't really explain why this is safe though. The
> history of this config is somehow messy, though. Mike has tried
> to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> reasoning what so ever. This doesn't make it really easy see whether
> reasons for reintroduction are still there. Maybe there are some subtle
> dependencies. I do not see any TBH but that might be burried deep in an
> arch specific code.

Yeah, since early_pfnnid_cache was added, we do not need worry about the
performance. But when I read the mem init code on x86 again, I do see there
are codes to handle the node overlapping, e.g in numa_cleanup_meminfo(),
when store node id into memblock. But the thing is if we have
encountered the node overlapping, we just return ahead of time, leave
something uninitialized. I am wondering if the system with node
overlapping can still run heathily.

> 
> > v3:
> >  * Revise the patch description
> > 
> > V2:
> >  * Revise the patch description
> > 
> > Hoan Tran (5):
> >   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
> >   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> > 
> >  arch/powerpc/Kconfig | 9 -
> >  arch/s390/Kconfig| 8 
> >  arch/sparc/Kconfig   | 9 -
> >  arch/x86/Kconfig | 9 -
> >  mm/page_alloc.c  | 2 +-
> >  5 files changed, 1 insertion(+), 36 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
>

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Baoquan He

On 03/30/20 at 04:16pm, Baoquan He wrote:
> On 03/30/20 at 09:42am, Michal Hocko wrote:
> > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other 
> > > nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address:    
> > > Node 1 address:    
> > > 
> > > Note:
> > >  - Memory from low to high
> > >  - 0/1: Node id
> > >  - x: Invalid memory of a node
> > > 
> > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > config, mm only checks the memory validity but not the node id.
> > > Because of that, Node 1 also detects the memory from node 0 as below
> > > when it scans from the start address to the end address of node 1.
> > > 
> > > Node 0 address:    
> > > Node 1 address:    
> > > 
> > > This layout could occur on any architecture. Most of them enables
> > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > 
> > I am not opposed to this at all. It reduces the config space and that is
> > a good thing on its own. The history has shown that meory layout might
> > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > which is clearly an overkill.
> > 
> > Your description doesn't really explain why this is safe though. The
> > history of this config is somehow messy, though. Mike has tried
> > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > reasoning what so ever. This doesn't make it really easy see whether
> > reasons for reintroduction are still there. Maybe there are some subtle
> > dependencies. I do not see any TBH but that might be burried deep in an
> > arch specific code.
> 
> Yeah, since early_pfnnid_cache was added, we do not need worry about the
> performance. But when I read the mem init code on x86 again, I do see there
> are codes to handle the node overlapping, e.g in numa_cleanup_meminfo(),
> when store node id into memblock. But the thing is if we have
> encountered the node overlapping, we just return ahead of time, leave
> something uninitialized. I am wondering if the system with node
> overlapping can still run heathily.

Ok, I didn't read code carefully. That is handling case where memblock
with different node id overlap, it needs return. In the example
Hoan gave, it has no problem, system can run well. Please ignore above
comment.

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Baoquan He

On 03/30/20 at 01:26pm, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 11:58:43AM +0200, Michal Hocko wrote:
> > On Mon 30-03-20 12:21:27, Mike Rapoport wrote:
> > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > > > In NUMA layout which nodes have memory ranges that span across other 
> > > > > nodes,
> > > > > the mm driver can detect the memory node id incorrectly.
> > > > > 
> > > > > For example, with layout below
> > > > > Node 0 address:    
> > > > > Node 1 address:    
> > > > > 
> > > > > Note:
> > > > >  - Memory from low to high
> > > > >  - 0/1: Node id
> > > > >  - x: Invalid memory of a node
> > > > > 
> > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > > > config, mm only checks the memory validity but not the node id.
> > > > > Because of that, Node 1 also detects the memory from node 0 as below
> > > > > when it scans from the start address to the end address of node 1.
> > > > > 
> > > > > Node 0 address:    
> > > > > Node 1 address:    
> > > > > 
> > > > > This layout could occur on any architecture. Most of them enables
> > > > > this config by default with CONFIG_NUMA. This patch, by default, 
> > > > > enables
> > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > > > 
> > > > I am not opposed to this at all. It reduces the config space and that is
> > > > a good thing on its own. The history has shown that meory layout might
> > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > > > which is clearly an overkill.
> > > > 
> > > > Your description doesn't really explain why this is safe though. The
> > > > history of this config is somehow messy, though. Mike has tried
> > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > > > reasoning what so ever. This doesn't make it really easy see whether
> > > > reasons for reintroduction are still there. Maybe there are some subtle
> > > > dependencies. I do not see any TBH but that might be burried deep in an
> > > > arch specific code.
> > > 
> > > Well, back then early_pfn_in_nid() was arch-dependant, today everyone
> > > except ia64 rely on HAVE_MEMBLOCK_NODE_MAP.
> > 
> > What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
> > really love to see that thing go away. It is causing problems when
> > people try to use memblock api.
> 
> Sorry, my bad, ia64 does not have NODES_SPAN_OTHER_NODES, but it does have
> HAVE_MEMBLOCK_NODE_MAP.
> 
> I remember I've tried killing HAVE_MEMBLOCK_NODE_MAP, but I've run into
> some problems and then I've got distracted. I too would like to have
> HAVE_MEMBLOCK_NODE_MAP go away, maybe I'll take another look at it.
>  
> > > So, if the memblock node map
> > > is correct, that using CONFIG_NUMA instead of 
> > > CONFIG_NODES_SPAN_OTHER_NODES
> > > would only mean that early_pfn_in_nid() will cost several cycles more on
> > > architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64
> > > and sh).
> > 
> > Do we have any idea on how much of an overhead that is? Because this is
> > per each pfn so it can accumulate a lot! 
> 
> It's O(log(N)) where N is the amount of the memory banks (ie. 
> memblock.memory.cnt)

This is for the Node id searching. But early_pfn_in_nid() is calling for
each pfn, this is the big one, I think. Otherwise, it may be optimized
as no-op.

>  
> > > Agian, ia64 is an exception here.
> > 
> > Thanks for the clarification!
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Sincerely yours,
> Mike.
> 
>

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Baoquan He

On 03/30/20 at 09:44am, Michal Hocko wrote:
> On Sun 29-03-20 08:19:24, Baoquan He wrote:
> > On 03/28/20 at 11:31am, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other 
> > > nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address:    
> > > Node 1 address:    
> > 
> > Sorry, I read this example several times, but still don't get what it
> > means. Can it be given with real hex number address as an exmaple? I
> > mean just using the memory layout you have seen from some systems. The
> > change looks interesting though.
> 
> Does this make it more clear?
>physical address range and its node associaion
>  [0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1]

I later read it again, have got what Hoan is trying to say, thanks.

I think the change in this patchset makes sense, still have some concern
though, let me add comment in other thread.

Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Baoquan He

On 03/30/20 at 09:42am, Michal Hocko wrote:
> On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address:    
> > Node 1 address:    
> > 
> > Note:
> >  - Memory from low to high
> >  - 0/1: Node id
> >  - x: Invalid memory of a node
> > 
> > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > config, mm only checks the memory validity but not the node id.
> > Because of that, Node 1 also detects the memory from node 0 as below
> > when it scans from the start address to the end address of node 1.
> > 
> > Node 0 address:    
> > Node 1 address:    
> > 
> > This layout could occur on any architecture. Most of them enables
> > this config by default with CONFIG_NUMA. This patch, by default, enables
> > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> 
> I am not opposed to this at all. It reduces the config space and that is
> a good thing on its own. The history has shown that meory layout might
> be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> which is clearly an overkill.
> 
> Your description doesn't really explain why this is safe though. The
> history of this config is somehow messy, though. Mike has tried
> to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> reasoning what so ever. This doesn't make it really easy see whether
> reasons for reintroduction are still there. Maybe there are some subtle
> dependencies. I do not see any TBH but that might be burried deep in an
> arch specific code.

Since on all ARCHes NODES_SPAN_OTHER_NODES has dependency on NUMA,
replacing it with CONFIG_NUMA seems no risk. Just for those ARCHes which
don't have CONFIG_NODES_SPAN_OTHER_NODES before, it involves a tiny
performance degradation. Besides, s390 has removed support of
NODES_SPAN_OTHER_NODES already.

commit 701dc81e7412daaf3c5bf4bc55d35c8b1525112a
Author: Heiko Carstens 
Date:   Wed Feb 19 13:29:15 2020 +0100

s390/mm: remove fake numa support

> 
> > v3:
> >  * Revise the patch description
> > 
> > V2:
> >  * Revise the patch description
> > 
> > Hoan Tran (5):
> >   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
> >   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> >   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> > 
> >  arch/powerpc/Kconfig | 9 -
> >  arch/s390/Kconfig| 8 
> >  arch/sparc/Kconfig   | 9 -
> >  arch/x86/Kconfig | 9 -
> >  mm/page_alloc.c  | 2 +-
> >  5 files changed, 1 insertion(+), 36 deletions(-)
> > 
> > -- 
> > 1.8.3.1
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
>

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-25 Thread Baoquan He

On 03/25/20 at 01:42pm, Aneesh Kumar K.V wrote:
> On 3/25/20 1:07 PM, Baoquan He wrote:
> > On 03/25/20 at 03:06pm, Baoquan He wrote:
> > > On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:
> > 
> > > >   mm/sparse.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > index aadb7298dcef..3012d1f3771a 100644
> > > > --- a/mm/sparse.c
> > > > +++ b/mm/sparse.c
> > > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> > > > unsigned long nr_pages,
> > > > ms->usage = NULL;
> > > > }
> > > > memmap = sparse_decode_mem_map(ms->section_mem_map, 
> > > > section_nr);
> > > > +   /* Mark the section invalid */
> > > > +   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > > 
> > > Not sure if we should add checking in valid_section() or pfn_valid(),
> > > e.g check ms->usage validation too. Otherwise, this fix looks good to
> > > me.
> > 
> > With SPASEMEM_VMEMAP enabled, we should do validation check on ms->usage
> > before checking any subsection is valid. Since now we do have case
> > in which ms->usage is released, people still try to check it.
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index f0a2c184eb9a..d79bd938852e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1306,6 +1306,8 @@ static inline int pfn_section_valid(struct 
> > mem_section *ms, unsigned long pfn)
> >   {
> > int idx = subsection_map_index(pfn);
> > +   if (!ms->usage)
> > +   return 0;
> > return test_bit(idx, ms->usage->subsection_map);
> >   }
> >   #else
> > 
> 
> We always check for section valid, before we check if pfn_section_valid().
> 
> static inline int pfn_valid(unsigned long pfn)
> 
>   struct mem_section *ms;
> 
>   if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
>   return 0;
>   ms = __nr_to_section(pfn_to_section_nr(pfn));
>   if (!valid_section(ms))
>   return 0;
>   /*
>* Traditionally early sections always returned pfn_valid() for
>* the entire section-sized span.
>*/
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> 
> IMHO adding that if (!ms->usage) is redundant.

Yeah, I tend to agree. Consider this happens in the only small window
between ms->usage releasing and ms->section_mem_map releasing when
removing a section. Just thought adding this check to enhance it even
though we have had your fix, because we only check ms->section_mem_map
in valid_section(). Anyway, your fix looks good to me, see if other
people have any comment.

Thanks
Baoquan

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-25 Thread Baoquan He

On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
>   __section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
>   int idx = subsection_map_index(pfn);
> 
>   return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> 
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Cc: Baoquan He 
> Reported-by: Sachin Sant 
> Signed-off-by: Aneesh Kumar K.V 

Maybe add Sachin's Tested-by, Sachin has tested and confirmed this fix
works.

> ---
>  mm/sparse.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..3012d1f3771a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
>   ms->usage = NULL;
>   }
>   memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> + /* Mark the section invalid */
> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;

Not sure if we should add checking in valid_section() or pfn_valid(),
e.g check ms->usage validation too. Otherwise, this fix looks good to
me.

Reviewed-by: Baoquan He

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-25 Thread Baoquan He

On 03/25/20 at 03:06pm, Baoquan He wrote:
> On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:

> >  mm/sparse.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index aadb7298dcef..3012d1f3771a 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> > unsigned long nr_pages,
> > ms->usage = NULL;
> > }
> > memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > +   /* Mark the section invalid */
> > +   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> 
> Not sure if we should add checking in valid_section() or pfn_valid(),
> e.g check ms->usage validation too. Otherwise, this fix looks good to
> me.

With SPASEMEM_VMEMAP enabled, we should do validation check on ms->usage
before checking any subsection is valid. Since now we do have case
in which ms->usage is released, people still try to check it.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f0a2c184eb9a..d79bd938852e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1306,6 +1306,8 @@ static inline int pfn_section_valid(struct mem_section 
*ms, unsigned long pfn)
 {
int idx = subsection_map_index(pfn);
 
+   if (!ms->usage)
+   return 0;
return test_bit(idx, ms->usage->subsection_map);
 }
 #else

Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Baoquan He

On 03/26/20 at 07:02pm, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> The crash is due to NULL dereference at
> 
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
> pfn_section_valid()
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
>   __section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
>   int idx = subsection_map_index(pfn);
> 
>   return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> For architectures like ppc64 where large pages are used for vmmemap mapping 
> (16MB),
> a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
> mapping page can be freed, the kernel needs to make sure there are no valid 
> sections
> within that mapping. Clearing the section valid bit before
> depopulate_section_memap enables this.
> 
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Cc: Baoquan He 
> Cc: Michael Ellerman 
> Cc: Dan Williams 
> Cc: Pankaj Gupta 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/sparse.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..65599e8bd636 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,12 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
>   ms->usage = NULL;
>   }
>       memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> + /*
> +  * Mark the section invalid so that valid_section()
> +  * return false. This prevents code from dereferencing
> +  * ms->usage array.
> +  */
> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>   }

Reviewed-by: Baoquan He

Re: [PATCH 17/21] mm: free_area_init: allow defining max_zone_pfn in descending order

2020-04-23 Thread Baoquan He

On 04/23/20 at 08:55am, Mike Rapoport wrote:
> On Thu, Apr 23, 2020 at 10:57:20AM +0800, Baoquan He wrote:
> > On 04/23/20 at 10:53am, Baoquan He wrote:
> > > On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> > > > From: Mike Rapoport 
> > > > 
> > > > Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the
> > > > ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it 
> > > > is
> > > > sorted in descending order allows using free_area_init() on such
> > > > architectures.
> > > > 
> > > > Add top -> down traversal of max_zone_pfn array in free_area_init() and 
> > > > use
> > > > the latter in ARC node/zone initialization.
> > > 
> > > Or maybe leave ARC as is. The change in this patchset doesn't impact
> > > ARC's handling about zone initialization, leaving it as is can reduce
> > > the complication in implementation of free_area_init(), which is a
> > > common function. So I personally don't see a strong motivation to have
> > > this patch.
> > 
> > OK, seems this patch is prepared to simplify free_area_init_node(), so
> > take back what I said at above.
> > 
> > Then this looks necessary, even though it introduces special case into
> > common function free_area_init().
> 
> The idea is to have a single free_area_init() for all architectures
> without keeping two completely different ways of calculating the zone
> extents.
> Another thing, is that with this we could eventually switch ARC from
> DISCONTIGMEM.

Yeah, I think uniting them into a single free_area_init() is a great
idea. Even though I had been through this patchset, when looked into
each of them, still may forget the detail in later patch :)

Re: [PATCH 05/21] mm: use free_area_init() instead of free_area_init_nodes()

2020-04-22 Thread Baoquan He

 *
>   * This will call free_area_init_node() for each active node in the system.
   It's __free_area_init_node() here being called, while
it dosn't matter much because it's updated in later patch.
> @@ -7440,7 +7440,7 @@ static void check_for_memory(pg_data_t *pgdat, int nid)
>   * starts where the previous one ended. For example, ZONE_DMA32 starts
>   * at arch_max_dma_pfn.
>   */
> -void __init free_area_init_nodes(unsigned long *max_zone_pfn)
> +void __init free_area_init(unsigned long *max_zone_pfn)
>  {
>   unsigned long start_pfn, end_pfn;
>   int i, nid;
> @@ -7700,12 +7700,6 @@ void __init set_dma_reserve(unsigned long 
> new_dma_reserve)
>   dma_reserve = new_dma_reserve;
>  }
>  
> -void __init free_area_init(unsigned long *max_zone_pfn)
> -{
> - init_unavailable_mem();
> - free_area_init_nodes(max_zone_pfn);
> -}
> -
>  static int page_alloc_cpu_dead(unsigned int cpu)
>  {

Reviewed-by: Baoquan He 

>  
> -- 
> 2.25.1
>

Re: [PATCH 18/21] mm: rename free_area_init_node() to free_area_init_memoryless_node()

2020-04-22 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The free_area_init_node() is only used by x86 to initialize a memory-less
> nodes.
> Make its name reflect this and drop all the function parameters except node
> ID as they are anyway zero.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/x86/mm/numa.c | 5 +
>  include/linux/mm.h | 9 +++--
>  mm/page_alloc.c| 7 ++-
>  3 files changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index fe024b2ac796..8ee952038c80 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -737,12 +737,9 @@ void __init x86_numa_init(void)
>  
>  static void __init init_memory_less_node(int nid)
>  {
> - unsigned long zones_size[MAX_NR_ZONES] = {0};
> - unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
>   /* Allocate and initialize node data. Memory-less node is now online.*/
>   alloc_node_data(nid);
> - free_area_init_node(nid, zones_size, 0, zholes_size);
> + free_area_init_memoryless_node(nid);
>  
>   /*
>* All zonelists will be built later in start_kernel() after per cpu
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1c2ecb42e043..27660f6cf26e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2272,8 +2272,7 @@ static inline spinlock_t *pud_lock(struct mm_struct 
> *mm, pud_t *pud)
>  }
>  
>  extern void __init pagecache_init(void);
> -extern void __init free_area_init_node(int nid, unsigned long * zones_size,
> - unsigned long zone_start_pfn, unsigned long *zholes_size);
> +extern void __init free_area_init_memoryless_node(int nid);
>  extern void free_initmem(void);
>  
>  /*
> @@ -2345,10 +2344,8 @@ static inline unsigned long get_num_physpages(void)
>  
>  /*
>   * Using memblock node mappings, an architecture may initialise its
> - * zones, allocate the backing mem_map and account for memory holes in a more
> - * architecture independent manner. This is a substitute for creating the
> - * zone_sizes[] and zholes_size[] arrays and passing them to
> - * free_area_init_node()
> + * zones, allocate the backing mem_map and account for memory holes in an
> + * architecture independent manner.
>   *
>   * An architecture is expected to register range of page frames backed by
>   * physical memory with memblock_add[_node]() before calling
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 376434c7a78b..e46232ec4849 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6979,12 +6979,9 @@ static void __init __free_area_init_node(int nid, 
> unsigned long *zones_size,
>   free_area_init_core(pgdat);
>  }
>  
> -void __init free_area_init_node(int nid, unsigned long *zones_size,
> - unsigned long node_start_pfn,
> - unsigned long *zholes_size)
> +void __init free_area_init_memoryless_node(int nid)
>  {
> - __free_area_init_node(nid, zones_size, node_start_pfn, zholes_size,
> -   true);
> + __free_area_init_node(nid, NULL, 0, NULL, false);

Can we move free_area_init_memoryless_node() definition into 
arch/x86/mm/numa.c since there's only one caller there?

And I am also wondering if adding a wrapper
free_area_init_memoryless_node() is necessary if it's only called the
function free_area_init_node().

>  }
>  
>  #if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> -- 
> 2.25.1
>

Re: [PATCH 17/21] mm: free_area_init: allow defining max_zone_pfn in descending order

2020-04-22 Thread Baoquan He

On 04/23/20 at 10:53am, Baoquan He wrote:
> On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the
> > ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it is
> > sorted in descending order allows using free_area_init() on such
> > architectures.
> > 
> > Add top -> down traversal of max_zone_pfn array in free_area_init() and use
> > the latter in ARC node/zone initialization.
> 
> Or maybe leave ARC as is. The change in this patchset doesn't impact
> ARC's handling about zone initialization, leaving it as is can reduce
> the complication in implementation of free_area_init(), which is a
> common function. So I personally don't see a strong motivation to have
> this patch.

OK, seems this patch is prepared to simplify free_area_init_node(), so
take back what I said at above.

Then this looks necessary, even though it introduces special case into
common function free_area_init().

Reviewed-by: Baoquan He 

> 
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/arc/mm/init.c | 36 +++-
> >  mm/page_alloc.c| 24 +++-
> >  2 files changed, 26 insertions(+), 34 deletions(-)
> > 
> > diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> > index 0920c969c466..41eb9be1653c 100644
> > --- a/arch/arc/mm/init.c
> > +++ b/arch/arc/mm/init.c
> > @@ -63,11 +63,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
> > size)
> >  
> > low_mem_sz = size;
> > in_use = 1;
> > +   memblock_add_node(base, size, 0);
> > } else {
> >  #ifdef CONFIG_HIGHMEM
> > high_mem_start = base;
> > high_mem_sz = size;
> > in_use = 1;
> > +   memblock_add_node(base, size, 1);
> >  #endif
> > }
> >  
> > @@ -83,8 +85,7 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
> > size)
> >   */
> >  void __init setup_arch_memory(void)
> >  {
> > -   unsigned long zones_size[MAX_NR_ZONES];
> > -   unsigned long zones_holes[MAX_NR_ZONES];
> > +   unsigned long max_zone_pfn[MAX_NR_ZONES] = { 0 };
> >  
> > init_mm.start_code = (unsigned long)_text;
> > init_mm.end_code = (unsigned long)_etext;
> > @@ -115,7 +116,6 @@ void __init setup_arch_memory(void)
> >  * the crash
> >  */
> >  
> > -   memblock_add_node(low_mem_start, low_mem_sz, 0);
> > memblock_reserve(CONFIG_LINUX_LINK_BASE,
> >  __pa(_end) - CONFIG_LINUX_LINK_BASE);
> >  
> > @@ -133,22 +133,7 @@ void __init setup_arch_memory(void)
> > memblock_dump_all();
> >  
> > /*- node/zones setup --*/
> > -   memset(zones_size, 0, sizeof(zones_size));
> > -   memset(zones_holes, 0, sizeof(zones_holes));
> > -
> > -   zones_size[ZONE_NORMAL] = max_low_pfn - min_low_pfn;
> > -   zones_holes[ZONE_NORMAL] = 0;
> > -
> > -   /*
> > -* We can't use the helper free_area_init(zones[]) because it uses
> > -* PAGE_OFFSET to compute the @min_low_pfn which would be wrong
> > -* when our kernel doesn't start at PAGE_OFFSET, i.e.
> > -* PAGE_OFFSET != CONFIG_LINUX_RAM_BASE
> > -*/
> > -   free_area_init_node(0,  /* node-id */
> > -   zones_size, /* num pages per zone */
> > -   min_low_pfn,/* first pfn of node */
> > -   zones_holes);   /* holes */
> > +   max_zone_pfn[ZONE_NORMAL] = max_low_pfn;
> >  
> >  #ifdef CONFIG_HIGHMEM
> > /*
> > @@ -168,20 +153,13 @@ void __init setup_arch_memory(void)
> > min_high_pfn = PFN_DOWN(high_mem_start);
> > max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
> >  
> > -   zones_size[ZONE_NORMAL] = 0;
> > -   zones_holes[ZONE_NORMAL] = 0;
> > -
> > -   zones_size[ZONE_HIGHMEM] = max_high_pfn - min_high_pfn;
> > -   zones_holes[ZONE_HIGHMEM] = 0;
> > -
> > -   free_area_init_node(1,  /* node-id */
> > -   zones_size, /* num pages per zone */
> > -   min_high_pfn,   /* first pfn of node */
> > -   zones_holes);   /* holes */
> > +   max_zone_pfn[ZONE_HIGHMEM] = max_high_pfn;
> >  
> > high_memory = (void *)(min_high_pfn << PAGE_SHIFT);
> > kmap_init();
> >  #endif
> > +
> > +   free_ar

Re: [PATCH 04/21] mm: free_area_init: use maximal zone PFNs rather than zone sizes

2020-04-22 Thread Baoquan He

zones_size[MAX_NR_ZONES], vaddr;
> - int i;
> + unsigned long max_zone_pfn[MAX_NR_ZONES] = { 0 };
> + unsigned long vaddr;
>  
>   empty_zero_page = (unsigned long *) memblock_alloc_low(PAGE_SIZE,
>  PAGE_SIZE);
> @@ -167,12 +167,8 @@ void __init paging_init(void)
>   panic("%s: Failed to allocate %lu bytes align=%lx\n",
> __func__, PAGE_SIZE, PAGE_SIZE);
>  
> - for (i = 0; i < ARRAY_SIZE(zones_size); i++)
> - zones_size[i] = 0;
> -
> - zones_size[ZONE_NORMAL] = (end_iomem >> PAGE_SHIFT) -
> - (uml_physmem >> PAGE_SHIFT);
> - free_area_init(zones_size);
> + max_zone_pfn[ZONE_NORMAL] = end_iomem >> PAGE_SHIFT;
> + free_area_init(max_zone_pfn);
>  
>   /*
>* Fixed mappings, only the page table structure has to be
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5903bbbdb336..d9a256a97ac5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2272,7 +2272,7 @@ static inline spinlock_t *pud_lock(struct mm_struct 
> *mm, pud_t *pud)
>  }
>  
>  extern void __init pagecache_init(void);
> -extern void free_area_init(unsigned long * zones_size);
> +extern void free_area_init(unsigned long * max_zone_pfn);
>  extern void __init free_area_init_node(int nid, unsigned long * zones_size,
>   unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4530e9cfd9f7..530701b38bc7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7700,11 +7700,10 @@ void __init set_dma_reserve(unsigned long 
> new_dma_reserve)
>   dma_reserve = new_dma_reserve;
>  }
>  
> -void __init free_area_init(unsigned long *zones_size)
> +void __init free_area_init(unsigned long *max_zone_pfn)
>  {
>   init_unavailable_mem();
> - free_area_init_node(0, zones_size,
> - __pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> + free_area_init_nodes(max_zone_pfn);

Reviewed-by: Baoquan He 

>  }
>  
>  static int page_alloc_cpu_dead(unsigned int cpu)
> -- 
> 2.25.1
>

Re: [PATCH 16/21] mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODES

2020-04-22 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The commit f47ac088c406 ("mm: memmap_init: iterate over memblock regions

This commit id should be a temporary one, will be changed when merged
into maintainer's tree and linus's tree. Only saying last patch plus the
patch subject is OK?

> rather that check each PFN") made early_pfn_in_nid() obsolete and since
> CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real
> implementation of early_pfn_in_nid() it is also not needed anymore.
> 
> Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES.
> 
> Co-developed-by: Hoan Tran 
> Signed-off-by: Hoan Tran 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/powerpc/Kconfig |  9 -
>  arch/sparc/Kconfig   |  9 -
>  arch/x86/Kconfig |  9 -
>  mm/page_alloc.c  | 20 
>  4 files changed, 47 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 5f86b22b7d2c..74f316deeae1 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -685,15 +685,6 @@ config ARCH_MEMORY_PROBE
>   def_bool y
>   depends on MEMORY_HOTPLUG
>  
> -# Some NUMA nodes have memory ranges that span
> -# other nodes.  Even though a pfn is valid and
> -# between a node's start and end pfns, it may not
> -# reside on that node.  See memmap_init_zone()
> -# for details.
> -config NODES_SPAN_OTHER_NODES
> - def_bool y
> - depends on NEED_MULTIPLE_NODES
> -
>  config STDBINUTILS
>   bool "Using standard binutils settings"
>   depends on 44x
> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
> index 795206b7b552..0e4f3891b904 100644
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -286,15 +286,6 @@ config NODES_SHIFT
> Specify the maximum number of NUMA Nodes available on the target
> system.  Increases memory reserved to accommodate various tables.
>  
> -# Some NUMA nodes have memory ranges that span
> -# other nodes.  Even though a pfn is valid and
> -# between a node's start and end pfns, it may not
> -# reside on that node.  See memmap_init_zone()
> -# for details.
> -config NODES_SPAN_OTHER_NODES
> - def_bool y
> - depends on NEED_MULTIPLE_NODES
> -
>  config ARCH_SPARSEMEM_ENABLE
>   def_bool y if SPARC64
>   select SPARSEMEM_VMEMMAP_ENABLE
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9d3e95b4fb85..37dac095659e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1581,15 +1581,6 @@ config X86_64_ACPI_NUMA
>   ---help---
> Enable ACPI SRAT based node topology detection.
>  
> -# Some NUMA nodes have memory ranges that span
> -# other nodes.  Even though a pfn is valid and
> -# between a node's start and end pfns, it may not
> -# reside on that node.  See memmap_init_zone()
> -# for details.
> -config NODES_SPAN_OTHER_NODES
> - def_bool y
> - depends on X86_64_ACPI_NUMA
> -
>  config NUMA_EMU
>   bool "NUMA emulation"
>   depends on NUMA
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c43ce8709457..343d87b8697d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1541,26 +1541,6 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
>  }
>  #endif /* CONFIG_NEED_MULTIPLE_NODES */
>  
> -#ifdef CONFIG_NODES_SPAN_OTHER_NODES
> -/* Only safe to use early in boot when initialisation is single-threaded */
> -static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> -{
> - int nid;
> -
> - nid = __early_pfn_to_nid(pfn, _pfnnid_cache);
> - if (nid >= 0 && nid != node)
> - return false;
> - return true;
> -}
> -
> -#else
> -static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> -{
> - return true;
> -}
> -#endif

And macro early_pfn_valid() is not needed either, we may need remove it
too. 

Otherwise, removing NODES_SPAN_OTHER_NODES in this patch looks good.

Reviewed-by: Baoquan He 

> -
> -
>  void __init memblock_free_pages(struct page *page, unsigned long pfn,
>   unsigned int order)
>  {
> -- 
> 2.25.1
>

Re: [PATCH 17/21] mm: free_area_init: allow defining max_zone_pfn in descending order

2020-04-22 Thread Baoquan He

On 04/12/20 at 10:48pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the
> ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it is
> sorted in descending order allows using free_area_init() on such
> architectures.
> 
> Add top -> down traversal of max_zone_pfn array in free_area_init() and use
> the latter in ARC node/zone initialization.

Or maybe leave ARC as is. The change in this patchset doesn't impact
ARC's handling about zone initialization, leaving it as is can reduce
the complication in implementation of free_area_init(), which is a
common function. So I personally don't see a strong motivation to have
this patch.

> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/arc/mm/init.c | 36 +++-
>  mm/page_alloc.c| 24 +++-
>  2 files changed, 26 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index 0920c969c466..41eb9be1653c 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -63,11 +63,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
> size)
>  
>   low_mem_sz = size;
>   in_use = 1;
> + memblock_add_node(base, size, 0);
>   } else {
>  #ifdef CONFIG_HIGHMEM
>   high_mem_start = base;
>   high_mem_sz = size;
>   in_use = 1;
> + memblock_add_node(base, size, 1);
>  #endif
>   }
>  
> @@ -83,8 +85,7 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
> size)
>   */
>  void __init setup_arch_memory(void)
>  {
> - unsigned long zones_size[MAX_NR_ZONES];
> - unsigned long zones_holes[MAX_NR_ZONES];
> + unsigned long max_zone_pfn[MAX_NR_ZONES] = { 0 };
>  
>   init_mm.start_code = (unsigned long)_text;
>   init_mm.end_code = (unsigned long)_etext;
> @@ -115,7 +116,6 @@ void __init setup_arch_memory(void)
>* the crash
>*/
>  
> - memblock_add_node(low_mem_start, low_mem_sz, 0);
>   memblock_reserve(CONFIG_LINUX_LINK_BASE,
>__pa(_end) - CONFIG_LINUX_LINK_BASE);
>  
> @@ -133,22 +133,7 @@ void __init setup_arch_memory(void)
>   memblock_dump_all();
>  
>   /*- node/zones setup --*/
> - memset(zones_size, 0, sizeof(zones_size));
> - memset(zones_holes, 0, sizeof(zones_holes));
> -
> - zones_size[ZONE_NORMAL] = max_low_pfn - min_low_pfn;
> - zones_holes[ZONE_NORMAL] = 0;
> -
> - /*
> -  * We can't use the helper free_area_init(zones[]) because it uses
> -  * PAGE_OFFSET to compute the @min_low_pfn which would be wrong
> -  * when our kernel doesn't start at PAGE_OFFSET, i.e.
> -  * PAGE_OFFSET != CONFIG_LINUX_RAM_BASE
> -  */
> - free_area_init_node(0,  /* node-id */
> - zones_size, /* num pages per zone */
> - min_low_pfn,/* first pfn of node */
> - zones_holes);   /* holes */
> + max_zone_pfn[ZONE_NORMAL] = max_low_pfn;
>  
>  #ifdef CONFIG_HIGHMEM
>   /*
> @@ -168,20 +153,13 @@ void __init setup_arch_memory(void)
>   min_high_pfn = PFN_DOWN(high_mem_start);
>   max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
>  
> - zones_size[ZONE_NORMAL] = 0;
> - zones_holes[ZONE_NORMAL] = 0;
> -
> - zones_size[ZONE_HIGHMEM] = max_high_pfn - min_high_pfn;
> - zones_holes[ZONE_HIGHMEM] = 0;
> -
> - free_area_init_node(1,  /* node-id */
> - zones_size, /* num pages per zone */
> - min_high_pfn,   /* first pfn of node */
> - zones_holes);   /* holes */
> + max_zone_pfn[ZONE_HIGHMEM] = max_high_pfn;
>  
>   high_memory = (void *)(min_high_pfn << PAGE_SHIFT);
>   kmap_init();
>  #endif
> +
> + free_area_init(max_zone_pfn);
>  }
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 343d87b8697d..376434c7a78b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7429,7 +7429,8 @@ static void check_for_memory(pg_data_t *pgdat, int nid)
>  void __init free_area_init(unsigned long *max_zone_pfn)
>  {
>   unsigned long start_pfn, end_pfn;
> - int i, nid;
> + int i, nid, zone;
> + bool descending = false;
>  
>   /* Record where the zone boundaries are */
>   memset(arch_zone_lowest_possible_pfn, 0,
> @@ -7439,13 +7440,26 @@ void __init free_area_init(unsigned long 
> *max_zone_pfn)
>  
>   start_pfn = find_min_pfn_with_active_regions();
>  
> + /*
> +  * Some architecturs, e.g. ARC may have ZONE_HIGHMEM below
> +  * ZONE_NORMAL. For such cases we allow max_zone_pfn sorted in the
> +  * descending order
> +  */
> + if (MAX_NR_ZONES > 1 && max_zone_pfn[0] > max_zone_pfn[1])
> + descending = true;
> +
>   for (i = 0; i <

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Baoquan He

On 04/22/20 at 12:05pm, David Hildenbrand wrote:
> On 22.04.20 11:57, Baoquan He wrote:
> > On 04/22/20 at 11:24am, David Hildenbrand wrote:
> >> On 22.04.20 11:17, Baoquan He wrote:
> >>> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >>>>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we 
> >>>>>> don't
> >>>>>> pass the efi, it won't get the SRAT table correctly, if I remember
> >>>>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >>>>>> ACPI only, this won't happen on bare metal though. Need check 
> >>>>>> carefully. 
> >>>>>> I have been using kvm guest with uefi firmwire recently.
> >>>>>
> >>>>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> >>>>>
> >>>>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> >>>>> not part of any efi tables or whatsoever. So I assume the kexec kernel
> >>>>> will not detect it automatically (good!), instead load the virtio-mem
> >>>>> driver and let it add memory back to the system.
> >>>>>
> >>>>> I should probably play with kexec and virtio-mem once I have some spare
> >>>>> cycles ... to find out what's broken and needs to be addressed :)
> >>>>
> >>>> FWIW, I just gave virtio-mem and kexec/kdump a try.
> >>>>
> >>>> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> >>>> The kexec kernel only uses memory in the crash region. The virtio-mem
> >>>> driver properly bails out due to is_kdump_kernel().
> >>>
> >>> Right, kdump is not impacted later added memory.
> >>>
> >>>>
> >>>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> >>>> to get placed on virtio-mem memory (pure luck due to the left-to-right
> >>>> search). Memory added by virtio-mem is not getting added to the e820
> >>>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> >>>> right memory is readded.
> >>>
> >>> kexec_file_load just behaves as you tested. It doesn't collect later
> >>> added memory to e820 because it uses e820_table_kexec directly to pass
> >>> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> >>> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> >>> doesn't have it in e820 during bootup, but it's recoginized and added
> >>> when ACPI scanning. I think we should update e820_table_kexec when hot
> >>> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> >>> balloon will need be added into e820_table_kexec too, and if this is
> >>> expected behaviour.
> >>>
> >>> But whatever we do, it won't impact the kexec file_loading, because of
> >>> the searching strategy bottom up. Just adding them into e820_table_kexec
> >>> will make it consistent with cold reboot which get recognizes and get
> >>> them into e820 during bootup.
> >>
> >> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
> >> kernel should see. Not more, not less.
> >>
> >> Regarding virtio-mem: Not in e820 on cold-boot.
> >> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
> >> IIRC. I think on real HW it can be different.
> > 
> > Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
> > of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
> > QEMU/KVM guest in this area, why we don't make kvm guest recognize
> > hotpluggable DIMM and add it into e820 map, he said he had tried to make
> > it, but this will corrupt guest on HyperV. So he had to revert the
> 
> Yeah, I remember that this had to be reverted due to something breaking.
> But OTOH, it allows us to online coldplugged DIMMs online_movable
> easily, so I'd say it's even a feature (although, does not behave like
> real HW we have).
> 
> I use this extensively when testing memory hot(un)plug via coldplugged
> DIMMs.
> 
> I do wonder if there is real HW, where this is also the case.

None for what I know. Hotplug on real HW includes two parts, the boot
mem being hotpluggable is more flexiable one. It allows people to
replace bad DIMM. And you can see code in boot stage has been adjusted a
lot on this purpose, at that time, people haven't thought about kvm
guest.

> 
> > commit on qemu. So I think we can leave it for now for both real HW and
> > kvm, or update the e820_table_kexec to include added DIMM for both real
> > HW and KVM. I hope one day KVM dev will find a way to conquer the defect
> > on HyperV and make the e820map consistent with bare metal. After all,
> > kvm guest is trying to imitate real HW for the most part.
> > 
> > Anyway, I will think about the e820_table_kexec updating. See if we can
> > do something about it.
> 
> Yeah, for DIMMs on real HW it might definitely make sense. We might be
> able to hook into updates of /sys/firmware/memmap on memory add/remove.

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Baoquan He

On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> >> pass the efi, it won't get the SRAT table correctly, if I remember
> >> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >> ACPI only, this won't happen on bare metal though. Need check carefully. 
> >> I have been using kvm guest with uefi firmwire recently.
> > 
> > Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> > 
> > I'm also asking because of virtio-mem. Memory added via virtio-mem is
> > not part of any efi tables or whatsoever. So I assume the kexec kernel
> > will not detect it automatically (good!), instead load the virtio-mem
> > driver and let it add memory back to the system.
> > 
> > I should probably play with kexec and virtio-mem once I have some spare
> > cycles ... to find out what's broken and needs to be addressed :)
> 
> FWIW, I just gave virtio-mem and kexec/kdump a try.
> 
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().

Right, kdump is not impacted later added memory.

> 
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

kexec_file_load just behaves as you tested. It doesn't collect later
added memory to e820 because it uses e820_table_kexec directly to pass
e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
doesn't have it in e820 during bootup, but it's recoginized and added
when ACPI scanning. I think we should update e820_table_kexec when hot
add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
balloon will need be added into e820_table_kexec too, and if this is
expected behaviour.

But whatever we do, it won't impact the kexec file_loading, because of
the searching strategy bottom up. Just adding them into e820_table_kexec
will make it consistent with cold reboot which get recognizes and get
them into e820 during bootup.
> 
> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
> 
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

Yes, kexec_load will read memory regions from /sys/firmware/memmap/ or
/proc/iomem. Making it right seems a little harder, we can export them
to /proc/iomem or /sys/firmware/memmap/ with mark them with 'hotplug',
but the attribute that which zone they belongs to is not easy to tell.

We are proactive on widely testing kexec_file_load on x86_64, s390,
arm64 by adding test cases into CKI.

> 
> 
> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.
> 
> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").
> 
> Baoquan, any opinion on that?
> 
> -- 
> Thanks,
> 
> David / dhildenb

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-22 Thread Baoquan He

On 04/22/20 at 11:24am, David Hildenbrand wrote:
> On 22.04.20 11:17, Baoquan He wrote:
> > On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we 
> >>>> don't
> >>>> pass the efi, it won't get the SRAT table correctly, if I remember
> >>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
> >>>> I have been using kvm guest with uefi firmwire recently.
> >>>
> >>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> >>>
> >>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> >>> not part of any efi tables or whatsoever. So I assume the kexec kernel
> >>> will not detect it automatically (good!), instead load the virtio-mem
> >>> driver and let it add memory back to the system.
> >>>
> >>> I should probably play with kexec and virtio-mem once I have some spare
> >>> cycles ... to find out what's broken and needs to be addressed :)
> >>
> >> FWIW, I just gave virtio-mem and kexec/kdump a try.
> >>
> >> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> >> The kexec kernel only uses memory in the crash region. The virtio-mem
> >> driver properly bails out due to is_kdump_kernel().
> > 
> > Right, kdump is not impacted later added memory.
> > 
> >>
> >> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> >> to get placed on virtio-mem memory (pure luck due to the left-to-right
> >> search). Memory added by virtio-mem is not getting added to the e820
> >> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> >> right memory is readded.
> > 
> > kexec_file_load just behaves as you tested. It doesn't collect later
> > added memory to e820 because it uses e820_table_kexec directly to pass
> > e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> > during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> > doesn't have it in e820 during bootup, but it's recoginized and added
> > when ACPI scanning. I think we should update e820_table_kexec when hot
> > add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> > balloon will need be added into e820_table_kexec too, and if this is
> > expected behaviour.
> > 
> > But whatever we do, it won't impact the kexec file_loading, because of
> > the searching strategy bottom up. Just adding them into e820_table_kexec
> > will make it consistent with cold reboot which get recognizes and get
> > them into e820 during bootup.
> 
> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
> kernel should see. Not more, not less.
> 
> Regarding virtio-mem: Not in e820 on cold-boot.
> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
> IIRC. I think on real HW it can be different.

Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
QEMU/KVM guest in this area, why we don't make kvm guest recognize
hotpluggable DIMM and add it into e820 map, he said he had tried to make
it, but this will corrupt guest on HyperV. So he had to revert the
commit on qemu. So I think we can leave it for now for both real HW and
kvm, or update the e820_table_kexec to include added DIMM for both real
HW and KVM. I hope one day KVM dev will find a way to conquer the defect
on HyperV and make the e820map consistent with bare metal. After all,
kvm guest is trying to imitate real HW for the most part.

Anyway, I will think about the e820_table_kexec updating. See if we can
do something about it.

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-14 Thread Baoquan He

On 04/14/20 at 11:37am, David Hildenbrand wrote:
> On 14.04.20 11:22, Baoquan He wrote:
> > On 04/14/20 at 10:00am, David Hildenbrand wrote:
> >> On 14.04.20 08:40, Baoquan He wrote:
> >>> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> >>>> Baoquan He  writes:
> >>>>
> >>>>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >>>>>>
> >>>>>> The only benefit of kexec_file_load is that it is simple enough from a
> >>>>>> kernel perspective that signatures can be checked.
> >>>>>
> >>>>> We don't have this restriction any more with below commit:
> >>>>>
> >>>>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> >>>>> and KEXEC_SIG_FORCE")
> >>>>>
> >>>>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> >>>>> secure boot or legacy system for kexec/kdump. Being simple enough is
> >>>>> enough to astract and convince us to use it instead. And kexec_file_load
> >>>>> has been in use for several years on systems with secure boot, since
> >>>>> added in 2014, on x86_64.
> >>>>
> >>>> No.  Actaully kexec_file_load is the less capable interface, and less
> >>>> flexible interface.  Which is why it is appropriate for signature
> >>>> verification.
> >>>
> >>> Well, everyone has a stance and the corresponding view. You could have
> >>> wider view from long time maintenance and in upstrem position, and think
> >>> kexec_file_load is horrible. But I can only see from our work as a front
> >>> line engineer to maintain/develop kexec/kdump in RHEL, and think
> >>> kexec_file_load is easier to maintain.
> >>>
> >>> Surely except of multiple kernel image format support. No matter it is
> >>> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> >>> This is produced from kerel building by default. We have no way to
> >>> support it in our distros and add it into kexec_file_load.
> >>>
> >>> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> >>> https://lkml.org/lkml/2017/2/15/654
> >>>
> >>>>
> >>>>>> kexec_load in every other respect is the more capable and functional
> >>>>>> interface.  It makes no sense to get rid of it.
> >>>>>>
> >>>>>> It does make sense to reload with a loaded kernel on memory hotplug.
> >>>>>> That is simple and easy.  If we are going to handle something in the
> >>>>>> kernel it should simple an automated unloading of the kernel on memory
> >>>>>> hotplug.
> >>>>>>
> >>>>>>
> >>>>>> I think it would be irresponsible to deprecate kexec_load on any
> >>>>>> platform.
> >>>>>>
> >>>>>> I also suspect that kexec_file_load could be taught to copy the dtb
> >>>>>> on arm32 if someone wants to deal with signatures.
> >>>>>>
> >>>>>> We definitely can not even think of deprecating kexec_load until
> >>>>>> architecture that supports it also supports kexec_file_load and 
> >>>>>> everyone
> >>>>>> is happy with that interface.  That is Linus's no regression rule.
> >>>>>
> >>>>> I should pick a milder word to express our tendency and tell our plan
> >>>>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> >>>>> much. I didn't mean to say 'deprecate' at all when replied.
> >>>>>
> >>>>> The situation and trend I understand about kexec_load and 
> >>>>> kexec_file_load
> >>>>> are:
> >>>>>
> >>>>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> >>>>> have yet, just as x86_64, arm64 and s390 have done;
> >>>>>  
> >>>>> 2) kexec_file_load is suggested to use, and take precedence over
> >>>>> kexec_load in the future, if both are supported in one ARCH.
> >>>>
> >>>> The deep problem is that kexec_file_load is distinctly less expressive
> >>>> than kexec_load.
> >>>>
> >>>>

Re: [PATCH v2 0/8] mm/memory_hotplug: allow to specify a default online_type

2020-03-18 Thread Baoquan He

On 03/18/20 at 02:54pm, Michal Hocko wrote:
> On Wed 18-03-20 21:05:17, Baoquan He wrote:
> > On 03/17/20 at 11:49am, David Hildenbrand wrote:
> > > Distributions nowadays use udev rules ([1] [2]) to specify if and
> > > how to online hotplugged memory. The rules seem to get more complex with
> > > many special cases. Due to the various special cases,
> > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
> > > is handled via udev rules.
> > > 
> > > Everytime we hotplug memory, the udev rule will come to the same
> > > conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
> > > memory in separate memory blocks and wait for memory to get onlined by 
> > > user
> > > space before continuing to add more memory blocks (to not add memory 
> > > faster
> > > than it is getting onlined). This of course slows down the whole memory
> > > hotplug process.
> > > 
> > > To make the job of distributions easier and to avoid udev rules that get
> > > more and more complicated, let's extend the mechanism provided by
> > > - /sys/devices/system/memory/auto_online_blocks
> > > - "memhp_default_state=" on the kernel cmdline
> > > to be able to specify also "online_movable" as well as "online_kernel"
> > 
> > This patch series looks good, thanks. Since Andrew has merged it to -mm 
> > again,
> > I won't add my Reviewed-by to bother. 
> 
> JFYI, Andrew usually adds R-b or A-b tags as they are posted.

Got it, thanks for telling.

Re: [PATCH v2 0/8] mm/memory_hotplug: allow to specify a default online_type

2020-03-18 Thread Baoquan He

On 03/18/20 at 02:58pm, Vitaly Kuznetsov wrote:
> Baoquan He  writes:
> 
> > On 03/17/20 at 11:49am, David Hildenbrand wrote:
> >> Distributions nowadays use udev rules ([1] [2]) to specify if and
> >> how to online hotplugged memory. The rules seem to get more complex with
> >> many special cases. Due to the various special cases,
> >> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
> >> is handled via udev rules.
> >> 
> >> Everytime we hotplug memory, the udev rule will come to the same
> >> conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
> >> memory in separate memory blocks and wait for memory to get onlined by user
> >> space before continuing to add more memory blocks (to not add memory faster
> >> than it is getting onlined). This of course slows down the whole memory
> >> hotplug process.
> >> 
> >> To make the job of distributions easier and to avoid udev rules that get
> >> more and more complicated, let's extend the mechanism provided by
> >> - /sys/devices/system/memory/auto_online_blocks
> >> - "memhp_default_state=" on the kernel cmdline
> >> to be able to specify also "online_movable" as well as "online_kernel"
> >
> > This patch series looks good, thanks. Since Andrew has merged it to -mm 
> > again,
> > I won't add my Reviewed-by to bother. 
> >
> > Hi David, Vitaly
> >
> > There are several things unclear to me.
> >
> > So, these improved interfaces are used to alleviate the burden of the 
> > existing udev rules, or try to replace it? As you know, we have been
> > using udev rules to interact between kernel and user space on bare metal,
> > and guests who want to hot add/remove.
> 
> With 'auto_online_blocks' interface you don't need the udev rule. David
> is trying to make it more versatile.
> 
> >
> > And also the OOM issue in hyperV when onlining pages after adding memory
> > block. I am not a virt devel expert, could this happen on bare metal
> > system?
> 
> Yes - in theory, very unlikely - in practice.
> 
> The root cause of the problem here is adding more memory to the system
> requires memory (page tables, memmaps,..) so if your system is low on
> memory and you're trying to hotplug A LOT you may run into OOM before
> you're able to online anything. With bare metal it's usualy not the
> case: servers, which are able to hotplug memory, are usually booted with
> enough memory and memory hotplug is a manual action (you need to insert
> DIMMs!). But, if you boot your server with e.g. 4G, almost exhaust it
> and then try to hotplug e.g. 256G ... well, OOM is almost guaranteed.

Thanks for this detailed explanation.

I finally know why this is a problem in hyperV. But with the current
mechanism, it will happen on any system if thing is done like this. 

Is there a reason hyperV need boot with small memory, then enlarge it
with huge memory? Since it's a real case in hyperV, I guess there must
be reason, I am just curious.

> With virtual machines it's very common (e.g. with Hyper-V VMs) to boot
> them with low memory and hotplug it (automatically, by some management
> software) when neededm thus the problem is way more common.

Re: [PATCH v2 0/8] mm/memory_hotplug: allow to specify a default online_type

2020-03-18 Thread Baoquan He

On 03/17/20 at 11:49am, David Hildenbrand wrote:
> Distributions nowadays use udev rules ([1] [2]) to specify if and
> how to online hotplugged memory. The rules seem to get more complex with
> many special cases. Due to the various special cases,
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
> is handled via udev rules.
> 
> Everytime we hotplug memory, the udev rule will come to the same
> conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
> memory in separate memory blocks and wait for memory to get onlined by user
> space before continuing to add more memory blocks (to not add memory faster
> than it is getting onlined). This of course slows down the whole memory
> hotplug process.
> 
> To make the job of distributions easier and to avoid udev rules that get
> more and more complicated, let's extend the mechanism provided by
> - /sys/devices/system/memory/auto_online_blocks
> - "memhp_default_state=" on the kernel cmdline
> to be able to specify also "online_movable" as well as "online_kernel"

This patch series looks good, thanks. Since Andrew has merged it to -mm again,
I won't add my Reviewed-by to bother. 

Hi David, Vitaly

There are several things unclear to me.

So, these improved interfaces are used to alleviate the burden of the 
existing udev rules, or try to replace it? As you know, we have been
using udev rules to interact between kernel and user space on bare metal,
and guests who want to hot add/remove.

And also the OOM issue in hyperV when onlining pages after adding memory
block. I am not a virt devel expert, could this happen on bare metal
system?

Thanks
Baoquan

Re: [PATCH v2 0/8] mm/memory_hotplug: allow to specify a default online_type

2020-03-18 Thread Baoquan He

On 03/18/20 at 02:50pm, David Hildenbrand wrote:
> On 18.03.20 14:05, Baoquan He wrote:
> > On 03/17/20 at 11:49am, David Hildenbrand wrote:
> >> Distributions nowadays use udev rules ([1] [2]) to specify if and
> >> how to online hotplugged memory. The rules seem to get more complex with
> >> many special cases. Due to the various special cases,
> >> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
> >> is handled via udev rules.
> >>
> >> Everytime we hotplug memory, the udev rule will come to the same
> >> conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
> >> memory in separate memory blocks and wait for memory to get onlined by user
> >> space before continuing to add more memory blocks (to not add memory faster
> >> than it is getting onlined). This of course slows down the whole memory
> >> hotplug process.
> >>
> >> To make the job of distributions easier and to avoid udev rules that get
> >> more and more complicated, let's extend the mechanism provided by
> >> - /sys/devices/system/memory/auto_online_blocks
> >> - "memhp_default_state=" on the kernel cmdline
> >> to be able to specify also "online_movable" as well as "online_kernel"
> > 
> > This patch series looks good, thanks. Since Andrew has merged it to -mm 
> > again,
> > I won't add my Reviewed-by to bother. 
> > 
> > Hi David, Vitaly
> > 
> > There are several things unclear to me.
> > 
> > So, these improved interfaces are used to alleviate the burden of the 
> > existing udev rules, or try to replace it? As you know, we have been
> 
> At least in RHEL, my plan is to replace it / use a udev rules as a
> fallback on older kernels (see the example scripts below). But other

Ok, got it. Didn't notice the script and the systemd service are your
part of plan, thought you are demonstrating the status. Thanks.

> distribution can handle it as they want.
> 
> > using udev rules to interact between kernel and user space on bare metal,
> > and guests who want to hot add/remove.>
> > And also the OOM issue in hyperV when onlining pages after adding memory
> > block. I am not a virt devel expert, could this happen on bare metal
> > system?
> 
> Don't think it's relevant on bare metal. If you plug a big DIMM, all
> memory blocks will be added first in one shot and then all memory blocks
> will be onlined. So it doesn't matter "how fast" you online that memory.
> 
> In contrast, Hyper-V (and virtio-mem) add one (or a limited number of)
> memory block at a time and wait for them to get onlined.
> 
> -- 
> Thanks,
> 
> David / dhildenb

Re: [PATCH v3 3/8] drivers/base/memory: store mapping between MMOP_* and string in an array

2020-03-20 Thread Baoquan He

On 03/19/20 at 02:12pm, David Hildenbrand wrote:
> Let's use a simple array which we can reuse soon. While at it, move the
> string->mmop conversion out of the device hotplug lock.
> 
> Reviewed-by: Wei Yang 
> Acked-by: Michal Hocko 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: "Rafael J. Wysocki" 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/base/memory.c | 38 +++---
>  1 file changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index e7e77cafef80..8a7f29c0bf97 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -28,6 +28,24 @@
>  
>  #define MEMORY_CLASS_NAME"memory"
>  
> +static const char *const online_type_to_str[] = {
> + [MMOP_OFFLINE] = "offline",
> + [MMOP_ONLINE] = "online",
> + [MMOP_ONLINE_KERNEL] = "online_kernel",
> + [MMOP_ONLINE_MOVABLE] = "online_movable",
> +};
> +
> +static int memhp_online_type_from_str(const char *str)
> +{
> + int i;

I would change it as: 

for (int i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {

> +
> + for (i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {
> + if (sysfs_streq(str, online_type_to_str[i]))
> + return i;
> + }
> + return -EINVAL;
> +}
> +
>  #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>  
>  static int sections_per_block;
> @@ -236,26 +254,17 @@ static int memory_subsys_offline(struct device *dev)
>  static ssize_t state_store(struct device *dev, struct device_attribute *attr,
>  const char *buf, size_t count)
>  {
> + const int online_type = memhp_online_type_from_str(buf);
>   struct memory_block *mem = to_memory_block(dev);
> - int ret, online_type;
> + int ret;
> +
> + if (online_type < 0)
> + return -EINVAL;
>  
>   ret = lock_device_hotplug_sysfs();
>   if (ret)
>   return ret;
>  
> - if (sysfs_streq(buf, "online_kernel"))
> - online_type = MMOP_ONLINE_KERNEL;
> - else if (sysfs_streq(buf, "online_movable"))
> - online_type = MMOP_ONLINE_MOVABLE;
> - else if (sysfs_streq(buf, "online"))
> - online_type = MMOP_ONLINE;
> - else if (sysfs_streq(buf, "offline"))
> - online_type = MMOP_OFFLINE;
> - else {
> - ret = -EINVAL;
> - goto err;
> - }
> -
>   switch (online_type) {
>   case MMOP_ONLINE_KERNEL:
>   case MMOP_ONLINE_MOVABLE:
> @@ -271,7 +280,6 @@ static ssize_t state_store(struct device *dev, struct 
> device_attribute *attr,
>   ret = -EINVAL; /* should never happen */
>   }
>  
> -err:
>   unlock_device_hotplug();
>  
>   if (ret < 0)
> -- 
> 2.24.1
>

Re: [PATCH v3 3/8] drivers/base/memory: store mapping between MMOP_* and string in an array

2020-03-20 Thread Baoquan He

On 03/20/20 at 10:50am, David Hildenbrand wrote:
> On 20.03.20 08:36, Baoquan He wrote:
> > On 03/19/20 at 02:12pm, David Hildenbrand wrote:
> >> Let's use a simple array which we can reuse soon. While at it, move the
> >> string->mmop conversion out of the device hotplug lock.
> >>
> >> Reviewed-by: Wei Yang 
> >> Acked-by: Michal Hocko 
> >> Cc: Greg Kroah-Hartman 
> >> Cc: Andrew Morton 
> >> Cc: Michal Hocko 
> >> Cc: Oscar Salvador 
> >> Cc: "Rafael J. Wysocki" 
> >> Cc: Baoquan He 
> >> Cc: Wei Yang 
> >> Signed-off-by: David Hildenbrand 
> >> ---
> >>  drivers/base/memory.c | 38 +++---
> >>  1 file changed, 23 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >> index e7e77cafef80..8a7f29c0bf97 100644
> >> --- a/drivers/base/memory.c
> >> +++ b/drivers/base/memory.c
> >> @@ -28,6 +28,24 @@
> >>  
> >>  #define MEMORY_CLASS_NAME "memory"
> >>  
> >> +static const char *const online_type_to_str[] = {
> >> +  [MMOP_OFFLINE] = "offline",
> >> +  [MMOP_ONLINE] = "online",
> >> +  [MMOP_ONLINE_KERNEL] = "online_kernel",
> >> +  [MMOP_ONLINE_MOVABLE] = "online_movable",
> >> +};
> >> +
> >> +static int memhp_online_type_from_str(const char *str)
> >> +{
> >> +  int i;
> > 
> > I would change it as: 
> > 
> > for (int i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {
> > 
> 
> That's not allowed by the C90 standard (and -std=gnu89).
> 
> $ gcc main.c -std=gnu89
> main.c: In function 'main':
> main.c:3:2: error: 'for' loop initial declarations are only allowed in
> C99 or C11 mode
> 3 |  for (int i = 0; i < 8; i++) {
>   |  ^~~

Good to know, thanks.

> 
> One of the reasons why
>   git grep "for (int "
> 
> will result in very little hits (IOW, only 5 in driver code only).
> 
> -- 
> Thanks,
> 
> David / dhildenb

Re: [PATCH v3 0/8] mm/memory_hotplug: allow to specify a default online_type

2020-03-20 Thread Baoquan He

On 03/19/20 at 02:12pm, David Hildenbrand wrote:
> Distributions nowadays use udev rules ([1] [2]) to specify if and
> how to online hotplugged memory. The rules seem to get more complex with
> many special cases. Due to the various special cases,
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
> is handled via udev rules.
> 
> Everytime we hotplug memory, the udev rule will come to the same
> conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
> memory in separate memory blocks and wait for memory to get onlined by user
> space before continuing to add more memory blocks (to not add memory faster
> than it is getting onlined). This of course slows down the whole memory
> hotplug process.
> 
> To make the job of distributions easier and to avoid udev rules that get
> more and more complicated, let's extend the mechanism provided by
> - /sys/devices/system/memory/auto_online_blocks
> - "memhp_default_state=" on the kernel cmdline
> to be able to specify also "online_movable" as well as "online_kernel"
> 
> v2 -> v3:
> - "hv_balloon: don't check for memhp_auto_online manually"
> -- init_completion() before register_memory_notifier()
> - Minor typo fix
> 
> v1 -> v2:
> - Tweaked some patch descriptions
> - Added
> -- "powernv/memtrace: always online added memory blocks"
> -- "hv_balloon: don't check for memhp_auto_online manually"
> -- "mm/memory_hotplug: unexport memhp_auto_online"
> - "mm/memory_hotplug: convert memhp_auto_online to store an online_type"
> -- No longer touches hv/memtrace code

Ack the series.

Reviewed-by: Baoquan He

Re: [5.6.0-rc7] Kernel crash while running ndctl tests

2020-03-24 Thread Baoquan He

Hi Sachin,

On 03/24/20 at 11:25am, Sachin Sant wrote:
> While running ndctl[1] tests against 5.6.0-rc7 following crash is encountered.
> 
> Bisect leads me to  commit d41e2f3bd546 
> mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
> 
> Reverting this commit helps and the tests complete without any crash.

Could you paste your kernel config and the boot log?

If it's confidential, private attachment is also OK.

Thanks
Baoquan

> 
> pmem0: detected capacity change from 0 to 10720641024
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in: dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
> libcrc32c ip6_tables nft_compat ip_set rfkill nf_tables nfnetlink sunrpc sg 
> pseries_rng papr_scm uio_pdrv_genirq uio sch_fq_codel ip_tables sd_mod t10_pi 
> ibmvscsi scsi_transport_srp ibmveth
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> NIP:  c0c3447c LR: c0088354 CTR: c018e990
> REGS: c006223fb630 TRAP: 0300   Not tainted  (5.6.0-rc7-autotest)
> MSR:  8280b033   CR: 2404  XER: 
> 
> CFAR: c000dec4 DAR:  DSISR: 4000 IRQMASK: 0 
> GPR00: c03c5820 c006223fb8c0 c1684900 0400 
> GPR04: c00c00010100 07ff c0067ff20900 c00c 
> GPR08:  c00c0001  c3f0 
> GPR12: 8000 c0001ec70200 7fffc102f9e8 1002e088 
> GPR16:  10050d88 1002f778 1002f770 
> GPR20:  0100 0001 1000 
> GPR24: 0008  0400 c00c00014000 
> GPR28: c3101aa0 c00c0001 0100 04000100 
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
> [c006223fb8c0] [c006223fb960] 0xc006223fb960 (unreliable)
> [c006223fb980] [c03c5820] section_deactivate+0x220/0x240
> [c006223fba30] [c03dc1d8] __remove_pages+0x118/0x170
> [c006223fba80] [c0086e5c] arch_remove_memory+0x3c/0x150
> [c006223fbb00] [c041a3bc] memunmap_pages+0x1cc/0x2f0
> [c006223fbb80] [c07d6d00] devm_action_release+0x30/0x50
> [c006223fbba0] [c07d7de8] release_nodes+0x2f8/0x3e0
> [c006223fbc50] [c07d0b38] 
> device_release_driver_internal+0x168/0x270
> [c006223fbc90] [c07ccf50] unbind_store+0x130/0x170
> [c006223fbcd0] [c07cc0b4] drv_attr_store+0x44/0x60
> [c006223fbcf0] [c051fdb8] sysfs_kf_write+0x68/0x80
> [c006223fbd10] [c051f200] kernfs_fop_write+0x100/0x290
> [c006223fbd60] [c042037c] __vfs_write+0x3c/0x70
> [c006223fbd80] [c042404c] vfs_write+0xcc/0x240
> [c006223fbdd0] [c042442c] ksys_write+0x7c/0x140
> [c006223fbe20] [c000b278] system_call+0x5c/0x68
> Instruction dump:
> 2ea8 4196003c 794a2428 7d685215 41820030 7d48502a 71480002 41820024 
> 714a0008 4082002c e90b0008 786adf62  7c635436 70630001 4c820020 
> ---[ end trace 579b48162da1b890 ]—
> 
> Thanks
> -Sachin
> 
> [1] 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py
>

Re: [5.6.0-rc7] Kernel crash while running ndctl tests

2020-03-24 Thread Baoquan He

On 03/24/20 at 03:06pm, Sachin Sant wrote:
> 
> 
> > On 24-Mar-2020, at 2:45 PM, Aneesh Kumar K.V  
> > wrote:
> > 
> > Sachin Sant  writes:
> > 
> >> While running ndctl[1] tests against 5.6.0-rc7 following crash is 
> >> encountered.
> >> 
> >> Bisect leads me to  commit d41e2f3bd546 
> >> mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
> >> 
> >> Reverting this commit helps and the tests complete without any crash.
> > 
> > 
> > Can you try this change?
> > 
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index aadb7298dcef..3012d1f3771a 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> > unsigned long nr_pages,
> > ms->usage = NULL;
> > }
> > memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > +   /* Mark the section invalid */
> > +   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > }
> > 
> > if (section_is_early && memmap)
> > 
> 
> This patch works for me. The test ran successfully without any crash/failure.

Hi Aneesh,

Could you make a formal patch to post, since Sachin has tested and
confirmed it works?

> 
> Thanks
> -Sachin
> 
> > a pfn_valid check involves pnf_section_valid() check if section is
> > having MEM_MAP. In this case we did end up  setting the ms->uage = NULL.
> > So when we do that tupdate the section to not have MEM_MAP.
> > 
> > -aneesh
>

Re: [PATCH] mm/hugetlb: Fix build failure with HUGETLB_PAGE but not HUGEBTLBFS

2020-03-17 Thread Baoquan He

On 03/17/20 at 08:04am, Christophe Leroy wrote:
> When CONFIG_HUGETLB_PAGE is set but not CONFIG_HUGETLBFS, the
> following build failure is encoutered:

>From the definition of HUGETLB_PAGE, isn't it relying on HUGETLBFS?
I could misunderstand the def_bool, please correct me if I am wrong.

config HUGETLB_PAGE
def_bool HUGETLBFS

> 
> In file included from arch/powerpc/mm/fault.c:33:0:
> ./include/linux/hugetlb.h: In function 'hstate_inode':
> ./include/linux/hugetlb.h:477:9: error: implicit declaration of function 
> 'HUGETLBFS_SB' [-Werror=implicit-function-declaration]
>   return HUGETLBFS_SB(i->i_sb)->hstate;
>  ^
> ./include/linux/hugetlb.h:477:30: error: invalid type argument of '->' (have 
> 'int')
>   return HUGETLBFS_SB(i->i_sb)->hstate;
>   ^
> 
> Gate hstate_inode() with CONFIG_HUGETLBFS instead of CONFIG_HUGETLB_PAGE.
> 
> Reported-by: kbuild test robot 
> Link: https://patchwork.ozlabs.org/patch/1255548/#2386036
> Fixes: a137e1cc6d6e ("hugetlbfs: per mount huge page sizes")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Christophe Leroy 
> ---
>  include/linux/hugetlb.h | 19 ---
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1e897e4168ac..dafb3d70ff81 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -390,7 +390,10 @@ static inline bool is_file_hugepages(struct file *file)
>   return is_file_shm_hugepages(file);
>  }
>  
> -
> +static inline struct hstate *hstate_inode(struct inode *i)
> +{
> + return HUGETLBFS_SB(i->i_sb)->hstate;
> +}
>  #else /* !CONFIG_HUGETLBFS */
>  
>  #define is_file_hugepages(file)  false
> @@ -402,6 +405,10 @@ hugetlb_file_setup(const char *name, size_t size, 
> vm_flags_t acctflag,
>   return ERR_PTR(-ENOSYS);
>  }
>  
> +static inline struct hstate *hstate_inode(struct inode *i)
> +{
> + return NULL;
> +}
>  #endif /* !CONFIG_HUGETLBFS */
>  
>  #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
> @@ -472,11 +479,6 @@ extern unsigned int default_hstate_idx;
>  
>  #define default_hstate (hstates[default_hstate_idx])
>  
> -static inline struct hstate *hstate_inode(struct inode *i)
> -{
> - return HUGETLBFS_SB(i->i_sb)->hstate;
> -}
> -
>  static inline struct hstate *hstate_file(struct file *f)
>  {
>   return hstate_inode(file_inode(f));
> @@ -729,11 +731,6 @@ static inline struct hstate *hstate_vma(struct 
> vm_area_struct *vma)
>   return NULL;
>  }
>  
> -static inline struct hstate *hstate_inode(struct inode *i)
> -{
> - return NULL;
> -}
> -
>  static inline struct hstate *page_hstate(struct page *page)
>  {
>   return NULL;
> -- 
> 2.25.0
> 
>

Re: [PATCH 10/15] memblock: make memblock_debug and related functionality private

2020-07-29 Thread Baoquan He

On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The only user of memblock_dbg() outside memblock was s390 setup code and it
> is converted to use pr_debug() instead.
> This allows to stop exposing memblock_debug and memblock_dbg() to the rest
> of the kernel.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/s390/kernel/setup.c |  4 ++--
>  include/linux/memblock.h | 12 +---
>  mm/memblock.c| 13 +++--
>  3 files changed, 14 insertions(+), 15 deletions(-)

Nice clean up.

Reviewed-by: Baoquan He 

> 
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 07aa15ba43b3..8b284cf6e199 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -776,8 +776,8 @@ static void __init memblock_add_mem_detect_info(void)
>   unsigned long start, end;
>   int i;
>  
> - memblock_dbg("physmem info source: %s (%hhd)\n",
> -  get_mem_info_source(), mem_detect.info_source);
> + pr_debug("physmem info source: %s (%hhd)\n",
> +  get_mem_info_source(), mem_detect.info_source);
>   /* keep memblock lists close to the kernel */
>   memblock_set_bottom_up(true);
>   for_each_mem_detect_block(i, , ) {
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 220b5f0dad42..e6a23b3db696 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -90,7 +90,6 @@ struct memblock {
>  };
>  
>  extern struct memblock memblock;
> -extern int memblock_debug;
>  
>  #ifndef CONFIG_ARCH_KEEP_MEMBLOCK
>  #define __init_memblock __meminit
> @@ -102,9 +101,6 @@ void memblock_discard(void);
>  static inline void memblock_discard(void) {}
>  #endif
>  
> -#define memblock_dbg(fmt, ...) \
> - if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> -
>  phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
>  phys_addr_t size, phys_addr_t align);
>  void memblock_allow_resize(void);
> @@ -456,13 +452,7 @@ bool memblock_is_region_memory(phys_addr_t base, 
> phys_addr_t size);
>  bool memblock_is_reserved(phys_addr_t addr);
>  bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
>  
> -extern void __memblock_dump_all(void);
> -
> -static inline void memblock_dump_all(void)
> -{
> - if (memblock_debug)
> - __memblock_dump_all();
> -}
> +void memblock_dump_all(void);
>  
>  /**
>   * memblock_set_current_limit - Set the current allocation limit to allow
> diff --git a/mm/memblock.c b/mm/memblock.c
> index a5b9b3df81fc..824938849f6d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -134,7 +134,10 @@ struct memblock memblock __initdata_memblock = {
>i < memblock_type->cnt;\
>i++, rgn = _type->regions[i])
>  
> -int memblock_debug __initdata_memblock;
> +#define memblock_dbg(fmt, ...) \
> + if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> +
> +static int memblock_debug __initdata_memblock;
>  static bool system_has_some_mirror __initdata_memblock = false;
>  static int memblock_can_resize __initdata_memblock;
>  static int memblock_memory_in_slab __initdata_memblock = 0;
> @@ -1919,7 +1922,7 @@ static void __init_memblock memblock_dump(struct 
> memblock_type *type)
>   }
>  }
>  
> -void __init_memblock __memblock_dump_all(void)
> +static void __init_memblock __memblock_dump_all(void)
>  {
>   pr_info("MEMBLOCK configuration:\n");
>   pr_info(" memory size = %pa reserved size = %pa\n",
> @@ -1933,6 +1936,12 @@ void __init_memblock __memblock_dump_all(void)
>  #endif
>  }
>  
> +void __init_memblock memblock_dump_all(void)
> +{
> + if (memblock_debug)
> + __memblock_dump_all();
> +}
> +
>  void __init memblock_allow_resize(void)
>  {
>   memblock_can_resize = 1;
> -- 
> 2.26.2
> 
>

Re: [PATCH 11/15] memblock: reduce number of parameters in for_each_mem_range()

2020-07-29 Thread Baoquan He

On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Currently for_each_mem_range() iterator is the most generic way to traverse
> memblock regions. As such, it has 8 parameters and it is hardly convenient
> to users. Most users choose to utilize one of its wrappers and the only
> user that actually needs most of the parameters outside memblock is s390
> crash dump implementation.
> 
> To avoid yet another naming for memblock iterators, rename the existing
> for_each_mem_range() to __for_each_mem_range() and add a new
> for_each_mem_range() wrapper with only index, start and end parameters.
> 
> The new wrapper nicely fits into init_unavailable_mem() and will be used in
> upcoming changes to simplify memblock traversals.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  .clang-format  |  1 +
>  arch/arm64/kernel/machine_kexec_file.c |  6 ++
>  arch/s390/kernel/crash_dump.c  |  8 
>  include/linux/memblock.h   | 18 ++
>  mm/page_alloc.c|  3 +--
>  5 files changed, 22 insertions(+), 14 deletions(-)

Reviewed-by: Baoquan He 

> 
> diff --git a/.clang-format b/.clang-format
> index a0a96088c74f..52ededab25ce 100644
> --- a/.clang-format
> +++ b/.clang-format
> @@ -205,6 +205,7 @@ ForEachMacros:
>- 'for_each_memblock_type'
>- 'for_each_memcg_cache_index'
>- 'for_each_mem_pfn_range'
> +  - '__for_each_mem_range'
>- 'for_each_mem_range'
>- 'for_each_mem_range_rev'
>- 'for_each_migratetype_order'
> diff --git a/arch/arm64/kernel/machine_kexec_file.c 
> b/arch/arm64/kernel/machine_kexec_file.c
> index 361a1143e09e..5b0e67b93cdc 100644
> --- a/arch/arm64/kernel/machine_kexec_file.c
> +++ b/arch/arm64/kernel/machine_kexec_file.c
> @@ -215,8 +215,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
> *sz)
>   phys_addr_t start, end;
>  
>   nr_ranges = 1; /* for exclusion of crashkernel region */
> - for_each_mem_range(i, , NULL, NUMA_NO_NODE,
> - MEMBLOCK_NONE, , , NULL)
> + for_each_mem_range(i, , )
>   nr_ranges++;
>  
>   cmem = kmalloc(struct_size(cmem, ranges, nr_ranges), GFP_KERNEL);
> @@ -225,8 +224,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
> *sz)
>  
>   cmem->max_nr_ranges = nr_ranges;
>   cmem->nr_ranges = 0;
> - for_each_mem_range(i, , NULL, NUMA_NO_NODE,
> - MEMBLOCK_NONE, , , NULL) {
> + for_each_mem_range(i, , ) {
>   cmem->ranges[cmem->nr_ranges].start = start;
>   cmem->ranges[cmem->nr_ranges].end = end - 1;
>   cmem->nr_ranges++;
> diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
> index f96a5857bbfd..e28085c725ff 100644
> --- a/arch/s390/kernel/crash_dump.c
> +++ b/arch/s390/kernel/crash_dump.c
> @@ -549,8 +549,8 @@ static int get_mem_chunk_cnt(void)
>   int cnt = 0;
>   u64 idx;
>  
> - for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> -MEMBLOCK_NONE, NULL, NULL, NULL)
> + __for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> +  MEMBLOCK_NONE, NULL, NULL, NULL)
>   cnt++;
>   return cnt;
>  }
> @@ -563,8 +563,8 @@ static void loads_init(Elf64_Phdr *phdr, u64 loads_offset)
>   phys_addr_t start, end;
>   u64 idx;
>  
> - for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> -MEMBLOCK_NONE, , , NULL) {
> + __for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> +  MEMBLOCK_NONE, , , NULL) {
>   phdr->p_filesz = end - start;
>   phdr->p_type = PT_LOAD;
>   phdr->p_offset = start;
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index e6a23b3db696..d70c2835e913 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -142,7 +142,7 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
> *out_start,
>  void __memblock_free_late(phys_addr_t base, phys_addr_t size);
>  
>  /**
> - * for_each_mem_range - iterate through memblock areas from type_a and not
> + * __for_each_mem_range - iterate through memblock areas from type_a and not
>   * included in type_b. Or just type_a if type_b is NULL.
>   * @i: u64 used as loop variable
>   * @type_a: ptr to memblock_type to iterate
> @@ -153,7 +153,7 @@ void __memblock_free_late(phys_addr_t base, phys_addr_t 
> size);
>   * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
>   * @p_nid: ptr to int for nid of the range, can be %NULL
>   *

1 2 3 4 >

1 - 100 of 315 matches

Mail list logo