from:"Baoquan He"

Re: [PATCH v9 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-16 Thread Baoquan He

On 04/16/21 at 09:00am, Oscar Salvador wrote:
...  
> +/*
> + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old 
> one
> + * @h: struct hstate old page belongs to
> + * @old_page: Old page to dissolve
> + * Returns 0 on success, otherwise negated error.
> + */
> +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
> *old_page)
> +{
> + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> + int nid = page_to_nid(old_page);
> + struct page *new_page;
> + int ret = 0;
> +
> + /*
> +  * Before dissolving the page, we need to allocate a new one for the
> +  * pool to remain stable. Using alloc_buddy_huge_page() allows us to
> +  * not having to deal with prep_new_page() and avoids dealing of any
   ~ prep_new_huge_page() ?
> +  * counters. This simplifies and let us do the whole thing under the
> +  * lock.
> +  */
> + new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
> + if (!new_page)
> + return -ENOMEM;
> +
> +retry:
> + spin_lock_irq(_lock);
...

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-13 Thread Baoquan He

On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> On Mon, Apr 12, 2021 at 2:52 AM Baoquan He  wrote:
> >
> > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> > >
> > >
> > > > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > > >
> > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > > >> Why don't we do this unconditionally? At the very best we gain half a 
> > > >> megabyte of memory (except the trampoline, which has to live there, 
> > > >> but it is only a few kilobytes.)
> > > >
> > > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > > make code simpler. Then the specific caring of real mode in
> > > > efi_free_boot_services() can be removed too.
> > > >
> > >
> > > This whole situation makes me think that the code is buggy before and 
> > > buggy after.
> > >
> > > The issue here (I think) is that various pieces of code want to reserve 
> > > specific pieces of otherwise-available low memory for their own nefarious 
> > > uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
> > > too much.
> >
> > Kdump kernel also need go through real mode code path during bootup. It
> > is not different than normal kernel except that it skips the firmware
> > resetting. So kdump kernel needs low 1M as system RAM just as normal
> > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> > to avoid any later kernel or driver data reside in this area. Otherwise,
> > we need dump the content of this area to vmcore. As we know, when crash
> > happened, the old memory of 1st kernel should be untouched until vmcore
> > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> > In the past, we used a back up region to copy out the low 1M area, and
> > map the back up region into the low 1M area in vmcore elf file. In
> > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> > option is specified"), we changed to lock the whole low 1M to avoid
> > writting any kernel data into, like this we can skip this area when
> > dumping vmcore.
> >
> > Above is why we try to memblock reserve the whole low 1M. We don't want
> > to use it, just don't want anyone to use it in 1st kernel.
> >
> > >
> > > I propose that the right solution is to give low-memory-reserving code 
> > > paths two chances to do what they need: once at the very beginning and 
> > > once after EFI boot services are freed.
> > >
> > > Alternatively, just reserve *all* otherwise unused sub 1M memory up 
> > > front, then release it right after releasing boot services, and then 
> > > invoke the special cases exactly once.
> >
> > I am not sure if I got both suggested ways clearly. They look a little
> > complicated in our case. As I explained at above, we want the whole low
> > 1M locked up, not one piece or some pieces of it.
> 
> My second suggestion is probably the better one.  Here it is, concretely:
> 
> The early (pre-free_efi_boot_services) code just reserves all
> available sub-1M memory unconditionally, but it specially marks it as
> reserved-but-available-later.  We stop allocating the trampoline page
> at this stage.
> 
> In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> stick it in the pile of reserved memory created in the early step.
> This may involve splitting a block, kind of like the current
> trampoline late allocation works.
> 
> Then, *after* free_efi_boot_services(), we run a single block of code
> that lets everything that wants sub-1M code claim some.  This means
> that the trampoline gets allocated and, if crashkernel wants to claim
> everything else, it can.  After that, everything still unclaimed gets
> freed.

void __init setup_arch(char **cmdline_p)
{
...
efi_reserve_boot_services();
e820__memblock_alloc_reserved_mpc_new();
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check();
#endif
reserve_real_mode();
  

trim_platform_memory_ranges();
trim_low_memory_range();
...
}

After efi_reserve_boot_services(), there are several function calling to
require memory reservation under low 1M.


asmlinkage __visible void __init __no_sanitize_address start_kernel(void)   
  
{
...
setup_arch(_line);
...
mm_init();
-->

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-12 Thread Baoquan He

On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> 
> 
> > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > 
> > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> >> Why don't we do this unconditionally? At the very best we gain half a 
> >> megabyte of memory (except the trampoline, which has to live there, but it 
> >> is only a few kilobytes.)
> > 
> > This is a great suggestion, thanks. I think we can fix it in this way to
> > make code simpler. Then the specific caring of real mode in
> > efi_free_boot_services() can be removed too.
> > 
> 
> This whole situation makes me think that the code is buggy before and buggy 
> after.
> 
> The issue here (I think) is that various pieces of code want to reserve 
> specific pieces of otherwise-available low memory for their own nefarious 
> uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too 
> much.

Kdump kernel also need go through real mode code path during bootup. It
is not different than normal kernel except that it skips the firmware
resetting. So kdump kernel needs low 1M as system RAM just as normal
kernel does. Here we reserve the whole low 1M with memblock_reserve()
to avoid any later kernel or driver data reside in this area. Otherwise,
we need dump the content of this area to vmcore. As we know, when crash
happened, the old memory of 1st kernel should be untouched until vmcore
dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
In the past, we used a back up region to copy out the low 1M area, and
map the back up region into the low 1M area in vmcore elf file. In
6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
option is specified"), we changed to lock the whole low 1M to avoid
writting any kernel data into, like this we can skip this area when
dumping vmcore.

Above is why we try to memblock reserve the whole low 1M. We don't want
to use it, just don't want anyone to use it in 1st kernel.

> 
> I propose that the right solution is to give low-memory-reserving code paths 
> two chances to do what they need: once at the very beginning and once after 
> EFI boot services are freed.
> 
> Alternatively, just reserve *all* otherwise unused sub 1M memory up front, 
> then release it right after releasing boot services, and then invoke the 
> special cases exactly once.

I am not sure if I got both suggested ways clearly. They look a little
complicated in our case. As I explained at above, we want the whole low
1M locked up, not one piece or some pieces of it.

> 
> In either case, the result is that the crashkernel mess gets unified with the 
> trampoline mess.  One way the result is called twice and needs to be more 
> careful, and the other way it’s called only once.
> 
> Just skipping freeing boot services seems wrong.  It doesn’t unmap boot 
> services, and skipping that is incorrect, I think. And it seems to result in 
> a bogus memory map in which the system thinks that some crashkernel memory is 
> EFI memory instead.

I like hpa's thought to lock the whole low 1M unconditionally since only
a few KB except of trampoline area is there. Rethinking about it, doing
it in can_free_region() may be risky because efi memory region could
cross the 1M boundary, e.g [640K, 100M] with type of
EFI_BOOT_SERVICES_CODE|EFI_BOOT_SERVICES_DATA, it could cause loss of memory.
Just a wild guess, not very sure if the 1M boundary corssing can really
happen. efi_reserve_boot_services() won't split regions.

If moving efi_reserve_boot_services() after reserve_real_mode() is not
accepted, maybe we can call efi_mem_reserve(0, 1M) just as
efi_esrt_init() has done.

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-11 Thread Baoquan He

On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> Why don't we do this unconditionally? At the very best we gain half a 
> megabyte of memory (except the trampoline, which has to live there, but it is 
> only a few kilobytes.)

This is a great suggestion, thanks. I think we can fix it in this way to
make code simpler. Then the specific caring of real mode in
efi_free_boot_services() can be removed too.

Thanks
Baoquan

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-09 Thread Baoquan He

On 04/07/21 at 10:03pm, Lianbo Jiang wrote:
> Some sub-1MB memory regions may be reserved by EFI boot services, and the
> memory regions will be released later in the efi_free_boot_services().
> 
> Currently, always reserve all sub-1MB memory regions when the crashkernel
> option is specified, but unfortunately EFI boot services may have already
> reserved some sub-1MB memory regions before the crash_reserve_low_1M() is
> called, which makes that the crash_reserve_low_1M() only own the
> remaining sub-1MB memory regions, not all sub-1MB memory regions, because,
> subsequently EFI boot services will free its own sub-1MB memory regions.
> Eventually, DMA will be able to allocate memory from the sub-1MB area and
> cause the following error:
> 

So this patch is fixing a problem found in crash utility. We ever met
the similar issue, later fixed by always reserving low 1M in commit
6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
option is specified"). Seems the commit is not fixing it completely.

> crash> kmem -s |grep invalid
> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
> 9403c0067300
> kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
> 9403c0067300
> crash> vtop 9403c0067300
> VIRTUAL   PHYSICAL
> 9403c0067300  67300   --->The physical address falls into this range 
> [0x00063000-0x0008efff]
> 
> kernel debugging log:
> ...
> [0.008927] memblock_reserve: [0x0001-0x00013fff] 
> efi_reserve_boot_services+0x85/0xd0
> [0.008930] memblock_reserve: [0x00063000-0x0008efff] 
> efi_reserve_boot_services+0x85/0xd0
> ...
> [0.009425] memblock_reserve: [0x-0x000f] 
> crash_reserve_low_1M+0x2c/0x49
> ...
> [0.010586] Zone ranges:
> [0.010587]   DMA  [mem 0x1000-0x00ff]
> [0.010589]   DMA32[mem 0x0100-0x]
> [0.010591]   Normal   [mem 0x0001-0x000c7fff]
> [0.010593]   Device   empty
> ...
> [8.814894] __memblock_free_late: [0x00063000-0x0008efff] 
> efi_free_boot_services+0x14b/0x23b
> [8.815793] __memblock_free_late: [0x0001-0x00013fff] 
> efi_free_boot_services+0x14b/0x23b


In commit 6f599d84231fd27, we call crash_reserve_low_1M() to lock the
whole low 1M area if crashkernel is specified in kernel cmdline.
But earlier efi_reserve_boot_services() invokation will break the
intention of the whole low 1M reserving. In efi_reserve_boot_services(),
if any memory under low 1M hasn't been reserved, it will call
memblock_reserve() to reserve it and leave it to
efi_free_boot_services() to free.

Hi Lianbo,

Please correct me if I am wrong or anything is missed. IIUC, can we move
efi_reserve_boot_services() after reserve_real_mode() to fix this bug?
Or move reserve_real_mode() before efi_reserve_boot_services() since
those real mode regions are all under 1M? Assume efi boot code/data
won't rely on low 1M area any more at this moment.

Thanks
Baoquan

> 
> Do not release sub-1MB memory regions even though they are reserved by
> EFI boot services, so that always reserve all sub-1MB memory regions when
> the crashkernel option is specified.
> 
> Signed-off-by: Lianbo Jiang 
> ---
>  arch/x86/platform/efi/quirks.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 67d93a243c35..637f932c4fd4 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define EFI_MIN_RESERVE 5120
>  
> @@ -303,6 +304,19 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
>   */
>  static __init bool can_free_region(u64 start, u64 size)
>  {
> + /*
> +  * Some sub-1MB memory regions may be reserved by EFI boot
> +  * services, and these memory regions will be released later
> +  * in the efi_free_boot_services().
> +  *
> +  * Do not release sub-1MB memory regions even though they are
> +  * reserved by EFI boot services, because, always reserve all
> +  * sub-1MB memory when the crashkernel option is specified.
> +  */
> + if (cmdline_find_option(boot_command_line, "crashkernel", NULL, 0) > 0
> + && (start + size < (1<<20)))
> + return false;
> +
>   if (start + size > __pa_symbol(_text) && start <= __pa_symbol(_end))
>   return false;
>  
> -- 
> 2.17.1
>

Re: [PATCH v3 12/12] kdump: Use vmlinux_build_id to simplify

2021-04-08 Thread Baoquan He

On 04/07/21 at 07:03pm, Petr Mladek wrote:
> On Tue 2021-03-30 20:05:20, Stephen Boyd wrote:
> > We can use the vmlinux_build_id array here now instead of open coding
> > it. This mostly consolidates code.
> > 
> > Cc: Jiri Olsa 
> > Cc: Alexei Starovoitov 
> > Cc: Jessica Yu 
> > Cc: Evan Green 
> > Cc: Hsin-Yi Wang 
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: Vivek Goyal 
> > Cc: 
> > Signed-off-by: Stephen Boyd 
> > ---
> >  include/linux/crash_core.h |  6 +-
> >  kernel/crash_core.c| 41 ++
> >  2 files changed, 3 insertions(+), 44 deletions(-)
> > 
> > diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> > index 206bde8308b2..fb8ab99bb2ee 100644
> > --- a/include/linux/crash_core.h
> > +++ b/include/linux/crash_core.h
> > @@ -39,7 +39,7 @@ phys_addr_t paddr_vmcoreinfo_note(void);
> >  #define VMCOREINFO_OSRELEASE(value) \
> > vmcoreinfo_append_str("OSRELEASE=%s\n", value)
> >  #define VMCOREINFO_BUILD_ID(value) \
> > -   vmcoreinfo_append_str("BUILD-ID=%s\n", value)
> > +   vmcoreinfo_append_str("BUILD-ID=%20phN\n", value)

I may miss something, wondering why we need add '20' here.

> 
> Please, add also build check that BUILD_ID_MAX == 20.
> 
> 
> >  #define VMCOREINFO_PAGESIZE(value) \
> > vmcoreinfo_append_str("PAGESIZE=%ld\n", value)
> >  #define VMCOREINFO_SYMBOL(name) \
> > @@ -69,10 +69,6 @@ extern unsigned char *vmcoreinfo_data;
> >  extern size_t vmcoreinfo_size;
> >  extern u32 *vmcoreinfo_note;
> >  
> > -/* raw contents of kernel .notes section */
> > -extern const void __start_notes __weak;
> > -extern const void __stop_notes __weak;
> > -
> >  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> >   void *data, size_t data_len);
> >  void final_note(Elf_Word *buf);
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 825284baaf46..6b560cf9f374 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -4,6 +4,7 @@
> >   * Copyright (C) 2002-2004 Eric Biederman  
> >   */
> >  
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -378,51 +379,13 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
> >  }
> >  EXPORT_SYMBOL(paddr_vmcoreinfo_note);
> >  
> > -#define NOTES_SIZE (&__stop_notes - &__start_notes)
> > -#define BUILD_ID_MAX SHA1_DIGEST_SIZE
> > -#define NT_GNU_BUILD_ID 3
> > -
> > -struct elf_note_section {
> > -   struct elf_note n_hdr;
> > -   u8 n_data[];
> > -};
> > -
> >  /*
> >   * Add build ID from .notes section as generated by the GNU ld(1)
> >   * or LLVM lld(1) --build-id option.
> >   */
> >  static void add_build_id_vmcoreinfo(void)
> >  {
> > -   char build_id[BUILD_ID_MAX * 2 + 1];
> > -   int n_remain = NOTES_SIZE;
> > -
> > -   while (n_remain >= sizeof(struct elf_note)) {
> > -   const struct elf_note_section *note_sec =
> > -   &__start_notes + NOTES_SIZE - n_remain;
> > -   const u32 n_namesz = note_sec->n_hdr.n_namesz;
> > -
> > -   if (note_sec->n_hdr.n_type == NT_GNU_BUILD_ID &&
> > -   n_namesz != 0 &&
> > -   !strcmp((char *)_sec->n_data[0], "GNU")) {
> > -   if (note_sec->n_hdr.n_descsz <= BUILD_ID_MAX) {
> > -   const u32 n_descsz = note_sec->n_hdr.n_descsz;
> > -   const u8 *s = _sec->n_data[n_namesz];
> > -
> > -   s = PTR_ALIGN(s, 4);
> > -   bin2hex(build_id, s, n_descsz);
> > -   build_id[2 * n_descsz] = '\0';
> > -   VMCOREINFO_BUILD_ID(build_id);
> > -   return;
> > -   }
> > -   pr_warn("Build ID is too large to include in 
> > vmcoreinfo: %u > %u\n",
> > -   note_sec->n_hdr.n_descsz,
> > -   BUILD_ID_MAX);
> > -   return;
> > -   }
> > -   n_remain -= sizeof(struct elf_note) +
> > -   ALIGN(note_sec->n_hdr.n_namesz, 4) +
> > -   ALIGN(note_sec->n_hdr.n_descsz, 4);
> > -   }
> > +   VMCOREINFO_BUILD_ID(vmlinux_build_id);
> &

Re: [PATCH v1 1/3] kernel/resource: make walk_system_ram_res() find all busy IORESOURCE_SYSTEM_RAM resources

2021-03-23 Thread Baoquan He

On 03/22/21 at 05:01pm, David Hildenbrand wrote:
> It used to be true that we can have busy system RAM only on the first level
> in the resourc tree. However, this is no longer holds for driver-managed
> system RAM (i.e., added via dax/kmem and virtio-mem), which gets added on
> lower levels.
> 
> We have two users of walk_system_ram_res(), which currently only
> consideres the first level:
> a) kernel/kexec_file.c:kexec_walk_resources() -- We properly skip
>IORESOURCE_SYSRAM_DRIVER_MANAGED resources via
>locate_mem_hole_callback(), so even after this change, we won't be
>placing kexec images onto dax/kmem and virtio-mem added memory. No
>change.
> b) arch/x86/kernel/crash.c:fill_up_crash_elf_data() -- we're currently
>not adding relevant ranges to the crash elf info, resulting in them
>not getting dumped via kdump.
> 
> This change fixes loading a crashkernel via kexec_file_load() and including
> dax/kmem and virtio-mem added System RAM in the crashdump on x86-64. Note
> that e.g,, arm64 relies on memblock data and, therefore, always considers
> all added System RAM already.
> 
> Let's find all busy IORESOURCE_SYSTEM_RAM resources, making the function
> behave like walk_system_ram_range().
> 
> Cc: Andrew Morton 
> Cc: Greg Kroah-Hartman 
> Cc: Dan Williams 
> Cc: Daniel Vetter 
> Cc: Andy Shevchenko 
> Cc: Mauro Carvalho Chehab 
> Cc: Signed-off-by: David Hildenbrand 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Vivek Goyal 
> Cc: Dave Hansen 
> Cc: Keith Busch 
> Cc: Michal Hocko 
> Cc: Qian Cai 
> Cc: Oscar Salvador 
> Cc: Eric Biederman 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: Tom Lendacky 
> Cc: Brijesh Singh 
> Cc: x...@kernel.org
> Cc: ke...@lists.infradead.org
> Signed-off-by: David Hildenbrand 
> ---
>  kernel/resource.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 627e61b0c124..4efd6e912279 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -457,7 +457,7 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
>  {
>   unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
>  
> - return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, true,
> + return __walk_iomem_res_desc(start, end, flags, IORES_DESC_NONE, false,
>arg, func);

Thanks, David, this is a good fix.

Acked-by: Baoquan He 

>  }
>  
> -- 
> 2.29.2
>

Re: [PATCH] include: linux: Remove duplicate include of pgtable.h

2021-03-23 Thread Baoquan He

On 03/23/21 at 11:13am, Wan Jiabing wrote:
> linux/pgtable.h has been included at line 11 with annotation.
> So we remove the duplicate one at line 8.
> 
> Signed-off-by: Wan Jiabing 

Thanks for your posting. But this resend is still not good. I have
pasted the suggested log, wondering why you just ignore it and send v2
without updating, and also not marking this is v2. Please read
Documentation/process/submitting-patches.rst before you post next time.
Anyway, I have ack-ed Tian Tao's patch since his patch log is good
enough.

Thanks
Baoquan

> ---
>  include/linux/crash_dump.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index a5192b718dbe..be79a45d7aa3 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -5,7 +5,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  #include  /* for pgprot_t */
> -- 
> 2.25.1
>

Re: [PATCH] crash_dump: remove duplicate include in crash_dump.h

2021-03-19 Thread Baoquan He

On 03/13/21 at 02:35am, menglong8.d...@gmail.com wrote:
> From: Zhang Yunkai 
> 
> 'linux/pgtable.h' included in 'crash_dump.h' is duplicated.
> It is also included in the 8th line.

Tian Tao posted one to address the same issue, his log is better. Please
update with below to repost.

linux/pgtable.h is included more than once, Remove the one that isn't
necessary.

> 
> Signed-off-by: Zhang Yunkai 
> ---
>  include/linux/crash_dump.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index a5192b718dbe..6bd8a33cb740 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -8,8 +8,6 @@
>  #include 
>  #include 
>  
> -#include  /* for pgprot_t */
> -
>  #ifdef CONFIG_CRASH_DUMP
>  #define ELFCORE_ADDR_MAX (-1ULL)
>  #define ELFCORE_ADDR_ERR (-2ULL)
> -- 
> 2.25.1
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Re: [PATCH] kernel: kexec_file: fix error return code of kexec_calculate_store_digests()

2021-03-10 Thread Baoquan He

On 03/09/21 at 12:39am, Jia-Ju Bai wrote:
> When vzalloc() returns NULL to sha_regions, no error return code of
> kexec_calculate_store_digests() is assigned.
> To fix this bug, ret is assigned with -ENOMEM in this case.
> 
> Fixes: a43cac0d9dc2 ("kexec: split kexec_file syscall code to kexec_file.c")
> Reported-by: TOTE Robot 
> Signed-off-by: Jia-Ju Bai 
> ---
>  kernel/kexec_file.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 5c3447cf7ad5..33400ff051a8 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -740,8 +740,10 @@ static int kexec_calculate_store_digests(struct kimage 
> *image)
>  
>   sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
>   sha_regions = vzalloc(sha_region_sz);
> - if (!sha_regions)
> + if (!sha_regions) {
> + ret = -ENOMEM;
>   goto out_free_desc;

A good catch. Even though the chance of failure is very small, it does
cause issue if happened.

Acked-by: Baoquan He 

Thanks
Baoquan

> + }
>  
>   desc->tfm   = tfm;
>  
> -- 
> 2.17.1
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Re: [PATCH] kexec: Add kexec reboot string

2021-03-10 Thread Baoquan He

On 03/04/21 at 01:46pm, Paul Menzel wrote:
> From: Joe LeVeque 
> 
> The purpose is to notify the kernel module for fast reboot.

Checked several modules which registered with reboot_notifier_list, the
passed string is not cared. Just curious, could you tell how you have
used or plan to use this string in your code?

No objection to this even though it's trivial if no real use case.

Acked-by: Baoquan He 

> 
> Upstream a patch from the SONiC network operating system [1].
> 
> [1]: https://github.com/Azure/sonic-linux-kernel/pull/46
> 
> Signed-off-by: Paul Menzel 
> ---
>  kernel/kexec_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index a0b6780740c8..f04d04d1b855 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1165,7 +1165,7 @@ int kernel_kexec(void)
>  #endif
>   {
>   kexec_in_progress = true;
> - kernel_restart_prepare(NULL);
> + kernel_restart_prepare("kexec reboot");
>   migrate_to_reboot_cpu();
>  
>   /*
> -- 
> 2.30.1
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Re: [PATCH v3 1/2] x86/setup: consolidate early memory reservations

2021-03-03 Thread Baoquan He

On 03/02/21 at 05:17pm, Mike Rapoport wrote:
> On Tue, Mar 02, 2021 at 09:04:09PM +0800, Baoquan He wrote:
...
> > > +static void __init early_reserve_memory(void)
> > > +{
> > > + /*
> > > +  * Reserve the memory occupied by the kernel between _text and
> > > +  * __end_of_kernel_reserve symbols. Any kernel sections after the
> > > +  * __end_of_kernel_reserve symbol must be explicitly reserved with a
> > > +  * separate memblock_reserve() or they will be discarded.
> > > +  */
> > > + memblock_reserve(__pa_symbol(_text),
> > > +  (unsigned long)__end_of_kernel_reserve - (unsigned 
> > > long)_text);
> > > +
> > > + /*
> > > +  * Make sure page 0 is always reserved because on systems with
> > > +  * L1TF its contents can be leaked to user processes.
> > > +  */
> > > + memblock_reserve(0, PAGE_SIZE);
> > > +
> > > + early_reserve_initrd();
> > > +
> > > + if (efi_enabled(EFI_BOOT))
> > > + efi_memblock_x86_reserve_range();
> > > +
> > > + memblock_x86_reserve_range_setup_data();
> > 
> > This patch looks good to me, thanks for the effort.
> > 
> > While at it, wondering if we can rename the above function to
> > memblock_reserve_setup_data() just as its e820 counterpart
> > e820__reserve_setup_data(), adding 'x86' to a function under arch/x86
> > seems redundant.
> 
> I'd rather keep these names for now. First, it's easier to dig to them in the 
> git
> history and second, I'm planning more changes in this area and these names
> are as good as FIXME: to remind what still needs to be checked :)

I see, thanks for explanation.

Re: [PATCH v3 1/2] x86/setup: consolidate early memory reservations

2021-03-02 Thread Baoquan He

On 03/02/21 at 12:04pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The early reservations of memory areas used by the firmware, bootloader,
> kernel text and data are spread over setup_arch(). Moreover, some of them
> happen *after* memblock allocations, e.g trim_platform_memory_ranges() and
> trim_low_memory_range() are called after reserve_real_mode() that allocates
> memory.
> 
> There was no corruption of these memory regions because memblock always
> allocates memory either from the end of memory (in top-down mode) or above
> the kernel image (in bottom-up mode). However, the bottom up mode is going
> to be updated to span the entire memory [1] to avoid limitations caused by
> KASLR.
> 
> Consolidate early memory reservations in a dedicated function to improve
> robustness against future changes. Having the early reservations in one
> place also makes it clearer what memory must be reserved before we allow
> memblock allocations.
> 
> [1] https://lore.kernel.org/lkml/20201217201214.3414100-2-g...@fb.com
> 
> Signed-off-by: Mike Rapoport 
> Acked-by: Borislav Petkov 
> ---
>  arch/x86/kernel/setup.c | 92 -
>  1 file changed, 44 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d883176ef2ce..3e3c6036b023 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -645,18 +645,6 @@ static void __init trim_snb_memory(void)
>   }
>  }
>  
> -/*
> - * Here we put platform-specific memory range workarounds, i.e.
> - * memory known to be corrupt or otherwise in need to be reserved on
> - * specific platforms.
> - *
> - * If this gets used more widely it could use a real dispatch mechanism.
> - */
> -static void __init trim_platform_memory_ranges(void)
> -{
> - trim_snb_memory();
> -}
> -
>  static void __init trim_bios_range(void)
>  {
>   /*
> @@ -729,7 +717,38 @@ static void __init trim_low_memory_range(void)
>  {
>   memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
>  }
> - 
> +
> +static void __init early_reserve_memory(void)
> +{
> + /*
> +  * Reserve the memory occupied by the kernel between _text and
> +  * __end_of_kernel_reserve symbols. Any kernel sections after the
> +  * __end_of_kernel_reserve symbol must be explicitly reserved with a
> +  * separate memblock_reserve() or they will be discarded.
> +  */
> + memblock_reserve(__pa_symbol(_text),
> +  (unsigned long)__end_of_kernel_reserve - (unsigned 
> long)_text);
> +
> + /*
> +  * Make sure page 0 is always reserved because on systems with
> +  * L1TF its contents can be leaked to user processes.
> +  */
> + memblock_reserve(0, PAGE_SIZE);
> +
> + early_reserve_initrd();
> +
> + if (efi_enabled(EFI_BOOT))
> + efi_memblock_x86_reserve_range();
> +
> + memblock_x86_reserve_range_setup_data();

This patch looks good to me, thanks for the effort.

While at it, wondering if we can rename the above function to
memblock_reserve_setup_data() just as its e820 counterpart
e820__reserve_setup_data(), adding 'x86' to a function under arch/x86
seems redundant.

FWIW,

Reviewed-by: Baoquan He 

Thanks
Baoquan

> +
> + reserve_ibft_region();
> + reserve_bios_regions();
> +
> + trim_snb_memory();
> + trim_low_memory_range();
> +}
> +
>  /*
>   * Dump out kernel offset information on panic.
>   */
> @@ -764,29 +783,6 @@ dump_kernel_offset(struct notifier_block *self, unsigned 
> long v, void *p)
>  
>  void __init setup_arch(char **cmdline_p)
>  {
> - /*
> -  * Reserve the memory occupied by the kernel between _text and
> -  * __end_of_kernel_reserve symbols. Any kernel sections after the
> -  * __end_of_kernel_reserve symbol must be explicitly reserved with a
> -  * separate memblock_reserve() or they will be discarded.
> -  */
> - memblock_reserve(__pa_symbol(_text),
> -  (unsigned long)__end_of_kernel_reserve - (unsigned 
> long)_text);
> -
> - /*
> -  * Make sure page 0 is always reserved because on systems with
> -  * L1TF its contents can be leaked to user processes.
> -  */
> - memblock_reserve(0, PAGE_SIZE);
> -
> - early_reserve_initrd();
> -
> - /*
> -  * At this point everything still needed from the boot loader
> -  * or BIOS or kernel text should be early reserved or marked not
> -  * RAM in e820. All other memory is free game.
> -  */
> -
>  #ifdef CONFIG_X86_32
>   memcpy(_cpu_data, _cpu_data, sizeof(new_cpu_data));
>  
> @@ -910,8 +906,1

Re: [PATCH 7/7] kdump: Use vmlinux_build_id() to simplify

2021-03-02 Thread Baoquan He

On 03/01/21 at 09:47am, Stephen Boyd wrote:
> We can use the vmlinux_build_id() helper here now instead of open coding
> it. This consolidates code and possibly avoids calculating the build ID
> twice in the case of a crash with a stacktrace.
> 
> Cc: Jiri Olsa 
> Cc: Alexei Starovoitov 
> Cc: Jessica Yu 
> Cc: Evan Green 
> Cc: Hsin-Yi Wang 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Vivek Goyal 
> Cc: 
> Signed-off-by: Stephen Boyd 
> ---
>  kernel/crash_core.c | 46 -
>  1 file changed, 8 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..07d3e1109a8c 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -4,6 +4,7 @@
>   * Copyright (C) 2002-2004 Eric Biederman  
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -378,51 +379,20 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>  }
>  EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>  
> -#define NOTES_SIZE (&__stop_notes - &__start_notes)
> -#define BUILD_ID_MAX SHA1_DIGEST_SIZE
> -#define NT_GNU_BUILD_ID 3
> -
> -struct elf_note_section {
> - struct elf_note n_hdr;
> - u8 n_data[];
> -};
> -
>  /*
>   * Add build ID from .notes section as generated by the GNU ld(1)
>   * or LLVM lld(1) --build-id option.
>   */
>  static void add_build_id_vmcoreinfo(void)
>  {
> - char build_id[BUILD_ID_MAX * 2 + 1];
> - int n_remain = NOTES_SIZE;
> -
> - while (n_remain >= sizeof(struct elf_note)) {
> - const struct elf_note_section *note_sec =
> - &__start_notes + NOTES_SIZE - n_remain;
> - const u32 n_namesz = note_sec->n_hdr.n_namesz;
> -
> - if (note_sec->n_hdr.n_type == NT_GNU_BUILD_ID &&
> - n_namesz != 0 &&
> - !strcmp((char *)_sec->n_data[0], "GNU")) {
> - if (note_sec->n_hdr.n_descsz <= BUILD_ID_MAX) {
> - const u32 n_descsz = note_sec->n_hdr.n_descsz;
> - const u8 *s = _sec->n_data[n_namesz];
> -
> - s = PTR_ALIGN(s, 4);
> - bin2hex(build_id, s, n_descsz);
> - build_id[2 * n_descsz] = '\0';
> - VMCOREINFO_BUILD_ID(build_id);
> - return;
> - }
> - pr_warn("Build ID is too large to include in 
> vmcoreinfo: %u > %u\n",
> - note_sec->n_hdr.n_descsz,
> - BUILD_ID_MAX);
> - return;
> - }
> - n_remain -= sizeof(struct elf_note) +
> - ALIGN(note_sec->n_hdr.n_namesz, 4) +
> - ALIGN(note_sec->n_hdr.n_descsz, 4);
> + const char *build_id = vmlinux_build_id();

It's strange that I can only see the cover letter and this patch 7,
couldn't find the patch where vmlinux_build_id() is introduced in lkml.

> +
> + if (build_id[0] == '\0') {
> + pr_warn("Build ID cannot be included in vmcoreinfo\n");
> + return;
>   }
> +
> + VMCOREINFO_BUILD_ID(build_id);
>  }
>  
>  static int __init crash_save_vmcoreinfo_init(void)
> -- 
> https://chromeos.dev
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Re: [PATCH v14 01/11] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2021-03-02 Thread Baoquan He

On 02/26/21 at 09:38am, Eric W. Biederman wrote:
> chenzhou  writes:
> 
> > On 2021/2/25 15:25, Baoquan He wrote:
> >> On 02/24/21 at 02:19pm, Catalin Marinas wrote:
> >>> On Sat, Jan 30, 2021 at 03:10:15PM +0800, Chen Zhou wrote:
> >>>> Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
> >>>> alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
> >>>> function reserve_crashkernel() also used 1M alignment. So just
> >>>> replace hard-coded alignment 1M with macro CRASH_ALIGN.
> >>> [...]
> >>>> @@ -510,7 +507,7 @@ static void __init reserve_crashkernel(void)
> >>>>  } else {
> >>>>  unsigned long long start;
> >>>>  
> >>>> -start = memblock_phys_alloc_range(crash_size, SZ_1M, 
> >>>> crash_base,
> >>>> +start = memblock_phys_alloc_range(crash_size, 
> >>>> CRASH_ALIGN, crash_base,
> >>>>crash_base + 
> >>>> crash_size);
> >>>>  if (start != crash_base) {
> >>>>  pr_info("crashkernel reservation failed - 
> >>>> memory is in use.\n");
> >>> There is a small functional change here for x86. Prior to this patch,
> >>> crash_base passed by the user on the command line is allowed to be 1MB
> >>> aligned. With this patch, such reservation will fail.
> >>>
> >>> Is the current behaviour a bug in the current x86 code or it does allow
> >>> 1MB-aligned reservations?
> >> Hmm, you are right. Here we should keep 1MB alignment as is because
> >> users specify the address and size, their intention should be respected.
> >> The 1MB alignment for fixed memory region reservation was introduced in
> >> below commit, but it doesn't tell what is Eric's request at that time, I
> >> guess it meant respecting users' specifying.
> 
> 
> > I think we could make the alignment unified. Why is the alignment system 
> > reserved and
> > user specified different? Besides, there is no document about the 1MB 
> > alignment.
> > How about adding the alignment size(16MB) in doc  if user specified
> > start address as arm64 does.
> 
> Looking at what the code is doing.  Attempting to reserve a crash region
> at the location the user specified.  Adding unnecessary alignment
> constraints is totally broken. 
> 
> I am not even certain enforcing a 1MB alignment makes sense.  I suspect
> it was added so that we don't accidentally reserve low memory on x86.
> Frankly I am not even certain that makes sense.
> 
> Now in practice there might be an argument for 2MB alignment that goes
> with huge page sizes on x86.  But until someone finds that there are
> actual problems with 1MB alignment I would not touch it.
> 
> The proper response to something that isn't documented and confusing is
> not to arbitrarily change it and risk breaking users.  Especially in
> this case where it is clear that adding additional alignment is total
> nonsense.  The proper response to something that isn't clear and
> documented is to dig in and document it, or to leave it alone and let it

Sounds reasonable. Then adding document or code comment around looks
like a good way to go further so that people can easily get why its
alignment is different than other reservation.

> be the next persons problem.
> 
> In this case there is no reason for changing this bit of code.
> All CRASH_ALIGN is about is a default alignment when none is specified.
> It is not a functional requirement but just something so that things
> come out nicely.
> 
> 
> Eric
>

Re: [PATCH v14 02/11] x86: kdump: make the lower bound of crash kernel reservation consistent

2021-02-25 Thread Baoquan He

On 02/25/21 at 02:42pm, Catalin Marinas wrote:
> On Thu, Feb 25, 2021 at 03:08:46PM +0800, Baoquan He wrote:
> > On 02/24/21 at 02:35pm, Catalin Marinas wrote:
> > > On Sat, Jan 30, 2021 at 03:10:16PM +0800, Chen Zhou wrote:
> > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > > index da769845597d..27470479e4a3 100644
> > > > --- a/arch/x86/kernel/setup.c
> > > > +++ b/arch/x86/kernel/setup.c
> > > > @@ -439,7 +439,8 @@ static int __init reserve_crashkernel_low(void)
> > > > return 0;
> > > > }
> > > >  
> > > > -   low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, 
> > > > CRASH_ADDR_LOW_MAX);
> > > > +   low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 
> > > > CRASH_ALIGN,
> > > > +   CRASH_ADDR_LOW_MAX);
> > > > if (!low_base) {
> > > > pr_err("Cannot reserve %ldMB crashkernel low memory, 
> > > > please try smaller size.\n",
> > > >(unsigned long)(low_size >> 20));
> > > 
> > > Is there any reason why the lower bound can't be 0 in all low cases
> > > here? (Sorry if it's been already discussed, I lost track)
> > 
> > Seems like a good question.
> > 
> > This reserve_crashkernel_low(), paired with reserve_crashkernel_high(), is
> > used to reserve memory under 4G so that kdump kernel owns memory for dma
> > buffer allocation. In that case, kernel usually is loaded in high
> > memory. In x86_64, kernel loading need be aligned to 16M because of
> > CONFIG_PHYSICAL_START, please see commit 32105f7fd8faa7b ("x86: find
> > offset for crashkernel reservation automatically"). But for crashkernel
> > low memory, there seems to be no reason to ask for 16M alignment, if
> > it's taken as dma buffer memory.
> > 
> > So we can make a different alignment for low memory only, e.g 2M. But
> > 16M alignment consistent with crashkernel,high is also fine to me. The
> > only affect is smaller alignment can increase the possibility of
> > crashkernel low reservation.
> 
> I don't mind the 16M alignment in both low and high base. But is there
> any reason that the lower bound (third argument) cannot be 0 in both
> reserve_crashkernel() (the low attempt) and reserve_crashkernel_low()
> cases? The comment in reserve_crashkernel() only talks about the 4G
> upper bound but not why we need a 16M lower bound.

Ah, sorry, I must have mixed this one with the alignment of fixed
memory region reservation in patch 1 when considering comments.

Hmm, in x86 we always have memory reserved in low 1M, lower bound
being 0 or 16M (kernel alignment) doesn't make difference on crashkernel
low reservation. But for crashkernel reservation, the reason should be
kernel loading alignment being 16M, please see commit 32105f7fd8faa7b
("x86: find offset for crashkernel reservation automatically").

So, for crashkernel low, keeping lower bound as 0 looks good to me, the
only reason is just as patch log tells. And it can skip the unnecessary
memblock searching under 16M since it will always fail, even though it
won't matter much. Or changing it to CRASH_ALIGN as this patch is doing,
and adding code comment, is also fine to me.

Thanks
Baoquan

Re: [PATCH v14 01/11] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2021-02-24 Thread Baoquan He

On 02/24/21 at 02:19pm, Catalin Marinas wrote:
> On Sat, Jan 30, 2021 at 03:10:15PM +0800, Chen Zhou wrote:
> > Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
> > alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
> > function reserve_crashkernel() also used 1M alignment. So just
> > replace hard-coded alignment 1M with macro CRASH_ALIGN.
> [...]
> > @@ -510,7 +507,7 @@ static void __init reserve_crashkernel(void)
> > } else {
> > unsigned long long start;
> >  
> > -   start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
> > +   start = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 
> > crash_base,
> >   crash_base + crash_size);
> > if (start != crash_base) {
> > pr_info("crashkernel reservation failed - memory is in 
> > use.\n");
> 
> There is a small functional change here for x86. Prior to this patch,
> crash_base passed by the user on the command line is allowed to be 1MB
> aligned. With this patch, such reservation will fail.
> 
> Is the current behaviour a bug in the current x86 code or it does allow
> 1MB-aligned reservations?

Hmm, you are right. Here we should keep 1MB alignment as is because
users specify the address and size, their intention should be respected.
The 1MB alignment for fixed memory region reservation was introduced in
below commit, but it doesn't tell what is Eric's request at that time, I
guess it meant respecting users' specifying.

commit 44280733e71ad15377735b42d8538c109c94d7e3
Author: Yinghai Lu 
Date:   Sun Nov 22 17:18:49 2009 -0800

x86: Change crash kernel to reserve via reserve_early()

use find_e820_area()/reserve_early() instead.

-v2: address Eric's request, to restore original semantics.
 will fail, if the provided address can not be used.

Re: [PATCH v14 02/11] x86: kdump: make the lower bound of crash kernel reservation consistent

2021-02-24 Thread Baoquan He

On 02/24/21 at 02:35pm, Catalin Marinas wrote:
> On Sat, Jan 30, 2021 at 03:10:16PM +0800, Chen Zhou wrote:
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index da769845597d..27470479e4a3 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -439,7 +439,8 @@ static int __init reserve_crashkernel_low(void)
> > return 0;
> > }
> >  
> > -   low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, 
> > CRASH_ADDR_LOW_MAX);
> > +   low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, CRASH_ALIGN,
> > +   CRASH_ADDR_LOW_MAX);
> > if (!low_base) {
> > pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> > smaller size.\n",
> >(unsigned long)(low_size >> 20));
> 
> Is there any reason why the lower bound can't be 0 in all low cases
> here? (Sorry if it's been already discussed, I lost track)

Seems like a good question.

This reserve_crashkernel_low(), paired with reserve_crashkernel_high(), is
used to reserve memory under 4G so that kdump kernel owns memory for dma
buffer allocation. In that case, kernel usually is loaded in high
memory. In x86_64, kernel loading need be aligned to 16M because of
CONFIG_PHYSICAL_START, please see commit 32105f7fd8faa7b ("x86: find
offset for crashkernel reservation automatically"). But for crashkernel
low memory, there seems to be no reason to ask for 16M alignment, if
it's taken as dma buffer memory.

So we can make a different alignment for low memory only, e.g 2M. But
16M alignment consistent with crashkernel,high is also fine to me. The
only affect is smaller alignment can increase the possibility of
crashkernel low reservation.

Thanks
Baoquan

Re: [PATCH v4 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-23 Thread Baoquan He

On 02/23/21 at 09:41am, Saeed Mirzamohammadi wrote:
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
> 
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 

Looks good, thx.

Acked-by: Baoquan He 

By the way, please provide changelog in the future. That can help people
know better what's happened during patch reviewing and evolution.

Thanks
Baoquan

> ---
>  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
>  .../admin-guide/kernel-parameters.txt |  6 ++
>  arch/Kconfig  | 20 +++
>  kernel/crash_core.c   |  7 +++
>  4 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..ae030111e22a 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>  
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>  
>  Boot into System Kernel
>  ===
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 9e3cdb271d06..a5deda5c85fe 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -747,6 +747,12 @@
>   a memory unit (amount[KMG]). See also
>   Documentation/admin-guide/kdump/kdump.rst for an 
> example.
>  
> + crashkernel=auto
> + [KNL] This parameter will set the reserved memory for
> + the crash kernel based on the value of the 
> CRASH_AUTO_STR
> + that is the best effort estimation for each arch. See 
> also
> + arch/Kconfig for further details.
> +
>   crashkernel=size[KMG],high
>   [KNL, X86-64] range could be above 4G. Allow kernel
>   to allocate physical memory region from top, so could
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..23d047548772 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,26 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +config CRASH_AUTO_STR
> + string "Memory reserved for crash kernel"
> + depends on CRASH_CORE
> + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> + help
> +   This configures the reserved memory dependent
> +   on the value of System RAM. The syntax is:
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
> +
> +   For example:
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
> +
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..90f9e4bb6704 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
>   if (suffix)
>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> + pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> + }
> +#endif
>   /*
>* if the commandline contains a ':', then that's the extended
>* syntax -- if not, it must be the classic syntax
> -- 
> 2.27.0
>

Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-23 Thread Baoquan He

On 02/24/21 at 09:54am, Baoquan He wrote:
> On 02/11/21 at 10:08am, Saeed Mirzamohammadi wrote:
> > This adds crashkernel=auto feature to configure reserved memory for
> > vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> > different kernel distributions and different archs based on their
> > needs.
> > 
> > Signed-off-by: Saeed Mirzamohammadi 
> > Signed-off-by: John Donnelly 
> > Tested-by: John Donnelly 
> > ---
> >  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
> >  .../admin-guide/kernel-parameters.txt |  6 +
> >  arch/Kconfig  | 24 +++
> >  kernel/crash_core.c   |  7 ++++++
> >  4 files changed, 39 insertions(+), 1 deletion(-)
> 
> Acked-by: Baoquan He 

Sorry, I just acked on wrong version of patch, please ignore this.

> 
> > 
> > diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> > b/Documentation/admin-guide/kdump/kdump.rst
> > index 2da65fef2a1c..e55cdc404c6b 100644
> > --- a/Documentation/admin-guide/kdump/kdump.rst
> > +++ b/Documentation/admin-guide/kdump/kdump.rst
> > @@ -285,7 +285,8 @@ This would mean:
> >  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
> >  3) if the RAM size is larger than 2G, then reserve 128M
> >  
> > -
> > +Or you can use crashkernel=auto to choose the crash kernel memory size
> > +based on the recommended configuration set for each arch.
> >  
> >  Boot into System Kernel
> >  ===
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 7d4e523646c3..aa2099465458 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -736,6 +736,12 @@
> > a memory unit (amount[KMG]). See also
> > Documentation/admin-guide/kdump/kdump.rst for an 
> > example.
> >  
> > +   crashkernel=auto
> > +   [KNL] This parameter will set the reserved memory for
> > +   the crash kernel based on the value of the 
> > CRASH_AUTO_STR
> > +   that is the best effort estimation for each arch. See 
> > also
> > +   arch/Kconfig for further details.
> > +
> > crashkernel=size[KMG],high
> > [KNL, X86-64] range could be above 4G. Allow kernel
> > to allocate physical memory region from top, so could
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index af14a567b493..f87c88ffa2f8 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -14,6 +14,30 @@ menu "General architecture-dependent options"
> >  config CRASH_CORE
> > bool
> >  
> > +if CRASH_CORE
> > +
> > +config CRASH_AUTO_STR
> > +   string "Memory reserved for crash kernel"
> > +   depends on CRASH_CORE
> > +   default "1G-64G:128M,64G-1T:256M,1T-:512M"
> > +   help
> > + This configures the reserved memory dependent
> > + on the value of System RAM. The syntax is:
> > + crashkernel=:[,:,...][@offset]
> > + range=start-[end]
> > +
> > + For example:
> > + crashkernel=512M-2G:64M,2G-:128M
> > +
> > + This would mean:
> > +
> > + 1) if the RAM is smaller than 512M, then don't reserve anything
> > +(this is the "rescue" case)
> > + 2) if the RAM size is between 512M and 2G (exclusive), then 
> > reserve 64M
> > + 3) if the RAM size is larger than 2G, then reserve 128M
> > +
> > +endif # CRASH_CORE
> > +
> >  config KEXEC_CORE
> > select CRASH_CORE
> > bool
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 106e4500fd53..ab0a2b4b1ffa 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -7,6 +7,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
> > if (suffix)
> > return parse_crashkernel_suffix(ck_cmdline, crash_size,
> > suffix);
> > +#ifdef CONFIG_CRASH_AUTO_STR
> > +   if (strncmp(ck_cmdline, "auto", 4) == 0) {
> > +   ck_cmdline = CONFIG_CRASH_AUTO_STR;
> > +   pr_info("Using crashkernel=auto, the size chosen is a best 
> > effort estimation.\n");
> > +   }
> > +#endif
> > /*
> >  * if the commandline contains a ':', then that's the extended
> >  * syntax -- if not, it must be the classic syntax
> > -- 
> > 2.27.0
> >

Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-23 Thread Baoquan He

On 02/11/21 at 10:08am, Saeed Mirzamohammadi wrote:
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
> 
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
>  .../admin-guide/kernel-parameters.txt |  6 +
>  arch/Kconfig  | 24 +++
>  kernel/crash_core.c   |  7 ++
>  4 files changed, 39 insertions(+), 1 deletion(-)

Acked-by: Baoquan He 

> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 2da65fef2a1c..e55cdc404c6b 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>  
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>  
>  Boot into System Kernel
>  ===
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 7d4e523646c3..aa2099465458 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -736,6 +736,12 @@
>   a memory unit (amount[KMG]). See also
>   Documentation/admin-guide/kdump/kdump.rst for an 
> example.
>  
> + crashkernel=auto
> + [KNL] This parameter will set the reserved memory for
> + the crash kernel based on the value of the 
> CRASH_AUTO_STR
> + that is the best effort estimation for each arch. See 
> also
> + arch/Kconfig for further details.
> +
>   crashkernel=size[KMG],high
>   [KNL, X86-64] range could be above 4G. Allow kernel
>   to allocate physical memory region from top, so could
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..f87c88ffa2f8 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,30 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +if CRASH_CORE
> +
> +config CRASH_AUTO_STR
> + string "Memory reserved for crash kernel"
> + depends on CRASH_CORE
> + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> + help
> +   This configures the reserved memory dependent
> +   on the value of System RAM. The syntax is:
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
> +
> +   For example:
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
> +
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
> +endif # CRASH_CORE
> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..ab0a2b4b1ffa 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
>   if (suffix)
>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> + pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> + }
> +#endif
>   /*
>* if the commandline contains a ':', then that's the extended
>* syntax -- if not, it must be the classic syntax
> -- 
> 2.27.0
>

Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-23 Thread Baoquan He

On 02/23/21 at 08:01pm, Kairui Song wrote:
> On Thu, Feb 18, 2021 at 10:03 AM Baoquan He  wrote:
> >
> > On 02/11/21 at 10:08am, Saeed Mirzamohammadi wrote:
...
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index af14a567b493..f87c88ffa2f8 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -14,6 +14,30 @@ menu "General architecture-dependent options"
> > >  config CRASH_CORE
> > >   bool
> > >
> > > +if CRASH_CORE
> > > +
> > > +config CRASH_AUTO_STR
> > > + string "Memory reserved for crash kernel"
> > > + depends on CRASH_CORE
> > > + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> > > + help
> > > +   This configures the reserved memory dependent
> > > +   on the value of System RAM. The syntax is:
> > > +   crashkernel=:[,:,...][@offset]
> > > +   range=start-[end]
> > > +
> > > +   For example:
> > > +   crashkernel=512M-2G:64M,2G-:128M
> > > +
> > > +   This would mean:
> > > +
> > > +   1) if the RAM is smaller than 512M, then don't reserve 
> > > anything
> > > +  (this is the "rescue" case)
> > > +   2) if the RAM size is between 512M and 2G (exclusive), then 
> > > reserve 64M
> > > +   3) if the RAM size is larger than 2G, then reserve 128M
> > > +
> > > +endif # CRASH_CORE
> >
> > Wondering if this CRASH_CORE ifdeffery is a little redundent here
> > since CRASH_CORE dependency has been added. Except of this, I like this
> > patch. As we discussed in private threads, we can try to push it into
> > mainline and continue improving later.
> >
> 
> I believe "if CRASH_CORE" is not needed as it already "depends on
> CRASH_CORE", tested with CRASH_CORE=y or 'not set', it just works.

Thanks for testing and confirmation, Kairui.

Saeed, can you post a v4 with CRASH_CORE ifdeffery removed? Maybe this
week?

Thanks
Baoquan

> 
> > > +
> > >  config KEXEC_CORE
> > >   select CRASH_CORE
> > >   bool
> > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > > index 106e4500fd53..ab0a2b4b1ffa 100644
> > > --- a/kernel/crash_core.c
> > > +++ b/kernel/crash_core.c
> > > @@ -7,6 +7,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #include 
> > >  #include 
> > > @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
> > >   if (suffix)
> > >   return parse_crashkernel_suffix(ck_cmdline, crash_size,
> > >   suffix);
> > > +#ifdef CONFIG_CRASH_AUTO_STR
> > > + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> > > + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> > > + pr_info("Using crashkernel=auto, the size chosen is a best 
> > > effort estimation.\n");
> > > + }
> > > +#endif
> > >   /*
> > >* if the commandline contains a ':', then that's the extended
> > >* syntax -- if not, it must be the classic syntax
> > > --
> > > 2.27.0
> > >
> >
> >
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> >
> 
> 
> -- 
> Best Regards,
> Kairui Song
>

Re: [PATCH v6 1/1] mm/page_alloc.c: refactor initialization of struct page for holes in memory layout

2021-02-22 Thread Baoquan He

On 02/22/21 at 12:57pm, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> There could be struct pages that are not backed by actual physical memory.
> This can happen when the actual memory bank is not a multiple of
> SECTION_SIZE or when an architecture does not register memory holes
> reserved by the firmware as memblock.memory.
> 
> Such pages are currently initialized using init_unavailable_mem() function
> that iterates through PFNs in holes in memblock.memory and if there is a
> struct page corresponding to a PFN, the fields of this page are set to
> default values and it is marked as Reserved.
> 
> init_unavailable_mem() does not take into account zone and node the page
> belongs to and sets both zone and node links in struct page to zero.
> 
> Before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
> rather that check each PFN") the holes inside a zone were re-initialized

Yeah, and the old code re-initialized the unavailable memory in a
implicit and confusing way. This patch does it explicitly in the similar
way to make it basically consistent with the old code. This looks great
to me.

> during memmap_init() and got their zone/node links right. However, after
> that commit nothing updates the struct pages representing such holes.
> 
> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> instance in a configuration below:
> 
>   # grep -A1 E820 /proc/iomem
>   7a17b000-7a216fff : Unknown E820 type
>   7a217000-7bff : System RAM
> 
> unset zone link in struct page will trigger
> 
>   VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
> in struct page) in the same pageblock.
> 
> Interleave initialization of the unavailable pages with the normal
> initialization of memory map, so that zone and node information will be
> properly set on struct pages that are not backed by the actual memory.
> 
> With this change the pages for holes inside a zone will get proper
> zone/node links and the pages that are not spanned by any node will get
> links to the adjacent zone/node.

Thanks for spending so much effort with patience to fix this.

Reviewed-by: Baoquan He 

> 
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather 
> that check each PFN")
> Signed-off-by: Mike Rapoport 
> Reported-by: Qian Cai 
> Reported-by: Andrea Arcangeli 
> Cc: Baoquan He 
> Cc: David Hildenbrand 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Qian Cai 
> Cc: Vlastimil Babka 
> ---
>  mm/page_alloc.c | 144 
>  1 file changed, 61 insertions(+), 83 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e93f8b29bae..1f1db70b7789 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6280,12 +6280,60 @@ static void __meminit zone_init_free_lists(struct 
> zone *zone)
>   }
>  }
>  
> +#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> +/*
> + * Only struct pages that correspond to ranges defined by memblock.memory
> + * are zeroed and initialized by going through __init_single_page() during
> + * memmap_init_zone().
> + *
> + * But, there could be struct pages that correspond to holes in
> + * memblock.memory. This can happen because of the following reasons:
> + * - phyiscal memory bank size is not necessarily the exact multiple of the
> + *   arbitrary section size
> + * - early reserved memory may not be listed in memblock.memory
> + * - memory layouts defined with memmap= kernel parameter may not align
> + *   nicely with memmap sections
> + *
> + * Explicitly initialize those struct pages so that:
> + * - PG_Reserved is set
> + * - zone and node links point to zone and node that span the page
> + */
> +static u64 __meminit init_unavailable_range(unsigned long spfn,
> + unsigned long epfn,
> + int zone, int node)
> +{
> + unsigned long pfn;
> + u64 pgcnt = 0;
> +
> + for (pfn = spfn; pfn < epfn; pfn++) {
> + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> + + pageblock_nr_pages - 1;
> + continue;
> + }
> + __init_single_page(pfn_to_page(pfn), pfn, zone, node);
> + __SetPageReserved(pfn_to_page(pfn));
> + pgcnt++;
> + }
> +
> + return pgcnt;
> +}
> +#else
> +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long 
> epfn,
> +

Re: [PATCH v14 11/11] kdump: update Documentation about crashkernel

2021-02-18 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> For arm64, the behavior of crashkernel=X has been changed, which
> tries low allocation in DMA zone and fall back to high allocation
> if it fails.
> 
> We can also use "crashkernel=X,high" to select a high region above
> DMA zone, which also tries to allocate at least 256M low memory in
> DMA zone automatically and "crashkernel=Y,low" can be used to allocate
> specified size low memory.
> 
> So update the Documentation.

Nice document adding which also takes care of x86 code implementation,
thanks. By the way, maybe you can remove John's 'Tested-by' since it
doesn't make much sense to test a document patch.

Acked-by: Baoquan He 

> 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst | 22 ---
>  .../admin-guide/kernel-parameters.txt | 11 --
>  2 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 75a9dd98e76e..0877c76f8015 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -299,7 +299,16 @@ Boot into System Kernel
> "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
> starting at physical address 0x0100 (16MB) for the dump-capture 
> kernel.
>  
> -   On x86 and x86_64, use "crashkernel=64M@16M".
> +   On x86 use "crashkernel=64M@16M".
> +
> +   On x86_64, use "crashkernel=X" to select a region under 4G first, and
> +   fall back to reserve region above 4G. And go for high allocation
> +   directly if the required size is too large.
> +   We can also use "crashkernel=X,high" to select a region above 4G, which
> +   also tries to allocate at least 256M below 4G automatically and
> +   "crashkernel=Y,low" can be used to allocate specified size low memory.
> +   Use "crashkernel=Y@X" if you really have to reserve memory from specified
> +   start address X.
>  
> On ppc64, use "crashkernel=128M@32M".
>  
> @@ -316,8 +325,15 @@ Boot into System Kernel
> kernel will automatically locate the crash kernel image within the
> first 512MB of RAM if X is not given.
>  
> -   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
> -   the kernel, X if explicitly specified, must be aligned to 2MiB (0x20).
> +   On arm64, use "crashkernel=X" to try low allocation in DMA zone and
> +   fall back to high allocation if it fails.
> +   We can also use "crashkernel=X,high" to select a high region above
> +   DMA zone, which also tries to allocate at least 256M low memory in
> +   DMA zone automatically.
> +   "crashkernel=Y,low" can be used to allocate specified size low memory.
> +   Use "crashkernel=Y@X" if you really have to reserve memory from
> +   specified start address X. Note that the start address of the kernel,
> +   X if explicitly specified, must be aligned to 2MiB (0x20).
>  
>  Load the Dump-capture Kernel
>  
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index a10b545c2070..908e5c8b61ba 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -738,6 +738,9 @@
>   [KNL, X86-64] Select a region under 4G first, and
>   fall back to reserve region above 4G when '@offset'
>   hasn't been specified.
> + [KNL, arm64] Try low allocation in DMA zone and fall 
> back
> + to high allocation if it fails when '@offset' hasn't 
> been
> + specified.
>   See Documentation/admin-guide/kdump/kdump.rst for 
> further details.
>  
>   crashkernel=range1:size1[,range2:size2,...][@offset]
> @@ -754,6 +757,8 @@
>   Otherwise memory region will be allocated below 4G, if
>   available.
>   It will be ignored if crashkernel=X is specified.
> + [KNL, arm64] range in high memory.
> + Allow kernel to allocate physical memory region from 
> top.
>   crashkernel=size[KMG],low
>   [KNL, X86-64] range under 4G. When crashkernel=X,high
>   is passed, kernel could allocate physical memory region
> @@ -762,13 +767,15 @@
>   requires at least 64M+32K low memory, also enough extra
>

Re: [PATCH v14 09/11] x86, arm64: Add ARCH_WANT_RESERVE_CRASH_KERNEL config

2021-02-18 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> We make the functions reserve_crashkernel[_low]() as generic for
> x86 and arm64. Since reserve_crashkernel[_low]() implementations
> are quite similar on other architectures as well, we can have more
> users of this later.
> 
> So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
> select this by X86 and ARM64.

This looks much better with the help of
CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL. And please take off the
'Suggested-by' tag from me, I just don't like the old CONFIG_X86 and
CONFIG_ARM64 ifdeffery way in v13, Mike suggested this ARCH_WANT_
option.

And the two dummy function reserve_crashkernel() in x86 and arm64 looks
not so good, but I don't have better idea. Maybe add
CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL iddeffery in the call site of
reserve_crashkernel() in each ARCH? Or just leave with it for now if no
other people has concern or suggestion about it.

Anyway, ack this one.

Acked-by: Baoquan He 

Thanks
Baoquan


> 
> Suggested-by: Mike Rapoport 
> Suggested-by: Baoquan He 
> Signed-off-by: Chen Zhou 
> ---
>  arch/Kconfig| 3 +++
>  arch/arm64/Kconfig  | 1 +
>  arch/x86/Kconfig| 2 ++
>  kernel/crash_core.c | 7 ++-
>  4 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..0ca1ff5bb157 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -24,6 +24,9 @@ config KEXEC_ELF
>  config HAVE_IMA_KEXEC
>   bool
>  
> +config ARCH_WANT_RESERVE_CRASH_KERNEL
> + bool
> +
>  config SET_FS
>   bool
>  
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index f39568b28ec1..09365c7ff469 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -82,6 +82,7 @@ config ARM64
>   select ARCH_WANT_FRAME_POINTERS
>   select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES 
> && !ARM64_VA_BITS_36)
>   select ARCH_WANT_LD_ORPHAN_WARN
> + select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>   select ARM_AMBA
>   select ARM_ARCH_TIMER
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 21f851179ff0..e6926fcb4a40 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -12,6 +12,7 @@ config X86_32
>   depends on !64BIT
>   # Options that are inherently 32-bit kernel only:
>   select ARCH_WANT_IPC_PARSE_VERSION
> + select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
>   select CLKSRC_I8253
>   select CLONE_BACKWARDS
>   select GENERIC_VDSO_32
> @@ -28,6 +29,7 @@ config X86_64
>   select ARCH_HAS_GIGANTIC_PAGE
>   select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>   select ARCH_USE_CMPXCHG_LOCKREF
> + select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
>   select HAVE_ARCH_SOFT_DIRTY
>   select MODULES_USE_ELF_RELA
>   select NEED_DMA_MAP_STATE
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 8479be270c0b..2c5783985db5 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -320,9 +320,7 @@ int __init parse_crashkernel_low(char *cmdline,
>   * - Crashkernel reservation --
>   */
>  
> -#ifdef CONFIG_KEXEC_CORE
> -
> -#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> +#ifdef CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
>  static int __init reserve_crashkernel_low(void)
>  {
>  #ifdef CONFIG_64BIT
> @@ -450,8 +448,7 @@ void __init reserve_crashkernel(void)
>   crashk_res.start = crash_base;
>   crashk_res.end   = crash_base + crash_size - 1;
>  }
> -#endif
> -#endif /* CONFIG_KEXEC_CORE */
> +#endif /* CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL */
>  
>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> void *data, size_t data_len)
> -- 
> 2.20.1
>

Re: [PATCH v14 03/11] x86: kdump: use macro CRASH_ADDR_LOW_MAX in functions reserve_crashkernel()

2021-02-18 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> To make the functions reserve_crashkernel() as generic,
> replace some hard-coded numbers with macro CRASH_ADDR_LOW_MAX.
> 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/kernel/setup.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 27470479e4a3..086a04235be4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -487,8 +487,9 @@ static void __init reserve_crashkernel(void)
>   if (!crash_base) {
>   /*
>* Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
> -  * crashkernel=x,high reserves memory over 4G, also allocates
> -  * 256M extra low memory for DMA buffers and swiotlb.
> +  * crashkernel=x,high reserves memory over CRASH_ADDR_LOW_MAX,
> +  * also allocates 256M extra low memory for DMA buffers
> +  * and swiotlb.
>* But the extra memory is not required for all machines.
>* So try low memory first and fall back to high memory
>* unless "crashkernel=size[KMG],high" is specified.
> @@ -516,7 +517,7 @@ static void __init reserve_crashkernel(void)
>   }
>   }
>  
> - if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) {
> + if (crash_base >= CRASH_ADDR_LOW_MAX && reserve_crashkernel_low()) {
>   memblock_free(crash_base, crash_size);
>   return;

Acked-by: Baoquan He 

>   }
> -- 
> 2.20.1
>

Re: [PATCH v14 09/11] x86, arm64: Add ARCH_WANT_RESERVE_CRASH_KERNEL config

2021-02-18 Thread Baoquan He

On 02/18/21 at 03:31pm, Baoquan He wrote:
> On 01/30/21 at 03:10pm, Chen Zhou wrote:
> > We make the functions reserve_crashkernel[_low]() as generic for
> > x86 and arm64. Since reserve_crashkernel[_low]() implementations
> > are quite similar on other architectures as well, we can have more
> > users of this later.
> > 
> > So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
> > select this by X86 and ARM64.
> > 
> > Suggested-by: Mike Rapoport 
> > Suggested-by: Baoquan He 
> > Signed-off-by: Chen Zhou 
> > ---
> >  arch/Kconfig| 3 +++
> >  arch/arm64/Kconfig  | 1 +
> >  arch/x86/Kconfig| 2 ++
> >  kernel/crash_core.c | 7 ++-
> >  4 files changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 24862d15f3a3..0ca1ff5bb157 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -24,6 +24,9 @@ config KEXEC_ELF
> >  config HAVE_IMA_KEXEC
> > bool
> >  
> > +config ARCH_WANT_RESERVE_CRASH_KERNEL
> > +   bool
> > +
> >  config SET_FS
> > bool
> >  
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index f39568b28ec1..09365c7ff469 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -82,6 +82,7 @@ config ARM64
> > select ARCH_WANT_FRAME_POINTERS
> > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES 
> > && !ARM64_VA_BITS_36)
> > select ARCH_WANT_LD_ORPHAN_WARN
> > +   select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
> > select ARCH_HAS_UBSAN_SANITIZE_ALL
> > select ARM_AMBA
> > select ARM_ARCH_TIMER
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 21f851179ff0..e6926fcb4a40 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -12,6 +12,7 @@ config X86_32
> > depends on !64BIT
> > # Options that are inherently 32-bit kernel only:
> > select ARCH_WANT_IPC_PARSE_VERSION
> > +   select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
> > select CLKSRC_I8253
> > select CLONE_BACKWARDS
> > select GENERIC_VDSO_32
> > @@ -28,6 +29,7 @@ config X86_64
> > select ARCH_HAS_GIGANTIC_PAGE
> > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > select ARCH_USE_CMPXCHG_LOCKREF
> > +   select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
> > select HAVE_ARCH_SOFT_DIRTY
> > select MODULES_USE_ELF_RELA
> > select NEED_DMA_MAP_STATE
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 8479be270c0b..2c5783985db5 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -320,9 +320,7 @@ int __init parse_crashkernel_low(char *cmdline,
> >   * - Crashkernel reservation --
> >   */
> >  
> > -#ifdef CONFIG_KEXEC_CORE
> > -
> > -#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > +#ifdef CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
> >  static int __init reserve_crashkernel_low(void)
> >  {
> >  #ifdef CONFIG_64BIT
> > @@ -450,8 +448,7 @@ void __init reserve_crashkernel(void)
> > crashk_res.start = crash_base;
> > crashk_res.end   = crash_base + crash_size - 1;
> >  }
> > -#endif
> > -#endif /* CONFIG_KEXEC_CORE */
> > +#endif /* CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL */
> 
> Why don't you move the dummy reserve_crashkernel() here too?
> 
> #ifdef CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
> #ifdef CONFIG_KEXEC_CORE
> ...
>   '...the real crashkernel reservation code...'
> ...
> #else 
> static void __init reserve_crashkernel(void)
> {
> }   
> #endif
> #endif /* CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL */
> 
> Like this, you don't need those two dummy reserve_crashkernel() in x86
> and arm64?

Sorry, I was wrong. It's impossible like this since
CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL is selected only if KEXEC_CORE is
true. Please ignore this comment.

Re: [PATCH v14 09/11] x86, arm64: Add ARCH_WANT_RESERVE_CRASH_KERNEL config

2021-02-18 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> We make the functions reserve_crashkernel[_low]() as generic for
> x86 and arm64. Since reserve_crashkernel[_low]() implementations
> are quite similar on other architectures as well, we can have more
> users of this later.
> 
> So have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL in arch/Kconfig and
> select this by X86 and ARM64.
> 
> Suggested-by: Mike Rapoport 
> Suggested-by: Baoquan He 
> Signed-off-by: Chen Zhou 
> ---
>  arch/Kconfig| 3 +++
>  arch/arm64/Kconfig  | 1 +
>  arch/x86/Kconfig| 2 ++
>  kernel/crash_core.c | 7 ++-
>  4 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..0ca1ff5bb157 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -24,6 +24,9 @@ config KEXEC_ELF
>  config HAVE_IMA_KEXEC
>   bool
>  
> +config ARCH_WANT_RESERVE_CRASH_KERNEL
> + bool
> +
>  config SET_FS
>   bool
>  
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index f39568b28ec1..09365c7ff469 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -82,6 +82,7 @@ config ARM64
>   select ARCH_WANT_FRAME_POINTERS
>   select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES 
> && !ARM64_VA_BITS_36)
>   select ARCH_WANT_LD_ORPHAN_WARN
> + select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>   select ARM_AMBA
>   select ARM_ARCH_TIMER
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 21f851179ff0..e6926fcb4a40 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -12,6 +12,7 @@ config X86_32
>   depends on !64BIT
>   # Options that are inherently 32-bit kernel only:
>   select ARCH_WANT_IPC_PARSE_VERSION
> + select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
>   select CLKSRC_I8253
>   select CLONE_BACKWARDS
>   select GENERIC_VDSO_32
> @@ -28,6 +29,7 @@ config X86_64
>   select ARCH_HAS_GIGANTIC_PAGE
>   select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>   select ARCH_USE_CMPXCHG_LOCKREF
> + select ARCH_WANT_RESERVE_CRASH_KERNEL if KEXEC_CORE
>   select HAVE_ARCH_SOFT_DIRTY
>   select MODULES_USE_ELF_RELA
>   select NEED_DMA_MAP_STATE
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 8479be270c0b..2c5783985db5 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -320,9 +320,7 @@ int __init parse_crashkernel_low(char *cmdline,
>   * - Crashkernel reservation --
>   */
>  
> -#ifdef CONFIG_KEXEC_CORE
> -
> -#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> +#ifdef CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
>  static int __init reserve_crashkernel_low(void)
>  {
>  #ifdef CONFIG_64BIT
> @@ -450,8 +448,7 @@ void __init reserve_crashkernel(void)
>   crashk_res.start = crash_base;
>   crashk_res.end   = crash_base + crash_size - 1;
>  }
> -#endif
> -#endif /* CONFIG_KEXEC_CORE */
> +#endif /* CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL */

Why don't you move the dummy reserve_crashkernel() here too?

#ifdef CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
#ifdef CONFIG_KEXEC_CORE
...
  '...the real crashkernel reservation code...'
...
#else 
static void __init reserve_crashkernel(void)
{
}   
#endif
#endif /* CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL */

Like this, you don't need those two dummy reserve_crashkernel() in x86
and arm64?

Thanks
Baoquan

>  
>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> void *data, size_t data_len)
> -- 
> 2.20.1
>

Re: [PATCH v14 06/11] x86/elf: Move vmcore_elf_check_arch_cross to arch/x86/include/asm/elf.h

2021-02-17 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> Move macro vmcore_elf_check_arch_cross from arch/x86/include/asm/kexec.h
> to arch/x86/include/asm/elf.h to fix the following compiling warning:
> 
> make ARCH=i386
> In file included from arch/x86/kernel/setup.c:39:0:
> ./arch/x86/include/asm/kexec.h:77:0: warning: "vmcore_elf_check_arch_cross" 
> redefined
>  # define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
> 
> In file included from arch/x86/kernel/setup.c:9:0:
> ./include/linux/crash_dump.h:39:0: note: this is the location of the previous 
> definition
>  #define vmcore_elf_check_arch_cross(x) 0
> 
> The root cause is that vmcore_elf_check_arch_cross under CONFIG_CRASH_CORE
> depend on CONFIG_KEXEC_CORE. Commit 2db65f1db17d ("x86: kdump: move
> reserve_crashkernel[_low]() into crash_core.c") triggered the issue.
> 
> Suggested by Mike, simply move vmcore_elf_check_arch_cross from
> arch/x86/include/asm/kexec.h to arch/x86/include/asm/elf.h to fix
> the warning.
> 
> Fixes: 2db65f1db17d ("x86: kdump: move reserve_crashkernel[_low]() into 
> crash_core.c")

Where does this commit id '2db65f1db17d' come from? Here you are fixing
another pathc in the same patchset. Please merge this with patch 05/11.

> Reported-by: kernel test robot 
> Suggested-by: Mike Rapoport 
> Signed-off-by: Chen Zhou 
> ---
>  arch/x86/include/asm/elf.h   | 3 +++
>  arch/x86/include/asm/kexec.h | 3 ---
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 66bdfe838d61..5333777cc758 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -94,6 +94,9 @@ extern unsigned int vdso32_enabled;
>  
>  #define elf_check_arch(x)elf_check_arch_ia32(x)
>  
> +/* We can also handle crash dumps from 64 bit kernel. */
> +# define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
> +
>  /* SVR4/i386 ABI (pages 3-31, 3-32) says that when the program starts %edx
> contains a pointer to a function which might be registered using `atexit'.
> This provides a mean for the dynamic linker to call DT_FINI functions for
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 2b18f918203e..6fcae01a9cca 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -72,9 +72,6 @@ struct kimage;
>  
>  /* The native architecture */
>  # define KEXEC_ARCH KEXEC_ARCH_386
> -
> -/* We can also handle crash dumps from 64 bit kernel. */
> -# define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
>  #else
>  /* Maximum physical address we can use pages from */
>  # define KEXEC_SOURCE_MEMORY_LIMIT  (MAXMEM-1)
> -- 
> 2.20.1
>

Re: [PATCH v14 04/11] x86: kdump: move xen_pv_domain() check and insert_resource() to setup_arch()

2021-02-17 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> We will make the functions reserve_crashkernel() as generic, the
> xen_pv_domain() check in reserve_crashkernel() is relevant only to
> x86, the same as insert_resource() in reserve_crashkernel[_low]().
> So move xen_pv_domain() check and insert_resource() to setup_arch()
> to keep them in x86.
> 
> Suggested-by: Mike Rapoport 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/kernel/setup.c | 19 +++
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 086a04235be4..5d676efc32f6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -454,7 +454,6 @@ static int __init reserve_crashkernel_low(void)
>  
>   crashk_low_res.start = low_base;
>   crashk_low_res.end   = low_base + low_size - 1;
> - insert_resource(_resource, _low_res);
>  #endif
>   return 0;
>  }
> @@ -478,11 +477,6 @@ static void __init reserve_crashkernel(void)
>   high = true;
>   }
>  
> - if (xen_pv_domain()) {
> - pr_info("Ignoring crashkernel for a Xen PV domain\n");
> - return;
> - }
> -
>   /* 0 means: find the address automatically */
>   if (!crash_base) {
>   /*
> @@ -529,7 +523,6 @@ static void __init reserve_crashkernel(void)
>  
>   crashk_res.start = crash_base;
>   crashk_res.end   = crash_base + crash_size - 1;
> - insert_resource(_resource, _res);
>  }
>  #else
>  static void __init reserve_crashkernel(void)
> @@ -1151,7 +1144,17 @@ void __init setup_arch(char **cmdline_p)
>* Reserve memory for crash kernel after SRAT is parsed so that it
>* won't consume hotpluggable memory.
>*/
> - reserve_crashkernel();
> + if (xen_pv_domain())
> + pr_info("Ignoring crashkernel for a Xen PV domain\n");
> + else {
> + reserve_crashkernel();
> +#ifdef CONFIG_KEXEC_CORE
> + if (crashk_res.end > crashk_res.start)
> +         insert_resource(_resource, _res);
> + if (crashk_low_res.end > crashk_low_res.start)
> + insert_resource(_resource, _low_res);
> +#endif

Acked-by: Baoquan He 

> + }
>  
>   memblock_find_dma_reserve();
>  
> -- 
> 2.20.1
>

Re: [PATCH v14 02/11] x86: kdump: make the lower bound of crash kernel reservation consistent

2021-02-17 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> The lower bounds of crash kernel reservation and crash kernel low
> reservation are different, use the consistent value CRASH_ALIGN.
> 
> Suggested-by: Dave Young 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/kernel/setup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index da769845597d..27470479e4a3 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -439,7 +439,8 @@ static int __init reserve_crashkernel_low(void)
>   return 0;
>   }
>  
> - low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, 
> CRASH_ADDR_LOW_MAX);
> + low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, CRASH_ALIGN,
> + CRASH_ADDR_LOW_MAX);

Acked-by: Baoquan He 

>   if (!low_base) {
>   pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> smaller size.\n",
>  (unsigned long)(low_size >> 20));
> -- 
> 2.20.1
>

Re: [PATCH v14 01/11] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2021-02-17 Thread Baoquan He

On 01/30/21 at 03:10pm, Chen Zhou wrote:
> Move CRASH_ALIGN to header asm/kexec.h for later use. Besides, the
> alignment of crash kernel regions in x86 is 16M(CRASH_ALIGN), but
> function reserve_crashkernel() also used 1M alignment. So just
> replace hard-coded alignment 1M with macro CRASH_ALIGN.
> 
> Suggested-by: Dave Young 
> Suggested-by: Baoquan He 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/include/asm/kexec.h | 3 +++
>  arch/x86/kernel/setup.c  | 5 +
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 6802c59e8252..be18dc7ae51f 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -18,6 +18,9 @@
>  
>  # define KEXEC_CONTROL_CODE_MAX_SIZE 2048
>  
> +/* 16M alignment for crash kernel regions */
> +#define CRASH_ALIGN  SZ_16M
> +
>  #ifndef __ASSEMBLY__
>  
>  #include 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3412c4595efd..da769845597d 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -390,9 +390,6 @@ static void __init 
> memblock_x86_reserve_range_setup_data(void)
>  
>  #ifdef CONFIG_KEXEC_CORE
>  
> -/* 16M alignment for crash kernel regions */
> -#define CRASH_ALIGN  SZ_16M
> -
>  /*
>   * Keep the crash kernel below this limit.
>   *
> @@ -510,7 +507,7 @@ static void __init reserve_crashkernel(void)
>   } else {
>   unsigned long long start;
>  
> - start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
> + start = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 
> crash_base,
> crash_base + crash_size);

Looks good to me, thx.

Acked-by: Baoquan He 

>   if (start != crash_base) {
>   pr_info("crashkernel reservation failed - memory is in 
> use.\n");
> -- 
> 2.20.1
>

Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-17 Thread Baoquan He

On 02/11/21 at 10:08am, Saeed Mirzamohammadi wrote:
> This adds crashkernel=auto feature to configure reserved memory for
> vmcore creation. CONFIG_CRASH_AUTO_STR is defined to be set for
> different kernel distributions and different archs based on their
> needs.
> 
> Signed-off-by: Saeed Mirzamohammadi 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> ---
>  Documentation/admin-guide/kdump/kdump.rst |  3 ++-
>  .../admin-guide/kernel-parameters.txt |  6 +
>  arch/Kconfig  | 24 +++
>  kernel/crash_core.c   |  7 ++
>  4 files changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kdump/kdump.rst 
> b/Documentation/admin-guide/kdump/kdump.rst
> index 2da65fef2a1c..e55cdc404c6b 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -285,7 +285,8 @@ This would mean:
>  2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>  3) if the RAM size is larger than 2G, then reserve 128M
>  
> -
> +Or you can use crashkernel=auto to choose the crash kernel memory size
> +based on the recommended configuration set for each arch.
>  
>  Boot into System Kernel
>  ===
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 7d4e523646c3..aa2099465458 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -736,6 +736,12 @@
>   a memory unit (amount[KMG]). See also
>   Documentation/admin-guide/kdump/kdump.rst for an 
> example.
>  
> + crashkernel=auto
> + [KNL] This parameter will set the reserved memory for
> + the crash kernel based on the value of the 
> CRASH_AUTO_STR
> + that is the best effort estimation for each arch. See 
> also
> + arch/Kconfig for further details.
> +
>   crashkernel=size[KMG],high
>   [KNL, X86-64] range could be above 4G. Allow kernel
>   to allocate physical memory region from top, so could
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..f87c88ffa2f8 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -14,6 +14,30 @@ menu "General architecture-dependent options"
>  config CRASH_CORE
>   bool
>  
> +if CRASH_CORE
> +
> +config CRASH_AUTO_STR
> + string "Memory reserved for crash kernel"
> + depends on CRASH_CORE
> + default "1G-64G:128M,64G-1T:256M,1T-:512M"
> + help
> +   This configures the reserved memory dependent
> +   on the value of System RAM. The syntax is:
> +   crashkernel=:[,:,...][@offset]
> +   range=start-[end]
> +
> +   For example:
> +   crashkernel=512M-2G:64M,2G-:128M
> +
> +   This would mean:
> +
> +   1) if the RAM is smaller than 512M, then don't reserve anything
> +  (this is the "rescue" case)
> +   2) if the RAM size is between 512M and 2G (exclusive), then 
> reserve 64M
> +   3) if the RAM size is larger than 2G, then reserve 128M
> +
> +endif # CRASH_CORE

Wondering if this CRASH_CORE ifdeffery is a little redundent here
since CRASH_CORE dependency has been added. Except of this, I like this
patch. As we discussed in private threads, we can try to push it into
mainline and continue improving later.


> +
>  config KEXEC_CORE
>   select CRASH_CORE
>   bool
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..ab0a2b4b1ffa 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -250,6 +251,12 @@ static int __init __parse_crashkernel(char *cmdline,
>   if (suffix)
>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
>   suffix);
> +#ifdef CONFIG_CRASH_AUTO_STR
> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
> + ck_cmdline = CONFIG_CRASH_AUTO_STR;
> + pr_info("Using crashkernel=auto, the size chosen is a best 
> effort estimation.\n");
> + }
> +#endif
>   /*
>* if the commandline contains a ':', then that's the extended
>* syntax -- if not, it must be the classic syntax
> -- 
> 2.27.0
>

Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory

2021-02-01 Thread Baoquan He

On 02/01/21 at 04:34pm, Mike Rapoport wrote:
> On Mon, Feb 01, 2021 at 07:26:05PM +0800, Baoquan He wrote:
> > On 02/01/21 at 10:32am, David Hildenbrand wrote:
> > > 
> > > 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
> > > PAGES_PER_SECTION) handling, consider range
> > >   [round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
> > > which would handle in the x86-64 case [0..0] and, therefore, initialize 
> > > PFN
> > > 0.
> > 
> > Sounds reasonable. Maybe we can change to get the real expected lowest
> > pfn from find_min_pfn_for_node() by iterating memblock.memory and
> > memblock.reserved and comparing.
> 
> As I've found out the hard way [1], reserved memory is not necessary present.
> 
> There could be a system that instead of reserving memory at 0xfe00 like
> in Guillaume's report, could have it reserved at 0x0 and populated only
> from the first gigabyte...

OK. I thought that we can even compare memblock.memory.regions[0].base
with memblock.reserved.regions[0].base and take the smaller one as the
lowest pfn and assign it to arch_zone_lowest_possible_pfn[0]. When we
try to get the present pages, we still check memblock.memory with
for_each_mem_pfn_range(). Since we will consider and take reserved
memory into zone anyway, arch_zone_lowest_possible_pfn[] only impact the
boundary of zone. Just rough thought, please ignore it if something is
missed.

Thanks
Baoquan

Re: [PATCH v4 1/2] x86/setup: always add the beginning of RAM as memblock.memory

2021-02-01 Thread Baoquan He

On 02/01/21 at 10:32am, David Hildenbrand wrote:
> On 30.01.21 23:10, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The physical memory on an x86 system starts at address 0, but this is not
> > always reflected in e820 map. For example, the BIOS can have e820 entries
> > like
> > 
> > [0.00] BIOS-provided physical RAM map:
> > [0.00] BIOS-e820: [mem 0x1000-0x0009] usable
> > 
> > or
> > 
> > [0.00] BIOS-provided physical RAM map:
> > [0.00] BIOS-e820: [mem 0x-0x0fff] 
> > reserved
> > [0.00] BIOS-e820: [mem 0x1000-0x00057fff] usable
> > 
> > In either case, e820__memblock_setup() won't add the range 0x - 0x1000
> > to memblock.memory and later during memory map initialization this range is
> > left outside any zone.
> > 
> > With SPARSEMEM=y there is always a struct page for pfn 0 and this struct
> > page will have it's zone link wrong no matter what value will be set there.
> > 
> > To avoid this inconsistency, add the beginning of RAM to memblock.memory.
> > Limit the added chunk size to match the reserved memory to avoid
> > registering memory that may be used by the firmware but never reserved at
> > e820__memblock_setup() time.
> > 
> > Fixes: bde9cfa3afe4 ("x86/setup: don't remove E820_TYPE_RAM for pfn 0")
> > Signed-off-by: Mike Rapoport 
> > Cc: sta...@vger.kernel.org
> > ---
> >   arch/x86/kernel/setup.c | 8 
> >   1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3412c4595efd..67c77ed6eef8 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -727,6 +727,14 @@ static void __init trim_low_memory_range(void)
> >  * Kconfig help text for X86_RESERVE_LOW.
> >  */
> > memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> > +
> > +   /*
> > +* Even if the firmware does not report the memory at address 0 as
> > +* usable, inform the generic memory management about its existence
> > +* to ensure it is a part of ZONE_DMA and the memory map for it is
> > +* properly initialized.
> > +*/
> > +   memblock_add(0, ALIGN(reserve_low, PAGE_SIZE));
> >   }
> > 
> >   /*
> > 
> 
> I think, to make that code more robust, and to not rely on archs to do the
> right thing, we should do something like
> 
> 1) Make sure in free_area_init() that each PFN with a memmap (i.e., falls
> into a partial present section) is spanned by a zone; that would include PFN
> 0 in this case.
> 
> 2) In init_zone_unavailable_mem(), similar to round_up(max_pfn,
> PAGES_PER_SECTION) handling, consider range
>   [round_down(min_pfn, PAGES_PER_SECTION), min_pfn - 1]
> which would handle in the x86-64 case [0..0] and, therefore, initialize PFN
> 0.

Sounds reasonable. Maybe we can change to get the real expected lowest
pfn from find_min_pfn_for_node() by iterating memblock.memory and
memblock.reserved and comparing.

> 
> Also, I think the special-case of PFN 0 is analogous to the
> round_up(max_pfn, PAGES_PER_SECTION) handling in
> init_zone_unavailable_mem(): who guarantees that these PFN above the highest
> present PFN are actually spanned by a zone?
> 
> I'd suggest going through all zone ranges in free_area_init() first, dealing
> with zones that have "not section aligned start/end", clamping them up/down
> if required such that no holes within a section are left uncovered by a
> zone.
> 
> -- 
> Thanks,
> 
> David / dhildenb

Re: [PATCH v3 2/2] mm: fix initialization of struct page for holes in memory layout

2021-02-01 Thread Baoquan He

On 02/01/21 at 10:14am, David Hildenbrand wrote:
> On 11.01.21 20:40, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > There could be struct pages that are not backed by actual physical memory.
> > This can happen when the actual memory bank is not a multiple of
> > SECTION_SIZE or when an architecture does not register memory holes
> > reserved by the firmware as memblock.memory.
> > 
> > Such pages are currently initialized using init_unavailable_mem() function
> > that iterates through PFNs in holes in memblock.memory and if there is a
> > struct page corresponding to a PFN, the fields if this page are set to
> > default values and the page is marked as Reserved.
> > 
> > init_unavailable_mem() does not take into account zone and node the page
> > belongs to and sets both zone and node links in struct page to zero.
> > 
> > On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> > instance in a configuration below:
> > 
> > # grep -A1 E820 /proc/iomem
> > 7a17b000-7a216fff : Unknown E820 type
> > 7a217000-7bff : System RAM
> > 
> > unset zone link in struct page will trigger
> > 
> > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > 
> > because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in
> > struct page) in the same pageblock.
> > 
> > Update init_unavailable_mem() to use zone constraints defined by an
> > architecture to properly setup the zone link and use node ID of the
> > adjacent range in memblock.memory to set the node link.
> > 
> > Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather 
> > that check each PFN")
> > Reported-by: Andrea Arcangeli 
> > Signed-off-by: Mike Rapoport 
> > ---
> >   mm/page_alloc.c | 84 +
> >   1 file changed, 50 insertions(+), 34 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bdbec4c98173..0b56c3ca354e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7077,23 +7077,26 @@ void __init free_area_init_memoryless_node(int nid)
> >* Initialize all valid struct pages in the range [spfn, epfn) and mark 
> > them
> >* PageReserved(). Return the number of struct pages that were 
> > initialized.
> >*/
> > -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long 
> > epfn)
> > +static u64 __init init_unavailable_range(unsigned long spfn, unsigned long 
> > epfn,
> > +int zone, int nid)
> >   {
> > -   unsigned long pfn;
> > +   unsigned long pfn, zone_spfn, zone_epfn;
> > u64 pgcnt = 0;
> > +   zone_spfn = arch_zone_lowest_possible_pfn[zone];
> > +   zone_epfn = arch_zone_highest_possible_pfn[zone];
> > +
> > +   spfn = clamp(spfn, zone_spfn, zone_epfn);
> > +   epfn = clamp(epfn, zone_spfn, zone_epfn);
> > +
> > for (pfn = spfn; pfn < epfn; pfn++) {
> > if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> > pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> > + pageblock_nr_pages - 1;
> > continue;
> > }
> > -   /*
> > -* Use a fake node/zone (0) for now. Some of these pages
> > -* (in memblock.reserved but not in memblock.memory) will
> > -* get re-initialized via reserve_bootmem_region() later.
> > -*/
> > -   __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> > +
> > +   __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
> > __SetPageReserved(pfn_to_page(pfn));
> > pgcnt++;
> > }
> > @@ -7102,51 +7105,64 @@ static u64 __init init_unavailable_range(unsigned 
> > long spfn, unsigned long epfn)
> >   }
> >   /*
> > - * Only struct pages that are backed by physical memory are zeroed and
> > - * initialized by going through __init_single_page(). But, there are some
> > - * struct pages which are reserved in memblock allocator and their fields
> > - * may be accessed (for example page_to_pfn() on some configuration 
> > accesses
> > - * flags). We must explicitly initialize those struct pages.
> > + * Only struct pages that correspond to ranges defined by memblock.memory
> > + * are zeroed and initialized by going through __init_single_page() during
> > + * memmap_init().
> > + *
> > + * But, there could be struct pages that correspond to holes in
> > + * memblock.memory. This can happen because of the following reasons:
> > + * - phyiscal memory bank size is not necessarily the exact multiple of the
> > + *   arbitrary section size
> > + * - early reserved memory may not be listed in memblock.memory
> > + * - memory layouts defined with memmap= kernel parameter may not align
> > + *   nicely with memmap sections
> >*
> > - * This function also addresses a similar issue where struct pages are left
> > - * uninitialized because the physical address range is not covered by
> > - * memblock.memory or

Re: [PATCH v2 1/1] kexec: dump kmessage before machine_kexec

2021-01-27 Thread Baoquan He

On 01/26/21 at 03:41pm, Pavel Tatashin wrote:
> kmsg_dump(KMSG_DUMP_SHUTDOWN) is called before
> machine_restart(), machine_halt(), machine_power_off(), the only one that
> is missing is  machine_kexec().
> 
> The dmesg output that it contains can be used to study the shutdown
> performance of both kernel and systemd during kexec reboot.
> 
> Here is example of dmesg data collected after kexec:
> 
> root@dplat-cp22:~# cat /sys/fs/pstore/dmesg-ramoops-0 | tail
> ...
> <6>[   70.914592] psci: CPU3 killed (polled 0 ms)
> <5>[   70.915705] CPU4: shutdown
> <6>[   70.916643] psci: CPU4 killed (polled 4 ms)
> <5>[   70.917715] CPU5: shutdown
> <6>[   70.918725] psci: CPU5 killed (polled 0 ms)
> <5>[   70.919704] CPU6: shutdown
> <6>[   70.920726] psci: CPU6 killed (polled 4 ms)
> <5>[   70.921642] CPU7: shutdown
> <6>[   70.922650] psci: CPU7 killed (polled 0 ms)
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Kees Cook 
> Reviewed-by: Petr Mladek 
> Reviewed-by: Bhupesh Sharma 
> ---
>  kernel/kexec_core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 4f8efc278aa7..e253c8b59145 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1180,6 +1181,7 @@ int kernel_kexec(void)
>   machine_shutdown();
>   }
>  
> + kmsg_dump(KMSG_DUMP_SHUTDOWN);
>   machine_kexec(kexec_image);

Looks good to me, thx.

Acked-by: Baoquan He 

>  
>  #ifdef CONFIG_KEXEC_JUMP
> -- 
> 2.25.1
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

2021-01-27 Thread Baoquan He

On 01/27/21 at 08:26pm, Mike Rapoport wrote:
> Hi Lukasz,
> 
> On Wed, Jan 27, 2021 at 02:15:53PM +0100, Łukasz Majczak wrote:
> > Hi Mike,
> > 
> > I have started bisecting your patch and I have figured out that there
> > might be something wrong with clamping - with comments out these lines
> > it started to work.
> > The full log (with logs from below patch) can be found here:
> > https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725
> > it's fresh - I haven't analyze it yet, just sharing with hope it will help.
> 
> Thanks, that helps!
> 
> The first page is never considered by the kernel as memory and so
> arch_zone_lowest_possible_pfn[ZONE_DMA] is set to 0x1000. As the result,
> init_unavailable_mem() skips pfn 0 and then __SetPageReserved(page) in
> reserve_bootmem_region() panics because the struct page for pfn 0 remains
> poisoned.

It's a great finding and quick fix. Previously I tested my cleanup
patches based on Mike's commit 9ebeee59af4cdd4d ("mm: fix initialization
of struct page for holes in memory layout") on a hardware system,
didn't meet this crash. But this crash seems to be a always reproduced
issue, wondering why I didn't reproduce it.

> 
> Can you please try the below patch on top of v5.11-rc5?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 783913e41f65..3ce9ef238dfc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7083,10 +7083,11 @@ void __init free_area_init_memoryless_node(int nid)
>  static u64 __init init_unavailable_range(unsigned long spfn, unsigned long 
> epfn,
>int zone, int nid)
>  {
> - unsigned long pfn, zone_spfn, zone_epfn;
> + unsigned long pfn, zone_spfn = 0, zone_epfn;
>   u64 pgcnt = 0;
>  
> - zone_spfn = arch_zone_lowest_possible_pfn[zone];
> + if (zone > 0)
> + zone_spfn = arch_zone_highest_possible_pfn[zone - 1];
>   zone_epfn = arch_zone_highest_possible_pfn[zone];
>  
>   spfn = clamp(spfn, zone_spfn, zone_epfn);
> 
>  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index eed54ce26ad1..9f4468c413a1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7093,9 +7093,11 @@ static u64 __init
> > init_unavailable_range(unsigned long spfn, unsigned long epfn,
> > zone_spfn = arch_zone_lowest_possible_pfn[zone];
> > zone_epfn = arch_zone_highest_possible_pfn[zone];
> > 
> > -   spfn = clamp(spfn, zone_spfn, zone_epfn);
> > -   epfn = clamp(epfn, zone_spfn, zone_epfn);
> > -
> > +   //spfn = clamp(spfn, zone_spfn, zone_epfn);
> > +   //epfn = clamp(epfn, zone_spfn, zone_epfn);
> > +   pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n",
> > zone_spfn, zone_epfn);
> > +   pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn);
> > +   pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n",
> > clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn));
> > for (pfn = spfn; pfn < epfn; pfn++) {
> > if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> > pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> > 
> > Best regards,
> > Lukasz
> > 
> > 
> > śr., 27 sty 2021 o 13:15 Łukasz Majczak  napisał(a):
> > >
> > > Unfortunately nothing :( my current kernel command line contains:
> > > console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7
> > >
> > > I was thinking about using earlycon, but it seems to be blocked.
> > > (I think the lack of earlycon might be related to Chromebook HW
> > > security design. There is an EC controller which is a part of AP ->
> > > serial chain as kernel messages are considered sensitive from a
> > > security standpoint.)
> > >
> > > Best regards,
> > > Lukasz
> > >
> > > śr., 27 sty 2021 o 12:19 Mike Rapoport  napisał(a):
> > > >
> > > > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > > > > Hi Mike,
> > > > >
> > > > > Actually I have a serial console attached (via servo device), but
> > > > > there is no output :( and also the reboot/crash is very fast/immediate
> > > > > after power on.
> > > >
> > > > If you boot with earlyprintk=serial are there any messages?
> > > >
> > > > > Best regards
> > > > > Lukasz
> > > > >
> > > > > śr., 27 sty 2021 o 11:05 Mike Rapoport  
> > > > > napisał(a):
> > > > > >
> > > > > > Hi Lukasz,
> > > > > >
> > > > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > > > > Crash after mm: fix initialization of struct page for holes in 
> > > > > > > memory layout
> > > > > > >
> > > > > > > Hi,
> > > > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro 
> > > > > > > (Caroline),
> > > > > > > but I've noticed it has crashed - unfortunately it seems to 
> > > > > > > happen at
> > > > > > > a very early stage - No output to the console nor to the screen, 
> > > > > > > so I
> > > > > > > have started a bisect (between 5.11-rc4 - which works just find - 
> > > > > > > and
> > > > > > > 5.11-rc5),
> >

[PATCH v5 0/5] mm: clean up names and parameters of memmap_init_xxxx functions

2021-01-22 Thread Baoquan He

This patchset is correcting inappropriate function names of
memmap_init_xxx, and simplify parameters of functions in the code flow.
And also fix a prototype warning reported by lkp.

This is based on the latest next/master.

V4 can be found here:
https://lore.kernel.org/linux-mm/20210120045213.6571-1-...@redhat.com/

v4->v5:
 - Add patch 1 into series which fixes a prototype warning from kernel
   test robot. Then rebase the v4 patches on top of it.

v3->v4:
 - Rebased patch 1, 2 on top of Mike's below new patch.
   [PATCH v3 0/2] mm: fix initialization of struct page for holes in  memory 
layout

 - Move the code of renaming function parameter 'range_start_pfn' and local
   variable 'range_end_pfn' of memmap_init() from patch 1 to patch 2
   according to David's comment.

 - Use the reverse Christmas tree style to reorder the local variables
   in memmap_init_zone() in patch 2 accodrding to David's comment.

Baoquan He (5):
  mm: fix prototype warning from kernel test robot
  mm: rename memmap_init() and memmap_init_zone()
  mm: simplify parater of function memmap_init_zone()
  mm: simplify parameter of setup_usemap()
  mm: remove unneeded local variable in free_area_init_core

 arch/ia64/include/asm/pgtable.h |  6 -
 arch/ia64/mm/init.c | 14 +-
 include/linux/mm.h  |  3 ++-
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 46 ++---
 5 files changed, 31 insertions(+), 40 deletions(-)

-- 
2.17.2

[PATCH v5 1/5] mm: fix prototype warning from kernel test robot

2021-01-22 Thread Baoquan He

Kernel test robot calling make with 'W=1' is triggering warning like
below for memmap_init_zone() function.

mm/page_alloc.c:6259:23: warning: no previous prototype for 'memmap_init_zone' 
[-Wmissing-prototypes]
 6259 | void __meminit __weak memmap_init_zone(unsigned long size, int nid,
  |   ^~~~

Fix it by adding the function declaration in include/linux/mm.h.
Since memmap_init_zone() has a generic version with '__weak',
the declaratoin in ia64 header file can be simply removed.

Signed-off-by: Baoquan He 
Reported-by: kernel test robot 
---
 arch/ia64/include/asm/pgtable.h | 6 --
 include/linux/mm.h  | 2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 779b6972aa84..9b4efe89e62d 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -517,12 +517,6 @@ extern struct page *zero_page_memmap_ptr;
__changed;  \
 })
 #endif
-
-#  ifdef CONFIG_VIRTUAL_MEM_MAP
-  /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init (unsigned long size, int nid, unsigned long zone,
-unsigned long start_pfn);
-#  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3dac7bc667ee..3d82b4f7cabc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,6 +2401,8 @@ extern void set_dma_reserve(unsigned long 
new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
+extern void memmap_init(unsigned long size, int nid,
+   unsigned long zone, unsigned long range_start_pfn);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
-- 
2.17.2

[PATCH v5 2/5] mm: rename memmap_init() and memmap_init_zone()

2021-01-22 Thread Baoquan He

The current memmap_init_zone() only handles memory region inside one zone,
actually memmap_init() does the memmap init of one zone. So rename both of
them accordingly.

Signed-off-by: Baoquan He 
---
 arch/ia64/mm/init.c | 6 +++---
 include/linux/mm.h  | 4 ++--
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 8 
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e76386a3479e..c8e68e92beb3 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
/ sizeof(struct page));
 
if (map_start < map_end)
-   memmap_init_zone((unsigned long)(map_end - map_start),
+   memmap_init_range((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
 void __meminit
-memmap_init (unsigned long size, int nid, unsigned long zone,
+memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3d82b4f7cabc..2395dc212221 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2398,10 +2398,10 @@ extern int __meminit early_pfn_to_nid(unsigned long 
pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long,
+extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
-extern void memmap_init(unsigned long size, int nid,
+extern void memmap_init_zone(unsigned long size, int nid,
unsigned long zone, unsigned long range_start_pfn);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..ddcb1cd24c60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
+   memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 44ec5594798d..42a1d2d2a87d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6117,7 +6117,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
  * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
  * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
+void __meminit memmap_init_range(unsigned long size, int nid, unsigned long 
zone,
unsigned long start_pfn, unsigned long zone_end_pfn,
enum meminit_context context,
struct vmem_altmap *altmap, int migratetype)
@@ -6254,7 +6254,7 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-void __meminit __weak memmap_init(unsigned long size, int nid,
+void __meminit __weak memmap_init_zone(unsigned long size, int nid,
  unsigned long zone,
  unsigned long range_start_pfn)
 {
@@ -6268,7 +6268,7 @@ void __meminit __weak memmap_init(unsigned long size, int 
nid,
 
if (end_pfn > start_pfn) {
size = end_pfn - start_pfn;
-   memmap_init_zone(size, nid, zone, start_pfn, 
range_end_pfn,
+   memmap_init_range(size, nid, zone, start_pfn, 
range_end_pfn,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
}
@@ -6978,7 +6978,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
set_pageblock_order();
setup_usemap(pgdat, zone, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
-   memmap_init(size, nid, j, zone_start_pfn);
+   memmap_init_zone(size, nid, j, zone_start_pfn);
}
 }
 
-- 
2.17.2

[PATCH v5 4/5] mm: simplify parameter of setup_usemap()

2021-01-22 Thread Baoquan He

Parameter 'zone' has got needed information, let's remove other
unnecessary parameters.

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
---
 mm/page_alloc.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cbb67d9c1b2a..69cf19baac12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6761,25 +6761,22 @@ static unsigned long __init usemap_size(unsigned long 
zone_start_pfn, unsigned l
return usemapsize / 8;
 }
 
-static void __ref setup_usemap(struct pglist_data *pgdat,
-   struct zone *zone,
-   unsigned long zone_start_pfn,
-   unsigned long zonesize)
+static void __ref setup_usemap(struct zone *zone)
 {
-   unsigned long usemapsize = usemap_size(zone_start_pfn, zonesize);
+   unsigned long usemapsize = usemap_size(zone->zone_start_pfn,
+  zone->spanned_pages);
zone->pageblock_flags = NULL;
if (usemapsize) {
zone->pageblock_flags =
memblock_alloc_node(usemapsize, SMP_CACHE_BYTES,
-   pgdat->node_id);
+   zone_to_nid(zone));
if (!zone->pageblock_flags)
panic("Failed to allocate %ld bytes for zone %s 
pageblock flags on node %d\n",
- usemapsize, zone->name, pgdat->node_id);
+ usemapsize, zone->name, zone_to_nid(zone));
}
 }
 #else
-static inline void setup_usemap(struct pglist_data *pgdat, struct zone *zone,
-   unsigned long zone_start_pfn, unsigned long 
zonesize) {}
+static inline void setup_usemap(struct zone *zone) {}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -6974,7 +6971,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
continue;
 
set_pageblock_order();
-   setup_usemap(pgdat, zone, zone_start_pfn, size);
+   setup_usemap(zone);
init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init_zone(zone);
}
-- 
2.17.2

[PATCH v5 3/5] mm: simplify parater of function memmap_init_zone()

2021-01-22 Thread Baoquan He

As David suggested, simply passing 'struct zone *zone' is enough. We can
get all needed information from 'struct zone*' easily.

Suggested-by: David Hildenbrand 
Signed-off-by: Baoquan He 
---
 arch/ia64/mm/init.c | 12 +++-
 include/linux/mm.h  |  3 +--
 mm/page_alloc.c | 24 +++-
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index c8e68e92beb3..88fb44895408 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -541,12 +541,14 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
return 0;
 }
 
-void __meminit
-memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-unsigned long start_pfn)
+void __meminit memmap_init_zone(struct zone *zone)
 {
+   int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long start_pfn = zone->zone_start_pfn;
+   unsigned long size = zone->spanned_pages;
+
if (!vmem_map) {
-   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + 
size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
@@ -556,7 +558,7 @@ memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
args.start = start;
args.end = start + size;
args.nid = nid;
-   args.zone = zone;
+   args.zone = zone_id;
 
efi_memmap_walk(virtual_memmap_init, );
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2395dc212221..073049bd0b29 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,8 +2401,7 @@ extern void set_dma_reserve(unsigned long 
new_dma_reserve);
 extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
-extern void memmap_init_zone(unsigned long size, int nid,
-   unsigned long zone, unsigned long range_start_pfn);
+extern void memmap_init_zone(struct zone *zone);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 42a1d2d2a87d..cbb67d9c1b2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6254,23 +6254,21 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-void __meminit __weak memmap_init_zone(unsigned long size, int nid,
- unsigned long zone,
- unsigned long range_start_pfn)
+void __meminit __weak memmap_init_zone(struct zone *zone)
 {
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
+   unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
+   int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
unsigned long start_pfn, end_pfn;
-   unsigned long range_end_pfn = range_start_pfn + size;
-   int i;
 
for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
-   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
-   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+   start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
+   end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
 
-   if (end_pfn > start_pfn) {
-   size = end_pfn - start_pfn;
-   memmap_init_range(size, nid, zone, start_pfn, 
range_end_pfn,
-MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
-   }
+   if (end_pfn > start_pfn)
+   memmap_init_range(end_pfn - start_pfn, nid,
+   zone_id, start_pfn, zone_end_pfn,
+   MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
 }
 
@@ -6978,7 +6976,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
set_pageblock_order();
setup_usemap(pgdat, zone, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
-   memmap_init_zone(size, nid, j, zone_start_pfn);
+   memmap_init_zone(zone);
}
 }
 
-- 
2.17.2

[PATCH v5 5/5] mm: remove unneeded local variable in free_area_init_core

2021-01-22 Thread Baoquan He

Local variable 'zone_start_pfn' is not needed since there's only
one call site in free_area_init_core(). Let's remove it and pass
zone->zone_start_pfn directly to init_currently_empty_zone().

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69cf19baac12..e0df67948ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6923,7 +6923,6 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, freesize, memmap_pages;
-   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
freesize = zone->present_pages;
@@ -6972,7 +6971,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
 
set_pageblock_order();
setup_usemap(zone);
-   init_currently_empty_zone(zone, zone_start_pfn, size);
+   init_currently_empty_zone(zone, zone->zone_start_pfn, size);
memmap_init_zone(zone);
}
 }
-- 
2.17.2

Re: [PATCH] mm: fix prototype warning from kernel test robot

2021-01-22 Thread Baoquan He

On 01/22/21 at 09:55am, David Hildenbrand wrote:
> On 22.01.21 09:46, Baoquan He wrote:
> > On 01/22/21 at 09:40am, David Hildenbrand wrote:
> >> On 22.01.21 08:03, Baoquan He wrote:
> >>> Kernel test robot calling make with 'W=1' triggering warning like below
> >>> below for memmap_init_zone() function.
> >>>
> >>> mm/page_alloc.c:6259:23: warning: no previous prototype for 
> >>> 'memmap_init_zone' [-Wmissing-prototypes]
> >>>  6259 | void __meminit __weak memmap_init_zone(unsigned long size, int 
> >>> nid,
> >>>   |   ^~~~
> >>>
> >>> Fix it by adding the function declaration in include/linux/mm.h.
> >>> Since memmap_init_zone() has a generic version with '__weak',
> >>> the declaratoin in ia64 header file can be simply removed.
> >>>
> >>> Signed-off-by: Baoquan He 
> >>> Reported-by: kernel test robot 
> >>> ---
> >>>  arch/ia64/include/asm/pgtable.h | 5 -
> >>>  include/linux/mm.h  | 1 +
> >>>  2 files changed, 1 insertion(+), 5 deletions(-)
> >>>
> >>> diff --git a/arch/ia64/include/asm/pgtable.h 
> >>> b/arch/ia64/include/asm/pgtable.h
> >>> index 2c81394a2430..9b4efe89e62d 100644
> >>> --- a/arch/ia64/include/asm/pgtable.h
> >>> +++ b/arch/ia64/include/asm/pgtable.h
> >>> @@ -517,11 +517,6 @@ extern struct page *zero_page_memmap_ptr;
> >>>   __changed;  \
> >>>  })
> >>>  #endif
> >>> -
> >>> -#  ifdef CONFIG_VIRTUAL_MEM_MAP
> >>> -  /* arch mem_map init routine is needed due to holes in a virtual 
> >>> mem_map */
> >>> -extern void memmap_init_zone(struct zone *zone);
> >>> -#  endif /* CONFIG_VIRTUAL_MEM_MAP */
> >>>  # endif /* !__ASSEMBLY__ */
> >>>  
> >>>  /*
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index 56bb239f9150..073049bd0b29 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -2401,6 +2401,7 @@ extern void set_dma_reserve(unsigned long 
> >>> new_dma_reserve);
> >>>  extern void memmap_init_range(unsigned long, int, unsigned long,
> >>>   unsigned long, unsigned long, enum meminit_context,
> >>>   struct vmem_altmap *, int migratetype);
> >>> +extern void memmap_init_zone(struct zone *zone);
> >>>  extern void setup_per_zone_wmarks(void);
> >>>  extern int __meminit init_per_zone_wmark_min(void);
> >>>  extern void mem_init(void);
> >>>
> >>
> >> This patch is on top of your other series, no?
> >>
> >> In -next, we have
> >>
> >> extern void memmap_init_zone(unsigned long, int, unsigned long, ...
> >>
> >> In that file, so something is wrong.
> > 
> > Right, this one is based on the memmap_init_xx clean up patchset. I
> > mentioned this the the sub-thread of kernel test robot reporting issues.
> > 
> 
> I think it would make things easier to move that fix to the front and
> resend the whole (5 patches) series.

OK, it's fine to me, will resend a series adding this one in. I also
need polish log of this patch. Thanks for looking into this.

Re: [PATCH] mm: fix prototype warning from kernel test robot

2021-01-22 Thread Baoquan He

On 01/22/21 at 09:40am, David Hildenbrand wrote:
> On 22.01.21 08:03, Baoquan He wrote:
> > Kernel test robot calling make with 'W=1' triggering warning like below
> > below for memmap_init_zone() function.
> > 
> > mm/page_alloc.c:6259:23: warning: no previous prototype for 
> > 'memmap_init_zone' [-Wmissing-prototypes]
> >  6259 | void __meminit __weak memmap_init_zone(unsigned long size, int nid,
> >   |   ^~~~
> > 
> > Fix it by adding the function declaration in include/linux/mm.h.
> > Since memmap_init_zone() has a generic version with '__weak',
> > the declaratoin in ia64 header file can be simply removed.
> > 
> > Signed-off-by: Baoquan He 
> > Reported-by: kernel test robot 
> > ---
> >  arch/ia64/include/asm/pgtable.h | 5 -
> >  include/linux/mm.h  | 1 +
> >  2 files changed, 1 insertion(+), 5 deletions(-)
> > 
> > diff --git a/arch/ia64/include/asm/pgtable.h 
> > b/arch/ia64/include/asm/pgtable.h
> > index 2c81394a2430..9b4efe89e62d 100644
> > --- a/arch/ia64/include/asm/pgtable.h
> > +++ b/arch/ia64/include/asm/pgtable.h
> > @@ -517,11 +517,6 @@ extern struct page *zero_page_memmap_ptr;
> > __changed;  \
> >  })
> >  #endif
> > -
> > -#  ifdef CONFIG_VIRTUAL_MEM_MAP
> > -  /* arch mem_map init routine is needed due to holes in a virtual mem_map 
> > */
> > -extern void memmap_init_zone(struct zone *zone);
> > -#  endif /* CONFIG_VIRTUAL_MEM_MAP */
> >  # endif /* !__ASSEMBLY__ */
> >  
> >  /*
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 56bb239f9150..073049bd0b29 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2401,6 +2401,7 @@ extern void set_dma_reserve(unsigned long 
> > new_dma_reserve);
> >  extern void memmap_init_range(unsigned long, int, unsigned long,
> > unsigned long, unsigned long, enum meminit_context,
> > struct vmem_altmap *, int migratetype);
> > +extern void memmap_init_zone(struct zone *zone);
> >  extern void setup_per_zone_wmarks(void);
> >  extern int __meminit init_per_zone_wmark_min(void);
> >  extern void mem_init(void);
> > 
> 
> This patch is on top of your other series, no?
> 
> In -next, we have
> 
> extern void memmap_init_zone(unsigned long, int, unsigned long, ...
> 
> In that file, so something is wrong.

Right, this one is based on the memmap_init_xx clean up patchset. I
mentioned this the the sub-thread of kernel test robot reporting issues.

Re: [PATCH v4 1/4] mm: rename memmap_init() and memmap_init_zone()

2021-01-21 Thread Baoquan He

On 01/20/21 at 11:47pm, kernel test robot wrote:
> Hi Baoquan,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on linux/master]
> [also build test WARNING on linus/master v5.11-rc4 next-20210120]
> [cannot apply to mmotm/master hnaz-linux-mm/master ia64/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:
> https://github.com/0day-ci/linux/commits/Baoquan-He/mm-clean-up-names-and-parameters-of-memmap_init_-functions/20210120-135239
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> 1e2a199f6ccdc15cf111d68d212e2fd4ce65682e
> config: mips-randconfig-r036-20210120 (attached as .config)
> compiler: mips-linux-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://github.com/0day-ci/linux/commit/1bbb0b35dd2fae4a7a38098e63899677c2e53108
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review 
> Baoquan-He/mm-clean-up-names-and-parameters-of-memmap_init_-functions/20210120-135239
> git checkout 1bbb0b35dd2fae4a7a38098e63899677c2e53108
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> ARCH=mips 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> All warnings (new ones prefixed by >>):
> 
>mm/page_alloc.c:3597:15: warning: no previous prototype for 
> 'should_fail_alloc_page' [-Wmissing-prototypes]
> 3597 | noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int 
> order)
>  |   ^~
> >> mm/page_alloc.c:6258:23: warning: no previous prototype for 
> >> 'memmap_init_zone' [-Wmissing-prototypes]
> 6258 | void __meminit __weak memmap_init_zone(unsigned long size, int nid,
>  |   ^~~~

Have posted a patch to fix this warning as below. The patch is based on
this patchset.

https://lore.kernel.org/linux-mm/20210122070359.24010-1-...@redhat.com/

Thanks
Baoquan

[PATCH] mm: fix prototype warning from kernel test robot

2021-01-21 Thread Baoquan He

Kernel test robot calling make with 'W=1' triggering warning like below
below for memmap_init_zone() function.

mm/page_alloc.c:6259:23: warning: no previous prototype for 'memmap_init_zone' 
[-Wmissing-prototypes]
 6259 | void __meminit __weak memmap_init_zone(unsigned long size, int nid,
  |   ^~~~

Fix it by adding the function declaration in include/linux/mm.h.
Since memmap_init_zone() has a generic version with '__weak',
the declaratoin in ia64 header file can be simply removed.

Signed-off-by: Baoquan He 
Reported-by: kernel test robot 
---
 arch/ia64/include/asm/pgtable.h | 5 -
 include/linux/mm.h  | 1 +
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 2c81394a2430..9b4efe89e62d 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -517,11 +517,6 @@ extern struct page *zero_page_memmap_ptr;
__changed;  \
 })
 #endif
-
-#  ifdef CONFIG_VIRTUAL_MEM_MAP
-  /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init_zone(struct zone *zone);
-#  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 56bb239f9150..073049bd0b29 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,6 +2401,7 @@ extern void set_dma_reserve(unsigned long 
new_dma_reserve);
 extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
+extern void memmap_init_zone(struct zone *zone);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
-- 
2.17.2

Re: [PATCH v4 1/4] mm: rename memmap_init() and memmap_init_zone()

2021-01-21 Thread Baoquan He

On 01/21/21 at 10:25am, Mike Rapoport wrote:
> On Thu, Jan 21, 2021 at 04:17:27PM +0800, Baoquan He wrote:
> > On 01/20/21 at 11:47pm, kernel test robot wrote:
> > > Hi Baoquan,
> > > 
> > > I love your patch! Perhaps something to improve:
> > > 
> > > [auto build test WARNING on linux/master]
> > > [also build test WARNING on linus/master v5.11-rc4 next-20210120]
> > > [cannot apply to mmotm/master hnaz-linux-mm/master ia64/next]
> > > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > > And when submitting patch, we suggest to use '--base' as documented in
> > > https://git-scm.com/docs/git-format-patch]
> > > 
> > > url:
> > > https://github.com/0day-ci/linux/commits/Baoquan-He/mm-clean-up-names-and-parameters-of-memmap_init_-functions/20210120-135239
> > > base:   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> > > 1e2a199f6ccdc15cf111d68d212e2fd4ce65682e
> > > config: mips-randconfig-r036-20210120 (attached as .config)
> > > compiler: mips-linux-gcc (GCC) 9.3.0
> > > reproduce (this is a W=1 build):
> > > wget 
> > > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
> > > -O ~/bin/make.cross
> > > chmod +x ~/bin/make.cross
> > >     # 
> > > https://github.com/0day-ci/linux/commit/1bbb0b35dd2fae4a7a38098e63899677c2e53108
> > > git remote add linux-review https://github.com/0day-ci/linux
> > > git fetch --no-tags linux-review 
> > > Baoquan-He/mm-clean-up-names-and-parameters-of-memmap_init_-functions/20210120-135239
> > > git checkout 1bbb0b35dd2fae4a7a38098e63899677c2e53108
> > > # save the attached .config to linux build tree
> > > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> > > ARCH=mips 
> > > 
> > > If you fix the issue, kindly add following tag as appropriate
> > > Reported-by: kernel test robot 
> > > 
> > > All warnings (new ones prefixed by >>):
> > > 
> > >mm/page_alloc.c:3597:15: warning: no previous prototype for 
> > > 'should_fail_alloc_page' [-Wmissing-prototypes]
> > > 3597 | noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned 
> > > int order)
> > >  |   ^~
> > > >> mm/page_alloc.c:6258:23: warning: no previous prototype for 
> > > >> 'memmap_init_zone' [-Wmissing-prototypes]
> > > 6258 | void __meminit __weak memmap_init_zone(unsigned long size, int 
> > > nid,
> > 
> > This is not introduced by this patch, but existing issue, should
> > be not related to this patchset. I will investigate and see what we
> > should do with memmap_init_zone(), adding static or adding it to header
> > file, or just leave it as should_fail_alloc_page().
> > 
> > 
> > By the way, I tried to reproduce on a fedora 32 system of x86 arch, but
> > met below issue. could you help check what I can do to fix the error.
> > 
> > 
> > [root@dell-per710-01 linux]# COMPILER_INSTALL_PATH=~/0day 
> > COMPILER=gcc-9.3.0 ~/bin/make.cross ARCH=mips
> > Compiler will be installed in /root/0day
> > make W=1 CONFIG_OF_ALL_DTBS=y CONFIG_DTC=y 
> > CROSS_COMPILE=/root/0day/gcc-9.3.0-nolibc/mips-linux/bin/mips-linux- 
> > --jobs=16 ARCH=mips
> >   HOSTCXX scripts/gcc-plugins/latent_entropy_plugin.so
> >   HOSTCXX scripts/gcc-plugins/structleak_plugin.so
> >   HOSTCXX scripts/gcc-plugins/randomize_layout_plugin.so
> > In file included from 
> > /root/0day/gcc-9.3.0-nolibc/mips-linux/bin/../lib/gcc/mips-linux/9.3.0/plugin/include/gcc-plugin.h:28,
> >  from scripts/gcc-plugins/gcc-common.h:7,
> >  from scripts/gcc-plugins/latent_entropy_plugin.c:78:
> > /root/0day/gcc-9.3.0-nolibc/mips-linux/bin/../lib/gcc/mips-linux/9.3.0/plugin/include/system.h:687:10:
> >  fatal error: gmp.h: No such file or directy
> >   687 | #include 
> >   |  ^~~
> > compilation terminated.
> > make[2]: *** [scripts/gcc-plugins/Makefile:47: 
> > scripts/gcc-plugins/latent_entropy_plugin.so] Error 1
> > make[2]: *** Waiting for unfinished jobs..
> 
> Do you have gmp-devel installed?

Ah, I didn't, thanks. Then libmpc-devel is needed. Will continue.

Re: [PATCH v4 1/4] mm: rename memmap_init() and memmap_init_zone()

2021-01-21 Thread Baoquan He

On 01/20/21 at 11:47pm, kernel test robot wrote:
> Hi Baoquan,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on linux/master]
> [also build test WARNING on linus/master v5.11-rc4 next-20210120]
> [cannot apply to mmotm/master hnaz-linux-mm/master ia64/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:
> https://github.com/0day-ci/linux/commits/Baoquan-He/mm-clean-up-names-and-parameters-of-memmap_init_-functions/20210120-135239
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> 1e2a199f6ccdc15cf111d68d212e2fd4ce65682e
> config: mips-randconfig-r036-20210120 (attached as .config)
> compiler: mips-linux-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://github.com/0day-ci/linux/commit/1bbb0b35dd2fae4a7a38098e63899677c2e53108
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review 
> Baoquan-He/mm-clean-up-names-and-parameters-of-memmap_init_-functions/20210120-135239
> git checkout 1bbb0b35dd2fae4a7a38098e63899677c2e53108
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> ARCH=mips 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> All warnings (new ones prefixed by >>):
> 
>mm/page_alloc.c:3597:15: warning: no previous prototype for 
> 'should_fail_alloc_page' [-Wmissing-prototypes]
> 3597 | noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int 
> order)
>  |   ^~
> >> mm/page_alloc.c:6258:23: warning: no previous prototype for 
> >> 'memmap_init_zone' [-Wmissing-prototypes]
> 6258 | void __meminit __weak memmap_init_zone(unsigned long size, int nid,

This is not introduced by this patch, but existing issue, should
be not related to this patchset. I will investigate and see what we
should do with memmap_init_zone(), adding static or adding it to header
file, or just leave it as should_fail_alloc_page().

By the way, I tried to reproduce on a fedora 32 system of x86 arch, but
met below issue. could you help check what I can do to fix the error.

[root@dell-per710-01 linux]# COMPILER_INSTALL_PATH=~/0day COMPILER=gcc-9.3.0 
~/bin/make.cross ARCH=mips
Compiler will be installed in /root/0day
make W=1 CONFIG_OF_ALL_DTBS=y CONFIG_DTC=y 
CROSS_COMPILE=/root/0day/gcc-9.3.0-nolibc/mips-linux/bin/mips-linux- --jobs=16 
ARCH=mips
  HOSTCXX scripts/gcc-plugins/latent_entropy_plugin.so
  HOSTCXX scripts/gcc-plugins/structleak_plugin.so
  HOSTCXX scripts/gcc-plugins/randomize_layout_plugin.so
In file included from 
/root/0day/gcc-9.3.0-nolibc/mips-linux/bin/../lib/gcc/mips-linux/9.3.0/plugin/include/gcc-plugin.h:28,
 from scripts/gcc-plugins/gcc-common.h:7,
 from scripts/gcc-plugins/latent_entropy_plugin.c:78:
/root/0day/gcc-9.3.0-nolibc/mips-linux/bin/../lib/gcc/mips-linux/9.3.0/plugin/include/system.h:687:10:
 fatal error: gmp.h: No such file or directy
  687 | #include 
  |  ^~~
compilation terminated.
make[2]: *** [scripts/gcc-plugins/Makefile:47: 
scripts/gcc-plugins/latent_entropy_plugin.so] Error 1
make[2]: *** Waiting for unfinished jobs..

Thanks
Baoquan

>  |   ^~~~
> 
> 
> vim +/memmap_init_zone +6258 mm/page_alloc.c
> 
>   6257
> > 6258void __meminit __weak memmap_init_zone(unsigned long size, int 
> > nid,
>   6259  unsigned long zone,
>   6260  unsigned long range_start_pfn)
>   6261{
>   6262unsigned long start_pfn, end_pfn;
>   6263unsigned long range_end_pfn = range_start_pfn + size;
>   6264int i;
>   6265
>   6266for_each_mem_pfn_range(i, nid, _pfn, _pfn, 
> NULL) {
>   6267start_pfn = clamp(start_pfn, range_start_pfn, 
> range_end_pfn);
>   6268end_pfn = clamp(end_pfn, range_start_pfn, 
> range_end_pfn);
>   6269
>   6270if (end_pfn > start_pfn) {
>   6271size = end_pfn - start_pfn;
>   6272memmap_init_range(size, nid, zone, 
> start_pfn, range_end_pfn,
>   6273

Re: [PATCH] vmalloc: remove redundant NULL check

2021-01-21 Thread Baoquan He

On 01/21/21 at 04:12pm, Yang Li wrote:
> Fix below warnings reported by coccicheck:
> ./fs/proc/vmcore.c:1503:2-7: WARNING: NULL check before some freeing
> functions is not needed.
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  fs/proc/vmcore.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index c3a345c..9a15334 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -1503,11 +1503,8 @@ int vmcore_add_device_dump(struct vmcoredd_data *data)
>   return 0;
>  
>  out_err:
> - if (buf)
> - vfree(buf);
> -
> - if (dump)
> - vfree(dump);
> +     vfree(buf);
> + vfree(dump);

Looks good, thx.

Acked-by: Baoquan He 

Thanks
Baoquan

[PATCH v4 1/4] mm: rename memmap_init() and memmap_init_zone()

2021-01-19 Thread Baoquan He

The current memmap_init_zone() only handles memory region inside one zone,
actually memmap_init() does the memmap init of one zone. So rename both of
them accordingly.

Signed-off-by: Baoquan He 
---
 arch/ia64/include/asm/pgtable.h | 2 +-
 arch/ia64/mm/init.c | 6 +++---
 include/linux/mm.h  | 2 +-
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 8 
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 779b6972aa84..dce2ff37df65 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -520,7 +520,7 @@ extern struct page *zero_page_memmap_ptr;
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init (unsigned long size, int nid, unsigned long zone,
+extern void memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
 unsigned long start_pfn);
 #  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e76386a3479e..c8e68e92beb3 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
/ sizeof(struct page));
 
if (map_start < map_end)
-   memmap_init_zone((unsigned long)(map_end - map_start),
+   memmap_init_range((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
 void __meminit
-memmap_init (unsigned long size, int nid, unsigned long zone,
+memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3dac7bc667ee..56bb239f9150 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2398,7 +2398,7 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long,
+extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..ddcb1cd24c60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
+   memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 44ec5594798d..42a1d2d2a87d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6117,7 +6117,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
  * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
  * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
+void __meminit memmap_init_range(unsigned long size, int nid, unsigned long 
zone,
unsigned long start_pfn, unsigned long zone_end_pfn,
enum meminit_context context,
struct vmem_altmap *altmap, int migratetype)
@@ -6254,7 +6254,7 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-void __meminit __weak memmap_init(unsigned long size, int nid,
+void __meminit __weak memmap_init_zone(unsigned long size, int nid,
  unsigned long zone,
  unsigned long range_start_pfn)
 {
@@ -6268,7 +6268,7 @@ void __meminit __weak memmap_init(unsigned long size, int 
nid,
 
if (end_pfn > start_pfn) {
size = end_pfn - start_pfn;
-   memmap_init_zone(size, nid, zone, start_pfn, 
range_end_pfn,
+   memmap_init_range(size, nid, zone, start_pfn, 
range_end_pfn,
 MEMINIT_EARLY, NULL, MIG

[PATCH v4 3/4] mm: simplify parameter of setup_usemap()

2021-01-19 Thread Baoquan He

Parameter 'zone' has got needed information, let's remove other
unnecessary parameters.

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
---
 mm/page_alloc.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cbb67d9c1b2a..69cf19baac12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6761,25 +6761,22 @@ static unsigned long __init usemap_size(unsigned long 
zone_start_pfn, unsigned l
return usemapsize / 8;
 }
 
-static void __ref setup_usemap(struct pglist_data *pgdat,
-   struct zone *zone,
-   unsigned long zone_start_pfn,
-   unsigned long zonesize)
+static void __ref setup_usemap(struct zone *zone)
 {
-   unsigned long usemapsize = usemap_size(zone_start_pfn, zonesize);
+   unsigned long usemapsize = usemap_size(zone->zone_start_pfn,
+  zone->spanned_pages);
zone->pageblock_flags = NULL;
if (usemapsize) {
zone->pageblock_flags =
memblock_alloc_node(usemapsize, SMP_CACHE_BYTES,
-   pgdat->node_id);
+   zone_to_nid(zone));
if (!zone->pageblock_flags)
panic("Failed to allocate %ld bytes for zone %s 
pageblock flags on node %d\n",
- usemapsize, zone->name, pgdat->node_id);
+ usemapsize, zone->name, zone_to_nid(zone));
}
 }
 #else
-static inline void setup_usemap(struct pglist_data *pgdat, struct zone *zone,
-   unsigned long zone_start_pfn, unsigned long 
zonesize) {}
+static inline void setup_usemap(struct zone *zone) {}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -6974,7 +6971,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
continue;
 
set_pageblock_order();
-   setup_usemap(pgdat, zone, zone_start_pfn, size);
+   setup_usemap(zone);
init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init_zone(zone);
}
-- 
2.17.2

[PATCH v4 4/4] mm: remove unneeded local variable in free_area_init_core

2021-01-19 Thread Baoquan He

Local variable 'zone_start_pfn' is not needed since there's only
one call site in free_area_init_core(). Let's remove it and pass
zone->zone_start_pfn directly to init_currently_empty_zone().

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69cf19baac12..e0df67948ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6923,7 +6923,6 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, freesize, memmap_pages;
-   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
freesize = zone->present_pages;
@@ -6972,7 +6971,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
 
set_pageblock_order();
setup_usemap(zone);
-   init_currently_empty_zone(zone, zone_start_pfn, size);
+   init_currently_empty_zone(zone, zone->zone_start_pfn, size);
memmap_init_zone(zone);
}
 }
-- 
2.17.2

[PATCH v4 2/4] mm: simplify parater of function memmap_init_zone()

2021-01-19 Thread Baoquan He

As David suggested, simply passing 'struct zone *zone' is enough. We can
get all needed information from 'struct zone*' easily.

Suggested-by: David Hildenbrand 
Signed-off-by: Baoquan He 
---
 arch/ia64/include/asm/pgtable.h |  3 +--
 arch/ia64/mm/init.c | 12 +++-
 mm/page_alloc.c | 24 +++-
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index dce2ff37df65..2c81394a2430 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -520,8 +520,7 @@ extern struct page *zero_page_memmap_ptr;
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
-unsigned long start_pfn);
+extern void memmap_init_zone(struct zone *zone);
 #  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
 
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index c8e68e92beb3..88fb44895408 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -541,12 +541,14 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
return 0;
 }
 
-void __meminit
-memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-unsigned long start_pfn)
+void __meminit memmap_init_zone(struct zone *zone)
 {
+   int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long start_pfn = zone->zone_start_pfn;
+   unsigned long size = zone->spanned_pages;
+
if (!vmem_map) {
-   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + 
size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
@@ -556,7 +558,7 @@ memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
args.start = start;
args.end = start + size;
args.nid = nid;
-   args.zone = zone;
+   args.zone = zone_id;
 
efi_memmap_walk(virtual_memmap_init, );
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 42a1d2d2a87d..cbb67d9c1b2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6254,23 +6254,21 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-void __meminit __weak memmap_init_zone(unsigned long size, int nid,
- unsigned long zone,
- unsigned long range_start_pfn)
+void __meminit __weak memmap_init_zone(struct zone *zone)
 {
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
+   unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
+   int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
unsigned long start_pfn, end_pfn;
-   unsigned long range_end_pfn = range_start_pfn + size;
-   int i;
 
for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
-   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
-   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+   start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
+   end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
 
-   if (end_pfn > start_pfn) {
-   size = end_pfn - start_pfn;
-   memmap_init_range(size, nid, zone, start_pfn, 
range_end_pfn,
-MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
-   }
+   if (end_pfn > start_pfn)
+   memmap_init_range(end_pfn - start_pfn, nid,
+   zone_id, start_pfn, zone_end_pfn,
+   MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
 }
 
@@ -6978,7 +6976,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
set_pageblock_order();
setup_usemap(pgdat, zone, zone_start_pfn, size);
init_currently_empty_zone(zone, zone_start_pfn, size);
-   memmap_init_zone(size, nid, j, zone_start_pfn);
+   memmap_init_zone(zone);
}
 }
 
-- 
2.17.2

[PATCH v4 0/4] mm: clean up names and parameters of memmap_init_xxxx functions

2021-01-19 Thread Baoquan He

This patchset is correcting inappropriate function names of
memmap_init_xxx, and simplify parameters of functions in the code flow
when I tried to fix a regression bug in memmap defer init. 

This is based on the latest next/master.

v3 can be found here:
https://lore.kernel.org/linux-mm/20210105074708.18483-1-...@redhat.com/

v3->v4:
 - Rebased patch 1, 2 on top of Mike's below new patch.
   [PATCH v3 0/2] mm: fix initialization of struct page for holes in  memory 
layout
  
 - Move the code of renaming function parameter 'range_start_pfn' and local
   variable 'range_end_pfn' of memmap_init() from patch 1 to patch 2
   according to David's comment.

 - Use the reverse Christmas tree style to reorder the local variables
   in memmap_init_zone() in patch 2 accodrding to David's comment.

Baoquan He (4):
  mm: rename memmap_init() and memmap_init_zone()
  mm: simplify parater of function memmap_init_zone()
  mm: simplify parameter of setup_usemap()
  mm: remove unneeded local variable in free_area_init_core

 arch/ia64/include/asm/pgtable.h |  3 +--
 arch/ia64/mm/init.c | 14 +-
 include/linux/mm.h  |  2 +-
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 46 ++---
 5 files changed, 31 insertions(+), 36 deletions(-)

-- 
2.17.2

Re: [PATCH 0/2] x86/setup: consolidate early memory reservations

2021-01-15 Thread Baoquan He

On 01/15/21 at 07:42pm, Baoquan He wrote:
> On 01/15/21 at 10:32am, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Hi,
> > 
> > David noticed that we do some of memblock_reserve() calls after allocations
> > are possible:
> > 
> > https://lore.kernel.org/lkml/6ba6bde3-1520-5cd0-f987-32d543f0b...@redhat.com
> 
> Thanks for CC-ing me, so I think the above patch from Roman is dangerous.
> KASLR does put kernel randomly in a place, but we did a brutal parse to
> get SRAT table so that we know where is hotpluggable area during boot
> decompression stage. In kernel, at the beginning, we don't know that
> before ACPI init. Roman's patch is wrong if I don't miss something.

Sorry, I was wrong. Bottom up searching disregarding kernel end is
good optimization. Please ignore this noise.

> 
> I will add comment in that thread.
> 
> Thanks
> Baoquan
> 
> > 
> > For now there is no actual problem because in top-down mode we allocate
> > from the end of the memory and in bottom-up mode we allocate above the
> > kernel image. But there is a patch in the mm tree that allow bottom-up
> > allocations below the kernel:
> > 
> > https://lore.kernel.org/lkml/20201217201214.3414100-2-g...@fb.com
> > 
> > and with this change we may get a memory corruption if an allocation steps
> > on some of the firmware areas that are yet to be reserved.
> > 
> > The below patches consolidate early memory reservations done during
> > setup_arch() so that memory used by firmware, bootloader, kernel text/data
> > and the memory that should be excluded from the available memory for
> > whatever other reason is reserved before memblock allocations are possible.
> > 
> > The patches are vs v5.11-rc3-mmots-2021-01-12-02-00 as I think they are
> > prerequisite for the memblock bottom-up changes, but if needed I can rebase
> > then on another tree.
> > 
> > Mike Rapoport (2):
> >   x86/setup: consolidate early memory reservations
> >   x86/setup: merge several reservations of start of the memory
> > 
> >  arch/x86/kernel/setup.c | 85 +
> >  1 file changed, 43 insertions(+), 42 deletions(-)
> > 
> > -- 
> > 2.28.0
> >

Re: [PATCH 0/2] x86/setup: consolidate early memory reservations

2021-01-15 Thread Baoquan He

On 01/15/21 at 10:32am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Hi,
> 
> David noticed that we do some of memblock_reserve() calls after allocations
> are possible:
> 
> https://lore.kernel.org/lkml/6ba6bde3-1520-5cd0-f987-32d543f0b...@redhat.com

Thanks for CC-ing me, so I think the above patch from Roman is dangerous.
KASLR does put kernel randomly in a place, but we did a brutal parse to
get SRAT table so that we know where is hotpluggable area during boot
decompression stage. In kernel, at the beginning, we don't know that
before ACPI init. Roman's patch is wrong if I don't miss something.

I will add comment in that thread.

Thanks
Baoquan

> 
> For now there is no actual problem because in top-down mode we allocate
> from the end of the memory and in bottom-up mode we allocate above the
> kernel image. But there is a patch in the mm tree that allow bottom-up
> allocations below the kernel:
> 
> https://lore.kernel.org/lkml/20201217201214.3414100-2-g...@fb.com
> 
> and with this change we may get a memory corruption if an allocation steps
> on some of the firmware areas that are yet to be reserved.
> 
> The below patches consolidate early memory reservations done during
> setup_arch() so that memory used by firmware, bootloader, kernel text/data
> and the memory that should be excluded from the available memory for
> whatever other reason is reserved before memblock allocations are possible.
> 
> The patches are vs v5.11-rc3-mmots-2021-01-12-02-00 as I think they are
> prerequisite for the memblock bottom-up changes, but if needed I can rebase
> then on another tree.
> 
> Mike Rapoport (2):
>   x86/setup: consolidate early memory reservations
>   x86/setup: merge several reservations of start of the memory
> 
>  arch/x86/kernel/setup.c | 85 +
>  1 file changed, 43 insertions(+), 42 deletions(-)
> 
> -- 
> 2.28.0
>

Re: [PATCH v3 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2021-01-11 Thread Baoquan He

On 01/11/21 at 10:16am, gre...@linuxfoundation.org wrote:
> On Fri, Jan 08, 2021 at 06:22:24PM +0800, Baoquan He wrote:
> > On 01/08/21 at 10:07am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > > Hi Baoquan,
> > > 
> > > -Original Message-
> > > > On 09/30/20 at 12:23pm, Alexander Egorenkov wrote:
> > > > > The offset of the field 'init_uts_ns.name' has changed
> > > > > since commit 9a56493f6942 ("uts: Use generic ns_common::count").
> > > > 
> > > > This patch is merged into 5.11-rc1, but we met the makedumpfile failure
> > > > of kdump test case in 5.10.0 kernel. Should affect 5.9 too since
> > > > commit 9a56493f6942 is merged into 5.9-rc2.
> > > 
> > > Hmm, commit 9a56493f6942 should have been merged into 5.11-rc1
> > > together with commit ca4a9241cc5e.
> > 
> > Checked on master branch of mainline kernel, commit 9a56493f6942 is in
> > 5.9-rc1.
> 
> 
> No, that commit is in 5.11-rc1, not 5.9-rc1:
>   $ git describe --contains 9a56493f6942
>   v5.11-rc1~182^2~9

Oh, then I was wrong about it. I add linux-next repo in my linux kernel
folder, so I can't get it with the above command on master branch of
mainline kernel:

[bhe@~ linux]$ git describe --contains 9a56493f6942
next-20200820~107^2~8

I just use 'git log --oneline' and found out commit 9a56493f6942 is
added before 5.9-rc1. Seems this is not right way to get the kernel
release. So please ignore the back porting request of stable tree.
Sorry about the confusion.

Thanks
Baoquan

> 
> > commit ca4a9241cc5e is merged into 5.11-rc1.
> > 
> > commit 9a56493f6942c0e2df1579986128721da96e00d8
> > Author: Kirill Tkhai 
> > Date:   Mon Aug 3 13:16:21 2020 +0300
> > 
> > uts: Use generic ns_common::count
> > 
> > 
> > commit ca4a9241cc5e718de86a34afd41972869546a5e3
> > Author: Alexander Egorenkov 
> > Date:   Tue Dec 15 20:45:31 2020 -0800
> > 
> > kdump: append uts_namespace.name offset to VMCOREINFO
> 
> 
> Are you all sure this is needed in 5.10.y?
> 
> thanks,
> 
> greg k-h
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2021-01-08 Thread Baoquan He

On 01/08/21 at 10:07am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> Hi Baoquan,
> 
> -Original Message-
> > On 09/30/20 at 12:23pm, Alexander Egorenkov wrote:
> > > The offset of the field 'init_uts_ns.name' has changed
> > > since commit 9a56493f6942 ("uts: Use generic ns_common::count").
> > 
> > This patch is merged into 5.11-rc1, but we met the makedumpfile failure
> > of kdump test case in 5.10.0 kernel. Should affect 5.9 too since
> > commit 9a56493f6942 is merged into 5.9-rc2.
> 
> Hmm, commit 9a56493f6942 should have been merged into 5.11-rc1
> together with commit ca4a9241cc5e.

Checked on master branch of mainline kernel, commit 9a56493f6942 is in
5.9-rc1. commit ca4a9241cc5e is merged into 5.11-rc1.

commit 9a56493f6942c0e2df1579986128721da96e00d8
Author: Kirill Tkhai 
Date:   Mon Aug 3 13:16:21 2020 +0300

uts: Use generic ns_common::count


commit ca4a9241cc5e718de86a34afd41972869546a5e3
Author: Alexander Egorenkov 
Date:   Tue Dec 15 20:45:31 2020 -0800

kdump: append uts_namespace.name offset to VMCOREINFO


> 
> Does your makedumpfile have the following patch?
> https://github.com/makedumpfile/makedumpfile/commit/54aec3878b3f91341e6bc735eda158cca5c54ec9

We met this issue on 5.10 kernel, the latest makedumpfile 1.6.8+ fixs
it. Makedumpfile 1.6.8+ includes the commit 54aec3878b3f. Not sure if I
got the kernel commit right in their corresponding release.

Thanks
Baoquan

Re: [PATCH v3 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2021-01-08 Thread Baoquan He

On 01/08/21 at 09:12am, Greg KH wrote:
> On Fri, Jan 08, 2021 at 11:32:48AM +0800, Baoquan He wrote:
> > On 09/30/20 at 12:23pm, Alexander Egorenkov wrote:
> > > The offset of the field 'init_uts_ns.name' has changed
> > > since commit 9a56493f6942 ("uts: Use generic ns_common::count").
> > 
> > This patch is merged into 5.11-rc1, but we met the makedumpfile failure
> > of kdump test case in 5.10.0 kernel. Should affect 5.9 too since
> > commit 9a56493f6942 is merged into 5.9-rc2.
> > 
> > Below tag and CC should have been added into patch when posted. 
> > 
> > Fixes: commit 9a56493f6942 ("uts: Use generic ns_common::count")
> > Cc: 
> > 
> > Hi Greg,
> > 
> > Do we still have chance to make it added into stable?
> 
> Sure, what is the git commit id of this patch in Linus's tree?

This commit:

ca4a9241cc5e kdump: append uts_namespace.name offset to VMCOREINFO

> 
> In the future, please read:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.

Sure, will do as the doc said in the future, thanks a lot for the
pointer.

Thanks
Baoquan

Re: [PATCH v3 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2021-01-07 Thread Baoquan He

On 09/30/20 at 12:23pm, Alexander Egorenkov wrote:
> The offset of the field 'init_uts_ns.name' has changed
> since commit 9a56493f6942 ("uts: Use generic ns_common::count").

This patch is merged into 5.11-rc1, but we met the makedumpfile failure
of kdump test case in 5.10.0 kernel. Should affect 5.9 too since
commit 9a56493f6942 is merged into 5.9-rc2.

Below tag and CC should have been added into patch when posted. 

Fixes: commit 9a56493f6942 ("uts: Use generic ns_common::count")
Cc: 

Hi Greg,

Do we still have chance to make it added into stable?

Thanks
Baoquan

> 
> Link: 
> https://lore.kernel.org/r/159644978167.604812.1773586504374412107.stgit@localhost.localdomain
> 
> Make the offset of the field 'uts_namespace.name' available
> in VMCOREINFO because tools like 'crash-utility' and
> 'makedumpfile' must be able to read it from crash dumps.
> 
> Signed-off-by: Alexander Egorenkov 
> ---
> 
> v2 -> v3:
>  * Added documentation to vmcoreinfo.rst
>  * Use the short form of the commit reference
> 
> v1 -> v2:
>  * Improved commit message
>  * Added link to the discussion of the uts namespace changes
> 
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 6 ++
>  kernel/crash_core.c| 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst 
> b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index e44a6c01f336..3861a25faae1 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -39,6 +39,12 @@ call.
>  User-space tools can get the kernel name, host name, kernel release
>  number, kernel version, architecture name and OS type from it.
>  
> +(uts_namespace, name)
> +-
> +
> +Offset of the name's member. Crash Utility and Makedumpfile get
> +the start address of the init_uts_ns.name from this.
> +
>  node_online_map
>  ---
>  
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..173fdc261882 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -447,6 +447,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_PAGESIZE(PAGE_SIZE);
>  
>   VMCOREINFO_SYMBOL(init_uts_ns);
> + VMCOREINFO_OFFSET(uts_namespace, name);
>   VMCOREINFO_SYMBOL(node_online_map);
>  #ifdef CONFIG_MMU
>   VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> -- 
> 2.26.2
>

Re: [PATCH v3 2/4] mm: simplify parater of function memmap_init_zone()

2021-01-07 Thread Baoquan He

On 01/05/21 at 05:53pm, David Hildenbrand wrote:
> [...]
> 
> > -void __meminit
> > -memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > -unsigned long start_pfn)
> > +void __meminit memmap_init_zone(struct zone *zone)
> >  {
> > +   unsigned long size = zone->spanned_pages;
> > +   int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
> > +   unsigned long start_pfn = zone->zone_start_pfn;
> > +
> 
> Nit: reverse Christmas tree.

Ah, yes, I will reorder these lines.

> 
> > if (!vmem_map) {
> > -   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
> > +   memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + 
> > size,
> >  MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> > } else {
> > struct page *start;
> > @@ -556,7 +558,7 @@ memmap_init_zone(unsigned long size, int nid, unsigned 
> > long zone,
> > args.start = start;
> > args.end = start + size;
> > args.nid = nid;
> > -   args.zone = zone;
> > +   args.zone = zone_id;
> >  
> > efi_memmap_walk(virtual_memmap_init, );
> > }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 69ebf75be91c..b2a46ffdaf0b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6292,16 +6292,16 @@ static inline u64 init_unavailable_range(unsigned 
> > long spfn, unsigned long epfn,
> >  }
> >  #endif
> >  
> > -void __init __weak memmap_init_zone(unsigned long size, int nid,
> > -  unsigned long zone,
> > -  unsigned long zone_start_pfn)
> > +void __init __weak memmap_init_zone(struct zone *zone)
> >  {
> > unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
> > -   unsigned long zone_end_pfn = zone_start_pfn + size;
> > +   int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
> > +   unsigned long zone_start_pfn = zone->zone_start_pfn;
> > +   unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
> 
> dito.

OK.

> 
> > u64 pgcnt = 0;
> > -   int i;
> >  
> > for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
> > +   unsigned long size;
> 
> You can just get rid of this parameter IMHO.

Someone possibly like a intermediate local variable better in this case,
but I am fine to both, will change as you suggested.

> 
> (Also, there is an empty line missing right now)

Sure. Thanks.

> 
> 
> Apart from that LGTM
> 
> -- 
> Thanks,
> 
> David / dhildenb

Re: [PATCH v3 1/4] mm: rename memmap_init() and memmap_init_zone()

2021-01-07 Thread Baoquan He

On 01/05/21 at 05:49pm, David Hildenbrand wrote:
> On 05.01.21 08:47, Baoquan He wrote:
> > The current memmap_init_zone() only handles memory region inside one zone,
> > actually memmap_init() does the memmap init of one zone. So rename both of
> > them accordingly.
> > 
> > And also rename the function parameter 'range_start_pfn' and local variable
> > 'range_end_pfn' of memmap_init() to zone_start_pfn/zone_end_pfn.
> > 
> > Signed-off-by: Baoquan He 
> > Reviewed-by: Mike Rapoport 
> > ---
> >  arch/ia64/include/asm/pgtable.h |  2 +-
> >  arch/ia64/mm/init.c |  6 +++---
> >  include/linux/mm.h  |  2 +-
> >  mm/memory_hotplug.c |  2 +-
> >  mm/page_alloc.c | 24 
> >  5 files changed, 18 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/ia64/include/asm/pgtable.h 
> > b/arch/ia64/include/asm/pgtable.h
> > index 779b6972aa84..dce2ff37df65 100644
> > --- a/arch/ia64/include/asm/pgtable.h
> > +++ b/arch/ia64/include/asm/pgtable.h
> > @@ -520,7 +520,7 @@ extern struct page *zero_page_memmap_ptr;
> >  
> >  #  ifdef CONFIG_VIRTUAL_MEM_MAP
> >/* arch mem_map init routine is needed due to holes in a virtual mem_map 
> > */
> > -extern void memmap_init (unsigned long size, int nid, unsigned long 
> > zone,
> > +extern void memmap_init_zone(unsigned long size, int nid, unsigned 
> > long zone,
> >  unsigned long start_pfn);
> >  #  endif /* CONFIG_VIRTUAL_MEM_MAP */
> >  # endif /* !__ASSEMBLY__ */
> > diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> > index e76386a3479e..c8e68e92beb3 100644
> > --- a/arch/ia64/mm/init.c
> > +++ b/arch/ia64/mm/init.c
> > @@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
> > / sizeof(struct page));
> >  
> > if (map_start < map_end)
> > -   memmap_init_zone((unsigned long)(map_end - map_start),
> > +   memmap_init_range((unsigned long)(map_end - map_start),
> >  args->nid, args->zone, page_to_pfn(map_start), 
> > page_to_pfn(map_end),
> >  MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> > return 0;
> >  }
> >  
> >  void __meminit
> > -memmap_init (unsigned long size, int nid, unsigned long zone,
> > +memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >  unsigned long start_pfn)
> >  {
> > if (!vmem_map) {
> > -   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
> > +   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
> >  MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> > } else {
> > struct page *start;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 6b3de3c09cd5..26c01f5a028b 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2439,7 +2439,7 @@ extern int __meminit early_pfn_to_nid(unsigned long 
> > pfn);
> >  #endif
> >  
> >  extern void set_dma_reserve(unsigned long new_dma_reserve);
> > -extern void memmap_init_zone(unsigned long, int, unsigned long,
> > +extern void memmap_init_range(unsigned long, int, unsigned long,
> > unsigned long, unsigned long, enum meminit_context,
> > struct vmem_altmap *, int migratetype);
> >  extern void setup_per_zone_wmarks(void);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index f9d57b9be8c7..ddcb1cd24c60 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
> > unsigned long start_pfn,
> >  * expects the zone spans the pfn range. All the pages in the range
> >  * are reserved so nobody should be touching them so we should be safe
> >  */
> > -   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
> > +   memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0,
> >  MEMINIT_HOTPLUG, altmap, migratetype);
> >  
> > set_zone_contiguous(zone);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3ea9d5cd6058..69ebf75be91c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6117,7 +6117,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
> > *pfn)
> >   * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
> >   * zone stats (e.g., nr_isolate_

Re: [PATCH] mm/memcontrol: fix warning in mem_cgroup_page_lruvec()

2021-01-06 Thread Baoquan He

On 01/06/21 at 11:35am, Andrew Morton wrote:
> On Wed, 6 Jan 2021 14:49:35 +0800 Baoquan He  wrote:
> 
> > > Fixes: 9a1ac2288cf1 ("mm/memcontrol:rewrite mem_cgroup_page_lruvec()")
> > 
> > ...
> >
> > Thanks for fixing this. We also encountered this issue in kdump kernel
> > with the mainline 5.10 kernel since 'cgroup_disable=memory' is added.
> 
> Wait.  9a1ac2288cf1 isn't present in 5.10?
> 

Yes, just checked, commit 9a1ac2288cf1 was merged in 5.11-rc1, not in
5.10.0. Seems Redhat CKI doesn't treat the kernel release correctly, it
calls all kernel 5.11-rcx as 5.10.0. Sorry for the confusion, I will
send mail to them to change this.

I got the failure report from redhat's CKI test, the kernel repo is as
below, but the subject of failure report and 'uname -r' told it's
5.10.0.

   Kernel repo: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Commit: 58cf05f597b0 - Merge tag 'sound-fix-5.11-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Re: [PATCH] mm/memcontrol: fix warning in mem_cgroup_page_lruvec()

2021-01-05 Thread Baoquan He

On 01/03/21 at 09:03pm, Hugh Dickins wrote:
> Boot a CONFIG_MEMCG=y kernel with "cgroup_disabled=memory" and you are
> met by a series of warnings from the VM_WARN_ON_ONCE_PAGE(!memcg, page)
> recently added to the inline mem_cgroup_page_lruvec().
> 
> An earlier attempt to place that warning, in mem_cgroup_lruvec(), had
> been careful to do so after weeding out the mem_cgroup_disabled() case;
> but was itself invalid because of the mem_cgroup_lruvec(NULL, pgdat) in
> clear_pgdat_congested() and age_active_anon().
> 
> Warning in mem_cgroup_page_lruvec() was once useful in detecting a KSM
> charge bug, so may be worth keeping: but skip if mem_cgroup_disabled().
> 
> Fixes: 9a1ac2288cf1 ("mm/memcontrol:rewrite mem_cgroup_page_lruvec()")
> Signed-off-by: Hugh Dickins 
> ---
> 
>  include/linux/memcontrol.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- 5.11-rc2/include/linux/memcontrol.h   2020-12-27 20:39:36.751923135 
> -0800
> +++ linux/include/linux/memcontrol.h  2021-01-03 19:38:24.822978559 -0800
> @@ -665,7 +665,7 @@ static inline struct lruvec *mem_cgroup_
>  {
>   struct mem_cgroup *memcg = page_memcg(page);
>  
> - VM_WARN_ON_ONCE_PAGE(!memcg, page);
> + VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
>   return mem_cgroup_lruvec(memcg, pgdat);

Thanks for fixing this. We also encountered this issue in kdump kernel
with the mainline 5.10 kernel since 'cgroup_disable=memory' is added.

Reviewed-by: Baoquan He

[PATCH v3 1/4] mm: rename memmap_init() and memmap_init_zone()

2021-01-04 Thread Baoquan He

The current memmap_init_zone() only handles memory region inside one zone,
actually memmap_init() does the memmap init of one zone. So rename both of
them accordingly.

And also rename the function parameter 'range_start_pfn' and local variable
'range_end_pfn' of memmap_init() to zone_start_pfn/zone_end_pfn.

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
---
 arch/ia64/include/asm/pgtable.h |  2 +-
 arch/ia64/mm/init.c |  6 +++---
 include/linux/mm.h  |  2 +-
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 24 
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 779b6972aa84..dce2ff37df65 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -520,7 +520,7 @@ extern struct page *zero_page_memmap_ptr;
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init (unsigned long size, int nid, unsigned long zone,
+extern void memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
 unsigned long start_pfn);
 #  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e76386a3479e..c8e68e92beb3 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
/ sizeof(struct page));
 
if (map_start < map_end)
-   memmap_init_zone((unsigned long)(map_end - map_start),
+   memmap_init_range((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
 void __meminit
-memmap_init (unsigned long size, int nid, unsigned long zone,
+memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b3de3c09cd5..26c01f5a028b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2439,7 +2439,7 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long,
+extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..ddcb1cd24c60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
+   memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3ea9d5cd6058..69ebf75be91c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6117,7 +6117,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
  * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
  * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
+void __meminit memmap_init_range(unsigned long size, int nid, unsigned long 
zone,
unsigned long start_pfn, unsigned long zone_end_pfn,
enum meminit_context context,
struct vmem_altmap *altmap, int migratetype)
@@ -6292,24 +6292,24 @@ static inline u64 init_unavailable_range(unsigned long 
spfn, unsigned long epfn,
 }
 #endif
 
-void __init __weak memmap_init(unsigned long size, int nid,
+void __init __weak memmap_init_zone(unsigned long size, int nid,
   unsigned long zone,
-  unsigned long range_start_pfn)
+  unsigned long zone_start_pfn)
 {
unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
-   unsigned long range_end_pfn = range_start_pfn + size;
+   unsigned long zone_end_pfn = zone_start_p

[PATCH v3 4/4] mm: remove unneeded local variable in free_area_init_core

2021-01-04 Thread Baoquan He

Local variable 'zone_start_pfn' is not needed since there's only
one call site in free_area_init_core(). Let's remove it and pass
zone->zone_start_pfn directly to init_currently_empty_zone().

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0ce6fb6373b..9cacb8652239 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6986,7 +6986,6 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, freesize, memmap_pages;
-   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
freesize = zone->present_pages;
@@ -7035,7 +7034,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
 
set_pageblock_order();
setup_usemap(zone);
-   init_currently_empty_zone(zone, zone_start_pfn, size);
+   init_currently_empty_zone(zone, zone->zone_start_pfn, size);
memmap_init_zone(zone);
}
 }
-- 
2.17.2

[PATCH v3 2/4] mm: simplify parater of function memmap_init_zone()

2021-01-04 Thread Baoquan He

As David suggested, simply passing 'struct zone *zone' is enough. We can
get all needed information from 'struct zone*' easily.

Suggested-by: David Hildenbrand 
Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
---
 arch/ia64/include/asm/pgtable.h |  3 +--
 arch/ia64/mm/init.c | 12 +++-
 mm/page_alloc.c | 20 ++--
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index dce2ff37df65..2c81394a2430 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -520,8 +520,7 @@ extern struct page *zero_page_memmap_ptr;
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
-unsigned long start_pfn);
+extern void memmap_init_zone(struct zone *zone);
 #  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
 
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index c8e68e92beb3..ccbda1a74c95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -541,12 +541,14 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
return 0;
 }
 
-void __meminit
-memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-unsigned long start_pfn)
+void __meminit memmap_init_zone(struct zone *zone)
 {
+   unsigned long size = zone->spanned_pages;
+   int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long start_pfn = zone->zone_start_pfn;
+
if (!vmem_map) {
-   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + 
size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
@@ -556,7 +558,7 @@ memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
args.start = start;
args.end = start + size;
args.nid = nid;
-   args.zone = zone;
+   args.zone = zone_id;
 
efi_memmap_walk(virtual_memmap_init, );
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69ebf75be91c..b2a46ffdaf0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6292,16 +6292,16 @@ static inline u64 init_unavailable_range(unsigned long 
spfn, unsigned long epfn,
 }
 #endif
 
-void __init __weak memmap_init_zone(unsigned long size, int nid,
-  unsigned long zone,
-  unsigned long zone_start_pfn)
+void __init __weak memmap_init_zone(struct zone *zone)
 {
unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
-   unsigned long zone_end_pfn = zone_start_pfn + size;
+   int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
+   unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
u64 pgcnt = 0;
-   int i;
 
for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
+   unsigned long size;
start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
hole_start_pfn = clamp(hole_start_pfn, zone_start_pfn,
@@ -6309,13 +6309,13 @@ void __init __weak memmap_init_zone(unsigned long size, 
int nid,
 
if (end_pfn > start_pfn) {
size = end_pfn - start_pfn;
-   memmap_init_range(size, nid, zone, start_pfn, 
zone_end_pfn,
+   memmap_init_range(size, nid, zone_id, start_pfn, 
zone_end_pfn,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
 
if (hole_start_pfn < start_pfn)
pgcnt += init_unavailable_range(hole_start_pfn,
-   start_pfn, zone, nid);
+   start_pfn, zone_id, 
nid);
hole_start_pfn = end_pfn;
}
 
@@ -6328,11 +6328,11 @@ void __init __weak memmap_init_zone(unsigned long size, 
int nid,
 */
if (hole_start_pfn < zone_end_pfn)
pgcnt += init_unavailable_range(hole_start_pfn, zone_end_pfn,
-   zone, nid);
+   zone_id, nid);
 
if (pgcnt)
pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
-   zone_names[zone], pgcnt);
+   zone_names[zone_id], pgcnt);
 }
 
 static int zone_batchsize(struct zone *zone)
@@ -7039,7 +7039,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)

[PATCH v3 3/4] mm: simplify parameter of setup_usemap()

2021-01-04 Thread Baoquan He

Parameter 'zone' has got needed information, let's remove other
unnecessary parameters.

Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
---
 mm/page_alloc.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2a46ffdaf0b..e0ce6fb6373b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6824,25 +6824,22 @@ static unsigned long __init usemap_size(unsigned long 
zone_start_pfn, unsigned l
return usemapsize / 8;
 }
 
-static void __ref setup_usemap(struct pglist_data *pgdat,
-   struct zone *zone,
-   unsigned long zone_start_pfn,
-   unsigned long zonesize)
+static void __ref setup_usemap(struct zone *zone)
 {
-   unsigned long usemapsize = usemap_size(zone_start_pfn, zonesize);
+   unsigned long usemapsize = usemap_size(zone->zone_start_pfn,
+  zone->spanned_pages);
zone->pageblock_flags = NULL;
if (usemapsize) {
zone->pageblock_flags =
memblock_alloc_node(usemapsize, SMP_CACHE_BYTES,
-   pgdat->node_id);
+   zone_to_nid(zone));
if (!zone->pageblock_flags)
panic("Failed to allocate %ld bytes for zone %s 
pageblock flags on node %d\n",
- usemapsize, zone->name, pgdat->node_id);
+ usemapsize, zone->name, zone_to_nid(zone));
}
 }
 #else
-static inline void setup_usemap(struct pglist_data *pgdat, struct zone *zone,
-   unsigned long zone_start_pfn, unsigned long 
zonesize) {}
+static inline void setup_usemap(struct zone *zone) {}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -7037,7 +7034,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
continue;
 
set_pageblock_order();
-   setup_usemap(pgdat, zone, zone_start_pfn, size);
+   setup_usemap(zone);
init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init_zone(zone);
}
-- 
2.17.2

[PATCH v3 0/4] mm: clean up names and parameters of memmap_init_xxxx functions

2021-01-04 Thread Baoquan He

This patchset is correcting inappropriate function names of
memmap_init_xxx, and simplify parameters of functions in the code flow
when I tried to fix a regression bug in memmap defer init. These are
taken from the v2 patchset, the bug fixing patch has bee sent alone and
merged. So send the rest as v3.

No any change comparing with v2, except of adding Mike's 'Reviewed-by' tag.

V2 post is here:
https://lore.kernel.org/linux-mm/20201220082754.6900-1-...@redhat.com/

Baoquan He (4):
  mm: rename memmap_init() and memmap_init_zone()
  mm: simplify parater of function memmap_init_zone()
  mm: simplify parameter of setup_usemap()
  mm: remove unneeded local variable in free_area_init_core

 arch/ia64/include/asm/pgtable.h |  3 +-
 arch/ia64/mm/init.c | 14 +
 include/linux/mm.h  |  2 +-
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 54 +++--
 5 files changed, 36 insertions(+), 39 deletions(-)

-- 
2.17.2

Re: [PATCH v2 0/5] Fix the incorrect memmep defer init handling and do some cleanup

2020-12-23 Thread Baoquan He

On 12/23/20 at 10:05am, Baoquan He wrote:
> On 12/22/20 at 05:46pm, Andrew Morton wrote:
> > On Sun, 20 Dec 2020 16:27:49 +0800 Baoquan He  wrote:
> > 
> > > VMware reported the performance regression during memmap_init() 
> > > invocation.
> > > And they bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over
> > > memblock regions rather that check each PFN") causing it.
> > > 
> > > https://lore.kernel.org/linux-mm/dm6pr05mb52921ff90fa01cc337dd23a1a4...@dm6pr05mb5292.namprd05.prod.outlook.com/
> > > 
> > > After investigation, it's caused by incorrect memmap init defer handling
> > > in memmap_init_zone() after commit 73a6e474cb376. The current
> > > memmap_init_zone() only handle one memory region of one zone, while
> > > memmap_init() iterates over all its memory regions and pass them one by
> > > one into memmap_init_zone() to handle.
> > > 
> > > So in this patchset, patch 1/5 fixes the bug observed by VMware. Patch
> > > 2~5/5 clean up codes.
> > > accordingly.
> > 
> > This series doesn't apply well to current mainline (plus, perhaps,
> > material which I sent to Linus today).
> > 
> > So please check all that against mainline in a day or so, refresh,
> > retest and resend.
> > 
> > Please separate the fix for the performance regression (1/5) into a
> > single standalone patch, ready for -stable backporting.  And then a
> > separate 4-patch series with the cleanups for a 5.11 merge.

Have sent the 1/5 as a standalone patch. Will send the rest 4 patches as
a patchset once the patch 1/5 is merged into linux-next. Thanks, Andrew.

> 
> Sure, doing now. 
> 
> By the way, when sending patches to linux-mm ML, which branch should I
> rebase them on? I usually take your akpm/master as base, thought this
> will make your patch picking easier. Seems my understanding is not true,
> akpm/master is changed very soon, we should always base patch on linus's
> master branch, whether patch is sending to linux-mm or not, right?

[PATCH v3 1/1] mm: memmap defer init dosn't work as expected

2020-12-23 Thread Baoquan He

VMware observed a performance regression during memmap init on their platform,
and bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN") causing it.

Before the commit:

  [0.033176] Normal zone: 1445888 pages used for memmap
  [0.033176] Normal zone: 89391104 pages, LIFO batch:63
  [0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

  [0.026874] Normal zone: 1445888 pages used for memmap
  [0.026875] Normal zone: 89391104 pages, LIFO batch:63
  [2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone, to
initialize all low zones of one numa node, but defer memmap init of the
last zone in that numa node. However, since commit 73a6e474cb376, function
memmap_init() is adapted to iterater over memblock regions inside one zone,
then call memmap_init_zone() to do memmap init for each region.

E.g, on VMware's system, the memory layout is as below, there are two memory
regions in node 2. The current code will mistakenly initialize the whole 1st
region [mem 0xab-0xfc], then do memmap defer to iniatialize
only one memmory section on the 2nd region [mem 0x100-0x1033fff].
In fact, we only expect to see that there's only one memory section's memmap
initialized. That's why more time is costed at the time.

[0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
[0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0xbfff]
[0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x55]
[0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x56-0xaa]
[0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab-0xfc]
[0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x100-0x1033fff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge whether
defer need be taken in zone wide.

Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions 
rather that check each PFN")
Reported-by: Rahul Gopakumar 
Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport 
Cc: sta...@vger.kernel.org
---
 arch/ia64/mm/init.c | 4 ++--
 include/linux/mm.h  | 5 +++--
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 8 +---
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 9b5acf8fb092..e76386a3479e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
 
if (map_start < map_end)
memmap_init_zone((unsigned long)(map_end - map_start),
-args->nid, args->zone, page_to_pfn(map_start),
+args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
@@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long 
zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn,
+   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5299b90a6c40..af0d3a8d77f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2432,8 +2432,9 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
-   enum meminit_context, struct vmem_altmap *, int migratetype);
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+   unsigned long, unsigned long, enum meminit_context,
+   struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c01604224299..789fceb4f2d5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
+   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a2c89b21115..bdbec4c98173 10

[PATCH v3 0/1] mm: memmap defer init dosn't work as expected

2020-12-23 Thread Baoquan He

Post the regression fix in a standalone patch as Andrew suggested for
-stable branch better back porting. This is rebased on the latest
master branch of mainline kenrel, surely there's almost no change
comparing with v2.
https://lore.kernel.org/linux-mm/20201220082754.6900-1-...@redhat.com/

Tested on a system with 24G ram as below, adding 'memmap=128M!0x5'
to split the one ram region into two regions in numa node1 to simulate
the scenario of VMware.

[  +0.00] BIOS-provided physical RAM map:
[  +0.00] BIOS-e820: [mem 0x-0x0009bfff] usable
[  +0.00] BIOS-e820: [mem 0x0009c000-0x0009] reserved
[  +0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[  +0.00] BIOS-e820: [mem 0x0010-0x6cdcefff] usable
[  +0.00] BIOS-e820: [mem 0x6cdcf000-0x6efcefff] reserved
[  +0.00] BIOS-e820: [mem 0x6efcf000-0x6fdfefff] ACPI NVS
[  +0.00] BIOS-e820: [mem 0x6fdff000-0x6fffefff] ACPI data
[  +0.00] BIOS-e820: [mem 0x6000-0x6fff] usable
[  +0.00] BIOS-e820: [mem 0x7000-0x8fff] reserved
[  +0.00] BIOS-e820: [mem 0xe000-0x] reserved
[  +0.00] BIOS-e820: [mem 0x0001-0x00067f1f] usable
[  +0.00] BIOS-e820: [mem 0x00067f20-0x00067fff] reserved

Test passed as below. As you can see, with patch applied, memmap init
will cost much less time on numa node 1:

Without the patch:
[0.065029] Early memory node ranges
[0.065030]   node   0: [mem 0x1000-0x0009bfff]
[0.065032]   node   0: [mem 0x0010-0x6cdcefff]
[0.065034]   node   0: [mem 0x6000-0x6fff]
[0.065036]   node   0: [mem 0x0001-0x00027fff]
[0.065038]   node   1: [mem 0x00028000-0x0004]
[0.065040]   node   1: [mem 0x00050800-0x00067f1f]
[0.065185] Zeroed struct page in unavailable ranges: 16533 pages
[0.065187] Initmem setup node 0 [mem 0x1000-0x00027fff]
[0.069616] Initmem setup node 1 [mem 0x00028000-0x00067f1f]
[0.096298] ACPI: PM-Timer IO Port: 0x408

With the patch applied:
[0.065029] Early memory node ranges
[0.065030]   node   0: [mem 0x1000-0x0009bfff]
[0.065032]   node   0: [mem 0x0010-0x6cdcefff]
[0.065034]   node   0: [mem 0x6000-0x6fff]
[0.065036]   node   0: [mem 0x0001-0x00027fff]
[0.065038]   node   1: [mem 0x00028000-0x0004]
[0.065041]   node   1: [mem 0x00050800-0x00067f1f]
[0.065187] Zeroed struct page in unavailable ranges: 16533 pages
[0.065189] Initmem setup node 0 [mem 0x1000-0x00027fff]
[0.069572] Initmem setup node 1 [mem 0x00028000-0x00067f1f]
[0.070161] ACPI: PM-Timer IO Port: 0x408


Baoquan He (1):
  mm: memmap defer init dosn't work as expected

 arch/ia64/mm/init.c | 4 ++--
 include/linux/mm.h  | 5 +++--
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 8 +---
 4 files changed, 11 insertions(+), 8 deletions(-)

-- 
2.17.2

Re: [PATCH v2 0/5] Fix the incorrect memmep defer init handling and do some cleanup

2020-12-22 Thread Baoquan He

On 12/22/20 at 05:46pm, Andrew Morton wrote:
> On Sun, 20 Dec 2020 16:27:49 +0800 Baoquan He  wrote:
> 
> > VMware reported the performance regression during memmap_init() invocation.
> > And they bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over
> > memblock regions rather that check each PFN") causing it.
> > 
> > https://lore.kernel.org/linux-mm/dm6pr05mb52921ff90fa01cc337dd23a1a4...@dm6pr05mb5292.namprd05.prod.outlook.com/
> > 
> > After investigation, it's caused by incorrect memmap init defer handling
> > in memmap_init_zone() after commit 73a6e474cb376. The current
> > memmap_init_zone() only handle one memory region of one zone, while
> > memmap_init() iterates over all its memory regions and pass them one by
> > one into memmap_init_zone() to handle.
> > 
> > So in this patchset, patch 1/5 fixes the bug observed by VMware. Patch
> > 2~5/5 clean up codes.
> > accordingly.
> 
> This series doesn't apply well to current mainline (plus, perhaps,
> material which I sent to Linus today).
> 
> So please check all that against mainline in a day or so, refresh,
> retest and resend.
> 
> Please separate the fix for the performance regression (1/5) into a
> single standalone patch, ready for -stable backporting.  And then a
> separate 4-patch series with the cleanups for a 5.11 merge.

Sure, doing now. 

By the way, when sending patches to linux-mm ML, which branch should I
rebase them on? I usually take your akpm/master as base, thought this
will make your patch picking easier. Seems my understanding is not true,
akpm/master is changed very soon, we should always base patch on linus's
master branch, whether patch is sending to linux-mm or not, right?

Thanks
Baoquan

Re: [RFC]: kexec: change to handle memory/cpu changes

2020-12-21 Thread Baoquan He

On 12/14/20 at 10:50am, Eric DeVolder wrote:
...
> The cell contents show the number of seconds it took for the system to
> process all of the 3840 memblocks. The value in parenthesis is the
> number of kdump unload-then-reload operations per second.
> 
>   1 480GB DIMM   480 1GB DIMMs
> ---+-++
>  RHEL7 | 181s (21.2 ops) | 389s (9.8 ops) |
> ---+-++
>  RHEL8 |  86s (44.7 ops) | 419s (9.2 ops) |
> ---+-++
> 
> The scenario of adding 480 1GiB virtual DIMMs takes more time given
> the larger number of round trips of QEMU -> kernel -> udev -> kernel ->
> QEMU, and are both roughly 400s.
> 
> The RHEL7 system process all 3840 memblocks individually and perform
> 3840 kdump unload-then-reload operations.
> 
> However, RHEL8 data in the best case scenario (1 480GiB DIMM) suggests
> that approximately 86/4= 21 kdump unload-then-reload operations
> happened, and in the worst case scenario (480 1GiB DIMMs), the data
> suggests that approximately 419/4 = 105 kdump unload-then-reload
> operations happened. For RHEL8, the final number of kdump
> unload-then-reload operations are 0.5% (21 of 3840) and 2.7% (105 of
> 3840), respectively, compared to that of the RHEL7 system.
> 
> The throttle approach is quite effective in reducing the number of
> kdump unload-then-reload operations. However, the kdump capture kernel
> is still reloaded multiple times, and each kdump capture kernel reload
> is a race window in which kdump can fail.
> 
> A quick peek at Ubuntu 20.04 LTS reveals it has 50-kdump-tools.rules
> that looks like:
> 
>   SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
>   SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
>   SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/usr/sbin/kdump-config try-reload"
>   SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
>   SUBSYSTEM=="cpu", ACTION=="offline", PROGRAM="/usr/sbin/kdump-config 
> try-reload"
> 
> which produces the equivalent behavior to RHEL7 whereby every event
> results in a kdump capture kernel reload.
> 
> Fedora 33 and CentOS 8-stream behave the same as RHEL8.
> 
> Perhaps a better solution is to rewrite the vmcoreinfo structure that
> contains the memory and CPU layout information, as those changes to
> memory and CPUs occur. Rewriting vmcoreinfo is an in-kernel activity
> and would certainly avoid the relatively large unload-then-reload
> times of the kdump capture kernel. The pointer to the vmcoreinfo
> structure is provided to the capture kernel via the elfcorehdr=
> parameter to the capture kernel cmdline. Rewriting the vmcoreinfo
> structure as well as rewriting the capture kernel cmdline parameter is
> needed to utilize this approach.

Great investigation and conclusion, and very nice idea as below. When I
read the first half of this mail, I thought maybe we could add a new
option to kexec-tools utility for updating eflcorehdr only when hotplug
udev events detected. Then come to this part, I would say yes, doing it
inside kernel looks better. A special handling for hotplug looks
necessary as you have said, I will check what we can do and give back
some details, thanks for doing these.

Thanks
Baoquan

> 
> Based upon some amount of examining code, I think the challenges
> involved in updating the CPU and memory layout in-kernel are:
> 
>  - adding call-outs on the add_memory()/try_remove_memory() and
>cpu_up()/cpu_down() paths for notifying the kdump subsystem of
>memory and/or CPU changes.
> 
>  - updating the struct kimage with the memory or CPU changes
> 
>  - Rewriting the vmcoreinfo structure from the data contained
>in struct kimage, eg crash_prepare_elf64_headers()
> 
>  - Installing the updated vmcoreinfo struct via
>kimage_crash_copy_vmcoreinfo() and rewriting the kdump kernel
>cmdline in order to update parameter elfcorehdr= with the
>new address
> 
> As I am not overly familiar with all the code paths involved, yet, I'm
> sure the devil is in the details. However, due the kexec_file_load
> syscall, it appears most of the infrastructure is already in place,
> and we essentially need to tap into it again for memory and cpu
> changes.
> 
> It appears that this change could be applicable to both kexec_load and
> kexec_file_load, it has the potential to (eventually) simplify the
> userland kexec utility for kexec_load, and would eliminate the need
> for 98-kexec.rules and the associated churn.
> 
> Comments please!
> eric
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

[PATCH v2 1/5] mm: memmap defer init dosn't work as expected

2020-12-20 Thread Baoquan He

VMware observed a performance regression during memmap init on their platform,
and bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN") causing it.

Before the commit:

  [0.033176] Normal zone: 1445888 pages used for memmap
  [0.033176] Normal zone: 89391104 pages, LIFO batch:63
  [0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

  [0.026874] Normal zone: 1445888 pages used for memmap
  [0.026875] Normal zone: 89391104 pages, LIFO batch:63
  [2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone, to
initialize all low zones of one numa node, but defer memmap init of the
last zone in that numa node. However, since commit 73a6e474cb376, function
memmap_init() is adapted to iterater over memblock regions inside one zone,
then call memmap_init_zone() to do memmap init for each region.

E.g, on VMware's system, the memory layout is as below, there are two memory
regions in node 2. The current code will mistakenly initialize the whole 1st
region [mem 0xab-0xfc], then do memmap defer to iniatialize
only one memmory section on the 2nd region [mem 0x100-0x1033fff].
In fact, we only expect to see that there's only one memory section's memmap
initialized. That's why more time is costed at the time.

[0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
[0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0xbfff]
[0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x55]
[0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x56-0xaa]
[0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab-0xfc]
[0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x100-0x1033fff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge whether
defer need be taken in zone wide.

Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions 
rather that check each PFN")
Reported-by: Rahul Gopakumar 
Signed-off-by: Baoquan He 
Cc: sta...@vger.kernel.org
---
 arch/ia64/mm/init.c | 4 ++--
 include/linux/mm.h  | 5 +++--
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 8 +---
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 9b5acf8fb092..e76386a3479e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
 
if (map_start < map_end)
memmap_init_zone((unsigned long)(map_end - map_start),
-args->nid, args->zone, page_to_pfn(map_start),
+args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
@@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long 
zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn,
+   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e4e5be20b0c2..92e06ea053f4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2432,8 +2432,9 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
-   enum meminit_context, struct vmem_altmap *, int migratetype);
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+   unsigned long, unsigned long, enum meminit_context,
+   struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index af41fb990820..f9d57b9be8c7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
+   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8cea0823b70e..32645f2e7b96 100644
--- a/mm/page_alloc.c
+++

[PATCH v2 5/5] mm: remove unneeded local variable in free_area_init_core

2020-12-20 Thread Baoquan He

Local variable 'zone_start_pfn' is not needed since there's only
one call site in free_area_init_core(). Let's remove it and pass
zone->zone_start_pfn directly to init_currently_empty_zone().

Signed-off-by: Baoquan He 
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f0a917ab858..189a86253c93 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6986,7 +6986,6 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, freesize, memmap_pages;
-   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
freesize = zone->present_pages;
@@ -7035,7 +7034,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
 
set_pageblock_order();
setup_usemap(zone);
-   init_currently_empty_zone(zone, zone_start_pfn, size);
+   init_currently_empty_zone(zone, zone->zone_start_pfn, size);
memmap_init_zone(zone);
}
 }
-- 
2.17.2

[PATCH v2 3/5] mm: simplify parater of function memmap_init_zone()

2020-12-20 Thread Baoquan He

As David suggested, simply passing 'struct zone *zone' is enough. We can
get all needed information from 'struct zone*' easily.

Suggested-by: David Hildenbrand 
Signed-off-by: Baoquan He 
---
 arch/ia64/include/asm/pgtable.h |  3 +--
 arch/ia64/mm/init.c | 12 +++-
 mm/page_alloc.c | 20 ++--
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index dce2ff37df65..2c81394a2430 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -520,8 +520,7 @@ extern struct page *zero_page_memmap_ptr;
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
-unsigned long start_pfn);
+extern void memmap_init_zone(struct zone *zone);
 #  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
 
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index c8e68e92beb3..ccbda1a74c95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -541,12 +541,14 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
return 0;
 }
 
-void __meminit
-memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-unsigned long start_pfn)
+void __meminit memmap_init_zone(struct zone *zone)
 {
+   unsigned long size = zone->spanned_pages;
+   int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long start_pfn = zone->zone_start_pfn;
+
if (!vmem_map) {
-   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + 
size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
@@ -556,7 +558,7 @@ memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
args.start = start;
args.end = start + size;
args.nid = nid;
-   args.zone = zone;
+   args.zone = zone_id;
 
efi_memmap_walk(virtual_memmap_init, );
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b46326099d9..7a6626351ed7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6292,16 +6292,16 @@ static inline u64 init_unavailable_range(unsigned long 
spfn, unsigned long epfn,
 }
 #endif
 
-void __init __weak memmap_init_zone(unsigned long size, int nid,
-  unsigned long zone,
-  unsigned long zone_start_pfn)
+void __init __weak memmap_init_zone(struct zone *zone)
 {
unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
-   unsigned long zone_end_pfn = zone_start_pfn + size;
+   int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
+   unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
u64 pgcnt = 0;
-   int i;
 
for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
+   unsigned long size;
start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
hole_start_pfn = clamp(hole_start_pfn, zone_start_pfn,
@@ -6309,13 +6309,13 @@ void __init __weak memmap_init_zone(unsigned long size, 
int nid,
 
if (end_pfn > start_pfn) {
size = end_pfn - start_pfn;
-   memmap_init_range(size, nid, zone, start_pfn, 
zone_end_pfn,
+   memmap_init_range(size, nid, zone_id, start_pfn, 
zone_end_pfn,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
 
if (hole_start_pfn < start_pfn)
pgcnt += init_unavailable_range(hole_start_pfn,
-   start_pfn, zone, nid);
+   start_pfn, zone_id, 
nid);
hole_start_pfn = end_pfn;
}
 
@@ -6328,11 +6328,11 @@ void __init __weak memmap_init_zone(unsigned long size, 
int nid,
 */
if (hole_start_pfn < zone_end_pfn)
pgcnt += init_unavailable_range(hole_start_pfn, zone_end_pfn,
-   zone, nid);
+   zone_id, nid);
 
if (pgcnt)
pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
-   zone_names[zone], pgcnt);
+   zone_names[zone_id], pgcnt);
 }
 
 static int zone_batchsize(struct zone *zone)
@@ -7039,7 +7039,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
set_pageblock_order();

[PATCH v2 4/5] mm: simplify parameter of setup_usemap()

2020-12-20 Thread Baoquan He

Parameter 'zone' has got needed information, let's remove other
unnecessary parameters.

Signed-off-by: Baoquan He 
---
 mm/page_alloc.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a6626351ed7..7f0a917ab858 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6824,25 +6824,22 @@ static unsigned long __init usemap_size(unsigned long 
zone_start_pfn, unsigned l
return usemapsize / 8;
 }
 
-static void __ref setup_usemap(struct pglist_data *pgdat,
-   struct zone *zone,
-   unsigned long zone_start_pfn,
-   unsigned long zonesize)
+static void __ref setup_usemap(struct zone *zone)
 {
-   unsigned long usemapsize = usemap_size(zone_start_pfn, zonesize);
+   unsigned long usemapsize = usemap_size(zone->zone_start_pfn,
+  zone->spanned_pages);
zone->pageblock_flags = NULL;
if (usemapsize) {
zone->pageblock_flags =
memblock_alloc_node(usemapsize, SMP_CACHE_BYTES,
-   pgdat->node_id);
+   zone_to_nid(zone));
if (!zone->pageblock_flags)
panic("Failed to allocate %ld bytes for zone %s 
pageblock flags on node %d\n",
- usemapsize, zone->name, pgdat->node_id);
+ usemapsize, zone->name, zone_to_nid(zone));
}
 }
 #else
-static inline void setup_usemap(struct pglist_data *pgdat, struct zone *zone,
-   unsigned long zone_start_pfn, unsigned long 
zonesize) {}
+static inline void setup_usemap(struct zone *zone) {}
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -7037,7 +7034,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
continue;
 
set_pageblock_order();
-   setup_usemap(pgdat, zone, zone_start_pfn, size);
+   setup_usemap(zone);
init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init_zone(zone);
}
-- 
2.17.2

[PATCH v2 2/5] mm: rename memmap_init() and memmap_init_zone()

2020-12-20 Thread Baoquan He

The current memmap_init_zone() only handles memory region inside one zone,
actually memmap_init() does the memmap init of one zone. So rename both of
them accordingly.

And also rename the function parameter 'range_start_pfn' and local variable
'range_end_pfn' of memmap_init() to zone_start_pfn/zone_end_pfn.

Signed-off-by: Baoquan He 
---
 arch/ia64/include/asm/pgtable.h |  2 +-
 arch/ia64/mm/init.c |  6 +++---
 include/linux/mm.h  |  2 +-
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 24 
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 779b6972aa84..dce2ff37df65 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -520,7 +520,7 @@ extern struct page *zero_page_memmap_ptr;
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */
-extern void memmap_init (unsigned long size, int nid, unsigned long zone,
+extern void memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
 unsigned long start_pfn);
 #  endif /* CONFIG_VIRTUAL_MEM_MAP */
 # endif /* !__ASSEMBLY__ */
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e76386a3479e..c8e68e92beb3 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
/ sizeof(struct page));
 
if (map_start < map_end)
-   memmap_init_zone((unsigned long)(map_end - map_start),
+   memmap_init_range((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
 void __meminit
-memmap_init (unsigned long size, int nid, unsigned long zone,
+memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 92e06ea053f4..f72c138c2272 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2432,7 +2432,7 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long,
+extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..ddcb1cd24c60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
+   memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32645f2e7b96..4b46326099d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6117,7 +6117,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
  * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
  * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
+void __meminit memmap_init_range(unsigned long size, int nid, unsigned long 
zone,
unsigned long start_pfn, unsigned long zone_end_pfn,
enum meminit_context context,
struct vmem_altmap *altmap, int migratetype)
@@ -6292,24 +6292,24 @@ static inline u64 init_unavailable_range(unsigned long 
spfn, unsigned long epfn,
 }
 #endif
 
-void __init __weak memmap_init(unsigned long size, int nid,
+void __init __weak memmap_init_zone(unsigned long size, int nid,
   unsigned long zone,
-  unsigned long range_start_pfn)
+  unsigned long zone_start_pfn)
 {
unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
-   unsigned long range_end_pfn = range_start_pfn + size;
+   unsigned long zone_end_pfn = zone_start_pfn + size;
u64

[PATCH v2 0/5] Fix the incorrect memmep defer init handling and do some cleanup

2020-12-20 Thread Baoquan He

VMware reported the performance regression during memmap_init() invocation.
And they bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over
memblock regions rather that check each PFN") causing it.

https://lore.kernel.org/linux-mm/dm6pr05mb52921ff90fa01cc337dd23a1a4...@dm6pr05mb5292.namprd05.prod.outlook.com/

After investigation, it's caused by incorrect memmap init defer handling
in memmap_init_zone() after commit 73a6e474cb376. The current
memmap_init_zone() only handle one memory region of one zone, while
memmap_init() iterates over all its memory regions and pass them one by
one into memmap_init_zone() to handle.

So in this patchset, patch 1/5 fixes the bug observed by VMware. Patch
2~5/5 clean up codes.
accordingly.

VMware helped do the testing for the patch 1 of v1 version which was based
on master branch of Linus's tree on their VMware ESI platform, while the
patch 1 is not changed in functionality in v2. And I haven't got a
ia64 machine to compile or test, will really appreciate if anyone can help
compile this patchset on one. This patchset is based on the latest next/master,
only did the basic test.  

Baoquan He (5):
  mm: memmap defer init dosn't work as expected
  mm: rename memmap_init() and memmap_init_zone()
  mm: simplify parater of function memmap_init_zone()
  mm: simplify parameter of setup_usemap()
  mm: remove unneeded local variable in free_area_init_core

 arch/ia64/include/asm/pgtable.h |  3 +-
 arch/ia64/mm/init.c | 16 +
 include/linux/mm.h  |  5 +--
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 60 -
 5 files changed, 43 insertions(+), 43 deletions(-)

-- 
2.17.2

Re: [PATCH 2/2] mm: rename memmap_init() and memmap_init_zone()

2020-12-15 Thread Baoquan He

On 12/14/20 at 01:04pm, Mike Rapoport wrote:
> On Mon, Dec 14, 2020 at 11:00:07AM +0100, David Hildenbrand wrote:
> > On 13.12.20 16:09, Baoquan He wrote:
> > > The current memmap_init_zone() only handles memory region inside one zone.
> > > Actually memmap_init() does the memmap init of one zone. So rename both of
> > > them accordingly.
> > > 
> > > And also rename the function parameter 'range_start_pfn' and local 
> > > variable
> > > 'range_end_pfn' to zone_start_pfn/zone_end_pfn.
> > > 
> > > Signed-off-by: Baoquan He 
> > > ---
..  

> > >   set_zone_contiguous(zone);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 315c22974f0d..fac599deba56 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -6050,7 +6050,7 @@ overlap_memmap_init(unsigned long zone, unsigned 
> > > long *pfn)
> > >   * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
> > >   * zone stats (e.g., nr_isolate_pageblock) are touched.
> > >   */
> > > -void __meminit memmap_init_zone(unsigned long size, int nid, unsigned 
> > > long zone,
> > > +void __meminit memmap_init_range(unsigned long size, int nid, unsigned 
> > > long zone,
> > >   unsigned long start_pfn, unsigned long zone_end_pfn,
> > >   enum meminit_context context,
> > >   struct vmem_altmap *altmap, int migratetype)
> > > @@ -6187,21 +6187,21 @@ static void __meminit zone_init_free_lists(struct 
> > > zone *zone)
> > >   }
> > >  }
> > >  
> > > -void __meminit __weak memmap_init(unsigned long size, int nid,
> > > +void __meminit __weak memmap_init_zone(unsigned long size, int nid,
> > > unsigned long zone,
> > > -   unsigned long range_start_pfn)
> > > +   unsigned long zone_start_pfn)
> > 
> > Why are we not simply passing "struct zone" like
> > 
> > void __meminit __weak  memmap_init_zone(struct zone *zone)
> > 
> > from which we can derive
> > - nid
> > - zone idx
> > - zone_start_pfn
> > - spanned_pages / zone_end_pfn
> > 
> > At least when called from free_area_init_core() this should work just
> > fine I think.
>  
> There is also a custom memmap init in ia64 which at least should be
> tested ;-)

Right. Tried in arch/ia64/mm/init.c, the change is as below. Looks
simple, compiling passed on ia64 should be OK.


diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index af678197ac2d..4fa49a762d58 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -541,12 +541,14 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
return 0;
 }
 
-void __meminit
-memmap_init_zone (unsigned long size, int nid, unsigned long zone,
-unsigned long start_pfn)
+void __meminit memmap_init_zone (struct zone *zone)
 {
+   unsigned long size = zone->spanned_size;
+   int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+   unsigned long start_pfn = zone->zone_start_pfn;
+
if (!vmem_map) {
-   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + 
size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
@@ -556,7 +558,7 @@ memmap_init_zone (unsigned long size, int nid, unsigned 
long zone,
args.start = start;
args.end = start + size;
args.nid = nid;
-   args.zone = zone;
+   args.zone = zone_id;
 
efi_memmap_walk(virtual_memmap_init, );
}
> 
> More broadly, while Baoquan's fix looks Ok to me, I think we can
> calculate node->first_deferred_pfn earlier in, say,
> free_area_init_node() rather than do defer_init() check for each pfn.

Remember I ever tried to move the defer init up one level into memmap_init()
when making draft patch in the first place. I finally ended up with this
because there's overlap_memmap_init().

>  
> > >  {
> > >   unsigned long start_pfn, end_pfn;
> > > - unsigned long range_end_pfn = range_start_pfn + size;
> > > + unsigned long zone_end_pfn = zone_start_pfn + size;
> > >   int i;
> > >  
> > >   for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
> > > - start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > - end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > + start_pfn = clamp(

Re: [PATCH 2/2] mm: rename memmap_init() and memmap_init_zone()

2020-12-14 Thread Baoquan He

On 12/14/20 at 11:00am, David Hildenbrand wrote:
> On 13.12.20 16:09, Baoquan He wrote:
> > The current memmap_init_zone() only handles memory region inside one zone.
> > Actually memmap_init() does the memmap init of one zone. So rename both of
> > them accordingly.
> > 
> > And also rename the function parameter 'range_start_pfn' and local variable
> > 'range_end_pfn' to zone_start_pfn/zone_end_pfn.
> > 
> > Signed-off-by: Baoquan He 
> > ---
> >  arch/ia64/mm/init.c |  6 +++---
> >  include/linux/mm.h  |  2 +-
> >  mm/memory_hotplug.c |  2 +-
> >  mm/page_alloc.c | 16 
> >  4 files changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> > index 27ca549ff47e..af678197ac2d 100644
> > --- a/arch/ia64/mm/init.c
> > +++ b/arch/ia64/mm/init.c
> > @@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
> > / sizeof(struct page));
> >  
> > if (map_start < map_end)
> > -   memmap_init_zone((unsigned long)(map_end - map_start),
> > +   memmap_init_range((unsigned long)(map_end - map_start),
> >  args->nid, args->zone, page_to_pfn(map_start), 
> > page_to_pfn(map_end),
> >  MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> > return 0;
> >  }
> >  
> >  void __meminit
> > -memmap_init (unsigned long size, int nid, unsigned long zone,
> > +memmap_init_zone (unsigned long size, int nid, unsigned long zone,
> >  unsigned long start_pfn)
> 
> While at it s/zone /zone/ please. :)

Yeah, when I git grep 'memmap_init(', I didn't searched the one in ia64,
didn't adjust it since I saw so many functions got a space between
name and parenthesis in arch/ia64/mm/. I will clean up this one anyway.

> 
> >  {
> > if (!vmem_map) {
> > -   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
> > +   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
> >  MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> > } else {
> > struct page *start;
...

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 315c22974f0d..fac599deba56 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6050,7 +6050,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
> > *pfn)
> >   * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
> >   * zone stats (e.g., nr_isolate_pageblock) are touched.
> >   */
> > -void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long 
> > zone,
> > +void __meminit memmap_init_range(unsigned long size, int nid, unsigned 
> > long zone,
> > unsigned long start_pfn, unsigned long zone_end_pfn,
> > enum meminit_context context,
> > struct vmem_altmap *altmap, int migratetype)
> > @@ -6187,21 +6187,21 @@ static void __meminit zone_init_free_lists(struct 
> > zone *zone)
> > }
> >  }
> >  
> > -void __meminit __weak memmap_init(unsigned long size, int nid,
> > +void __meminit __weak memmap_init_zone(unsigned long size, int nid,
> >   unsigned long zone,
> > - unsigned long range_start_pfn)
> > + unsigned long zone_start_pfn)
> 
> Why are we not simply passing "struct zone" like
> 
> void __meminit __weak  memmap_init_zone(struct zone *zone)
> 
> from which we can derive
> - nid
> - zone idx
> - zone_start_pfn
> - spanned_pages / zone_end_pfn
> 
> At least when called from free_area_init_core() this should work just
> fine I think.

Yes, passing 'struct zone *zone' looks much better, I will append a patch to
do this. Thanks.

> 
> 
> 
> >  {
> > unsigned long start_pfn, end_pfn;
> > -   unsigned long range_end_pfn = range_start_pfn + size;
> > +   unsigned long zone_end_pfn = zone_start_pfn + size;
> > int i;
> >  
> > for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
> > -   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > -   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > +   start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
> > +   end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
> >  
> > if (end_pfn > start_pfn) {
> > size = end_pfn - start_pfn;
> > -   memmap_ini

[PATCH 2/2] mm: rename memmap_init() and memmap_init_zone()

2020-12-13 Thread Baoquan He

The current memmap_init_zone() only handles memory region inside one zone.
Actually memmap_init() does the memmap init of one zone. So rename both of
them accordingly.

And also rename the function parameter 'range_start_pfn' and local variable
'range_end_pfn' to zone_start_pfn/zone_end_pfn.

Signed-off-by: Baoquan He 
---
 arch/ia64/mm/init.c |  6 +++---
 include/linux/mm.h  |  2 +-
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 16 
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 27ca549ff47e..af678197ac2d 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -535,18 +535,18 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
/ sizeof(struct page));
 
if (map_start < map_end)
-   memmap_init_zone((unsigned long)(map_end - map_start),
+   memmap_init_range((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
 void __meminit
-memmap_init (unsigned long size, int nid, unsigned long zone,
+memmap_init_zone (unsigned long size, int nid, unsigned long zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
+   memmap_init_range(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd5c313729ea..3d81ebbbef89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2439,7 +2439,7 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long,
+extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 47b75da63f01..579762e4f8d8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -714,7 +714,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
+   memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 315c22974f0d..fac599deba56 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6050,7 +6050,7 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
  * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
  * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long 
zone,
+void __meminit memmap_init_range(unsigned long size, int nid, unsigned long 
zone,
unsigned long start_pfn, unsigned long zone_end_pfn,
enum meminit_context context,
struct vmem_altmap *altmap, int migratetype)
@@ -6187,21 +6187,21 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-void __meminit __weak memmap_init(unsigned long size, int nid,
+void __meminit __weak memmap_init_zone(unsigned long size, int nid,
  unsigned long zone,
- unsigned long range_start_pfn)
+ unsigned long zone_start_pfn)
 {
unsigned long start_pfn, end_pfn;
-   unsigned long range_end_pfn = range_start_pfn + size;
+   unsigned long zone_end_pfn = zone_start_pfn + size;
int i;
 
for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
-   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
-   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+   start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
+   end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
 
if (end_pfn > start_pfn) {
size = end_pfn - start_pfn;
-   memmap_init_zone(size, nid, zone, start_pfn, 
range_end_pfn,
+   memmap_init_range(size, nid, zone, start_pfn, 
zone_end_pfn,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
}
@@ -6903,7 +6903,7 @@ static void __init free_area_init

[PATCH 1/2] mm: memmap defer init dosn't work as expected

2020-12-13 Thread Baoquan He

VMware observed a performance regression during memmap init on their platform,
and bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN") to cause it.

Before the commit:

  [0.033176] Normal zone: 1445888 pages used for memmap
  [0.033176] Normal zone: 89391104 pages, LIFO batch:63
  [0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

  [0.026874] Normal zone: 1445888 pages used for memmap
  [0.026875] Normal zone: 89391104 pages, LIFO batch:63
  [2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone, to
initialize all low zones of one numa node, but defer memmap init of the
last zone in that numa node. However, since commit 73a6e474cb376, function
memmap_init() is adapted to iterater over memblock regions inside one zone,
then call memmap_init_zone() to do memmap init for each region.

E.g, on VMware's system, the memory layout is as below, there are two memory
regions in node 2. The current code will mistakenly initialize the whole 1st
region [mem 0xab-0xfc], then do memmap defer to iniatialize
only one memmory section on the 2nd region [mem 0x100-0x1033fff].
In fact, we only expect to see that there's only one memory section's memmap
initialized. That's why more time is costed at this time.

[0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
[0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0xbfff]
[0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x55]
[0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x56-0xaa]
[0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab-0xfc]
[0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x100-0x1033fff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the read zone end pfn so that defer_init() can use it to judge whether
defer need be taken in zone wide.

Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions 
rather that check each PFN")
Signed-off-by: Baoquan He 
Cc: sta...@vger.kernel.org

---
 arch/ia64/mm/init.c | 4 ++--
 include/linux/mm.h  | 5 +++--
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 8 +---
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index ef12e097f318..27ca549ff47e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
 
if (map_start < map_end)
memmap_init_zone((unsigned long)(map_end - map_start),
-args->nid, args->zone, page_to_pfn(map_start),
+args->nid, args->zone, page_to_pfn(map_start), 
page_to_pfn(map_end),
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
@@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long 
zone,
 unsigned long start_pfn)
 {
if (!vmem_map) {
-   memmap_init_zone(size, nid, zone, start_pfn,
+   memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..cd5c313729ea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2439,8 +2439,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
-   enum meminit_context, struct vmem_altmap *, int migratetype);
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+   unsigned long, unsigned long, enum meminit_context,
+   struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 63b2e46b6555..47b75da63f01 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -714,7 +714,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * expects the zone spans the pfn range. All the pages in the range
 * are reserved so nobody should be touching them so we should be safe
 */
-   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
+   memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
 MEMINIT_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..315c22974f0d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -448,6 +44

[PATCH 0/2] Fix the incorrect memmap init defer handling

2020-12-13 Thread Baoquan He

VMware reported the performance regression during memmap_init() invocation.
And they bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over
memblock regions rather that check each PFN") causing it.

After investigation, it's caused by incorrect memmap init defer handling
in memmap_init_zone() after commit 73a6e474cb376. The current
memmap_init_zone() only handle one memory region of one zone, while
memmap_init() iterates over all its memory regions and pass them one by
one into memmap_init_zone() to handle.

So in this patchset, patch 1/2 fixes the bug observed by VMware. Patch
2/2 clean up the inappropriate name of memmap_init(), memmap_init_zone()
accordingly.

VMware helped do the testing on their VMware ESI platform. This patchset
is based on 5.10.0-rc7+, master branch of Linus's tree.

Baoquan He (2):
  mm: memmap defer init dosn't work as expected
  mm: rename memmap_init() and memmap_init_zone()

 arch/ia64/mm/init.c |  8 
 include/linux/mm.h  |  5 +++--
 mm/memory_hotplug.c |  2 +-
 mm/page_alloc.c | 22 --
 4 files changed, 20 insertions(+), 17 deletions(-)

-- 
2.17.2

Re: [PATCH v13 6/8] arm64: kdump: reimplement crashkernel=X

2020-11-12 Thread Baoquan He

On 11/12/20 at 10:25am, Mike Rapoport wrote:
> On Wed, Nov 11, 2020 at 09:54:48PM +0800, Baoquan He wrote:
> > On 11/11/20 at 09:27pm, chenzhou wrote:
> > > Hi Baoquan,
> > ...
> > > >>  #ifdef CONFIG_CRASH_DUMP
> > > >>  static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
> > > >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > >> index 1c0f3e02f731..c55cee290bbb 100644
> > > >> --- a/arch/arm64/mm/mmu.c
> > > >> +++ b/arch/arm64/mm/mmu.c
> > > >> @@ -488,6 +488,10 @@ static void __init map_mem(pgd_t *pgdp)
> > > >> */
> > > >>memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> > > >>  #ifdef CONFIG_KEXEC_CORE
> > > >> +  if (crashk_low_res.end)
> > > >> +  memblock_mark_nomap(crashk_low_res.start,
> > > >> +  resource_size(_low_res));
> > > >> +
> > > >>if (crashk_res.end)
> > > >>memblock_mark_nomap(crashk_res.start,
> > > >>resource_size(_res));
> > > >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > > >> index d39892bdb9ae..cdef7d8c91a6 100644
> > > >> --- a/kernel/crash_core.c
> > > >> +++ b/kernel/crash_core.c
> > > >> @@ -321,7 +321,7 @@ int __init parse_crashkernel_low(char *cmdline,
> > > >>  
> > > >>  int __init reserve_crashkernel_low(void)
> > > >>  {
> > > >> -#ifdef CONFIG_X86_64
> > > >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
> > > > Not very sure if a CONFIG_64BIT checking is better.
> > > If doing like this, there may be some compiling errors for other 64-bit 
> > > kernel, such as mips.
> > > >
> > > >>unsigned long long base, low_base = 0, low_size = 0;
> > > >>unsigned long low_mem_limit;
> > > >>int ret;
> > > >> @@ -362,12 +362,14 @@ int __init reserve_crashkernel_low(void)
> > > >>  
> > > >>crashk_low_res.start = low_base;
> > > >>crashk_low_res.end   = low_base + low_size - 1;
> > > >> +#ifdef CONFIG_X86_64
> > > >>insert_resource(_resource, _low_res);
> > > >> +#endif
> > > >>  #endif
> > > >>return 0;
> > > >>  }
> > > >>  
> > > >> -#ifdef CONFIG_X86
> > > >> +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > Should we make this weak default so that we can remove the ARCH config?
> > > The same as above, some arch may not support kdump, in that case,  
> > > compiling errors occur.
> > 
> > OK, not sure if other people have better idea, oterwise, we can leave with 
> > it. 
> > Thanks for telling.
> 
> I think it would be better to have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
> in arch/Kconfig and select this by X86 and ARM64.
> 
> Since reserve_crashkernel() implementations are quite similart on other
> architectures as well, we can have more users of this later.

Yes, this sounds like a nice way.

Re: [PATCH v13 6/8] arm64: kdump: reimplement crashkernel=X

2020-11-11 Thread Baoquan He

On 11/11/20 at 09:27pm, chenzhou wrote:
> Hi Baoquan,
...
> >>  #ifdef CONFIG_CRASH_DUMP
> >>  static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index 1c0f3e02f731..c55cee290bbb 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -488,6 +488,10 @@ static void __init map_mem(pgd_t *pgdp)
> >> */
> >>memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> >>  #ifdef CONFIG_KEXEC_CORE
> >> +  if (crashk_low_res.end)
> >> +  memblock_mark_nomap(crashk_low_res.start,
> >> +  resource_size(_low_res));
> >> +
> >>if (crashk_res.end)
> >>memblock_mark_nomap(crashk_res.start,
> >>resource_size(_res));
> >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> >> index d39892bdb9ae..cdef7d8c91a6 100644
> >> --- a/kernel/crash_core.c
> >> +++ b/kernel/crash_core.c
> >> @@ -321,7 +321,7 @@ int __init parse_crashkernel_low(char *cmdline,
> >>  
> >>  int __init reserve_crashkernel_low(void)
> >>  {
> >> -#ifdef CONFIG_X86_64
> >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
> > Not very sure if a CONFIG_64BIT checking is better.
> If doing like this, there may be some compiling errors for other 64-bit 
> kernel, such as mips.
> >
> >>unsigned long long base, low_base = 0, low_size = 0;
> >>unsigned long low_mem_limit;
> >>int ret;
> >> @@ -362,12 +362,14 @@ int __init reserve_crashkernel_low(void)
> >>  
> >>crashk_low_res.start = low_base;
> >>crashk_low_res.end   = low_base + low_size - 1;
> >> +#ifdef CONFIG_X86_64
> >>insert_resource(_resource, _low_res);
> >> +#endif
> >>  #endif
> >>return 0;
> >>  }
> >>  
> >> -#ifdef CONFIG_X86
> >> +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > Should we make this weak default so that we can remove the ARCH config?
> The same as above, some arch may not support kdump, in that case,  compiling 
> errors occur.

OK, not sure if other people have better idea, oterwise, we can leave with it. 
Thanks for telling.

Re: [PATCH v13 0/8] support reserving crashkernel above 4G on arm64 kdump

2020-11-10 Thread Baoquan He

Hi Zhou, Bhupesh

On 10/31/20 at 03:44pm, Chen Zhou wrote:
> There are following issues in arm64 kdump:
> 1. We use crashkernel=X to reserve crashkernel below 4G, which
> will fail when there is no enough low memory.
> 2. If reserving crashkernel above 4G, in this case, crash dump
> kernel will boot failure because there is no low memory available
> for allocation.
> 3. Since commit 1a8e1cef7603 ("arm64: use both ZONE_DMA and ZONE_DMA32"),
> if the memory reserved for crash dump kernel falled in ZONE_DMA32,
> the devices in crash dump kernel need to use ZONE_DMA will alloc
> fail.

I went through this patchset, mainly the x86 related and generic
changes, the changes look great and no risk. And I know Bhupesh is
following up this and helping review, thanks, both.

So you have also tested crashkernel reservation on x86_64, with the
normal reservation, and high/low reservation, it is working well,
right? Asking this because I didn't see the test result description, and
just note it.

Thanks
Baoquan

> 
> To solve these issues, change the behavior of crashkernel=X.
> crashkernel=X tries low allocation in DMA zone (or the DMA32 zone if
> CONFIG_ZONE_DMA is disabled), and fall back to high allocation if it fails.
> 
> We can also use "crashkernel=X,high" to select a high region above
> DMA zone, which also tries to allocate at least 256M low memory in
> DMA zone automatically (or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
> "crashkernel=Y,low" can be used to allocate specified size low memory.
> 
> When reserving crashkernel in high memory, some low memory is reserved
> for crash dump kernel devices. So there may be two regions reserved for
> crash dump kernel.
> In order to distinct from the high region and make no effect to the use
> of existing kexec-tools, rename the low region as "Crash kernel (low)",
> and pass the low region by reusing DT property
> "linux,usable-memory-range". We made the low memory region as the last
> range of "linux,usable-memory-range" to keep compatibility with existing
> user-space and older kdump kernels.
> 
> Besides, we need to modify kexec-tools:
> arm64: support more than one crash kernel regions(see [1])
> 
> Another update is document about DT property 'linux,usable-memory-range':
> schemas: update 'linux,usable-memory-range' node schema(see [2])
> 
> This patchset contains the following eight patches:
> 0001-x86-kdump-replace-the-hard-coded-alignment-with-macr.patch
> 0002-x86-kdump-make-the-lower-bound-of-crash-kernel-reser.patch
> 0003-x86-kdump-use-macro-CRASH_ADDR_LOW_MAX-in-functions-.patch
> 0004-x86-kdump-move-reserve_crashkernel-_low-into-crash_c.patch
> 0005-arm64-kdump-introduce-some-macroes-for-crash-kernel-.patch
> 0006-arm64-kdump-reimplement-crashkernel-X.patch
> 0007-arm64-kdump-add-memory-for-devices-by-DT-property-li.patch
> 0008-kdump-update-Documentation-about-crashkernel.patch
> 
> 0001-0003 are some x86 cleanups which prepares for making
> functionsreserve_crashkernel[_low]() generic.
> 0004 makes functions reserve_crashkernel[_low]() generic.
> 0005-0006 reimplements arm64 crashkernel=X.
> 0007 adds memory for devices by DT property linux,usable-memory-range.
> 0008 updates the doc.
> 
> Changes since [v12]
> - Rebased on top of 5.10-rc1.
> - Keep CRASH_ALIGN as 16M suggested by Dave.
> - Drop patch "kdump: add threshold for the required memory".
> - Add Tested-by from John.
> 
> Changes since [v11]
> - Rebased on top of 5.9-rc4.
> - Make the function reserve_crashkernel() of x86 generic.
> Suggested by Catalin, make the function reserve_crashkernel() of x86 generic
> and arm64 use the generic version to reimplement crashkernel=X.
> 
> Changes since [v10]
> - Reimplement crashkernel=X suggested by Catalin, Many thanks to Catalin.
> 
> Changes since [v9]
> - Patch 1 add Acked-by from Dave.
> - Update patch 5 according to Dave's comments.
> - Update chosen schema.
> 
> Changes since [v8]
> - Reuse DT property "linux,usable-memory-range".
> Suggested by Rob, reuse DT property "linux,usable-memory-range" to pass the 
> low
> memory region.
> - Fix kdump broken with ZONE_DMA reintroduced.
> - Update chosen schema.
> 
> Changes since [v7]
> - Move x86 CRASH_ALIGN to 2M
> Suggested by Dave and do some test, move x86 CRASH_ALIGN to 2M.
> - Update Documentation/devicetree/bindings/chosen.txt.
> Add corresponding documentation to 
> Documentation/devicetree/bindings/chosen.txt
> suggested by Arnd.
> - Add Tested-by from Jhon and pk.
> 
> Changes since [v6]
> - Fix build errors reported by kbuild test robot.
> 
> Changes since [v5]
> - Move reserve_crashkernel_low() into kernel/crash_core.c.
> - Delete crashkernel=X,high.
> - Modify crashkernel=X,low.
> If crashkernel=X,low is specified simultaneously, reserve spcified size low
> memory for crash kdump kernel devices firstly and then reserve memory above 
> 4G.
> In addition, rename crashk_low_res as "Crash kernel (low)" for arm64, and then
> pass to crash dump kernel by DT property

Re: [PATCH v13 6/8] arm64: kdump: reimplement crashkernel=X

2020-11-10 Thread Baoquan He

On 10/31/20 at 03:44pm, Chen Zhou wrote:
> There are following issues in arm64 kdump:
> 1. We use crashkernel=X to reserve crashkernel below 4G, which
> will fail when there is no enough low memory.
> 2. If reserving crashkernel above 4G, in this case, crash dump
> kernel will boot failure because there is no low memory available
> for allocation.
> 3. Since commit 1a8e1cef7603 ("arm64: use both ZONE_DMA and ZONE_DMA32"),
> if the memory reserved for crash dump kernel falled in ZONE_DMA32,
> the devices in crash dump kernel need to use ZONE_DMA will alloc
> fail.
> 
> To solve these issues, change the behavior of crashkernel=X and
> introduce crashkernel=X,[high,low]. crashkernel=X tries low allocation
> in DMA zone or DMA32 zone if CONFIG_ZONE_DMA is disabled, and fall back
> to high allocation if it fails.
> We can also use "crashkernel=X,high" to select a region above DMA zone,
> which also tries to allocate at least 256M in DMA zone automatically
> (or the DMA32 zone if CONFIG_ZONE_DMA is disabled).
> "crashkernel=Y,low" can be used to allocate specified size low memory.
> 
> Another minor change, there may be two regions reserved for crash
> dump kernel, in order to distinct from the high region and make no
> effect to the use of existing kexec-tools, rename the low region as
> "Crash kernel (low)".
> 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/arm64/include/asm/kexec.h |  9 +
>  arch/arm64/kernel/setup.c  | 13 +++-
>  arch/arm64/mm/init.c   | 60 ++
>  arch/arm64/mm/mmu.c|  4 +++
>  kernel/crash_core.c|  8 +++--
>  5 files changed, 34 insertions(+), 60 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
> index 402d208265a3..79909ae5e22e 100644
> --- a/arch/arm64/include/asm/kexec.h
> +++ b/arch/arm64/include/asm/kexec.h
> @@ -28,7 +28,12 @@
>  /* 2M alignment for crash kernel regions */
>  #define CRASH_ALIGN  SZ_2M
>  
> +#ifdef CONFIG_ZONE_DMA
> +#define CRASH_ADDR_LOW_MAX   arm64_dma_phys_limit
> +#else
>  #define CRASH_ADDR_LOW_MAX   arm64_dma32_phys_limit
> +#endif
> +
>  #define CRASH_ADDR_HIGH_MAX  MEMBLOCK_ALLOC_ACCESSIBLE
>  
>  #ifndef __ASSEMBLY__
> @@ -96,6 +101,10 @@ static inline void crash_prepare_suspend(void) {}
>  static inline void crash_post_resume(void) {}
>  #endif
>  
> +#ifdef CONFIG_KEXEC_CORE
> +extern void __init reserve_crashkernel(void);
> +#endif
> +
>  #ifdef CONFIG_KEXEC_FILE
>  #define ARCH_HAS_KIMAGE_ARCH
>  
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 133257ffd859..6aff30de8f47 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -238,7 +238,18 @@ static void __init request_standard_resources(void)
>   kernel_data.end <= res->end)
>   request_resource(res, _data);
>  #ifdef CONFIG_KEXEC_CORE
> - /* Userspace will find "Crash kernel" region in /proc/iomem. */
> + /*
> +  * Userspace will find "Crash kernel" or "Crash kernel (low)"
> +  * region in /proc/iomem.
> +  * In order to distinct from the high region and make no effect
> +  * to the use of existing kexec-tools, rename the low region as
> +  * "Crash kernel (low)".
> +  */
> + if (crashk_low_res.end && crashk_low_res.start >= res->start &&
> + crashk_low_res.end <= res->end) {
> + crashk_low_res.name = "Crash kernel (low)";
> + request_resource(res, _low_res);
> + }
>   if (crashk_res.end && crashk_res.start >= res->start &&
>   crashk_res.end <= res->end)
>   request_resource(res, _res);
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index a07fd8e1f926..888c4f7eadc3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -62,66 +63,11 @@ EXPORT_SYMBOL(memstart_addr);
>  phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  phys_addr_t arm64_dma32_phys_limit __ro_after_init;
>  
> -#ifdef CONFIG_KEXEC_CORE
> -/*
> - * reserve_crashkernel() - reserves memory for crash kernel
> - *
> - * This function reserves memory area given in "crashkernel=" kernel command
> - * line parameter. The memory reserved is used by dump capture kernel when
> - * primary kernel is crashing.
> - */
> -static void __init reserve_crashkernel(void)
> -{
> - unsigned long long crash_base, crash_size;
> - int ret;
> -
> - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> - _size, _base);
> - /* no crashkernel= or invalid value specified */
> - if (ret || !crash_size)
> - return;
> -
> - crash_size = PAGE_ALIGN(crash_size);
> -
> - if (crash_base

Re: [PATCH v13 1/8] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2020-11-10 Thread Baoquan He

On 10/31/20 at 03:44pm, Chen Zhou wrote:
> Move CRASH_ALIGN to header asm/kexec.h and replace the hard-coded
> alignment with macro CRASH_ALIGN in function reserve_crashkernel().

Seems you tell what you have done in this patch, but don't like adding
several more words to tell why it's done like that. Please see below
inline comments.

In other patches, I can also see this similar problem.

> 
> Suggested-by: Dave Young 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/include/asm/kexec.h | 3 +++
>  arch/x86/kernel/setup.c  | 5 +
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 6802c59e8252..8cf9d3fd31c7 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -18,6 +18,9 @@
>  
>  # define KEXEC_CONTROL_CODE_MAX_SIZE 2048
>  
> +/* 2M alignment for crash kernel regions */
> +#define CRASH_ALIGN  SZ_16M
> +
>  #ifndef __ASSEMBLY__
>  
>  #include 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 84f581c91db4..bf373422dc8a 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -395,9 +395,6 @@ static void __init 
> memblock_x86_reserve_range_setup_data(void)
>  
>  #ifdef CONFIG_KEXEC_CORE
>  
> -/* 16M alignment for crash kernel regions */
> -#define CRASH_ALIGN  SZ_16M
> -
>  /*
>   * Keep the crash kernel below this limit.
>   *
> @@ -515,7 +512,7 @@ static void __init reserve_crashkernel(void)
>   } else {
>   unsigned long long start;
>  
> - start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
> + start = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 
> crash_base,
> crash_base + crash_size);

Here, SZ_1M is replaced with CRASH_ALIGN which is 16M. I remember I ever
commented that this had better be told in patch log.

>   if (start != crash_base) {
>   pr_info("crashkernel reservation failed - memory is in 
> use.\n");
> -- 
> 2.20.1
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

Re: [PATCH v3 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2020-10-19 Thread Baoquan He

On 09/30/20 at 12:23pm, Alexander Egorenkov wrote:
> The offset of the field 'init_uts_ns.name' has changed
> since commit 9a56493f6942 ("uts: Use generic ns_common::count").
> 
> Link: 
> https://lore.kernel.org/r/159644978167.604812.1773586504374412107.stgit@localhost.localdomain
> 
> Make the offset of the field 'uts_namespace.name' available
> in VMCOREINFO because tools like 'crash-utility' and
> 'makedumpfile' must be able to read it from crash dumps.
> 
> Signed-off-by: Alexander Egorenkov 

Ack, thanks.

Acked-by: Baoquan He 

> ---
> 
> v2 -> v3:
>  * Added documentation to vmcoreinfo.rst
>  * Use the short form of the commit reference
> 
> v1 -> v2:
>  * Improved commit message
>  * Added link to the discussion of the uts namespace changes
> 
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 6 ++
>  kernel/crash_core.c| 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst 
> b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index e44a6c01f336..3861a25faae1 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -39,6 +39,12 @@ call.
>  User-space tools can get the kernel name, host name, kernel release
>  number, kernel version, architecture name and OS type from it.
>  
> +(uts_namespace, name)
> +-
> +
> +Offset of the name's member. Crash Utility and Makedumpfile get
> +the start address of the init_uts_ns.name from this.
> +
>  node_online_map
>  ---
>  
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..173fdc261882 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -447,6 +447,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_PAGESIZE(PAGE_SIZE);
>  
>   VMCOREINFO_SYMBOL(init_uts_ns);
> + VMCOREINFO_OFFSET(uts_namespace, name);
>   VMCOREINFO_SYMBOL(node_online_map);
>  #ifdef CONFIG_MMU
>   VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> -- 
> 2.26.2
>

Re: [PATCH v2 1/1] kdump: append uts_namespace.name offset to VMCOREINFO

2020-09-24 Thread Baoquan He

On 09/24/20 at 02:46pm, Alexander Egorenkov wrote:
> The offset of the field 'init_uts_ns.name' has changed
> since
> 
> commit 9a56493f6942c0e2df1579986128721da96e00d8
> Author: Kirill Tkhai 
> Date:   Mon Aug 3 13:16:21 2020 +0300
> 
> uts: Use generic ns_common::count
> 
> Link: 
> https://lore.kernel.org/r/159644978167.604812.1773586504374412107.stgit@localhost.localdomain

Seems there's some argument about the generic ns_common::count in the
thread of above link. While except of it, the adding the offset of
uts_namespace.name looks good to me.

Acked-by: Baoquan He 


> 
> Make the offset of the field 'uts_namespace.name' available
> in VMCOREINFO because tools like 'crash-utility' and
> 'makedumpfile' must be able to read it from crash dumps.
> 
> Signed-off-by: Alexander Egorenkov 
> ---
>  kernel/crash_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 106e4500fd53..173fdc261882 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -447,6 +447,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_PAGESIZE(PAGE_SIZE);
>  
>   VMCOREINFO_SYMBOL(init_uts_ns);
> + VMCOREINFO_OFFSET(uts_namespace, name);
>   VMCOREINFO_SYMBOL(node_online_map);
>  #ifdef CONFIG_MMU
>   VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> -- 
> 2.26.2
>

Re: [PATCH] Revert "iommu/amd: Treat per-device exclusion ranges as r/w unity-mapped regions"

2020-09-22 Thread Baoquan He

Forgot CC-ing Jerry, add him.

On 09/23/20 at 10:26am, Baoquan He wrote:
> A regression failure of kdump kernel boot was reported on a HPE system.
> Bisect points at commit 387caf0b759ac43 ("iommu/amd: Treat per-device
> exclusion ranges as r/w unity-mapped regions") as criminal. Reverting it
> fix the failure.
> 
> With the commit, kdump kernel will always print below error message, then
> naturally AMD iommu can't function normally during kdump kernel bootup.
> 
>   ~
>   AMD-Vi: [Firmware Bug]: IVRS invalid checksum
> 
> Why commit 387caf0b759ac43 causing it haven't been made clear.

Hi Joerg, Adrian

We only have one machine which can reproduce the issue, it's a gen10-01
of HPE. If any log or info are needed, please let me know, I can attach
here.

Thanks
Baoquan

> 
> From the commit log, a discussion thread link is pasted. In that discussion
> thread, Adrian told the fix is for a system with already broken BIOS, and
> Joerg suggested two options. Finally option 2) is taken. Maybe option 1)
> should be the right approach?
> 
>   1) Bail out and disable the IOMMU as the BIOS screwed up
>   2) Treat per-device exclusion ranges just as r/w unity-mapped
>  regions.
> 
> https://lists.linuxfoundation.org/pipermail/iommu/2019-November/040117.html
> Signed-off-by: Baoquan He 
> ---
>  drivers/iommu/amd/init.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
> index 9aa1eae26634..bbe7ceae5949 100644
> --- a/drivers/iommu/amd/init.c
> +++ b/drivers/iommu/amd/init.c
> @@ -1109,17 +1109,22 @@ static int __init add_early_maps(void)
>   */
>  static void __init set_device_exclusion_range(u16 devid, struct ivmd_header 
> *m)
>  {
> + struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
> +
>   if (!(m->flags & IVMD_FLAG_EXCL_RANGE))
>   return;
>  
> - /*
> -  * Treat per-device exclusion ranges as r/w unity-mapped regions
> -  * since some buggy BIOSes might lead to the overwritten exclusion
> -  * range (exclusion_start and exclusion_length members). This
> -  * happens when there are multiple exclusion ranges (IVMD entries)
> -  * defined in ACPI table.
> -  */
> - m->flags = (IVMD_FLAG_IW | IVMD_FLAG_IR | IVMD_FLAG_UNITY_MAP);
> + if (iommu) {
> + /*
> +  * We only can configure exclusion ranges per IOMMU, not
> +  * per device. But we can enable the exclusion range per
> +  * device. This is done here
> +  */
> + set_dev_entry_bit(devid, DEV_ENTRY_EX);
> + iommu->exclusion_start = m->range_start;
> + iommu->exclusion_length = m->range_length;
> + }
> +
>  }
>  
>  /*
> -- 
> 2.17.2
> 
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>

[PATCH] Revert "iommu/amd: Treat per-device exclusion ranges as r/w unity-mapped regions"

2020-09-22 Thread Baoquan He

A regression failure of kdump kernel boot was reported on a HPE system.
Bisect points at commit 387caf0b759ac43 ("iommu/amd: Treat per-device
exclusion ranges as r/w unity-mapped regions") as criminal. Reverting it
fix the failure.

With the commit, kdump kernel will always print below error message, then
naturally AMD iommu can't function normally during kdump kernel bootup.

  ~
  AMD-Vi: [Firmware Bug]: IVRS invalid checksum

Why commit 387caf0b759ac43 causing it haven't been made clear.

>From the commit log, a discussion thread link is pasted. In that discussion
thread, Adrian told the fix is for a system with already broken BIOS, and
Joerg suggested two options. Finally option 2) is taken. Maybe option 1)
should be the right approach?

  1) Bail out and disable the IOMMU as the BIOS screwed up
  2) Treat per-device exclusion ranges just as r/w unity-mapped
 regions.

https://lists.linuxfoundation.org/pipermail/iommu/2019-November/040117.html
Signed-off-by: Baoquan He 
---
 drivers/iommu/amd/init.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 9aa1eae26634..bbe7ceae5949 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1109,17 +1109,22 @@ static int __init add_early_maps(void)
  */
 static void __init set_device_exclusion_range(u16 devid, struct ivmd_header *m)
 {
+   struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
+
if (!(m->flags & IVMD_FLAG_EXCL_RANGE))
return;
 
-   /*
-* Treat per-device exclusion ranges as r/w unity-mapped regions
-* since some buggy BIOSes might lead to the overwritten exclusion
-* range (exclusion_start and exclusion_length members). This
-* happens when there are multiple exclusion ranges (IVMD entries)
-* defined in ACPI table.
-*/
-   m->flags = (IVMD_FLAG_IW | IVMD_FLAG_IR | IVMD_FLAG_UNITY_MAP);
+   if (iommu) {
+   /*
+* We only can configure exclusion ranges per IOMMU, not
+* per device. But we can enable the exclusion range per
+* device. This is done here
+*/
+   set_dev_entry_bit(devid, DEV_ENTRY_EX);
+   iommu->exclusion_start = m->range_start;
+   iommu->exclusion_length = m->range_length;
+   }
+
 }
 
 /*
-- 
2.17.2

Re: [PATCH v12 3/9] x86: kdump: use macro CRASH_ADDR_LOW_MAX in functions reserve_crashkernel[_low]()

2020-09-18 Thread Baoquan He

Hi,

On 09/07/20 at 09:47pm, Chen Zhou wrote:
> To make the functions reserve_crashkernel[_low]() as generic,
> replace some hard-coded numbers with macro CRASH_ADDR_LOW_MAX.
> 
> Signed-off-by: Chen Zhou 
> ---
>  arch/x86/kernel/setup.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d7fd90c52dae..71a6a6e7ca5b 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -430,7 +430,7 @@ static int __init reserve_crashkernel_low(void)
>   unsigned long total_low_mem;
>   int ret;
>  
> - total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
> + total_low_mem = memblock_mem_size(CRASH_ADDR_LOW_MAX >> PAGE_SHIFT);

Just note that the replacement has been done in another patch from Mike
Rapoport, partially. He seems to have done reserve_crashkernel_low()
part, there's one left in reserve_crashkernel(), you might want to check
that. 

Mike's patch which is from a patchset has been merged into Andrew's next
tree.

commit 6e50f7672ffa362e9bd4bc0c0d2524ed872828c5
Author: Mike Rapoport 
Date:   Wed Aug 26 15:22:32 2020 +1000

x86/setup: simplify reserve_crashkernel()

>  
>   /* crashkernel=Y,low */
>   ret = parse_crashkernel_low(boot_command_line, total_low_mem, 
> _size, );
> @@ -451,7 +451,7 @@ static int __init reserve_crashkernel_low(void)
>   return 0;
>   }
>  
> - low_base = memblock_find_in_range(CRASH_ALIGN, 1ULL << 32, low_size, 
> CRASH_ALIGN);
> + low_base = memblock_find_in_range(CRASH_ALIGN, CRASH_ADDR_LOW_MAX, 
> low_size, CRASH_ALIGN);
>   if (!low_base) {
>   pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> smaller size.\n",
>  (unsigned long)(low_size >> 20));
> @@ -504,8 +504,9 @@ static void __init reserve_crashkernel(void)
>   if (!crash_base) {
>   /*
>* Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
> -  * crashkernel=x,high reserves memory over 4G, also allocates
> -  * 256M extra low memory for DMA buffers and swiotlb.
> +  * crashkernel=x,high reserves memory over CRASH_ADDR_LOW_MAX,
> +  * also allocates 256M extra low memory for DMA buffers
> +  * and swiotlb.
>* But the extra memory is not required for all machines.
>* So try low memory first and fall back to high memory
>* unless "crashkernel=size[KMG],high" is specified.
> @@ -539,7 +540,7 @@ static void __init reserve_crashkernel(void)
>   return;
>   }
>  
> - if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) {
> + if (crash_base >= CRASH_ADDR_LOW_MAX && reserve_crashkernel_low()) {
>   memblock_free(crash_base, crash_size);
>   return;
>   }
> -- 
> 2.20.1
>

Re: [PATCH v5 0/6] mm / virtio-mem: support ZONE_MOVABLE

2020-08-21 Thread Baoquan He

On 08/21/20 at 10:31am, David Hildenbrand wrote:
> On 16.08.20 14:53, David Hildenbrand wrote:
> > For 5.10. Patch #1-#4,#6 have RBs or ACKs, patch #5 is virtio-mem stuff
> > maintained by me. This should go via the -mm tree.
> > 
> 
> @Andrew, can we give this a churn if there are no further comments? Thanks!

Saw this series in next already.

Re: [PATCH 10/10] mm/hugetlb: not necessary to abuse temporary page to workaround the nasty free_huge_page

2020-08-11 Thread Baoquan He

On 08/11/20 at 02:43pm, Mike Kravetz wrote:
> Here is a patch to do that.  However, we are optimizing a return path in
> a race condition that we are unlikely to ever hit.  I 'tested' it by 
> allocating
> an 'extra' page and freeing it via this method in alloc_surplus_huge_page.
> 
> From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz 
> Date: Tue, 11 Aug 2020 12:45:41 -0700
> Subject: [PATCH] hugetlb: optimize race error return in
>  alloc_surplus_huge_page
> 
> The routine alloc_surplus_huge_page() could race with with a pool
> size change.  If this happens, the allocated page may not be needed.
> To free the page, the current code will 'Abuse temporary page to
> workaround the nasty free_huge_page codeflow'.  Instead, directly
> call the low level routine that free_huge_page uses.  This works
> out well because the page is new, we hold the only reference and
> already hold the hugetlb_lock.
> 
> Signed-off-by: Mike Kravetz 
> ---
>  mm/hugetlb.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 590111ea6975..ac89b91fba86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct 
> hstate *h, gfp_t gfp_mask,
>   /*
>* We could have raced with the pool size change.
>* Double check that and simply deallocate the new page
> -  * if we would end up overcommiting the surpluses. Abuse
> -  * temporary page to workaround the nasty free_huge_page
> -  * codeflow
> +  * if we would end up overcommiting the surpluses.
>*/
>   if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> - SetPageHugeTemporary(page);
> + /*
> +  * Since this page is new, we hold the only reference, and
> +  * we already hold the hugetlb_lock call the low level free
> +  * page routine.  This saves at least a lock roundtrip.
> +  */
> + (void)put_page_testzero(page); /* don't call destructor */
> + update_and_free_page(h, page);

Yeah, taking this code change, or keeping the temporary page way as is,
both looks good.

>   spin_unlock(_lock);
> - put_page(page);
>   return NULL;
>   } else {
>   h->surplus_huge_pages++;

Re: [PATCH 10/10] mm/hugetlb: not necessary to abuse temporary page to workaround the nasty free_huge_page

2020-08-11 Thread Baoquan He

On 08/11/20 at 08:54am, Michal Hocko wrote:
> On Tue 11-08-20 09:51:48, Baoquan He wrote:
> > On 08/10/20 at 05:19pm, Mike Kravetz wrote:
> > > On 8/9/20 7:17 PM, Baoquan He wrote:
> > > > On 08/07/20 at 05:12pm, Wei Yang wrote:
> > > >> Let's always increase surplus_huge_pages and so that free_huge_page
> > > >> could decrease it at free time.
> > > >>
> > > >> Signed-off-by: Wei Yang 
> > > >> ---
> > > >>  mm/hugetlb.c | 14 ++
> > > >>  1 file changed, 6 insertions(+), 8 deletions(-)
> > > >>
> > > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > >> index 1f2010c9dd8d..a0eb81e0e4c5 100644
> > > >> --- a/mm/hugetlb.c
> > > >> +++ b/mm/hugetlb.c
> > > >> @@ -1913,21 +1913,19 @@ static struct page 
> > > >> *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> > > >>return NULL;
> > > >>  
> > > >>spin_lock(_lock);
> > > >> +
> > > >> +  h->surplus_huge_pages++;
> > > >> +  h->surplus_huge_pages_node[page_to_nid(page)]++;
> > > >> +
> > > >>/*
> > > >> * We could have raced with the pool size change.
> > > >> * Double check that and simply deallocate the new page
> > > >> -   * if we would end up overcommiting the surpluses. Abuse
> > > >> -   * temporary page to workaround the nasty free_huge_page
> > > >> -   * codeflow
> > > >> +   * if we would end up overcommiting the surpluses.
> > > >> */
> > > >> -  if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> > > >> -  SetPageHugeTemporary(page);
> > > > 
> > > > Hmm, the temporary page way is taken intentionally in
> > > > commit 9980d744a0428 ("mm, hugetlb: get rid of surplus page accounting 
> > > > tricks").
> > > > From code, this is done inside hugetlb_lock holding, and the code flow
> > > > is straightforward, should be safe. Adding Michal to CC.
> 
> But the lock is not held during the migration, right?

I see what I misunderstoold about the hugetlb_lock holding. The
put_page() is called after releasing hugetlb_lock in
alloc_surplus_huge_page(), I mistakenly got put_page() is inside
hugetlb_lock. Yes, there's obviously a race window, and the temporary
page way is an effective way to not mess up the surplus_huge_pages
accounting.

> 
> > > I remember when the temporary page code was added for page migration.
> > > The use of temporary page here was added at about the same time.  
> > > Temporary
> > > page does have one advantage in that it will not CAUSE surplus count to
> > > exceed overcommit.  This patch could cause surplus to exceed overcommit
> > > for a very short period of time.  However, do note that for this to happen
> > > the code needs to race with a pool resize which itself could cause surplus
> > > to exceed overcommit.
> 
> Correct.
> 
> > > IMO both approaches are valid.
> > > - Advantage of temporary page is that it can not cause surplus to exceed
> > >   overcommit.  Disadvantage is as mentioned in the comment 'abuse of 
> > > temporary
> > >   page'.
> > > - Advantage of this patch is that it uses existing counters.  Disadvantage
> > >   is that it can momentarily cause surplus to exceed overcommit.
> 
> Do I remember correctly that this can cause an allocation failure due to
> overcommit check? In other words it would be user space visible thing?
> 
> > Yeah, since it's all done inside hugetlb_lock, should be OK even
> > though it may cause surplus to exceed overcommit.
> > > 
> > > Unless someone has a strong opinion, I prefer the changes in this patch.
> > 
> > Agree, I also prefer the code change in this patch, to remove the
> > unnecessary confusion about the temporary page.
> 
> I have managed to forgot all the juicy details since I have made that
> change. All that remains is that the surplus pages accounting was quite
> tricky and back then I didn't figure out a simpler method that would
> achieve the consistent look at those counters. As mentioned above I
> suspect this could lead to pre-mature allocation failures while the
> migration is ongoing. Sure quite unlikely to happen and the race window
> is likely very small. Maybe this is even acceptable but I would strongly
> recommend to have all this thinking documented in the changelog.
> -- 
> Michal Hocko
> SUSE Labs
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3124 matches

Mail list logo