Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-18 Thread Andrew Morton
On Wed, 18 Dec 2019 18:08:04 +0100 David Hildenbrand  wrote:

> On 01.12.19 00:21, Andrew Morton wrote:
> > On Sun, 27 Oct 2019 23:45:52 +0100 David Hildenbrand  
> > wrote:
> > 
> >> I think I just found an issue with try_offline_node(). 
> >> try_offline_node() is pretty much broken already (touches garbage 
> >> memmaps and will not considers mixed NIDs within sections), however, 
> >> relies on the node span to look for memory sections to probe. So it 
> >> seems to rely on the nodes getting shrunk when removing memory, not when 
> >> offlining.
> >>
> >> As we shrink the node span when offlining now and not when removing, 
> >> this can go wrong once we offline the last memory block of the node and 
> >> offline the last CPU. We could still have memory around that we could 
> >> re-online, however, the node would already be offline. Unlikely, but 
> >> possible.
> >>
> >> Note that the same is also broken without this patch in case memory is 
> >> never onlined. The "pfn_to_nid(pfn) != nid" can easily succeed on the 
> >> garbage memmap, resulting in  no memory being detected as belonging to 
> >> the node. Also, resize_pgdat_range() is called when onlining memory, not 
> >> when adding it. :/ Oh this is so broken :)
> >>
> >> The right fix is probably to walk over all memory blocks that could 
> >> exist and test if they belong to the nid (if offline, check the 
> >> block->nid, if online check all pageblocks). A fix we can then move in 
> >> front of this patch.
> >>
> >> Will look into this this week.
> > 
> > And this series shows almost no sign of having been reviewed.  I'll hold
> > it over for 5.6.
> > 
> 
> Hi Andrew, any chance we can get the (now at least reviewed - thx Oscar)
> fix in patch #5 into 5.5? (I want to do the final stable backports for
> the uninitialized memmap stuff)

Sure, I queued it for the next batch of 5.5 fixes.


Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-18 Thread David Hildenbrand
On 01.12.19 00:21, Andrew Morton wrote:
> On Sun, 27 Oct 2019 23:45:52 +0100 David Hildenbrand  wrote:
> 
>> I think I just found an issue with try_offline_node(). 
>> try_offline_node() is pretty much broken already (touches garbage 
>> memmaps and will not considers mixed NIDs within sections), however, 
>> relies on the node span to look for memory sections to probe. So it 
>> seems to rely on the nodes getting shrunk when removing memory, not when 
>> offlining.
>>
>> As we shrink the node span when offlining now and not when removing, 
>> this can go wrong once we offline the last memory block of the node and 
>> offline the last CPU. We could still have memory around that we could 
>> re-online, however, the node would already be offline. Unlikely, but 
>> possible.
>>
>> Note that the same is also broken without this patch in case memory is 
>> never onlined. The "pfn_to_nid(pfn) != nid" can easily succeed on the 
>> garbage memmap, resulting in  no memory being detected as belonging to 
>> the node. Also, resize_pgdat_range() is called when onlining memory, not 
>> when adding it. :/ Oh this is so broken :)
>>
>> The right fix is probably to walk over all memory blocks that could 
>> exist and test if they belong to the nid (if offline, check the 
>> block->nid, if online check all pageblocks). A fix we can then move in 
>> front of this patch.
>>
>> Will look into this this week.
> 
> And this series shows almost no sign of having been reviewed.  I'll hold
> it over for 5.6.
> 

Hi Andrew, any chance we can get the (now at least reviewed - thx Oscar)
fix in patch #5 into 5.5? (I want to do the final stable backports for
the uninitialized memmap stuff)

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-03 Thread David Hildenbrand
On 03.12.19 16:10, Oscar Salvador wrote:
> On Sun, Oct 06, 2019 at 10:56:41AM +0200, David Hildenbrand wrote:
>> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
>> Signed-off-by: David Hildenbrand 
> 
> I did not see anything wrong with the taken approach, and makes sense to me.
> The only thing that puzzles me is we seem to not balance spanned_pages
> for ZONE_DEVICE anymore.
> memremap_pages() increments them via move_pfn_range_to_zone, but we skip
> ZONE_DEVICE in remove_pfn_range_from_zone.

Yes, documented e.g., in

commit 7ce700bf11b5e2cb84e4352bbdf2123a7a239c84
Author: David Hildenbrand 
Date:   Thu Nov 21 17:53:56 2019 -0800

mm/memory_hotplug: don't access uninitialized memmaps in
shrink_zone_span()

Needs some more thought - but is definitely not urgent (well, now it's
at least no longer completely broken).

> 
> That is not really related to this patch, so I might be missing something,
> but it caught my eye while reviewing this.
> 
> Anyway, for this one:
> 
> Reviewed-by: Oscar Salvador 
> 

Thanks!

> 
> off-topic: I __think__ we really need to trim the CC list.

Yes we should :) - done.

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-03 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:41AM +0200, David Hildenbrand wrote:
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> Signed-off-by: David Hildenbrand 

I did not see anything wrong with the taken approach, and makes sense to me.
The only thing that puzzles me is we seem to not balance spanned_pages
for ZONE_DEVICE anymore.
memremap_pages() increments them via move_pfn_range_to_zone, but we skip
ZONE_DEVICE in remove_pfn_range_from_zone.

That is not really related to this patch, so I might be missing something,
but it caught my eye while reviewing this.

Anyway, for this one:

Reviewed-by: Oscar Salvador 


off-topic: I __think__ we really need to trim the CC list.

> ---
>  arch/arm64/mm/mmu.c|  4 +---
>  arch/ia64/mm/init.c|  4 +---
>  arch/powerpc/mm/mem.c  |  3 +--
>  arch/s390/mm/init.c|  4 +---
>  arch/sh/mm/init.c  |  4 +---
>  arch/x86/mm/init_32.c  |  4 +---
>  arch/x86/mm/init_64.c  |  4 +---
>  include/linux/memory_hotplug.h |  7 +--
>  mm/memory_hotplug.c| 31 ---
>  mm/memremap.c  |  2 +-
>  10 files changed, 29 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 60c929f3683b..d10247fab0fd 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1069,7 +1069,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
>   /*
>* FIXME: Cleanup page tables (also in arch_add_memory() in case
> @@ -1078,7 +1077,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>* unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
>* unlocked yet.
>*/
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index bf9df2625bc8..a6dd80a2c939 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index be941d382c8d..97e5922cb52e 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 
> size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
>   int ret;
>  
> - __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  
>   /* Remove htab bolted mappings for this section of memory */
>   start = (unsigned long)__va(start);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index a124f19f7b3c..c1d96e588152 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -291,10 +291,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>   vmem_remove_mapping(start, size);
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index dfdbaa50946e..d1b1ff2be17a 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 930edeb41ec3..0a74407ef92e 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -865,10 +865,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + 

Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-11-30 Thread David Hildenbrand



> Am 01.12.2019 um 00:22 schrieb Andrew Morton :
> 
> On Sun, 27 Oct 2019 23:45:52 +0100 David Hildenbrand  
> wrote:
> 
>> I think I just found an issue with try_offline_node(). 
>> try_offline_node() is pretty much broken already (touches garbage 
>> memmaps and will not considers mixed NIDs within sections), however, 
>> relies on the node span to look for memory sections to probe. So it 
>> seems to rely on the nodes getting shrunk when removing memory, not when 
>> offlining.
>> 
>> As we shrink the node span when offlining now and not when removing, 
>> this can go wrong once we offline the last memory block of the node and 
>> offline the last CPU. We could still have memory around that we could 
>> re-online, however, the node would already be offline. Unlikely, but 
>> possible.
>> 
>> Note that the same is also broken without this patch in case memory is 
>> never onlined. The "pfn_to_nid(pfn) != nid" can easily succeed on the 
>> garbage memmap, resulting in  no memory being detected as belonging to 
>> the node. Also, resize_pgdat_range() is called when onlining memory, not 
>> when adding it. :/ Oh this is so broken :)
>> 
>> The right fix is probably to walk over all memory blocks that could 
>> exist and test if they belong to the nid (if offline, check the 
>> block->nid, if online check all pageblocks). A fix we can then move in 
>> front of this patch.
>> 
>> Will look into this this week.
> 
> And this series shows almost no sign of having been reviewed.  I'll hold
> it over for 5.6.
> 

Makes sense, can‘t do anything about it. Btw, this one is the last stable patch 
to fix access of uninitialized memmaps that is not upstream yet... so it has to 
remain broken for some longer.



Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-11-30 Thread Andrew Morton
On Sun, 27 Oct 2019 23:45:52 +0100 David Hildenbrand  wrote:

> I think I just found an issue with try_offline_node(). 
> try_offline_node() is pretty much broken already (touches garbage 
> memmaps and will not considers mixed NIDs within sections), however, 
> relies on the node span to look for memory sections to probe. So it 
> seems to rely on the nodes getting shrunk when removing memory, not when 
> offlining.
> 
> As we shrink the node span when offlining now and not when removing, 
> this can go wrong once we offline the last memory block of the node and 
> offline the last CPU. We could still have memory around that we could 
> re-online, however, the node would already be offline. Unlikely, but 
> possible.
> 
> Note that the same is also broken without this patch in case memory is 
> never onlined. The "pfn_to_nid(pfn) != nid" can easily succeed on the 
> garbage memmap, resulting in  no memory being detected as belonging to 
> the node. Also, resize_pgdat_range() is called when onlining memory, not 
> when adding it. :/ Oh this is so broken :)
> 
> The right fix is probably to walk over all memory blocks that could 
> exist and test if they belong to the nid (if offline, check the 
> block->nid, if online check all pageblocks). A fix we can then move in 
> front of this patch.
> 
> Will look into this this week.

And this series shows almost no sign of having been reviewed.  I'll hold
it over for 5.6.



Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-27 Thread David Hildenbrand

On 06.10.19 10:56, David Hildenbrand wrote:

We currently try to shrink a single zone when removing memory. We use the
zone of the first page of the memory we are removing. If that memmap was
never initialized (e.g., memory was never onlined), we will read garbage
and can trigger kernel BUGs (due to a stale pointer):

:/# [   23.912993] BUG: unable to handle page fault for address: 
353d
[   23.914219] #PF: supervisor write access in kernel mode
[   23.915199] #PF: error_code(0x0002) - not-present page
[   23.916160] PGD 0 P4D 0
[   23.916627] Oops: 0002 [#1] SMP PTI
[   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 
5.3.0-rc5-next-20190820+ #317
[   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
[   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 
c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
[   23.926876] RSP: 0018:ad2400043c98 EFLAGS: 00010246
[   23.927928] RAX:  RBX: 0002 RCX: 
[   23.929458] RDX: 0020 RSI: 0014 RDI: 2f40
[   23.930899] RBP: 00014000 R08:  R09: 0001
[   23.932362] R10:  R11:  R12: 0014
[   23.933603] R13: 0014 R14: 2f40 R15: 9e3e7aff3680
[   23.934913] FS:  () GS:9e3e7bb0() 
knlGS:
[   23.936294] CS:  0010 DS:  ES:  CR0: 80050033
[   23.937481] CR2: 353d CR3: 5861 CR4: 06e0
[   23.938687] DR0:  DR1:  DR2: 
[   23.939889] DR3:  DR6: fffe0ff0 DR7: 0400
[   23.941168] Call Trace:
[   23.941580]  __remove_pages+0x4b/0x640
[   23.942303]  ? mark_held_locks+0x49/0x70
[   23.943149]  arch_remove_memory+0x63/0x8d
[   23.943921]  try_remove_memory+0xdb/0x130
[   23.944766]  ? walk_memory_blocks+0x7f/0x9e
[   23.945616]  __remove_memory+0xa/0x11
[   23.946274]  acpi_memory_device_remove+0x70/0x100
[   23.947308]  acpi_bus_trim+0x55/0x90
[   23.947914]  acpi_device_hotplug+0x227/0x3a0
[   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
[   23.949433]  process_one_work+0x221/0x550
[   23.950190]  worker_thread+0x50/0x3b0
[   23.950993]  kthread+0x105/0x140
[   23.951644]  ? process_one_work+0x550/0x550
[   23.952508]  ? kthread_park+0x80/0x80
[   23.953367]  ret_from_fork+0x3a/0x50
[   23.954025] Modules linked in:
[   23.954613] CR2: 353d
[   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now properly
shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Andrew Morton 
Cc: Mark Rutland 
Cc: Steve Capper 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Yu Zhao 
Cc: Jun Yao 
Cc: Robin Murphy 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Pavel Tatashin 
Cc: Gerald Schaefer 
Cc: Halil Pasic 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Masahiro Yamada 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Signed-off-by: David Hildenbrand 
---
  arch/arm64/mm/mmu.c|  4 +---
  arch/ia64/mm/init.c|  4 +---
  arch/powerpc/mm/mem.c  |  3 +--
  arch/s390/mm/init.c|  4 +---
  arch/sh/mm/init.c  |  4 +---
  arch/x86/mm/init_32.c  |  4 +---
  arch/x86/mm/init_64.c  |  4 +---
  include/linux/memory_hotplug.h |  7 +--
  mm/memory_hotplug.c| 31 ---
  mm/memremap.c  |  2 +-
  10 files changed, 29 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f3683b..d10247fab0fd 100644
--- 

Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-14 Thread Andrew Morton
On Mon, 14 Oct 2019 11:39:13 +0200 David Hildenbrand  wrote:

> > Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> 
> @Andrew, can you convert that to
> 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
> zones until online") # visible after d0dc12e86b319

Done.

> While adding cc'ing sta...@vger.kernel.org # v4.13+ would be nice,
> I doubt it will be easily possible to backport, as we are missing
> some prereq patches (e.g., from Oscar like 2c2a5af6fed2 ("mm,
> memory_hotplug: add nid parameter to arch_remove_memory")). But, it could
> be done with some work.
> 
> I think "Cc: sta...@vger.kernel.org # v5.0+" could be done more
> easily. Maybe it's okay to not cc:stable this one. We usually
> online all memory (except s390x), however, s390x does not remove that
> memory ever. Devmem with driver reserved memory would be, however,
> worth backporting this.

I added 

Cc: [5.0+]


Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-14 Thread David Hildenbrand
On 06.10.19 10:56, David Hildenbrand wrote:
> We currently try to shrink a single zone when removing memory. We use the
> zone of the first page of the memory we are removing. If that memmap was
> never initialized (e.g., memory was never onlined), we will read garbage
> and can trigger kernel BUGs (due to a stale pointer):
> 
> :/# [   23.912993] BUG: unable to handle page fault for address: 
> 353d
> [   23.914219] #PF: supervisor write access in kernel mode
> [   23.915199] #PF: error_code(0x0002) - not-present page
> [   23.916160] PGD 0 P4D 0
> [   23.916627] Oops: 0002 [#1] SMP PTI
> [   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 
> 5.3.0-rc5-next-20190820+ #317
> [   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
> [   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 
> c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
> [   23.926876] RSP: 0018:ad2400043c98 EFLAGS: 00010246
> [   23.927928] RAX:  RBX: 0002 RCX: 
> 
> [   23.929458] RDX: 0020 RSI: 0014 RDI: 
> 2f40
> [   23.930899] RBP: 00014000 R08:  R09: 
> 0001
> [   23.932362] R10:  R11:  R12: 
> 0014
> [   23.933603] R13: 0014 R14: 2f40 R15: 
> 9e3e7aff3680
> [   23.934913] FS:  () GS:9e3e7bb0() 
> knlGS:
> [   23.936294] CS:  0010 DS:  ES:  CR0: 80050033
> [   23.937481] CR2: 353d CR3: 5861 CR4: 
> 06e0
> [   23.938687] DR0:  DR1:  DR2: 
> 
> [   23.939889] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   23.941168] Call Trace:
> [   23.941580]  __remove_pages+0x4b/0x640
> [   23.942303]  ? mark_held_locks+0x49/0x70
> [   23.943149]  arch_remove_memory+0x63/0x8d
> [   23.943921]  try_remove_memory+0xdb/0x130
> [   23.944766]  ? walk_memory_blocks+0x7f/0x9e
> [   23.945616]  __remove_memory+0xa/0x11
> [   23.946274]  acpi_memory_device_remove+0x70/0x100
> [   23.947308]  acpi_bus_trim+0x55/0x90
> [   23.947914]  acpi_device_hotplug+0x227/0x3a0
> [   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
> [   23.949433]  process_one_work+0x221/0x550
> [   23.950190]  worker_thread+0x50/0x3b0
> [   23.950993]  kthread+0x105/0x140
> [   23.951644]  ? process_one_work+0x550/0x550
> [   23.952508]  ? kthread_park+0x80/0x80
> [   23.953367]  ret_from_fork+0x3a/0x50
> [   23.954025] Modules linked in:
> [   23.954613] CR2: 353d
> [   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---
> 
> Instead, shrink the zones when offlining memory or when onlining failed.
> Introduce and use remove_pfn_range_from_zone(() for that. We now properly
> shrink the zones, even if we have DIMMs whereby
> - Some memory blocks fall into no zone (never onlined)
> - Some memory blocks fall into multiple zones (offlined+re-onlined)
> - Multiple memory blocks that fall into different zones
> 
> Drop the zone parameter (with a potential dubious value) from
> __remove_pages() and __remove_section().
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Andrew Morton 
> Cc: Mark Rutland 
> Cc: Steve Capper 
> Cc: Mike Rapoport 
> Cc: Anshuman Khandual 
> Cc: Yu Zhao 
> Cc: Jun Yao 
> Cc: Robin Murphy 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: Pavel Tatashin 
> Cc: Gerald Schaefer 
> Cc: Halil Pasic 
> Cc: Tom Lendacky 
> Cc: Greg Kroah-Hartman 
> Cc: Masahiro Yamada 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Cc: Qian Cai 
> Cc: Jason Gunthorpe 
> Cc: Logan Gunthorpe 
> Cc: Ira Weiny 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")

@Andrew, can you convert that to

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
zones until online") # visible after d0dc12e86b319

While adding cc'ing sta...@vger.kernel.org # v4.13+ would be nice,
I doubt it will be easily possible to backport, as we are missing
some prereq patches (e.g., from Oscar like 2c2a5af6fed2 ("mm,
memory_hotplug: add nid parameter to 

[PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-06 Thread David Hildenbrand
We currently try to shrink a single zone when removing memory. We use the
zone of the first page of the memory we are removing. If that memmap was
never initialized (e.g., memory was never onlined), we will read garbage
and can trigger kernel BUGs (due to a stale pointer):

:/# [   23.912993] BUG: unable to handle page fault for address: 
353d
[   23.914219] #PF: supervisor write access in kernel mode
[   23.915199] #PF: error_code(0x0002) - not-present page
[   23.916160] PGD 0 P4D 0
[   23.916627] Oops: 0002 [#1] SMP PTI
[   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 
5.3.0-rc5-next-20190820+ #317
[   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
[   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 
c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
[   23.926876] RSP: 0018:ad2400043c98 EFLAGS: 00010246
[   23.927928] RAX:  RBX: 0002 RCX: 
[   23.929458] RDX: 0020 RSI: 0014 RDI: 2f40
[   23.930899] RBP: 00014000 R08:  R09: 0001
[   23.932362] R10:  R11:  R12: 0014
[   23.933603] R13: 0014 R14: 2f40 R15: 9e3e7aff3680
[   23.934913] FS:  () GS:9e3e7bb0() 
knlGS:
[   23.936294] CS:  0010 DS:  ES:  CR0: 80050033
[   23.937481] CR2: 353d CR3: 5861 CR4: 06e0
[   23.938687] DR0:  DR1:  DR2: 
[   23.939889] DR3:  DR6: fffe0ff0 DR7: 0400
[   23.941168] Call Trace:
[   23.941580]  __remove_pages+0x4b/0x640
[   23.942303]  ? mark_held_locks+0x49/0x70
[   23.943149]  arch_remove_memory+0x63/0x8d
[   23.943921]  try_remove_memory+0xdb/0x130
[   23.944766]  ? walk_memory_blocks+0x7f/0x9e
[   23.945616]  __remove_memory+0xa/0x11
[   23.946274]  acpi_memory_device_remove+0x70/0x100
[   23.947308]  acpi_bus_trim+0x55/0x90
[   23.947914]  acpi_device_hotplug+0x227/0x3a0
[   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
[   23.949433]  process_one_work+0x221/0x550
[   23.950190]  worker_thread+0x50/0x3b0
[   23.950993]  kthread+0x105/0x140
[   23.951644]  ? process_one_work+0x550/0x550
[   23.952508]  ? kthread_park+0x80/0x80
[   23.953367]  ret_from_fork+0x3a/0x50
[   23.954025] Modules linked in:
[   23.954613] CR2: 353d
[   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now properly
shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Andrew Morton 
Cc: Mark Rutland 
Cc: Steve Capper 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Yu Zhao 
Cc: Jun Yao 
Cc: Robin Murphy 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Pavel Tatashin 
Cc: Gerald Schaefer 
Cc: Halil Pasic 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Masahiro Yamada 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Signed-off-by: David Hildenbrand 
---
 arch/arm64/mm/mmu.c|  4 +---
 arch/ia64/mm/init.c|  4 +---
 arch/powerpc/mm/mem.c  |  3 +--
 arch/s390/mm/init.c|  4 +---
 arch/sh/mm/init.c  |  4 +---
 arch/x86/mm/init_32.c  |  4 +---
 arch/x86/mm/init_64.c  |  4 +---
 include/linux/memory_hotplug.h |  7 +--
 mm/memory_hotplug.c| 31 ---
 mm/memremap.c  |  2 +-
 10 files changed, 29 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f3683b..d10247fab0fd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1069,7