Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-22 Thread Oscar Salvador
On Wed, May 22, 2024 at 05:46:09PM -0400, Peter Xu wrote:
> > Now, ProcessB still has the page mapped, so upon re-accessing it,
> > it will trigger a new MCE event. memory-failure code will see that this
> 
> The question is why accessing that hwpoison entry from ProcB triggers an
> MCE.  Logically that's a swap entry and it should generate a page fault
> rather than MCE.  Then in the pgfault hanlder we don't need that encoded
> pfn as we have vmf->address.

It would be a swap entry if we reach try_to_umap_one() without trouble.
Then we have the code that converts it:

 ...
 if (PageHWPoison(p))
 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
 set_{huge_}pte_at
 ...

But maybe we could only do that for ProcA, while ProcB failed to do that,
which means that for ProcA that is a hwpoisoned-swap-entry, but ProcB still
has this page mapped as usual, so if ProcB re-access it, that will not
trigger a fault (because the page is still mapped in its pagetables).


-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()

2024-05-21 Thread Oscar Salvador
On Mon, May 20, 2024 at 04:24:51PM +, Christophe Leroy wrote:
> I had a quick look at that document and it seems to provide a good 
> summary of MMU features and principles. However there are some 
> theoritical information which is not fully right in practice. For 
> instance when they say "Segment attributes. These fields define 
> attributes common to all pages in this segment.". This is right in 
> theory if you consider it from Linux page table topology point of view, 
> hence what they call a segment is a PMD entry for Linux. However, in 
> practice each page has its own L1 and L2 attributes and there is not 
> requirement at HW level to have all L1 attributes of all pages of a 
> segment the same.

Thanks for taking the time Christophe, highly appreciated.

 
> rlwimi = Rotate Left Word Immediate then Mask Insert. Here it rotates 
> r10 by 23 bits to the left (or 9 to the right) then masks with 
> _PMD_PAGE_512K and inserts it into r11.
> 
> It means _PAGE_HUGE bit is copied into lower bit of PS attribute.
> 
> PS takes the following values:
> 
> PS = 00 ==> Small page (4k or 16k)
> PS = 01 ==> 512k page
> PS = 10 ==> Undefined
> PS = 11 ==> 8M page

I see, thanks for the explanation.

> That's a RFC, all ideas are welcome, I needed something to replace 
> hugepd_populate()

The only user interested in pmd_populate() having a sz parameter
is 8xx because it will toggle _PMD_PAGE_8M in case of a 8MB mapping.

Would it be possible for 8xx to encode the 'sz' in the *pmd pointer
prior to calling down the chain? (something like as we do for PTR_ERR()).
Then pmd_populate_{kernel_}size() from 8xx, would extract it like:

 unsigned long sz = PTR_SIZE(pmd)

Then we would not need all these 'sz' parameters scattered.

Can that work?


PD: Do you know a way to emulate a 8xx VM? qemu seems to not have
support support.

Thanks


-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size()

2024-05-21 Thread Oscar Salvador
On Fri, May 17, 2024 at 08:59:57PM +0200, Christophe Leroy wrote:
> On powerpc 8xx, when a page is 8M size, the information is in the PMD
> entry. So provide it to pte_leaf_size().
> 
> Signed-off-by: Christophe Leroy 

Overall looks good to me.

Would be nicer if we could left the arch code untouched.
I wanted to see how this would be if we go down that road and focus only 
on 8xx at the risk of being more esoteric.
pmd_pte_leaf_size() is a name of hell, but could be replaced
with __pte_leaf_size for example.

Worth it? Maybe not, anyway, just wanted to give it a go:


 diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h 
b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
 index 137dc3c84e45..9e3fe6e1083f 100644
 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
 +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
 @@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd)
 
  #define pgd_leaf_size pgd_leaf_size
 
 -static inline unsigned long pte_leaf_size(pte_t pte)
 +static inline unsigned long pmd_pte_leaf_size(pte_t pte)
  {
 pte_basic_t val = pte_val(pte);
 
 @@ -162,7 +162,7 @@ static inline unsigned long pte_leaf_size(pte_t pte)
 return SZ_4K;
  }
 
 -#define pte_leaf_size pte_leaf_size
 +#define pmd_pte_leaf_size pmd_pte_leaf_size
 
  /*
   * On the 8xx, the page tables are a bit special. For 16k pages, we have
 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
 index 18019f037bae..2bc2fe3b2b53 100644
 --- a/include/linux/pgtable.h
 +++ b/include/linux/pgtable.h
 @@ -1891,6 +1891,9 @@ typedef unsigned int pgtbl_mod_mask;
  #ifndef pte_leaf_size
  #define pte_leaf_size(x) PAGE_SIZE
  #endif
 +#ifndef pmd_pte_leaf_size
 +#define pmd_pte_leaf_size(x, y) pte_leaf_size(y)
 +#endif
 
  /*
   * We always define pmd_pfn for all archs as it's used in lots of generic
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index f0128c5ff278..e90a547d2fb2 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -7596,7 +7596,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, 
unsigned long addr)
 
 pte = ptep_get_lockless(ptep);
 if (pte_present(pte))
 -   size = pte_leaf_size(pte);
 +   size = pmd_pte_leaf_size(pmd, pte);
 pte_unmap(ptep);
  #endif /* CONFIG_HAVE_GUP_FAST */

 

-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()

2024-05-21 Thread Oscar Salvador
On Tue, May 21, 2024 at 10:48:21AM +1000, Michael Ellerman wrote:
> Yeah I can. Does it actually cause a bug at runtime (I assume so)?

No, currently set_huge_pte_at() from 8xx ignores the 'sz' parameter.
But it will be used after this series.

-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()

2024-05-20 Thread Oscar Salvador
On Mon, May 20, 2024 at 04:31:39PM +, Christophe Leroy wrote:
> Hi Oscar, hi Michael,
> 
> Le 20/05/2024 à 11:14, Oscar Salvador a écrit :
> > On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
> >> set_huge_pte_at() expects the real page size, not the psize which is
> > 
> > "expects the size of the huge page" sounds bettter?
> 
> Parameter 'pzize' already provides the size of the hugepage, but not in 
> the way set_huge_pte_at() expects it.
> 
> psize has one of the values defined by MMU_PAGE_XXX macros defined in 
> arch/powerpc/include/asm/mmu.h while set_huge_pte_at() expects the size 
> as a value.

Yes, psize is an index, which is not a size by itself but used to get
mmu_psize_def.shift to see the actual size, I guess.
This is why I thought that being explicit about "expects the size of the
huge page" was better.

But no strong feelings here.


-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()

2024-05-20 Thread Oscar Salvador
On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
> set_huge_pte_at() expects the real page size, not the psize which is

"expects the size of the huge page" sounds bettter? 

> the index of the page definition in table mmu_psize_defs[]
> 
> Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to 
> set_huge_pte_at()")
> Signed-off-by: Christophe Leroy 

Reviewed-by: Oscar Salvador 

AFAICS, this fixup is not related to the series, right? (yes, you will
the parameter later)
I would have it at the very beginning of the series.


> ---
>  arch/powerpc/mm/nohash/8xx.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
> index 43d4842bb1c7..d93433e26ded 100644
> --- a/arch/powerpc/mm/nohash/8xx.c
> +++ b/arch/powerpc/mm/nohash/8xx.c
> @@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long 
> va, phys_addr_t pa,
>   return -EINVAL;
>  
>   set_huge_pte_at(_mm, va, ptep,
> - pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize);
> + pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
> +     1UL << mmu_psize_to_shift(psize));
>  
>   return 0;
>  }
> -- 
> 2.44.0
> 

-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()

2024-05-20 Thread Oscar Salvador
t;   NULL: pte_offset_kernel(pmd, address))
>  
>  #endif /* _LINUX_PGALLOC_TRACK_H */
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 3c3539c573e7..0f129d5c5aa2 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -764,7 +764,7 @@ static __always_inline ssize_t mfill_atomic(struct 
> userfaultfd_ctx *ctx,
>   break;
>   }
>   if (unlikely(pmd_none(dst_pmdval)) &&
> - unlikely(__pte_alloc(dst_mm, dst_pmd))) {
> + unlikely(__pte_alloc(dst_mm, dst_pmd, PAGE_SIZE))) {
>   err = -ENOMEM;
>   break;
>   }
> @@ -1687,7 +1687,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, 
> unsigned long dst_start,
>   err = -ENOENT;
>   break;
>   }
> - if (unlikely(__pte_alloc(mm, src_pmd))) {
> + if (unlikely(__pte_alloc(mm, src_pmd, 
> PAGE_SIZE))) {
>   err = -ENOMEM;
>   break;
>   }
> -- 
> 2.44.0
> 

-- 
Oscar Salvador
SUSE Labs


Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-05-17 Thread Oscar Salvador
On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> This series reimplements hugepages with hugepd on powerpc 8xx.
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> There are probably several ways to implement stuff, so feedback is
> very welcome.


Hi Christophe,

I have been looking into this because I am interested in the ongoing work of
the hugetlb unification, but my knowledge of ppc pagetables tends to zero,
So be prepared for some stupid questions.

First, let me have a clear picture of the current situation:

power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level 
pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them
as PMD + PTE though.

And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of
of 4GB.

Looking at the page table diagram for power8xx, it seems power8xx has also some
sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing
bigger page sizes?
I also guess that although power8xx supports all these different sizes, only one
of them can be active at any time, right?

It also seems that this whole hugepd thing is only used when we are using 8MB
PAGE_SIZE, right?
And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB
hugepd.
E.g:

 [PGD#0] > ||
   | 8MB hugepd |
 [PGD#1] > ||

What you want to do with this work is to get rid of that hugepd abstraction
because it is something power8xx/hugetlb specific and cannot be represented
with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE).
I did not check, but I guess we cannot walk the hugepd thing with a normal
page table walker, or can we? (how special is a hugepd? can you describe its
internal layout?)

So, what you proprose, is something like the following?

 [PGD#X] ---> [PTE#0]
 ---> [PTE..#1023]
 [PGD#Y] ---> [PTE#0]
 ---> [PTE..#1023]

so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs.

The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is
just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless
of the pagesize, but either I read that wrong or..else.
Because on 16K page size, they show that each pte covers for 16KB memory chunk.
But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what
is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous
PTEs for bigger sizes?

Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right?

Peter mentioned that we should have something like:

X   X
[PGD] - [P4D] - [PUD] - [PMD] - [PTE]

where the PMD and PTE would be the ones we use for representing the 2Level
ptage table, and PGD,P4D and PUD would just be dummies.

But, is not the convention to at least have PGD-PTE always, and have anything
in between as optional? E.g:

  X   ?   ?   ?   X
[PGD] - [P4D] - [PUD] - [PMD] - [PTE]

I mean, are page table walkers ready to deal with non-PGD? I thought they were
not.

Also, in patch#1, you mentioned:

"At the time being, for 512k pages the flag is kept in the PTE and inserted in
the PMD entry at TLB miss exception".

Can you point out where to look for that in the code?

Also, what exactly is the "sz" parameter that gets passed down to 
pmd_populate_size()?
Is the size of the current mapping we are establishing?
I see that you only make a distinction when the mapping size is 8MB.
So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are 
contiguous
representing a 4MB chunk?

I will start looking deeper into this series on Monday, but just wanted to have 
a better
insight of what is going on.

PD: I think we could make the changelog of the coverletter a bit fatter and 
cover some
details in there, e.g: layout of page-tables for different page sizes, layout 
of hugepd,
expected layout after the work, etc.
I think it would help in reviewing this series.

Thanks!

[1] https://github.com/linuxppc/wiki/wiki/Huge-pages


-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-15 Thread Oscar Salvador
On Wed, May 15, 2024 at 12:41:42PM +0200, Borislav Petkov wrote:
> On Fri, May 10, 2024 at 11:29:26AM -0700, Axel Rasmussen wrote:
> > @@ -3938,7 +3938,7 @@ static vm_fault_t handle_pte_marker(struct vm_fault 
> > *vmf)
> >  
> > /* Higher priority than uffd-wp when data corrupted */
> > if (marker & PTE_MARKER_POISONED)
> > -   return VM_FAULT_HWPOISON;
> > +   return VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_SILENT;
> 
> If you know here that this poisoning should be silent, why do you have
> to make it all complicated and propagate it into arch code, waste
> a separate VM_FAULT flag just for that instead of simply returning here
> a VM_FAULT_COMPLETED or some other innocuous value which would stop
> processing the fault?

AFAIK, He only wants it to be silent wrt. the arch fault handler not screaming,
but he still wants to be able to trigger force_sig_mceerr().


-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-15 Thread Oscar Salvador
On Tue, May 14, 2024 at 03:34:24PM -0600, Peter Xu wrote:
> The question is whether we can't.
> 
> Now we reserved a swp entry just for hwpoison and it makes sense only
> because we cached the poisoned pfn inside.  My long standing question is
> why do we ever need that pfn after all.  If we don't need the pfn, we
> simply need a bit in the pgtable entry saying that it's poisoned, if
> accessed we should kill the process using sigbus.
> 
> I used to comment on this before, the only path that uses that pfn is
> check_hwpoisoned_entry(), which was introduced in:
> 
> commit a3f5d80ea401ac857f2910e28b15f35b2cf902f4
> Author: Naoya Horiguchi 
> Date:   Mon Jun 28 19:43:14 2021 -0700
> 
> mm,hwpoison: send SIGBUS with error virutal address
> 
> Now an action required MCE in already hwpoisoned address surely sends a
> SIGBUS to current process, but the SIGBUS doesn't convey error virtual
> address.  That's not optimal for hwpoison-aware applications.
> 
> To fix the issue, make memory_failure() call kill_accessing_process(),
> that does pagetable walk to find the error virtual address.  It could find
> multiple virtual addresses for the same error page, and it seems hard to
> tell which virtual address is correct one.  But that's rare and sending
> incorrect virtual address could be better than no address.  So let's
> report the first found virtual address for now.
> 
> So this time I read more on this and Naoya explained why - it's only used
> so far to dump the VA of the poisoned entry.

Well, not really dumped, but we just pass the VA down the chain to the
signal handler.

But on the question whether we need the pfn encoded, I am not sure
actually.
check_hwpoisoned_entry() checks whether the pfn where the pte sits is
the same as the hwpoisoned one, so we know if the process has to be
killed.

Now, could we get away with using pte_page() instead of pte_pfn() in there, and
passing the hwpoisoned page instead ot the pfn?
I am trying to think of the implications, then we should not need the
encoded pfn?


> However what confused me is, if an entry is poisoned already logically we
> dump that message in the fault handler not memory_failure(), which is:
> 
> MCE: Killing uffd-unit-tests:650 due to hardware memory corruption fault at 
> 7f3589d7e000
> 
> So perhaps we're trying to also dump that when the MCEs (points to the same
> pfn) are only generated concurrently?  I donno much on hwpoison so I cannot
> tell, there's also implication where it's only triggered if
> MF_ACTION_REQUIRED.  But I think it means hwpoison may work without pfn
> encoded, but I don't know the implication to lose that dmesg line.

Not necesarily concurrently, but simply if for any reason the page could
not have been unmapped.
Let us say you have ProcessA and ProcessB mapping the same page.
We get an MCE for that page but we could not unmapped it, at least not
from all processes (maybe just ProcessA).

memory-failure code will mark it as hwpoison, now ProcessA faults it in
and gets killed because we see that the page is hwpoisoned in the fault
path, so we sill send VM_FAULT_HWPOISON all the way down till you see
the:

"MCE: Killing uffd-unit-tests:650 due to hardware memory corruption
fault at 7f3589d7e000" from e.g: arch/x86/mm/fault.c:do_sigbus()

Now, ProcessB still has the page mapped, so upon re-accessing it,
it will trigger a new MCE event. memory-failure code will see that this
page has already been marked as hwpoisoned, but since we failed to
unmap it (otherwise no one should be re-accessing it), it goes: "ok,
let's just kill everybody who has this page mapped".


> We used to not dump error for swapin error.  Note that here what I am
> saying is not that Axel is doing things wrong, but it's just that logically
> swapin error (as pte marker) can also be with !QUIET, so my final point is
> we may want to avoid having the assumption that "pte marker should always
> be QUITE", because I want to make it clear that pte marker can used in any
> form, so itself shouldn't imply anything..

I think it would make more sense if we have a separate marker for swapin
errors?
I mean, deep down, they do not mean the same as poison, right?

Then you can choose which events get to be silent because you do not
care, and which ones need to scream loud.
I think swapin errors belong to the latter. At least I a heads-up why a
process is getting killed is appreciated, IMHO.
 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-14 Thread Oscar Salvador
On Fri, May 10, 2024 at 03:29:48PM -0400, Peter Xu wrote:
> IMHO we shouldn't mention that detail, but only state the effect which is
> to not report the event to syslog.
> 
> There's no hard rule that a pte marker can't reflect a real page poison in
> the future even MCE.  Actually I still remember most places don't care
> about the pfn in the hwpoison swap entry so maybe we can even do it? But
> that's another story regardless..

But we should not use pte markers for real hwpoisons events (aka MCE), right?
I mean, we do have the means to mark a page as hwpoisoned when a real
MCE gets triggered, why would we want a pte marker to also reflect that?
Or is that something for userfaultd realm?

> And also not report swapin error is, IMHO, only because arch errors said
> "MCE" in the error logs which may not apply here.  Logically speaking
> swapin error should also be reported so admin knows better on why a proc is
> killed.  Now it can still confuse the admin if it really happens, iiuc.

I am bit confused by this.
It seems we create poisoned pte markers on swap errors (e.g:
unuse_pte()), which get passed down the chain with VM_FAULT_HWPOISON,
which end up in sigbus (I guess?).

This all seems very subtle to me.

First of all, why not passing VM_FAULT_SIGBUS if that is what will end
up happening?
I mean, at the moment that is not possible because we convolute swaping
errors and uffd poison in the same type of marker, so we do not have any
means to differentiate between the two of them.

Would it make sense to create yet another pte marker type to split that
up? Because when I look at VM_FAULT_HWPOISON, I get reminded of MCE
stuff, and that does not hold here.

 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v3] powerpc/memhotplug: Add add_pages override for PPC

2022-06-29 Thread Oscar Salvador
On Wed, Jun 29, 2022 at 10:39:25AM +0530, Aneesh Kumar K.V wrote:
> With commit ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 
> 32-bit")
> the kernel now validate the addr against high_memory value. This results
> in the below BUG_ON with dax pfns.
> 
> [  635.798741][T26531] kernel BUG at mm/page_alloc.c:5521!
> 1:mon> e
> cpu 0x1: Vector: 700 (Program Check) at [c7287630]
> pc: c055ed48: free_pages.part.0+0x48/0x110
> lr: c053ca70: tlb_finish_mmu+0x80/0xd0
> sp: c72878d0
>msr: 8282b033
>   current = 0xcafabe00
>   paca= 0xc0037300   irqmask: 0x03   irq_happened: 0x05
> pid   = 26531, comm = 50-landscape-sy
> kernel BUG at :5521!
> Linux version 5.19.0-rc3-14659-g4ec05be7c2e1 (kvaneesh@ltc-boston8) (gcc 
> (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) 
> #625 SMP Thu Jun 23 00:35:43 CDT 2022
> 1:mon> t
> [link register   ] c053ca70 tlb_finish_mmu+0x80/0xd0
> [c72878d0] c053ca54 tlb_finish_mmu+0x64/0xd0 (unreliable)
> [c7287900] c0539424 exit_mmap+0xe4/0x2a0
> [c72879e0] c019fc1c mmput+0xcc/0x210
> [c7287a20] c0629230 begin_new_exec+0x5e0/0xf40
> [c7287ae0] c070b3cc load_elf_binary+0x3ac/0x1e00
> [c7287c10] c0627af0 bprm_execve+0x3b0/0xaf0
> [c7287cd0] c0628414 do_execveat_common.isra.0+0x1e4/0x310
> [c7287d80] c062858c sys_execve+0x4c/0x60
> [c7287db0] c002c1b0 system_call_exception+0x160/0x2c0
> [c7287e10] c000c53c system_call_common+0xec/0x250
> 
> The fix is to make sure we update high_memory on memory hotplug.
> This is similar to what x86 does in commit 3072e413e305 ("mm/memory_hotplug: 
> introduce add_pages")
> 
> Fixes: ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 
> 32-bit")
> Cc: Kefeng Wang 
> Cc: Christophe Leroy 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Oscar Salvador 

> ---
> Changes from v2:
> * drop WARN_ON_ONCE
> * check for error from __add_pages
> 
>  arch/powerpc/Kconfig  |  4 
>  arch/powerpc/mm/mem.c | 33 -
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index c2ce2e60c8f0..7aa12e88c580 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -358,6 +358,10 @@ config ARCH_SUSPEND_NONZERO_CPU
>   def_bool y
>   depends on PPC_POWERNV || PPC_PSERIES
>  
> +config ARCH_HAS_ADD_PAGES
> + def_bool y
> + depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
>  config PPC_DCR_NATIVE
>   bool
>  
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 52b77684acda..a97128a48817 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -105,6 +105,37 @@ void __ref arch_remove_linear_mapping(u64 start, u64 
> size)
>   vm_unmap_aliases();
>  }
>  
> +/*
> + * After memory hotplug the variables max_pfn, max_low_pfn and high_memory 
> need
> + * updating.
> + */
> +static void update_end_of_memory_vars(u64 start, u64 size)
> +{
> + unsigned long end_pfn = PFN_UP(start + size);
> +
> + if (end_pfn > max_pfn) {
> + max_pfn = end_pfn;
> + max_low_pfn = end_pfn;
> + high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
> + }
> +}
> +
> +int __ref add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> + struct mhp_params *params)
> +{
> + int ret;
> +
> + ret = __add_pages(nid, start_pfn, nr_pages, params);
> + if (ret)
> + return ret;
> +
> + /* update max_pfn, max_low_pfn and high_memory */
> + update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
> +   nr_pages << PAGE_SHIFT);
> +
> + return ret;
> +}
> +
>  int __ref arch_add_memory(int nid, u64 start, u64 size,
> struct mhp_params *params)
>  {
> @@ -115,7 +146,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>   rc = arch_create_linear_mapping(nid, start, size, params);
>   if (rc)
>   return rc;
> - rc = __add_pages(nid, start_pfn, nr_pages, params);
> + rc = add_pages(nid, start_pfn, nr_pages, params);
>   if (rc)
>   arch_remove_linear_mapping(start, size);
>   return rc;
> -- 
> 2.36.1
> 
> 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH] powerpc/numa: Associate numa node to its cpu earlier

2022-04-11 Thread Oscar Salvador
On Mon, Apr 11, 2022 at 02:28:08PM +0530, Srikar Dronamraju wrote:
> Given that my patch got accepted into powerpc tree
> https://git.kernel.org/powerpc/c/e4ff77598a109bd36789ad5e80aba66fc53d0ffb
> is now part of Linus tree, this line may need a slight tweak.

Right.

@Michael: Will you resolve the conflict, or you would rather want me to send
v2 with the amendment?

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes

2022-04-11 Thread Oscar Salvador
On Sun, Apr 10, 2022 at 09:28:38PM +1000, Michael Ellerman wrote:
> Yeah agreed, thanks for getting to the root of the problem.
> 
> Can you resend as a standalone patch. Because you sent it as a reply it
> won't be recognised by patchwork[1] which means it risks getting lost.

Hi Michael,

It's done [1].

thanks!

[1] 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220411074934.4632-1-osalva...@suse.de/

 
-- 
Oscar Salvador
SUSE Labs


[PATCH] powerpc/numa: Associate numa node to its cpu earlier

2022-04-11 Thread Oscar Salvador
powerpc is the only platform that do not rely on
cpu_up()->try_online_node() to bring up a numa node,
and special cases it, instead, deep in its own machinery:

dlpar_online_cpu
 find_and_online_cpu_nid
  try_online_node

This should not be needed, but the thing is that the try_online_node()
from cpu_up() will not apply on the right node, because cpu_to_node()
will return the old mapping numa<->cpu that gets set on boot stage
for all possible cpus.

That can be seen easily if we try to print out the numa node passed
to try_online_node() in cpu_up().

The thing is that the numa<->cpu mapping does not get updated till a much
later stage in start_secondary:

start_secondary:
 set_numa_node(numa_cpu_lookup_table[cpu])

But we do not really care, as we already now the
CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
so let us make use of that and set the proper numa<->cpu mapping,
so cpu_to_node() in cpu_up() returns the right node and
try_online_node() can do its work.

Signed-off-by: Oscar Salvador 
Reviewed-by: Srikar Dronamraju 
Tested-by: Geetika Moolchandani 
---
 arch/powerpc/include/asm/topology.h  |  8 ++-
 arch/powerpc/mm/numa.c   | 31 +++-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 +-
 3 files changed, 11 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 36fcafb1fd6d..6ae1b2dce83e 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) 
{}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
-extern int find_and_online_cpu_nid(int cpu);
+extern void find_and_update_cpu_nid(int cpu);
 extern int cpu_to_coregroup_id(int cpu);
 #else
-static inline int find_and_online_cpu_nid(int cpu)
-{
-   return 0;
-}
-
+static inline void find_and_update_cpu_nid(int cpu) {}
 static inline int cpu_to_coregroup_id(int cpu)
 {
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b9b7fefbb64b..b5bc8b1a833d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu,
return rc;
 }
 
-int find_and_online_cpu_nid(int cpu)
+void find_and_update_cpu_nid(int cpu)
 {
__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
int new_nid;
 
/* Use associativity from first thread for all siblings */
if (vphn_get_associativity(cpu, associativity))
-   return cpu_to_node(cpu);
+   return;
 
+   /* Do not have previous associativity, so find it now. */
new_nid = associativity_to_nid(associativity);
+
if (new_nid < 0 || !node_possible(new_nid))
new_nid = first_online_node;
-
-   if (NODE_DATA(new_nid) == NULL) {
-#ifdef CONFIG_MEMORY_HOTPLUG
-   /*
-* Need to ensure that NODE_DATA is initialized for a node from
-* available memory (see memblock_alloc_try_nid). If unable to
-* init the node, then default to nearest node that has memory
-* installed. Skip onlining a node if the subsystems are not
-* yet initialized.
-*/
-   if (!topology_inited || try_online_node(new_nid))
-   new_nid = first_online_node;
-#else
-   /*
-* Default to using the nearest node that has memory installed.
-* Otherwise, it would be necessary to patch the kernel MM code
-* to deal with more memoryless-node error conditions.
+   else
+   /* Associate node <-> cpu, so cpu_up() calls
+* try_online_node() on the right node.
 */
-   new_nid = first_online_node;
-#endif
-   }
+   set_cpu_numa_node(cpu, new_nid);
 
pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__,
cpu, new_nid);
-   return new_nid;
 }
 
 int cpu_to_coregroup_id(int cpu)
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index b81fc846d99c..0f8cd8b06432 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn)
if (get_hard_smp_processor_id(cpu) != thread)
continue;
cpu_maps_update_done();
-   find_and_online_cpu_nid(cpu);
+   find_and_update_cpu_nid(cpu);
rc = device_online(get_cpu_device(cpu));
if (rc) {
dlpar_offline_cpu(dn);
-- 
2.16.4



Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes

2022-04-06 Thread Oscar Salvador
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..13022d734951 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>   if (new_nid < 0 || !node_possible(new_nid))
>   new_nid = first_online_node;
>  
> - if (NODE_DATA(new_nid) == NULL) {
> + if (!node_online(new_nid)) {
>  #ifdef CONFIG_MEMORY_HOTPLUG
>   /*
>* Need to ensure that NODE_DATA is initialized for a node from

Because of this fix, I wanted to check whether we might have any more fallouts 
due
to ("mm: handle uninitialized numa nodes gracefully"), and it made me look 
closer
as to why powerpc is the only platform that special cases try_online_node(),
while all others rely on cpu_up()->try_online_node() to do the right thing.

So, I had a look.
Let us rewind a bit:

The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").

In there, it says that we have the following picture:

partition_sched_domains
 arch_update_cpu_topology
  numa_update_cpu_topology
   find_and_online_cpu_nid

and by the time find_and_online_cpu_nid() gets called to online the node, it 
might be
too late as we might have referenced some NODE_DATA() fields.
Note that this happens at a much later stage in cpuhp.

Also note that at a much earlier stage, we do already have a try_online_node() 
in cpu_up(),
which should allocate-and-online the node and prevent accessing garbage.
But the problem is that, on powerpc, all possible cpus have the same node set 
at boot stage
(see arch/powerpc/mm/numa.c:mem_topology_setup),
so cpu_to_node() returns the same thing until it the mapping gets update (which 
happens in
start_secondary()->set_numa_node()), and so, the try_online_node() from 
cpu_up() does not
apply on the right node, because it still holds the not-up-to-date mapping node 
<-> cpu.

(e.g: in my test case, when onlining a CPU belongin to node1, 
cpu_up()->try_online_node()
 tries to online node0, or whatever old mapping numa<->cpu is there).

So, we have something like:

dlpar_online_cpu
 device_online
  dev->bus->online
   cpu_subsys_online
cpu_device_up
 cpu_up
  try_online_node (old mapping nid <-> cpu )
  ...
  ...
  cphp_callbacks
   sched_cpu_activate
cpuset_update_active_cpus
 schedule_work(_hotplug_work)
  cpuset_hotplug_work
   partition_sched_domains
arch_update_cpu_topology
 numa_update_cpu_topology
  find_and_online_cpu_nid (online new_nid)


So, yeah, the real onlining in 
numa_update_cpu_topology()->find_and_online_cpu_nid()
happens too late, that is why adding find_and_online_cpu_nid() back in 
dlpar_online_cpu()
fixed the issue, but we should not need this special casing at all.

We do already know the numa<->cpu associativity in
dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
update the numa<->cpu mapping, and let the try_online_node() do the right thing.

With this in mind, I came up with the following patch, where I carried out a 
battery
of tests for CPU hotplug stuff and it worked as expected.
But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus 
etc, so
I would like to hear from more familiar people.

The patch is:

From: Oscar Salvador 
Date: Wed, 6 Apr 2022 14:39:15 +0200
Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier

powerpc is the only platform that do not rely on
cpu_up()->try_online_node() to bring up a numa node,
and special cases it, instead, deep in its own machinery:

dlpar_online_cpu
 find_and_online_cpu_nid
  try_online_node

This should not be needed, but the thing is that the try_online_node()
from cpu_up() will not apply on the right node, because cpu_to_node()
will return the old mapping numa<->cpu that gets set on boot stage
for all possible cpus.

That can be seen easily if we try to print out the numa node passed
to try_online_node() in cpu_up().

The thing is that the numa<->cpu mapping does not get updated till a much
later stage in start_secondary:

start_secondary:
 set_numa_node(numa_cpu_lookup_table[cpu])

But we do not really care, as we already now the
CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
so let us make use of that and set the proper numa<->cpu mapping,
so cpu_to_node() in cpu_up() returns the right node and
try_online_node() can do its work.

Signed-off-by: Oscar Salvador 
---
 arch/powerpc/include/asm/topology.h  |  8 ++-
 arch/powerpc/mm/numa.c   | 

Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes

2022-03-30 Thread Oscar Salvador
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> With commit 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully") NODE_DATA for even a memoryless/cpuless node is partially
> initialized at boot time.
> 
> Before onlining the node, current Powerpc code checks for NODE_DATA to
> be NULL. However since NODE_DATA is partially initialized, this check
> will end up always being false.
> 
> This causes hotplugging a CPU to a memoryless/cpuless node to fail.
> 
> Before adding CPUs
> $ numactl -H
> available: 1 nodes (4)
> node 4 cpus: 0 1 2 3 4 5 6 7
> node 4 size: 97372 MB
> node 4 free: 95545 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00
> 
> %user  %sys %wait%idlephysc %entc lbusy   app  vcsw phint
> - - --- - - - - -
> 2.66  2.67  0.1694.51 0.00  0.00  5.33  0.00 67749 0
> 
> After hotplugging 32 cores
> $ numactl -H
> node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130
> 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
> 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
> 167 168 169 170 171 172 173 174 175
> node 4 size: 97372 MB
> node 4 free: 93636 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00
> 
> %user  %sys %wait%idlephysc %entc lbusy   app  vcsw phint
> - - --- - - - - -
> 0.04  0.02  0.0099.94 0.00  0.00  0.06  0.00 1128751 3
> 
> As we can see numactl is listing only 8 cores while lparstat is showing
> 33 cores.
> 
> Also dmesg is showing messages like:
> [ 2261.318350 ] BUG: arch topology borken
> [ 2261.318357 ]  the DIE domain not a subset of the NODE domain
> 
> Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: Michal Hocko 
> Cc: Michael Ellerman 
> Reported-by: Geetika Moolchandani 
> Signed-off-by: Srikar Dronamraju 

Acked-by: Oscar Salvador 

Thanks Srikar!

> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..13022d734951 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>   if (new_nid < 0 || !node_possible(new_nid))
>   new_nid = first_online_node;
>  
> - if (NODE_DATA(new_nid) == NULL) {
> + if (!node_online(new_nid)) {
>  #ifdef CONFIG_MEMORY_HOTPLUG
>   /*
>* Need to ensure that NODE_DATA is initialized for a node from
> -- 
> 2.27.0
> 
> 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v4 4/7] mm: make alloc_contig_range work at pageblock granularity

2022-02-04 Thread Oscar Salvador
On Wed, Jan 19, 2022 at 02:06:20PM -0500, Zi Yan wrote:
> From: Zi Yan 
> 
> alloc_contig_range() worked at MAX_ORDER-1 granularity to avoid merging
> pageblocks with different migratetypes. It might unnecessarily convert
> extra pageblocks at the beginning and at the end of the range. Change
> alloc_contig_range() to work at pageblock granularity.
> 
> It is done by restoring pageblock types and split >pageblock_order free
> pages after isolating at MAX_ORDER-1 granularity and migrating pages
> away at pageblock granularity. The reason for this process is that
> during isolation, some pages, either free or in-use, might have >pageblock
> sizes and isolating part of them can cause free accounting issues.
> Restoring the migratetypes of the pageblocks not in the interesting
> range later is much easier.

Hi Zi Yan,

Due to time constraints I only glanced over, so some comments below
about stuff that caught my eye:

> +static inline void split_free_page_into_pageblocks(struct page *free_page,
> + int order, struct zone *zone)
> +{
> + unsigned long pfn;
> +
> + spin_lock(>lock);
> + del_page_from_free_list(free_page, zone, order);
> + for (pfn = page_to_pfn(free_page);
> +  pfn < page_to_pfn(free_page) + (1UL << order);

It migt make sense to have a end_pfn variable so that does not have to
be constantly evaluated. Or maybe the compiler is clever enough to only
evualuate it once.

> +  pfn += pageblock_nr_pages) {
> + int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> +
> + __free_one_page(pfn_to_page(pfn), pfn, zone, pageblock_order,
> + mt, FPI_NONE);
> + }
> + spin_unlock(>lock);

It is possible that free_page's order is already pageblock_order, so I
would add a one-liner upfront to catch that case and return, otherwise
we do the delete_from_freelist-and-free_it_again dance.

> + /* Save the migratepages of the pageblocks before start and after end */
> + num_pageblock_to_save = (alloc_start - isolate_start) / 
> pageblock_nr_pages
> + + (isolate_end - alloc_end) / 
> pageblock_nr_pages;
> + saved_mt =
> + kmalloc_array(num_pageblock_to_save,
> +   sizeof(unsigned char), GFP_KERNEL);
> + if (!saved_mt)
> + return -ENOMEM;
> +
> + num = save_migratetypes(saved_mt, isolate_start, alloc_start);
> +
> + num = save_migratetypes(_mt[num], alloc_end, isolate_end);

I really hope we can put all this magic within start_isolate_page_range,
and the counterparts in undo_isolate_page_range.

Also, I kinda dislike the _mt thing. I thought about some other
approaches but nothing that wasn't too specific for this case, and I
guess we want that function to be as generic as possible.

> + /*
> +  * Split free page spanning [alloc_end, isolate_end) and put the
> +  * pageblocks in the right migratetype list
> +  */
> + for (outer_end = alloc_end; outer_end < isolate_end;) {
> + unsigned long begin_pfn = outer_end;
> +
> + order = 0;
> + while (!PageBuddy(pfn_to_page(outer_end))) {
> + if (++order >= MAX_ORDER) {
> + outer_end = begin_pfn;
> + break;
> + }
> + outer_end &= ~0UL << order;
> + }
> +
> + if (outer_end != begin_pfn) {
> + order = buddy_order(pfn_to_page(outer_end));
> +
> + /*
> +  * split the free page has start page and put the 
> pageblocks
> +  * in the right migratetype list
> +  */
> + VM_BUG_ON(outer_end + (1UL << order) <= begin_pfn);

How could this possibily happen?

> + {
> + struct page *free_page = pfn_to_page(outer_end);
> +
> + split_free_page_into_pageblocks(free_page, 
> order, cc.zone);
> + }
> + outer_end += 1UL << order;
> + } else
> + outer_end = begin_pfn + 1;
>   }

I think there are cases could optimize for. If the page has already been
split in pageblock by the outer_start loop, we could skip this outer_end
logic altogether.

E.g: An order-10 page is split in two pageblocks. There's nothing else
to be done, right? We could skip this. 


-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-02-02 Thread Oscar Salvador
On Wed, Feb 02, 2022 at 01:25:28PM +0100, David Hildenbrand wrote:
> That's the whole idea for being able to allocate parts of an unmovable
> pageblock that are movable.
> 
> If the first part is unmovable but the second part is movable, nothing
> should stop us from trying to allocate the second part.

Yeah, I see, I was a bit slow there, but I see the point now.
 
Thanks David

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-02-02 Thread Oscar Salvador
On Wed, Jan 19, 2022 at 02:06:19PM -0500, Zi Yan wrote:
> From: Zi Yan 
> 
> Enable set_migratetype_isolate() to check specified sub-range for
> unmovable pages during isolation. Page isolation is done
> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
> pages within that granularity are intended to be isolated. For example,
> alloc_contig_range(), which uses page isolation, allows ranges without
> alignment. This commit makes unmovable page check only look for
> interesting pages, so that page isolation can succeed for any
> non-overlapping ranges.

Another thing that came to my mind.
Prior to this patch, has_unmovable_pages() was checking on pageblock
granularity, starting at pfn#0 of the pageblock.
With this patch, you no longer check on pageblock granularity, which
means you might isolate a pageblock, but some pages that sneaked in
might actually be unmovable.

E.g:

Let's say you have a pageblock that spans (pfn#512,pfn#1024),
and you pass alloc_contig_range() (pfn#514,pfn#1024).
has_unmovable_pages() will start checking the pageblock at pfn#514,
and so it will mis pfn#512 and pfn#513. Isn't that a problem, if those
pfn turn out to be actually unmovable?


-- 
Oscar Salvador
SUSE Labs


Re: [PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

2022-01-31 Thread Oscar Salvador
On Fri, Jan 28, 2022 at 04:15:40PM +0100, David Hildenbrand wrote:
> ... and call node_dev_init() after memory_dev_init() from driver_init(),
> so before any of the existing arch/subsys calls. All online nodes should
> be known at that point.
> 
> This is in line with memory_dev_init(), which initializes the memory
> device subsystem and creates all memory block devices.
> 
> Similar to memory_dev_init(), panic() if anything goes wrong, we don't
> want to continue with such basic initialization errors.
> 
> The important part is that node_dev_init() gets called after
> memory_dev_init() and after cpu_dev_init(), but before any of the
> relevant archs call register_cpu() to register the new cpu device under
> the node device. The latter should be the case for the current users
> of topology_init().
> 
> Cc: Andrew Morton 
> Cc: Greg Kroah-Hartman 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Thomas Bogendoerfer 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: Albert Ou 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: "David S. Miller" 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: "Rafael J. Wysocki" 
> Cc: x...@kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@kvack.org
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE Labs


Re: [PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

2022-01-31 Thread Oscar Salvador
On Mon, Jan 31, 2022 at 08:48:54AM +0100, David Hildenbrand wrote:
> Hi Oscar,

Hi David :-),

> Right, and the idea is that the online state of nodes (+ node/zone
> ranges) already has to be known at that point in time, because
> otherwise, we'd be in bigger trouble.

Yeah, I wanted to check where exactly did we mark the nodes online,
and for the few architectures I checked it happens in setup_arch(),
which is called very early in start_kernel(), while driver_init()
gets called through arch_call_rest_init(), which happens at the end
of the function.

I am not sure whether we want to remark that somehow in the changelog,
so it is crystal clear that by the time the node_dev_init() gets called,
we already set the nodes online.

Anyway, just saying, but is fine as is.
 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

2022-01-30 Thread Oscar Salvador
On Fri, Jan 28, 2022 at 04:15:40PM +0100, David Hildenbrand wrote:
> ... and call node_dev_init() after memory_dev_init() from driver_init(),
> so before any of the existing arch/subsys calls. All online nodes should
> be known at that point.
> 
> This is in line with memory_dev_init(), which initializes the memory
> device subsystem and creates all memory block devices.
> 
> Similar to memory_dev_init(), panic() if anything goes wrong, we don't
> want to continue with such basic initialization errors.
> 
> The important part is that node_dev_init() gets called after
> memory_dev_init() and after cpu_dev_init(), but before any of the
> relevant archs call register_cpu() to register the new cpu device under
> the node device. The latter should be the case for the current users
> of topology_init().

So, before this change we had something like this:

do_basic_setup
 driver_init
  memory_dev_init
 do_init_calls
  ...
   topology_init
register_nodes/register_one_node

And after the patch all happens in driver_init()

driver_init
 memory_dev_init
 node_dev_init

I guess this is fine as we do not have any ordering problems (aka: none
of the functions we used to call before expect the nodes not to be
there for some weird reason).

So, no functional change, right?

This certainly looks like an improvment. 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-01-25 Thread Oscar Salvador
On Tue, Jan 25, 2022 at 02:19:46PM +0100, Oscar Salvador wrote:
> I know that this has been discussed previously, and the cover-letter already
> mentions it, but I think it would be great to have some sort of information 
> about
> the problem in the commit message as well, so people do not have to go and 
> find
> it somewhere else.

Sorry, the commit already points it out, but I meant to elaborate some more.

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-01-25 Thread Oscar Salvador
On Mon, Jan 24, 2022 at 12:17:23PM -0500, Zi Yan wrote:
> You are right. Sorry for the confusion. I think it should be
> “Page isolation is done at least on max(MAX_ORDER_NR_PAEGS,
> pageblock_nr_pages) granularity.”
> 
> memory_hotplug uses PAGES_PER_SECTION. It is greater than that.

Or just specify that the max(MAX_ORDER_NR_PAGES, pageblock_nr_pages) granurality
only comes from alloc_contig_range at the moment. Other callers might want
to work in other granularity (e.g: memory-hotplug) although ultimately the
range has to be aligned to something.

> > True is that start_isolate_page_range() expects the range to be pageblock 
> > aligned and works in pageblock_nr_pages chunks, but I do not think that is 
> > what you meant to say here.
> 
> Actually, start_isolate_page_range() should expect max(MAX_ORDER_NR_PAEGS,
> pageblock_nr_pages) alignment instead of pageblock alignment. It seems to
> be an uncovered bug in the current code, since all callers uses at least
> max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment.
> 
> The reason is that if start_isolate_page_range() is only pageblock aligned
> and a caller wants to isolate one pageblock from a MAX_ORDER-1
> (2 pageblocks on x84_64 systems) free page, this will lead to MIGRATE_ISOLATE
> accounting error. To avoid it, start_isolate_page_range() needs to isolate
> the max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) aligned range.

So, let me see if I get this straight:

You are saying that, currently, alloc_contig_ranges() works on the biggest
alignment otherwise we might have this scenario:

[  MAX_ORDER-1   ]
[pageblock#0][pageblock#1]

We only want to isolate pageblock#1, so we pass a pageblock-aligned range to
start_isolate_page_range(), but the page belonging to pageblock#1 spans
pageblock#0 and pageblock#1 because it is a MAX_ORDER-1 page.

So when we call set_migratetype_isolate()->set_pageblock_migratetype(), this 
will
mark either pageblock#0 or pageblock#1 as isolated, but the whole page will be 
put
in the MIGRATE_ISOLATE freelist by move_freepages_block()->move_freepages().
Meaning, we wil effectively have two pageblocks isolated, but only one marked
as such?

Did I get it right or did I miss something?

I know that this has been discussed previously, and the cover-letter already
mentions it, but I think it would be great to have some sort of information 
about
the problem in the commit message as well, so people do not have to go and find
it somewhere else.


-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v4 2/7] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c

2022-01-24 Thread Oscar Salvador

On 2022-01-19 20:06, Zi Yan wrote:

From: Zi Yan 

has_unmovable_pages() is only used in mm/page_isolation.c. Move it from
mm/page_alloc.c and make it static.

Signed-off-by: Zi Yan 


Reviewed-by: Oscar Salvador 

--
Oscar Salvador
SUSE Labs


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-01-24 Thread Oscar Salvador

On 2022-01-19 20:06, Zi Yan wrote:

From: Zi Yan 

Enable set_migratetype_isolate() to check specified sub-range for
unmovable pages during isolation. Page isolation is done
at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
pages within that granularity are intended to be isolated. For example,
alloc_contig_range(), which uses page isolation, allows ranges without
alignment. This commit makes unmovable page check only look for
interesting pages, so that page isolation can succeed for any
non-overlapping ranges.


Hi Zi Yan,

I had to re-read this several times as I found this a bit misleading.
I was mainly confused by the fact that memory_hotplug does isolation on 
PAGES_PER_SECTION granularity, and reading the above seems to indicate 
that either do it at MAX_ORDER_NR_PAGES or at pageblock_nr_pages 
granularity.


True is that start_isolate_page_range() expects the range to be 
pageblock aligned and works in pageblock_nr_pages chunks, but I do not 
think that is what you meant to say here.


Now, to the change itself, below:



@@ -47,8 +51,8 @@ static struct page *has_unmovable_pages(struct zone
*zone, struct page *page,
return page;
}

-   for (; iter < pageblock_nr_pages - offset; iter++) {
-   page = pfn_to_page(pfn + iter);
+   for (pfn = first_pfn; pfn < last_pfn; pfn++) {


You already did pfn = first_pfn before.


 /**
  * start_isolate_page_range() - make page-allocation-type of range of 
pages to

  * be MIGRATE_ISOLATE.
- * @start_pfn: The lower PFN of the range to be isolated.
- * @end_pfn:   The upper PFN of the range to be isolated.
+ * @start_pfn: The lower PFN of the range to be checked for
+ * possibility of isolation.
+ * @end_pfn:   The upper PFN of the range to be checked for
+ * possibility of isolation.
+ * @isolate_start: The lower PFN of the range to be isolated.
+ * @isolate_end:   The upper PFN of the range to be isolated.


So, what does "possibility" means here. I think this need to be 
clarified a bit better.


From what you pointed out in the commit message I think what you are 
doing is:


- alloc_contig_range() gets a range to be isolated.
- then you pass two ranges to start_isolate_page_range()
  (start_pfn, end_pfn]: which is the unaligned range you got in 
alloc_contig_range()
  (isolate_start, isolate_end]: which got aligned to, let's say, to 
MAX_ORDER_NR_PAGES


Now, most likely, (start_pfn, end_pfn] only covers a sub-range of 
(isolate_start, isolate_end], and that

sub-range is what you really want to isolate (so (start_pfn, end_pfn])?

If so, should not the names be reversed?


--
Oscar Salvador
SUSE Labs


Re: [PATCH v1 6/6] x86: remove memory hotplug support on X86_32

2021-10-07 Thread Oscar Salvador
On Wed, Sep 29, 2021 at 04:36:00PM +0200, David Hildenbrand wrote:
> CONFIG_MEMORY_HOTPLUG was marked BROKEN over one year and we just
> restricted it to 64 bit. Let's remove the unused x86 32bit implementation
> and simplify the Kconfig.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  arch/x86/Kconfig  |  6 +++---
>  arch/x86/mm/init_32.c | 31 ---
>  2 files changed, 3 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ab83c22d274e..85f4762429f1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -62,7 +62,7 @@ config X86
>   select ARCH_32BIT_OFF_T if X86_32
>   select ARCH_CLOCKSOURCE_INIT
>   select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && 
> MIGRATION
> - select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
> + select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64
>   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
>   select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 
> || X86_PAE)
>   select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
> @@ -1615,7 +1615,7 @@ config ARCH_SELECT_MEMORY_MODEL
>  
>  config ARCH_MEMORY_PROBE
>   bool "Enable sysfs memory/probe interface"
> - depends on X86_64 && MEMORY_HOTPLUG
> + depends on MEMORY_HOTPLUG
>   help
> This option enables a sysfs memory/probe interface for testing.
> See Documentation/admin-guide/mm/memory-hotplug.rst for more 
> information.
> @@ -2395,7 +2395,7 @@ endmenu
>  
>  config ARCH_HAS_ADD_PAGES
>   def_bool y
> - depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
> + depends on ARCH_ENABLE_MEMORY_HOTPLUG
>  
>  config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>   def_bool y
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index bd90b8fe81e4..5cd7ea6d645c 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -779,37 +779,6 @@ void __init mem_init(void)
>   test_wp_bit();
>  }
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_params *params)
> -{
> - unsigned long start_pfn = start >> PAGE_SHIFT;
> - unsigned long nr_pages = size >> PAGE_SHIFT;
> - int ret;
> -
> - /*
> -  * The page tables were already mapped at boot so if the caller
> -  * requests a different mapping type then we must change all the
> -  * pages with __set_memory_prot().
> -  */
> - if (params->pgprot.pgprot != PAGE_KERNEL.pgprot) {
> - ret = __set_memory_prot(start, nr_pages, params->pgprot);
> - if (ret)
> - return ret;
> - }
> -
> - return __add_pages(nid, start_pfn, nr_pages, params);
> -}
> -
> -void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> -{
> - unsigned long start_pfn = start >> PAGE_SHIFT;
> - unsigned long nr_pages = size >> PAGE_SHIFT;
> -
> - __remove_pages(start_pfn, nr_pages, altmap);
> -}
> -#endif
> -
>  int kernel_set_to_readonly __read_mostly;
>  
>  static void mark_nxdata_nx(void)
> -- 
> 2.31.1
> 
> 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v1 5/6] mm/memory_hotplug: remove stale function declarations

2021-10-07 Thread Oscar Salvador
On Wed, Sep 29, 2021 at 04:35:59PM +0200, David Hildenbrand wrote:
> These functions no longer exist.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  include/linux/memory_hotplug.h | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index e5a867c950b2..be48e003a518 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -98,9 +98,6 @@ static inline void zone_seqlock_init(struct zone *zone)
>  {
>   seqlock_init(>span_seqlock);
>  }
> -extern int zone_grow_free_lists(struct zone *zone, unsigned long 
> new_nr_pages);
> -extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
> -extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
>  extern void adjust_present_page_count(struct page *page,
> struct memory_group *group,
>     long nr_pages);
> -- 
> 2.31.1
> 
> 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v1 4/6] mm/memory_hotplug: remove HIGHMEM leftovers

2021-10-07 Thread Oscar Salvador
On Wed, Sep 29, 2021 at 04:35:58PM +0200, David Hildenbrand wrote:
> We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not
> HIGHMEM. Let's remove any leftover code -- including the unused
> "status_change_nid_high" field part of the memory notifier.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  Documentation/core-api/memory-hotplug.rst |  3 --
>  .../zh_CN/core-api/memory-hotplug.rst |  4 ---
>  include/linux/memory.h|  1 -
>  mm/memory_hotplug.c   | 36 ++-
>  4 files changed, 2 insertions(+), 42 deletions(-)
> 
> diff --git a/Documentation/core-api/memory-hotplug.rst 
> b/Documentation/core-api/memory-hotplug.rst
> index de7467e48067..682259ee633a 100644
> --- a/Documentation/core-api/memory-hotplug.rst
> +++ b/Documentation/core-api/memory-hotplug.rst
> @@ -57,7 +57,6 @@ The third argument (arg) passes a pointer of struct 
> memory_notify::
>   unsigned long start_pfn;
>   unsigned long nr_pages;
>   int status_change_nid_normal;
> - int status_change_nid_high;
>   int status_change_nid;
>   }
>  
> @@ -65,8 +64,6 @@ The third argument (arg) passes a pointer of struct 
> memory_notify::
>  - nr_pages is # of pages of online/offline memory.
>  - status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
>is (will be) set/clear, if this is -1, then nodemask status is not changed.
> -- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
> -  is (will be) set/clear, if this is -1, then nodemask status is not changed.
>  - status_change_nid is set node id when N_MEMORY of nodemask is (will be)
>set/clear. It means a new(memoryless) node gets new memory by online and a
>node loses all memory. If this is -1, then nodemask status is not changed.
> diff --git a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst 
> b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
> index 161f4d2c18cc..9a204eb196f2 100644
> --- a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
> +++ b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
> @@ -63,7 +63,6 @@ memory_notify结构体的指针::
>   unsigned long start_pfn;
>   unsigned long nr_pages;
>   int status_change_nid_normal;
> - int status_change_nid_high;
>   int status_change_nid;
>   }
>  
> @@ -74,9 +73,6 @@ memory_notify结构体的指针::
>  - status_change_nid_normal是当nodemask的N_NORMAL_MEMORY被设置/清除时设置节
>点id,如果是-1,则nodemask状态不改变。
>  
> -- status_change_nid_high是当nodemask的N_HIGH_MEMORY被设置/清除时设置的节点
> -  id,如果这个值为-1,那么nodemask状态不会改变。
> -
>  - status_change_nid是当nodemask的N_MEMORY被(将)设置/清除时设置的节点id。这
>意味着一个新的(没上线的)节点通过联机获得新的内存,而一个节点失去了所有的内
>存。如果这个值为-1,那么nodemask的状态就不会改变。
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index dd6e608c3e0b..c46ff374d48d 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -96,7 +96,6 @@ struct memory_notify {
>   unsigned long start_pfn;
>   unsigned long nr_pages;
>   int status_change_nid_normal;
> - int status_change_nid_high;
>   int status_change_nid;
>  };
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8d7b2b593a26..95c927c8bfb8 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -21,7 +21,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -585,10 +584,6 @@ void generic_online_page(struct page *page, unsigned int 
> order)
>   debug_pagealloc_map_pages(page, 1 << order);
>   __free_pages_core(page, order);
>   totalram_pages_add(1UL << order);
> -#ifdef CONFIG_HIGHMEM
> - if (PageHighMem(page))
> - totalhigh_pages_add(1UL << order);
> -#endif
>  }
>  EXPORT_SYMBOL_GPL(generic_online_page);
>  
> @@ -625,16 +620,11 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>  
>   arg->status_change_nid = NUMA_NO_NODE;
>   arg->status_change_nid_normal = NUMA_NO_NODE;
> - arg->status_change_nid_high = NUMA_NO_NODE;
>  
>   if (!node_state(nid, N_MEMORY))
>   arg->status_change_nid = nid;
>   if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
>   arg->status_change_nid_normal = nid;
> -#ifdef CONFIG_HIGHMEM
> - if (zone_idx(zone) <= ZONE_HIGHMEM && !node_state(nid, N_HIGH_MEMORY))
> - arg->status_change_nid_high = nid;
> -#endif
>  }
>  
>  static 

Re: [PATCH v1 3/6] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit

2021-10-07 Thread Oscar Salvador
On Wed, Sep 29, 2021 at 04:35:57PM +0200, David Hildenbrand wrote:
> 32 bit support is broken in various ways: for example, we can online
> memory that should actually go to ZONE_HIGHMEM to ZONE_MOVABLE or in
> some cases even to one of the other kernel zones.
> 
> We marked it BROKEN in commit b59d02ed0869 ("mm/memory_hotplug: disable the
> functionality for 32b") almost one year ago. According to that commit
> it might be broken at least since 2017. Further, there is hardly a sane use
> case nowadays.
> 
> Let's just depend completely on 64bit, dropping the "BROKEN" dependency to
> make clear that we are not going to support it again. Next, we'll remove
> some HIGHMEM leftovers from memory hotplug code to clean up.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ea8762cd8e1e..88273dd5c6d6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -125,7 +125,7 @@ config MEMORY_HOTPLUG
>   select MEMORY_ISOLATION
>   depends on SPARSEMEM
>   depends on ARCH_ENABLE_MEMORY_HOTPLUG
> - depends on 64BIT || BROKEN
> + depends on 64BIT
>   select NUMA_KEEP_MEMINFO if NUMA
>  
>  config MEMORY_HOTPLUG_DEFAULT_ONLINE
> -- 
> 2.31.1
> 
> 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH v1 2/6] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE

2021-10-07 Thread Oscar Salvador
On Wed, Sep 29, 2021 at 04:35:56PM +0200, David Hildenbrand wrote:
> CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
> CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
> CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.
> 
> Signed-off-by: David Hildenbrand 

Acked-by: Oscar Salvador 

> ---
>  arch/powerpc/include/asm/machdep.h|  2 +-
>  arch/powerpc/kernel/setup_64.c|  2 +-
>  arch/powerpc/platforms/powernv/setup.c|  4 ++--
>  arch/powerpc/platforms/pseries/setup.c|  2 +-
>  drivers/base/Makefile |  2 +-
>  drivers/base/node.c   |  9 -
>  drivers/virtio/Kconfig|  2 +-
>  include/linux/memory.h| 18 +++---
>  include/linux/node.h  |  4 ++--
>  lib/Kconfig.debug |  2 +-
>  mm/Kconfig|  4 
>  mm/memory_hotplug.c   |  2 --
>  tools/testing/selftests/memory-hotplug/config |  1 -
>  13 files changed, 21 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index 764f2732a821..d8a2ca007082 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -32,7 +32,7 @@ struct machdep_calls {
>   void(*iommu_save)(void);
>   void(*iommu_restore)(void);
>  #endif
> -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +#ifdef CONFIG_MEMORY_HOTPLUG
>   unsigned long   (*memory_block_size)(void);
>  #endif
>  #endif /* CONFIG_PPC64 */
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index eaa79a0996d1..21f15d82f062 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -912,7 +912,7 @@ void __init setup_per_cpu_areas(void)
>  }
>  #endif
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +#ifdef CONFIG_MEMORY_HOTPLUG
>  unsigned long memory_block_size_bytes(void)
>  {
>   if (ppc_md.memory_block_size)
> diff --git a/arch/powerpc/platforms/powernv/setup.c 
> b/arch/powerpc/platforms/powernv/setup.c
> index a8db3f153063..ad56a54ac9c5 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -440,7 +440,7 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int 
> secondary)
>  }
>  #endif /* CONFIG_KEXEC_CORE */
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +#ifdef CONFIG_MEMORY_HOTPLUG
>  static unsigned long pnv_memory_block_size(void)
>  {
>   /*
> @@ -553,7 +553,7 @@ define_machine(powernv) {
>  #ifdef CONFIG_KEXEC_CORE
>   .kexec_cpu_down = pnv_kexec_cpu_down,
>  #endif
> -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +#ifdef CONFIG_MEMORY_HOTPLUG
>   .memory_block_size  = pnv_memory_block_size,
>  #endif
>  };
> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index f79126f16258..d29f6f1f7f37 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -1089,7 +1089,7 @@ define_machine(pseries) {
>   .machine_kexec  = pSeries_machine_kexec,
>   .kexec_cpu_down = pseries_kexec_cpu_down,
>  #endif
> -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +#ifdef CONFIG_MEMORY_HOTPLUG
>   .memory_block_size  = pseries_memory_block_size,
>  #endif
>  };
> diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> index ef8e44a7d288..02f7f1358e86 100644
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -13,7 +13,7 @@ obj-y   += power/
>  obj-$(CONFIG_ISA_BUS_API)+= isa.o
>  obj-y+= firmware_loader/
>  obj-$(CONFIG_NUMA)   += node.o
> -obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
> +obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o
>  ifeq ($(CONFIG_SYSFS),y)
>  obj-$(CONFIG_MODULES)+= module.o
>  endif
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index c56d34f8158f..b5a4ba18f9f9 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -629,7 +629,7 @@ static void node_device_release(struct device *dev)
>  {
>   struct node *node = to_node(dev);
>  
> -#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
> +#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_HUGETLBFS)
>   /*
>* We schedule the work only when a memory section is
>* onlined/offlined on this node. When we come here,
> @@ -782,7 +782,7 @@ int unregister_cpu_under_node(unsigned int

Re: [PATCH v1 1/6] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG

2021-10-07 Thread Oscar Salvador
On Wed, Sep 29, 2021 at 04:35:55PM +0200, David Hildenbrand wrote:
> SPARSEMEM is the only possible memory model for x86-64, FLATMEM is not
> possible:
>   config ARCH_FLATMEM_ENABLE
>   def_bool y
>   depends on X86_32 && !NUMA
> 
> And X86_64_ACPI_NUMA (obviously) only supports x86-64:
>   config X86_64_ACPI_NUMA
>   def_bool y
>   depends on X86_64 && NUMA && ACPI && PCI
> 
> Let's just remove the CONFIG_X86_64_ACPI_NUMA dependency, as it does no
> longer make sense.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d16ba9249bc5..b7fb3f0b485e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -123,7 +123,7 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
>  config MEMORY_HOTPLUG
>   bool "Allow for memory hot-add"
>   select MEMORY_ISOLATION
> - depends on SPARSEMEM || X86_64_ACPI_NUMA
> + depends on SPARSEMEM
>   depends on ARCH_ENABLE_MEMORY_HOTPLUG
>   depends on 64BIT || BROKEN
>   select NUMA_KEEP_MEMINFO if NUMA
> -- 
> 2.31.1
> 
> 

-- 
Oscar Salvador
SUSE Labs


Re: [PATCH V2 4/6] mm: Drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION

2021-04-12 Thread Oscar Salvador
On Thu, Apr 01, 2021 at 12:14:06PM +0530, Anshuman Khandual wrote:
> ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION configs have duplicate definitions on
> platforms that subscribe them. Drop these reduntant definitions and instead
> just select them appropriately.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Andrew Morton 
> Cc: x...@kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Acked-by: Catalin Marinas  (arm64)
> Signed-off-by: Anshuman Khandual 

Hi Anshuman, 

X86 needs fixing, see below:

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 503d8b2e8676..10702ef1eb57 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -60,8 +60,10 @@ config X86
>   select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
>   select ARCH_32BIT_OFF_T if X86_32
>   select ARCH_CLOCKSOURCE_INIT
> + select ARCH_ENABLE_HUGEPAGE_MIGRATION if x86_64 && HUGETLB_PAGE && 
> MIGRATION
>   select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
>   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
> + select ARCH_ENABLE_THP_MIGRATION if x86_64 && TRANSPARENT_HUGEPAGE

you need s/x86_64/X86_64/, otherwise we are left with no migration :-)

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 8/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:22PM +0100, David Hildenbrand wrote:
> Suggested-by: Michal Hocko 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

LGTM, and much cleaner definetely:

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:21PM +0100, David Hildenbrand wrote:
> Let's revert what we did in case seomthing goes wrong and we return an
"something" :-)
> error - as already done on arm64 and s390x.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 6/8] powerepc/book3s64/hash: drop WARN_ON in hash__remove_section_mapping

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:20PM +0100, David Hildenbrand wrote:
> The single caller (arch_remove_linear_mapping()) prints a proper warning
> when this function fails. No need to eventually crash the kernel - let's
> drop this WARN_ON.
> 
> Suggested-by: Oscar Salvador 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Cc: "Aneesh Kumar K.V" 
> Cc: Nicholas Piggin 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 4/8] powerpc/mm: protect linear mapping modifications by a mutex

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:18PM +0100, David Hildenbrand wrote:
> @@ -144,7 +147,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
>   start = (unsigned long)__va(start);
>   flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
>  
> + mutex_lock(_mapping_mutex);
>   ret = remove_section_mapping(start, start + size);
> + mutex_unlock(_mapping_mutex);
>   WARN_ON_ONCE(ret);

My expertise in this area is low, so bear with me.

Why we do not need to protect flush_dcache_range_chunked and
vm_unmap_aliases?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 3/8] powerpc/mm: factor out creating/removing linear mapping

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:17PM +0100, David Hildenbrand wrote:
> We want to stop abusing memory hotplug infrastructure in memtrace code
> to perform allocations and remove the linear mapping. Instead we will use
> alloc_contig_pages() and remove the linear mapping manually.
> 
> Let's factor out creating/removing the linear mapping into
> arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the
> future, we might be able to have whole arch_add_memory() /
> arch_remove_memory() be implemented in common code.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 2/8] powernv/memtrace: fix crashing the kernel when enabling concurrently

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:16PM +0100, David Hildenbrand wrote:
> Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory 
> tracing")
> Cc: sta...@vger.kernel.org# v4.14+
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 1/8] powernv/memtrace: don't leak kernel memory to user space

2020-11-17 Thread Oscar Salvador
On Wed, Nov 11, 2020 at 03:53:15PM +0100, David Hildenbrand wrote:
> Reported-by: Michael Ellerman 
> Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory 
> tracing")
> Cc: sta...@vger.kernel.org # v4.14+
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

2020-11-04 Thread Oscar Salvador
On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
> Let's revert what we did in case seomthing goes wrong and we return an
> error.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  arch/powerpc/mm/mem.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 685028451dd2..69b3e8072261 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -165,7 +165,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>   rc = arch_create_linear_mapping(nid, start, size, params);
>   if (rc)
>   return rc;
> - return __add_pages(nid, start_pfn, nr_pages, params);
> + rc = __add_pages(nid, start_pfn, nr_pages, params);
> + if (rc)
> + arch_remove_linear_mapping(start, size);
> + return rc;
>  }
>  
>  void __ref arch_remove_memory(int nid, u64 start, u64 size,
> -- 
> 2.26.2
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

2020-11-04 Thread Oscar Salvador
On Wed, Nov 04, 2020 at 02:06:51PM +0200, Mike Rapoport wrote:
> On Wed, Nov 04, 2020 at 10:50:07AM +0100, osalvador wrote:
> > On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
> > > Let's revert what we did in case seomthing goes wrong and we return an
> > > error.
> > 
> > Dumb question, but should not we do this for other arches as well?
> 
> It seems arm64 and s390 already do that. 
> x86 could have its arch_add_memory() improved though :)

Right, I only stared at x86 and see it did not have it.
I guess we want to have all arches aligned with this.

Thanks

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory

2020-02-04 Thread Oscar Salvador
On Tue, Feb 04, 2020 at 09:45:24AM +0100, David Hildenbrand wrote:
> I really hope we'll find more reviewers in general - I'm also not happy
> if my patches go upstream with little/no review. However, patches
> shouldn't be stuck for multiple merge windows in linux-next IMHO
> (excluding exceptions of course) - then they should either be sent
> upstream (and eventually fixed later) or dropped.

First of all sorry for my lack of review, as lately I have been a bit 
disconnected
of the list because lack of time.

Lucky my I managed to find some time, so I went through the patches that did
lack review (#6-#10).

I hope this helps in moving forward the series, although Michal's review would 
be
great as well.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:46AM +0200, David Hildenbrand wrote:
> Let's drop the basically unused section stuff and simplify.
> 
> Also, let's use a shorter variant to calculate the number of pages to
> the next section boundary.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

I have to confess that it took me while to wrap around my head
with the new min() change, but looks ok:

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c | 17 ++---
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 843481bd507d..2275240cfa10 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -490,25 +490,20 @@ static void __remove_section(unsigned long pfn, 
> unsigned long nr_pages,
>  void __remove_pages(unsigned long pfn, unsigned long nr_pages,
>   struct vmem_altmap *altmap)
>  {
> + const unsigned long end_pfn = pfn + nr_pages;
> + unsigned long cur_nr_pages;
>   unsigned long map_offset = 0;
> - unsigned long nr, start_sec, end_sec;
>  
>   map_offset = vmem_altmap_offset(altmap);
>  
>   if (check_pfn_span(pfn, nr_pages, "remove"))
>   return;
>  
> - start_sec = pfn_to_section_nr(pfn);
> - end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> - for (nr = start_sec; nr <= end_sec; nr++) {
> - unsigned long pfns;
> -
> + for (; pfn < end_pfn; pfn += cur_nr_pages) {
>   cond_resched();
> - pfns = min(nr_pages, PAGES_PER_SECTION
> - - (pfn & ~PAGE_SECTION_MASK));
> - __remove_section(pfn, pfns, map_offset, altmap);
> - pfn += pfns;
> - nr_pages -= pfns;
> + /* Select all remaining pages up to the next section boundary */
> + cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
> + __remove_section(pfn, cur_nr_pages, map_offset, altmap);
>   map_offset = 0;
>   }
>  }
> -- 
> 2.21.0
> 
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:45AM +0200, David Hildenbrand wrote:
> Get rid of the unnecessary local variables.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8dafa1ba8d9f..843481bd507d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, 
> struct zone *zone,
>  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>unsigned long end_pfn)
>  {
> - unsigned long zone_start_pfn = zone->zone_start_pfn;
> - unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
> - unsigned long zone_end_pfn = z;
>   unsigned long pfn;
>   int nid = zone_to_nid(zone);

We could also remove the nid, right?
AFAICS, the nid is only used in find_{smallest/biggest}_section_pfn so we could
place there as well.

Anyway, nothing to nit-pick about:

Reviewed-by: Oscar Salvador 

>  
>   zone_span_writelock(zone);
> - if (zone_start_pfn == start_pfn) {
> + if (zone->zone_start_pfn == start_pfn) {
>   /*
>* If the section is smallest section in the zone, it need
>* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
> @@ -389,25 +386,25 @@ static void shrink_zone_span(struct zone *zone, 
> unsigned long start_pfn,
>* for shrinking zone.
>*/
>   pfn = find_smallest_section_pfn(nid, zone, end_pfn,
> - zone_end_pfn);
> + zone_end_pfn(zone));
>   if (pfn) {
> + zone->spanned_pages = zone_end_pfn(zone) - pfn;
>   zone->zone_start_pfn = pfn;
> - zone->spanned_pages = zone_end_pfn - pfn;
>   } else {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
>   }
> - } else if (zone_end_pfn == end_pfn) {
> + } else if (zone_end_pfn(zone) == end_pfn) {
>   /*
>* If the section is biggest section in the zone, it need
>* shrink zone->spanned_pages.
>* In this case, we find second biggest valid mem_section for
>* shrinking zone.
>*/
> - pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
> + pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
>  start_pfn);
>   if (pfn)
> - zone->spanned_pages = pfn - zone_start_pfn + 1;
> +     zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
>   else {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
> -- 
> 2.21.0
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:44AM +0200, David Hildenbrand wrote:
> If we have holes, the holes will automatically get detected and removed
> once we remove the next bigger/smaller section. The extra checks can
> go.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Heh, I have been here before.
I have to confess that when I wrote my version of this I was not really 100%
about removing it, because hotplug was a sort of a "catchall" for all sort of 
weird
and corner-cases configurations, but thinking more about it, I cannot think of
any situation that would make this blow up.

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c | 34 +++---
>  1 file changed, 7 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f294918f7211..8dafa1ba8d9f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
> long start_pfn,
>   if (pfn) {
>   zone->zone_start_pfn = pfn;
>   zone->spanned_pages = zone_end_pfn - pfn;
> + } else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;
>   }
>   } else if (zone_end_pfn == end_pfn) {
>   /*
> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> unsigned long start_pfn,
>  start_pfn);
>   if (pfn)
>   zone->spanned_pages = pfn - zone_start_pfn + 1;
> + else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;
> + }
>   }
> -
> - /*
> -  * The section is not biggest or smallest mem_section in the zone, it
> -  * only creates a hole in the zone. So in this case, we need not
> -  * change the zone. But perhaps, the zone has only hole data. Thus
> -  * it check the zone has only hole or not.
> -  */
> - pfn = zone_start_pfn;
> - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
> - if (unlikely(!pfn_to_online_page(pfn)))
> - continue;
> -
> - if (page_zone(pfn_to_page(pfn)) != zone)
> - continue;
> -
> - /* Skip range to be removed */
> - if (pfn >= start_pfn && pfn < end_pfn)
> - continue;
> -
> - /* If we find valid section, we have nothing to do */
> - zone_span_writeunlock(zone);
> - return;
> - }
> -
> - /* The zone has no valid section */
> - zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
>   zone_span_writeunlock(zone);
>  }
>  
> -- 
> 2.21.0
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:43AM +0200, David Hildenbrand wrote:
> With shrink_pgdat_span() out of the way, we now always have a valid
> zone.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bf5173e7913d..f294918f7211 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -337,7 +337,7 @@ static unsigned long find_smallest_section_pfn(int nid, 
> struct zone *zone,
>   if (unlikely(pfn_to_nid(start_pfn) != nid))
>   continue;
>  
> - if (zone && zone != page_zone(pfn_to_page(start_pfn)))
> + if (zone != page_zone(pfn_to_page(start_pfn)))
>   continue;
>  
>   return start_pfn;
> @@ -362,7 +362,7 @@ static unsigned long find_biggest_section_pfn(int nid, 
> struct zone *zone,
>   if (unlikely(pfn_to_nid(pfn) != nid))
>   continue;
>  
> - if (zone && zone != page_zone(pfn_to_page(pfn)))
> + if (zone != page_zone(pfn_to_page(pfn)))
>   continue;
>  
>   return pfn;
> -- 
> 2.21.0
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:42AM +0200, David Hildenbrand wrote:
> Let's poison the pages similar to when adding new memory in
> sparse_add_section(). Also call remove_pfn_range_from_zone() from
> memunmap_pages(), so we can poison the memmap from there as well.
> 
> While at it, calculate the pfn in memunmap_pages() only once.
> 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Signed-off-by: David Hildenbrand 

Looks good to me, it is fine as long as we do not access those pages later on,
and if my eyes did not lie to me, we have to proper checks (pfn_to_online_page)
in place to avoid that, so:

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-03 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:41AM +0200, David Hildenbrand wrote:
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> Signed-off-by: David Hildenbrand 

I did not see anything wrong with the taken approach, and makes sense to me.
The only thing that puzzles me is we seem to not balance spanned_pages
for ZONE_DEVICE anymore.
memremap_pages() increments them via move_pfn_range_to_zone, but we skip
ZONE_DEVICE in remove_pfn_range_from_zone.

That is not really related to this patch, so I might be missing something,
but it caught my eye while reviewing this.

Anyway, for this one:

Reviewed-by: Oscar Salvador 


off-topic: I __think__ we really need to trim the CC list.

> ---
>  arch/arm64/mm/mmu.c|  4 +---
>  arch/ia64/mm/init.c|  4 +---
>  arch/powerpc/mm/mem.c  |  3 +--
>  arch/s390/mm/init.c|  4 +---
>  arch/sh/mm/init.c  |  4 +---
>  arch/x86/mm/init_32.c  |  4 +---
>  arch/x86/mm/init_64.c  |  4 +---
>  include/linux/memory_hotplug.h |  7 +--
>  mm/memory_hotplug.c| 31 ---
>  mm/memremap.c  |  2 +-
>  10 files changed, 29 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 60c929f3683b..d10247fab0fd 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1069,7 +1069,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
>   /*
>* FIXME: Cleanup page tables (also in arch_add_memory() in case
> @@ -1078,7 +1077,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>* unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
>* unlocked yet.
>*/
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index bf9df2625bc8..a6dd80a2c939 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index be941d382c8d..97e5922cb52e 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 
> size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
>   int ret;
>  
> - __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  
>   /* Remove htab bolted mappings for this section of memory */
>   start = (unsigned long)__va(start);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index a124f19f7b3c..c1d96e588152 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -291,10 +291,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>   vmem_remove_mapping(start, size);
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index dfdbaa50946e..d1b1ff2be17a 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 930edeb41ec3..0a74407ef92e 100644
> --- a/arch/x86/

Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory

2019-12-03 Thread Oscar Salvador
On Mon, Dec 02, 2019 at 10:09:51AM +0100, David Hildenbrand wrote:
> @Michal, @Oscar, can some of you at least have a patch #5 now so we can
> proceed with that? (the other patches can stay in -next some time longer)

Hi, 

I will be having a look at patch#5 shortly.

Thanks for the reminder

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-07-16 Thread Oscar Salvador
On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote:
> On 01.07.19 12:27, Michal Hocko wrote:
> > On Mon 01-07-19 11:36:44, Oscar Salvador wrote:
> >> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote:
> >>> Yeah, we do not allow to offline multi zone (node) ranges so the current
> >>> code seems to be over engineered.
> >>>
> >>> Anyway, I am wondering why do we have to strictly check for already
> >>> removed nodes links. Is the sysfs code going to complain we we try to
> >>> remove again?
> >>
> >> No, sysfs will silently "fail" if the symlink has already been removed.
> >> At least that is what I saw last time I played with it.
> >>
> >> I guess the question is what if sysfs handling changes in the future
> >> and starts dropping warnings when trying to remove a symlink is not there.
> >> Maybe that is unlikely to happen?
> > 
> > And maybe we handle it then rather than have a static allocation that
> > everybody with hotremove configured has to pay for.
> > 
> 
> So what's the suggestion? Dropping the nodemask_t completely and calling
> sysfs_remove_link() on already potentially removed links?
> 
> Of course, we can also just use mem_blk->nid and rest assured that it
> will never be called for memory blocks belonging to multiple nodes.

Hi David,

While it is easy to construct a scenario where a memblock belongs to multiple
nodes, I have to confess that I yet have not seen that in a real-world scenario.

Given said that, I think that the less risky way is to just drop the nodemask_t
and do not care about calling sysfs_remove_link() for already removed links.
As I said, sysfs_remove_link() will silently fail when it fails to find the
symlink, so I do not think it is a big deal.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-07-01 Thread Oscar Salvador
On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote:
> Yeah, we do not allow to offline multi zone (node) ranges so the current
> code seems to be over engineered.
> 
> Anyway, I am wondering why do we have to strictly check for already
> removed nodes links. Is the sysfs code going to complain we we try to
> remove again?

No, sysfs will silently "fail" if the symlink has already been removed.
At least that is what I saw last time I played with it.

I guess the question is what if sysfs handling changes in the future
and starts dropping warnings when trying to remove a symlink is not there.
Maybe that is unlikely to happen?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 02/11] s390x/mm: Fail when an altmap is used for arch_add_memory()

2019-06-10 Thread Oscar Salvador
On Mon, May 27, 2019 at 01:11:43PM +0200, David Hildenbrand wrote:
> ZONE_DEVICE is not yet supported, fail if an altmap is passed, so we
> don't forget arch_add_memory()/arch_remove_memory() when unlocking
> support.
> 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Mike Rapoport 
> Cc: David Hildenbrand 
> Cc: Vasily Gorbik 
> Cc: Oscar Salvador 
> Suggested-by: Dan Williams 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 11/11] mm/memory_hotplug: Remove "zone" parameter from sparse_remove_one_section

2019-06-10 Thread Oscar Salvador
On Mon, May 27, 2019 at 01:11:52PM +0200, David Hildenbrand wrote:
> The parameter is unused, so let's drop it. Memory removal paths should
> never care about zones. This is the job of memory offlining and will
> require more refactorings.
> 
> Reviewed-by: Dan Williams 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-06-10 Thread Oscar Salvador
On Mon, May 27, 2019 at 01:11:51PM +0200, David Hildenbrand wrote:
> We really don't want anything during memory hotunplug to fail.
> We always pass a valid memory block device, that check can go. Avoid
> allocating memory and eventually failing. As we are always called under
> lock, we can use a static piece of memory. This avoids having to put
> the structure onto the stack, having to guess about the stack size
> of callers.
> 
> Patch inspired by a patch from Oscar Salvador.
> 
> In the future, there might be no need to iterate over nodes at all.
> mem->nid should tell us exactly what to remove. Memory block devices
> with mixed nodes (added during boot) should properly fenced off and never
> removed.
> 
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: Alex Deucher 
> Cc: "David S. Miller" 
> Cc: Mark Brown 
> Cc: Chris Wilson 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Jonathan Cameron 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 01/11] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()

2019-06-10 Thread Oscar Salvador
On Mon, May 27, 2019 at 01:11:42PM +0200, David Hildenbrand wrote:
> By converting start and size to page granularity, we actually ignore
> unaligned parts within a page instead of properly bailing out with an
> error.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Qian Cai 
> Cc: Wei Yang 
> Cc: Arun KS 
> Cc: Mathieu Malaterre 
> Reviewed-by: Dan Williams 
> Reviewed-by: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 2/6] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2018-10-05 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 11:25:50AM +0200, David Hildenbrand wrote:
> Reviewed-by: Pavel Tatashin 
> Reviewed-by: Rafael J. Wysocki 
> Reviewed-by: Rashmica Gupta 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 1/6] mm/memory_hotplug: make remove_memory() take the device_hotplug_lock

2018-10-05 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 11:25:49AM +0200, David Hildenbrand wrote:
> Reviewed-by: Pavel Tatashin 
> Reviewed-by: Rafael J. Wysocki 
> Reviewed-by: Rashmica Gupta 
> Signed-off-by: David Hildenbrand 
 
Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 6/6] memory-hotplug.txt: Add some details about locking internals

2018-10-05 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 11:25:54AM +0200, David Hildenbrand wrote:
> Cc: Jonathan Corbet 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Reviewed-by: Pavel Tatashin 
> Reviewed-by: Rashmica Gupta 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 3/6] mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

2018-10-05 Thread Oscar Salvador
On Thu, Sep 27, 2018 at 11:25:51AM +0200, David Hildenbrand wrote:
> Reviewed-by: Pavel Tatashin 
> Reviewed-by: Rashmica Gupta 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 
-- 
Oscar Salvador
SUSE L3


Re: [PATCH RFCv2 0/6] mm: online/offline_pages called w.o. mem_hotplug_lock

2018-08-31 Thread Oscar Salvador
On Tue, Aug 21, 2018 at 12:44:12PM +0200, David Hildenbrand wrote:
> This is the same approach as in the first RFC, but this time without
> exporting device_hotplug_lock (requested by Greg) and with some more
> details and documentation regarding locking. Tested only on x86 so far.

Hi David,

I would like to review this but I am on vacation, so I will not be able to get 
to it
soon.
I plan to do it once I am back.

Thanks
-- 
Oscar Salvador
SUSE L3


Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-12 Thread Oscar Salvador
> > I just roughly check, but if I checked the right place,
> > vmemmap_populated() checks for the section to contain the flags we are
> > setting in sparse_init_one_section().
> 
> Yes.
> 
> > But with this patch, we populate first everything, and then we call
> > sparse_init_one_section() in sparse_init().
> > As I said I could be mistaken because I just checked the surface.
> 
> Yeah I think that's correct.
> 
> This might just be a bug in our code, let me look at it a bit.

I wonder if something like this could make the trick:

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 51ce091914f9..e281651f50cd 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -177,6 +177,8 @@ static __meminit void vmemmap_list_populate(unsigned long 
phys,
vmemmap_list = vmem_back;
 }
 
+static unsigned long last_addr_populated = 0;
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
struct vmem_altmap *altmap)
 {
@@ -191,7 +193,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
void *p;
int rc;
 
-   if (vmemmap_populated(start, page_size))
+   if (start + page_size <= last_addr_populated)
continue;
 
if (altmap)
@@ -212,6 +214,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
__func__, rc);
return -EFAULT;
}
+   last_addr_populated = start + page_size;
}

I know it looks hacky, and chances are that are wrong, but could you give it a 
try?
I will try to grab a ppc server and try it out too.
 
Thanks
-- 
Oscar Salvador
SUSE L3


Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)

2018-07-11 Thread Oscar Salvador
On Wed, Jul 11, 2018 at 10:49:58PM +1000, Michael Ellerman wrote:
> a...@linux-foundation.org writes:
> > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> >
> >http://www.ozlabs.org/~akpm/mmotm/
> ...
> 
> > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > * 
> > mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > * 
> > mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> 
> > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> 
> This seems to be breaking my powerpc pseries qemu boots.
> 
> The boot log with some extra debug shows eg:
> 
>   $ make pseries_le_defconfig

Could you please share the config?
I was not able to find such config in the kernel tree.
-- 
Oscar Salvador
SUSE L3