Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-23 Thread David Rientjes
On Tue, 23 May 2023, Vlastimil Babka wrote:

> As discussed at LSF/MM [1] [2] and with no objections raised there,
> deprecate the SLAB allocator. Rename the user-visible option so that
> users with CONFIG_SLAB=y get a new prompt with explanation during make
> oldconfig, while make olddefconfig will just switch to SLUB.
> 
> In all defconfigs with CONFIG_SLAB=y remove the line so those also
> switch to SLUB. Regressions due to the switch should be reported to
> linux-mm and slab maintainers.
> 
> [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/
> [2] https://lwn.net/Articles/932201/
> 
> Signed-off-by: Vlastimil Babka 

Acked-by: David Rientjes 

The Kconfig option says that SLAB will be removed in a few cycles.  I 
think we should wait until at least the next LTS kernel is forked at the 
end of the year so that users who upgrade to only the LTS releases can be 
prompted for this change and surface any concerns.  Slab allocation is a 
critical subsystem, so I presume this is the safest and most responsible 
way to do the SLAB deprecation.  Hopefully that timeline works for 
everybody.


Re: [PATCH] mm: Remove HARDENED_USERCOPY_FALLBACK

2021-09-26 Thread David Rientjes
On Tue, 21 Sep 2021, Stephen Kitt wrote:

> This has served its purpose and is no longer used. All usercopy
> violations appear to have been handled by now, any remaining
> instances (or new bugs) will cause copies to be rejected.
> 
> This isn't a direct revert of commit 2d891fbc3bb6 ("usercopy: Allow
> strict enforcement of whitelists"); since usercopy_fallback is
> effectively 0, the fallback handling is removed too.
> 
> This also removes the usercopy_fallback module parameter on
> slab_common.
> 
> Link: https://github.com/KSPP/linux/issues/153
> Signed-off-by: Stephen Kitt 
> Suggested-by: Kees Cook 

Acked-by: David Rientjes 


Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread David Rientjes
On Tue, 10 Apr 2018, Laurent Dufour wrote:

> > On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote:
> >>  arch/powerpc/include/asm/pte-common.h  | 3 ---
> >>  arch/riscv/Kconfig | 1 +
> >>  arch/s390/Kconfig  | 1 +
> > 
> > You forgot to delete __HAVE_ARCH_PTE_SPECIAL from
> > arch/riscv/include/asm/pgtable-bits.h
> 
> Damned !
> Thanks for catching it.
> 

Squashing the two patches together at least allowed it to be caught 
easily.  After it's fixed, feel free to add

Acked-by: David Rientjes <rient...@google.com>

Thanks for doing this!


Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-04-10 Thread David Rientjes
On Mon, 26 Mar 2018, Andi Kleen wrote:

> > Aside: should there be a new spec_flt field for struct task_struct that 
> > complements maj_flt and min_flt and be exported through /proc/pid/stat?
> 
> No. task_struct is already too bloated. If you need per process tracking 
> you can always get it through trace points.
> 

Hi Andi,

We have

count_vm_event(PGFAULT);
count_memcg_event_mm(vma->vm_mm, PGFAULT);

in handle_mm_fault() but not counterpart for spf.  I think it would be 
helpful to be able to determine how much faulting can be done 
speculatively if there is no per-process tracking without tracing.


Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL

2018-04-09 Thread David Rientjes
On Mon, 9 Apr 2018, Christoph Hellwig wrote:

> > -#ifdef __HAVE_ARCH_PTE_SPECIAL
> > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> >  # define HAVE_PTE_SPECIAL 1
> >  #else
> >  # define HAVE_PTE_SPECIAL 0
> 
> I'd say kill this odd indirection and just use the
> CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly.
> 
> 

Agree, and I think it would be easier to audit/review if patches 1 and 3 
were folded together to see the relationship between the newly added 
selects and what #define's it is replacing.  Otherwise, looks good!


Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-04-04 Thread David Rientjes
On Wed, 4 Apr 2018, Laurent Dufour wrote:

> > I also think the following is needed:
> > 
> > diff --git a/fs/exec.c b/fs/exec.c
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
> > vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
> > VM_STACK_INCOMPLETE_SETUP;
> > vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> > INIT_LIST_HEAD(>anon_vma_chain);
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > +   seqcount_init(>vm_sequence);
> > +   atomic_set(>vm_ref_count, 0);
> > +#endif
> > 
> > err = insert_vm_struct(mm, vma);
> > if (err)
> 
> No, this not needed because the vma is allocated with kmem_cache_zalloc() so
> vm_ref_count is 0, and insert_vm_struc() will later call
> __vma_link_rb() which will call seqcount_init().
> 
> Furhtermore, in case of error, the vma structure is freed without calling
> get_vma() so there is risk of lockdep warning.
> 

Perhaps you're working from a different tree than I am, or you fixed the 
lockdep warning differently when adding to dup_mmap() and mmap_region().

I got the following two lockdep errors.

I fixed it locally by doing the seqcount_init() and atomic_set() 
everywhere a vma could be initialized.

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 12 PID: 1 Comm: init Not tainted
Call Trace:
 [] dump_stack+0x67/0x98
 [] register_lock_class+0x1e6/0x4e0
 [] __lock_acquire+0xb9/0x1710
 [] lock_acquire+0xba/0x200
 [] mprotect_fixup+0x10f/0x310
 [] setup_arg_pages+0x12d/0x230
 [] load_elf_binary+0x44a/0x1740
 [] search_binary_handler+0x9b/0x1e0
 [] load_script+0x206/0x270
 [] search_binary_handler+0x9b/0x1e0
 [] do_execveat_common.isra.32+0x6b5/0x9d0
 [] do_execve+0x2c/0x30
 [] run_init_process+0x2b/0x30
 [] kernel_init+0x54/0x110
 [] ret_from_fork+0x3a/0x50

and

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 21 PID: 1926 Comm: mkdir Not tainted
Call Trace:
 [] dump_stack+0x67/0x98
 [] register_lock_class+0x1e6/0x4e0
 [] __lock_acquire+0xb9/0x1710
 [] lock_acquire+0xba/0x200
 [] unmap_page_range+0x89/0xaa0
 [] unmap_single_vma+0x8f/0x100
 [] unmap_vmas+0x4b/0x90
 [] exit_mmap+0xa3/0x1c0
 [] mmput+0x73/0x120
 [] do_exit+0x2bd/0xd60
 [] SyS_exit+0x17/0x20
 [] do_syscall_64+0x6d/0x1a0
 [] entry_SYSCALL_64_after_hwframe+0x26/0x9b

I think it would just be better to generalize vma allocation to initialize 
certain fields and init both spf fields properly for 
CONFIG_SPECULATIVE_PAGE_FAULT.  It's obviously too delicate as is.


Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-04-03 Thread David Rientjes
On Tue, 3 Apr 2018, David Rientjes wrote:

> > >> I found the root cause of this lockdep warning.
> > >>
> > >> In mmap_region(), unmap_region() may be called while vma_link() has not 
> > >> been
> > >> called. This happens during the error path if call_mmap() failed.
> > >>
> > >> The only to fix that particular case is to call
> > >> seqcount_init(>vm_sequence) when initializing the vma in 
> > >> mmap_region().
> > >>
> > > 
> > > Ack, although that would require a fixup to dup_mmap() as well.
> > 
> > You're right, I'll fix that too.
> > 
> 
> I also think the following is needed:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
>   vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
> VM_STACK_INCOMPLETE_SETUP;
>   vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
>   INIT_LIST_HEAD(>anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(>vm_sequence);
> + atomic_set(>vm_ref_count, 0);
> +#endif
>  
>   err = insert_vm_struct(mm, vma);
>   if (err)
> 

Ugh, I think there are a number of other places where this is needed as 
well in mm/mmap.c.  I think it would be better to just create a new 
alloc_vma(unsigned long flags) that all vma allocators can use and for 
CONFIG_SPECULATIVE_PAGE_FAULT will initialize the seqcount_t and atomic_t.


Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-04-03 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> On 26/03/2018 00:10, David Rientjes wrote:
> > On Wed, 21 Mar 2018, Laurent Dufour wrote:
> > 
> >> I found the root cause of this lockdep warning.
> >>
> >> In mmap_region(), unmap_region() may be called while vma_link() has not 
> >> been
> >> called. This happens during the error path if call_mmap() failed.
> >>
> >> The only to fix that particular case is to call
> >> seqcount_init(>vm_sequence) when initializing the vma in 
> >> mmap_region().
> >>
> > 
> > Ack, although that would require a fixup to dup_mmap() as well.
> 
> You're right, I'll fix that too.
> 

I also think the following is needed:

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
VM_STACK_INCOMPLETE_SETUP;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
INIT_LIST_HEAD(>anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_init(>vm_sequence);
+   atomic_set(>vm_ref_count, 0);
+#endif
 
err = insert_vm_struct(mm, vma);
if (err)


Re: [PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE

2018-04-03 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 4d02524a7998..2f3e98edc94a 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -300,6 +300,7 @@ extern pgprot_t protection_map[16];
> >>  #define FAULT_FLAG_USER   0x40/* The fault originated in 
> >> userspace */
> >>  #define FAULT_FLAG_REMOTE 0x80/* faulting for non current tsk/mm */
> >>  #define FAULT_FLAG_INSTRUCTION  0x100 /* The fault was during an 
> >> instruction fetch */
> >> +#define FAULT_FLAG_SPECULATIVE0x200   /* Speculative fault, not 
> >> holding mmap_sem */
> >>  
> >>  #define FAULT_FLAG_TRACE \
> >>{ FAULT_FLAG_WRITE, "WRITE" }, \
> > 
> > I think FAULT_FLAG_SPECULATIVE should be introduced in the patch that 
> > actually uses it.
> 
> I think you're right, I'll move down this define in the series.
> 
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e0ae4999c824..8ac241b9f370 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, 
> >> unsigned long addr,
> >>  }
> >>  EXPORT_SYMBOL_GPL(apply_to_page_range);
> >>  
> >> +static bool pte_map_lock(struct vm_fault *vmf)
> > 
> > inline?
> 
> Agreed.
> 

Ignore this, the final form of the function after the full patchset 
shouldn't be inline.

> >> +{
> >> +  vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> >> + vmf->address, >ptl);
> >> +  return true;
> >> +}
> >> +
> >>  /*
> >>   * handle_pte_fault chooses page fault handler according to an entry 
> >> which was
> >>   * read non-atomically.  Before making any commitment, on those 
> >> architectures
> >> @@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf)
> >>const unsigned long mmun_start = vmf->address & PAGE_MASK;
> >>const unsigned long mmun_end = mmun_start + PAGE_SIZE;
> >>struct mem_cgroup *memcg;
> >> +  int ret = VM_FAULT_OOM;
> >>  
> >>if (unlikely(anon_vma_prepare(vma)))
> >>goto oom;
> >> @@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf)
> >>/*
> >> * Re-check the pte - we dropped the lock
> >> */
> >> -  vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
> >> +  if (!pte_map_lock(vmf)) {
> >> +  mem_cgroup_cancel_charge(new_page, memcg, false);
> >> +  ret = VM_FAULT_RETRY;
> >> +  goto oom_free_new;
> >> +  }
> > 
> > Ugh, but we aren't oom here, so maybe rename oom_free_new so that it makes 
> > sense for return values other than VM_FAULT_OOM?
> 
> You're right, now this label name is not correct, I'll rename it to
> "out_free_new" and rename also the label "oom" to "out" since it is generic 
> too
> now.
> 

I think it would just be better to introduce a out_uncharge that handles 
the mem_cgroup_cancel_charge() in the exit path.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,9 +2645,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Re-check the pte - we dropped the lock
 */
if (!pte_map_lock(vmf)) {
-   mem_cgroup_cancel_charge(new_page, memcg, false);
ret = VM_FAULT_RETRY;
-   goto oom_free_new;
+   goto out_uncharge;
}
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
@@ -2735,6 +2734,8 @@ static int wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
+out_uncharge:
+   mem_cgroup_cancel_charge(new_page, memcg, false);
 oom_free_new:
put_page(new_page);
 oom:


Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-03 Thread David Rientjes
On Tue, 3 Apr 2018, Jerome Glisse wrote:

> > When dealing with the speculative fault path we should use the VMA's field
> > cached value stored in the vm_fault structure.
> > 
> > Currently vm_normal_page() is using the pointer to the VMA to fetch the
> > vm_flags value. This patch provides a new __vm_normal_page() which is
> > receiving the vm_flags flags value as parameter.
> > 
> > Note: The speculative path is turned on for architecture providing support
> > for special PTE flag. So only the first block of vm_normal_page is used
> > during the speculative path.
> 
> Might be a good idea to explicitly have SPECULATIVE Kconfig option depends
> on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function
> explaining that speculative page fault should never reach that point.

Yeah, I think that's appropriate but in a follow-up patch since this is 
only propagating vma_flags.  It will require that __HAVE_ARCH_PTE_SPECIAL 
become an actual Kconfig entry, however.


Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-04-03 Thread David Rientjes
On Tue, 3 Apr 2018, Jerome Glisse wrote:

> > diff --git a/mm/memory.c b/mm/memory.c
> > index 21b1212a0892..4bc7b0bdcb40 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
> >   * parts, do_swap_page must check under lock before unmapping the pte and
> >   * proceeding (but do_wp_page is only called after already making such a 
> > check;
> >   * and do_anonymous_page can safely check later on).
> > + *
> > + * pte_unmap_same() returns:
> > + * 0   if the PTE are the same
> > + * VM_FAULT_PTNOTSAME  if the PTE are different
> > + * VM_FAULT_RETRY  if the VMA has changed in our back during
> > + * a speculative page fault handling.
> >   */
> > -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> > -   pte_t *page_table, pte_t orig_pte)
> > +static inline int pte_unmap_same(struct vm_fault *vmf)
> >  {
> > -   int same = 1;
> > +   int ret = 0;
> > +
> >  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
> > if (sizeof(pte_t) > sizeof(unsigned long)) {
> > -   spinlock_t *ptl = pte_lockptr(mm, pmd);
> > -   spin_lock(ptl);
> > -   same = pte_same(*page_table, orig_pte);
> > -   spin_unlock(ptl);
> > +   if (pte_spinlock(vmf)) {
> > +   if (!pte_same(*vmf->pte, vmf->orig_pte))
> > +   ret = VM_FAULT_PTNOTSAME;
> > +   spin_unlock(vmf->ptl);
> > +   } else
> > +   ret = VM_FAULT_RETRY;
> > }
> >  #endif
> > -   pte_unmap(page_table);
> > -   return same;
> > +   pte_unmap(vmf->pte);
> > +   return ret;
> >  }
> >  
> >  static inline void cow_user_page(struct page *dst, struct page *src, 
> > unsigned long va, struct vm_area_struct *vma)
> > @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
> > int exclusive = 0;
> > int ret = 0;
> >  
> > -   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> > +   ret = pte_unmap_same(vmf);
> > +   if (ret)
> > goto out;
> >  
> 
> This change what do_swap_page() returns ie before it was returning 0
> when locked pte lookup was different from orig_pte. After this patch
> it returns VM_FAULT_PTNOTSAME but this is a new return value for
> handle_mm_fault() (the do_swap_page() return value is what ultimately
> get return by handle_mm_fault())
> 
> Do we really want that ? This might confuse some existing user of
> handle_mm_fault() and i am not sure of the value of that information
> to caller.
> 
> Note i do understand that you want to return retry if anything did
> change from underneath and thus need to differentiate from when the
> pte value are not the same.
> 

I think VM_FAULT_RETRY should be handled appropriately for any user of 
handle_mm_fault() already, and would be surprised to learn differently.  
Khugepaged has the appropriate handling.  I think the concern is whether a 
user is handling anything other than VM_FAULT_RETRY and VM_FAULT_ERROR 
(which VM_FAULT_PTNOTSAME is not set in)?  I haven't done a full audit of 
the users.


Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> This change is inspired by the Peter's proposal patch [1] which was
> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
> that particular case, and it is introducing major performance degradation
> due to excessive scheduling operations.
> 
> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
> is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
> search it is safe to protect it using such a lock.  The VMA cache is not
> protected by the new rwlock and it should not be used without holding the
> mmap_sem.
> 
> To allow the picked VMA structure to be used once the rwlock is released, a
> use count is added to the VMA structure. When the VMA is allocated it is
> set to 1.  Each time the VMA is picked with the rwlock held its use count
> is incremented. Each time the VMA is released it is decremented. When the
> use count hits zero, this means that the VMA is no more used and should be
> freed.
> 
> This patch is preparing for 2 kind of VMA access :
>  - as usual, under the control of the mmap_sem,
>  - without holding the mmap_sem for the speculative page fault handler.
> 
> Access done under the control the mmap_sem doesn't require to grab the
> rwlock to protect read access to the mm_rb tree, but access in write must
> be done under the protection of the rwlock too. This affects inserting and
> removing of elements in the RB tree.
> 
> The patch is introducing 2 new functions:
>  - vma_get() to find a VMA based on an address by holding the new rwlock.
>  - vma_put() to release the VMA when its no more used.
> These services are designed to be used when access are made to the RB tree
> without holding the mmap_sem.
> 
> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
> we rely on the WMB done when releasing the rwlock to serialize the write
> with the RMB done in a later patch to check for the VMA's validity.
> 
> When free_vma is called, the file associated with the VMA is closed
> immediately, but the policy and the file structure remained in used until
> the VMA's use count reach 0, which may happens later when exiting an
> in progress speculative page fault.
> 
> [1] https://patchwork.kernel.org/patch/5108281/
> 
> Cc: Peter Zijlstra (Intel) 
> Cc: Matthew Wilcox 
> Signed-off-by: Laurent Dufour 

Can __free_vma() be generalized for mm/nommu.c's delete_vma() and 
do_mmap()?


Re: [PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> When dealing with speculative page fault handler, we may race with VMA
> being split or merged. In this case the vma->vm_start and vm->vm_end
> fields may not match the address the page fault is occurring.
> 
> This can only happens when the VMA is split but in that case, the
> anon_vma pointer of the new VMA will be the same as the original one,
> because in __split_vma the new->anon_vma is set to src->anon_vma when
> *new = *vma.
> 
> So even if the VMA boundaries are not correct, the anon_vma pointer is
> still valid.
> 
> If the VMA has been merged, then the VMA in which it has been merged
> must have the same anon_vma pointer otherwise the merge can't be done.
> 
> So in all the case we know that the anon_vma is valid, since we have
> checked before starting the speculative page fault that the anon_vma
> pointer is valid for this VMA and since there is an anon_vma this
> means that at one time a page has been backed and that before the VMA
> is cleaned, the page table lock would have to be grab to clean the
> PTE, and the anon_vma field is checked once the PTE is locked.
> 
> This patch introduce a new __page_add_new_anon_rmap() service which
> doesn't check for the VMA boundaries, and create a new inline one
> which do the check.
> 
> When called from a page fault handler, if this is not a speculative one,
> there is a guarantee that vm_start and vm_end match the faulting address,
> so this check is useless. In the context of the speculative page fault
> handler, this check may be wrong but anon_vma is still valid as explained
> above.
> 
> Signed-off-by: Laurent Dufour 

I'm indifferent on this: it could be argued both sides that the new 
function and its variant for a simple VM_BUG_ON() isn't worth it and it 
would should rather be done in the callers of page_add_new_anon_rmap().  
It feels like it would be better left to the caller and add a comment to 
page_add_anon_rmap() itself in mm/rmap.c.


Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a84ddc218bbd..73b8b99f482b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1263,8 +1263,11 @@ struct zap_details {
>   pgoff_t last_index; /* Highest page->index to unmap 
> */
>  };
>  
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -  pte_t pte, bool with_public_device);
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +   pte_t pte, bool with_public_device,
> +   unsigned long vma_flags);
> +#define _vm_normal_page(vma, addr, pte, with_public_device) \
> + __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags)
>  #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
>  
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long 
> addr,

If _vm_normal_page() is a static inline function does it break somehow?  
It's nice to avoid the #define's.

> diff --git a/mm/memory.c b/mm/memory.c
> index af0338fbc34d..184a0d663a76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>  #else
>  # define HAVE_PTE_SPECIAL 0
>  #endif
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -  pte_t pte, bool with_public_device)
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +   pte_t pte, bool with_public_device,
> +   unsigned long vma_flags)
>  {
>   unsigned long pfn = pte_pfn(pte);
>  

Would it be possible to update the comment since the function itself is no 
longer named vm_normal_page?


Re: [PATCH v9 14/24] mm: Introduce __maybe_mkwrite()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dfa81a638b7c..a84ddc218bbd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -684,13 +684,18 @@ void free_compound_page(struct page *page);
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
>   */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
>  {
> - if (likely(vma->vm_flags & VM_WRITE))
> + if (likely(vma_flags & VM_WRITE))
>   pte = pte_mkwrite(pte);
>   return pte;
>  }
>  
> +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> + return __maybe_mkwrite(pte, vma->vm_flags);
> +}
> +
>  int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
>   struct page *page);
>  int finish_fault(struct vm_fault *vmf);
> diff --git a/mm/memory.c b/mm/memory.c
> index 0a0a483d9a65..af0338fbc34d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>   entry = pte_mkyoung(vmf->orig_pte);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
>   update_mmu_cache(vma, vmf->address, vmf->pte);
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
>   inc_mm_counter_fast(mm, MM_ANONPAGES);
>   }
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> - entry = mk_pte(new_page, vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = mk_pte(new_page, vmf->vma_page_prot);
> + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   /*
>* Clear the pte entry and flush it first, before updating the
>* pte with the new entry. This will avoid a race condition

Don't you also need to do this in do_swap_page()?

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3067,9 +3067,9 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
-   pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+   pte = __maybe_mkwrite(pte_mkdirty(pte), vmf->vma_flags);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;


Re: [PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> The speculative page fault handler which is run without holding the
> mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
> is not guaranteed to remain constant.
> Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
> value parameter instead of the vma pointer.
> 
> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>

Acked-by: David Rientjes <rient...@google.com>


Re: [PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> migrate_misplaced_page() is only called during the page fault handling so
> it's better to pass the pointer to the struct vm_fault instead of the vma.
> 
> This way during the speculative page fault path the saved vma->vm_flags
> could be used.
> 
> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>

Acked-by: David Rientjes <rient...@google.com>


Re: [PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef6ef0627090..dfa81a638b7c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -359,6 +359,12 @@ struct vm_fault {
>* page table to avoid allocation from
>* atomic context.
>*/
> + /*
> +  * These entries are required when handling speculative page fault.
> +  * This way the page handling is done using consistent field values.
> +  */
> + unsigned long vma_flags;
> + pgprot_t vma_page_prot;
>  };
>  
>  /* page entry size for vm->huge_fault() */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 446427cafa19..f71db2b42b30 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   .vma = vma,
>   .address = address,
>   .flags = flags,
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   /*
>* Hard to debug if it ends up being
>* used by a callee that assumes
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 32314e9e48dd..a946d5306160 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
> *mm,
>   .flags = FAULT_FLAG_ALLOW_RETRY,
>   .pmd = pmd,
>   .pgoff = linear_page_index(vma, address),
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   };
>  
>   /* we only decide to swapin, if there is enough young ptes */
> diff --git a/mm/memory.c b/mm/memory.c
> index 0200340ef089..46fe92b93682 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>* Don't let another task, with possibly unlocked vma,
>* keep the mlocked page.
>*/
> - if (page_copied && (vma->vm_flags & VM_LOCKED)) {
> + if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
>   lock_page(old_page);/* LRU manipulation */
>   if (PageMlocked(old_page))
>   munlock_vma_page(old_page);

Doesn't wp_page_copy() also need to pass this to anon_vma_prepare() so 
that find_mergeable_anon_vma() works correctly?

> @@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>   */
>  int finish_mkwrite_fault(struct vm_fault *vmf)
>  {
> - WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
> + WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
>   if (!pte_map_lock(vmf))
>   return VM_FAULT_RETRY;
>   /*
> @@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
>* We should not cow pages in a shared writeable mapping.
>* Just mark the pages writable and/or call ops->pfn_mkwrite.
>*/
> - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> + if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
>(VM_WRITE|VM_SHARED))
>   return wp_pfn_shared(vmf);
>  
> @@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
>   return VM_FAULT_WRITE;
>   }
>   unlock_page(vmf->page);
> - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> + } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
>   (VM_WRITE|VM_SHARED))) {
>   return wp_page_shared(vmf);
>   }
> @@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf)
>  
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>   dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> - pte = mk_pte(page, vma->vm_page_prot);
> + pte = mk_pte(page, vmf->vma_page_prot);
>   if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>   pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>   vmf->flags &= ~FAULT_FLAG_WRITE;
> @@ -3093,7 +3093,7 @@ int do_swap_page(struct vm_fault *vmf)
>  
>   swap_free(entry);
>   if (mem_cgroup_swap_full(page) ||
> - (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> + (vmf->vma_flags & VM_LOCKED) || PageMlocked(page))
>   try_to_free_swap(page);
>   unlock_page(page);
>   if (page != swapcache && swapcache) {
> @@ -3150,7 +3150,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
>   pte_t entry;
>  
>   /* File mapping without ->vm_ops ? */
> - if (vma->vm_flags & VM_SHARED)
> + if (vmf->vma_flags & VM_SHARED)
>  

Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-28 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> >> @@ -326,7 +336,10 @@ static unsigned long move_vma(struct vm_area_struct 
> >> *vma,
> >>mremap_userfaultfd_prep(new_vma, uf);
> >>arch_remap(mm, old_addr, old_addr + old_len,
> >>   new_addr, new_addr + new_len);
> >> +  if (vma != new_vma)
> >> +  vm_raw_write_end(vma);
> >>}
> >> +  vm_raw_write_end(new_vma);
> > 
> > Just do
> > 
> > vm_raw_write_end(vma);
> > vm_raw_write_end(new_vma);
> > 
> > here.
> 
> Are you sure ? we can have vma = new_vma done if (unlikely(err))
> 

Sorry, what I meant was do

if (vma != new_vma)
vm_raw_write_end(vma);
vm_raw_write_end(new_vma);

after the conditional.  Having the locking unnecessarily embedded in the 
conditional has been an issue in the past with other areas of core code, 
unless you have a strong reason for it.


Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-28 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> > Putting this in mm/Kconfig is definitely the right way to go about it 
> > instead of any generic option in arch/*.
> > 
> > My question, though, was making this configurable by the user:
> > 
> > config SPECULATIVE_PAGE_FAULT
> > bool "Speculative page faults"
> > depends on X86_64 || PPC
> > default y
> > help
> >   ..
> > 
> > It's a question about whether we want this always enabled on x86_64 and 
> > power or whether the user should be able to disable it (right now they 
> > can't).  With a large feature like this, you may want to offer something 
> > simple (disable CONFIG_SPECULATIVE_PAGE_FAULT) if someone runs into 
> > regressions.
> 
> I agree, but I think it would be important to get the per architecture
> enablement to avoid complex check here. For instance in the case of powerPC
> this is only supported for PPC_BOOK3S_64.
> 
> To avoid exposing such per architecture define here, what do you think about
> having supporting architectures setting ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> and the SPECULATIVE_PAGE_FAULT depends on this, like this:
> 
> In mm/Kconfig:
> config SPECULATIVE_PAGE_FAULT
>   bool "Speculative page faults"
>   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && SMP
>   default y
>   help
>   ...
> 
> In arch/powerpc/Kconfig:
> config PPC
>   ...
>   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
> 
> In arch/x86/Kconfig:
> config X86_64
>   ...
>   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> 
> 

Looks good to me!  It feels like this will add more assurance that if 
things regress for certain workloads that it can be disabled.  I don't 
feel strongly about the default value, I'm ok with it being enabled by 
default.


Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-03-28 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> >> @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
> >>int exclusive = 0;
> >>int ret = 0;
> > 
> > Initialization is now unneeded.
> 
> I'm sorry, what "initialization" are you talking about here ?
> 

The initialization of the ret variable.

@@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
int ret = 0;
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+   ret = pte_unmap_same(vmf);
+   if (ret)
goto out;
 
entry = pte_to_swp_entry(vmf->orig_pte);

"ret" is immediately set to the return value of pte_unmap_same(), so there 
is no need to initialize it to 0.


Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-28 Thread David Rientjes
On Wed, 28 Mar 2018, Laurent Dufour wrote:

> >> This configuration variable will be used to build the code needed to
> >> handle speculative page fault.
> >>
> >> By default it is turned off, and activated depending on architecture
> >> support.
> >>
> >> Suggested-by: Thomas Gleixner 
> >> Signed-off-by: Laurent Dufour 
> >> ---
> >>  mm/Kconfig | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index abefa573bcd8..07c566c88faf 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -759,3 +759,6 @@ config GUP_BENCHMARK
> >>  performance of get_user_pages_fast().
> >>  
> >>  See tools/testing/selftests/vm/gup_benchmark.c
> >> +
> >> +config SPECULATIVE_PAGE_FAULT
> >> +   bool
> > 
> > Should this be configurable even if the arch supports it?
> 
> Actually, this is not configurable unless by manually editing the .config 
> file.
> 
> I made it this way on the Thomas's request :
> https://lkml.org/lkml/2018/1/15/969
> 
> That sounds to be the smarter way to achieve that, isn't it ?
> 

Putting this in mm/Kconfig is definitely the right way to go about it 
instead of any generic option in arch/*.

My question, though, was making this configurable by the user:

config SPECULATIVE_PAGE_FAULT
bool "Speculative page faults"
depends on X86_64 || PPC
default y
help
  ..

It's a question about whether we want this always enabled on x86_64 and 
power or whether the user should be able to disable it (right now they 
can't).  With a large feature like this, you may want to offer something 
simple (disable CONFIG_SPECULATIVE_PAGE_FAULT) if someone runs into 
regressions.


Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-27 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 88042d843668..ef6ef0627090 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct 
> anon_vma_chain *node);
>  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
> cap_sys_admin);
>  extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> - struct vm_area_struct *expand);
> + struct vm_area_struct *expand, bool keep_locked);
>  static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
>  {
> - return __vma_adjust(vma, start, end, pgoff, insert, NULL);
> + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
>  }
> -extern struct vm_area_struct *vma_merge(struct mm_struct *,
> +extern struct vm_area_struct *__vma_merge(struct mm_struct *,
>   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>   unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> - struct mempolicy *, struct vm_userfaultfd_ctx);
> + struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
> +static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
> + struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
> + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
> +{
> + return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
> +pol, uff, false);
> +}

The first formal to vma_merge() is an mm, not a vma.

This area could use an uncluttering.

>  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
>  extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
>   unsigned long addr, int new_below);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d6533cb85213..ac32b577a0c9 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
>   */
>  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> - struct vm_area_struct *expand)
> + struct vm_area_struct *expand, bool keep_locked)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
> @@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
> long start,
>  
>   if (next && next != vma)
>   vm_raw_write_end(next);
> - vm_raw_write_end(vma);
> + if (!keep_locked)
> + vm_raw_write_end(vma);
>  
>   validate_mm(mm);
>  

This will require a fixup for the following patch where a retval from 
anon_vma_close() can also return without vma locked even though 
keep_locked == true.

How does vma_merge() handle that error wrt vm_raw_write_begin(vma)?

> @@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
> unsigned long vm_flags,
>   * parameter) may establish ptes with the wrong permissions of 
>   * instead of the right permissions of .
>   */
> -struct vm_area_struct *vma_merge(struct mm_struct *mm,
> +struct vm_area_struct *__vma_merge(struct mm_struct *mm,
>   struct vm_area_struct *prev, unsigned long addr,
>   unsigned long end, unsigned long vm_flags,
>   struct anon_vma *anon_vma, struct file *file,
>   pgoff_t pgoff, struct mempolicy *policy,
> - struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> + struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> + bool keep_locked)
>  {
>   pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
>   struct vm_area_struct *area, *next;
> @@ -1185,10 +1187,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>   /* cases 1, 6 */
>   err = __vma_adjust(prev, prev->vm_start,
>next->vm_end, prev->vm_pgoff, NULL,
> -  prev);
> +  prev, keep_locked);
>   } else  /* cases 2, 5, 7 */
>   err = __vma_adjust(prev, prev->vm_start,
> -  end, prev->vm_pgoff, NULL, prev);
> +end, prev->vm_pgoff, NULL, prev,
> +keep_locked);
>   if (err)
>   return NULL;
>   khugepaged_enter_vma_merge(prev, vm_flags);
> @@ -1205,10 +1208,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> 

Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-27 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5898255d0aeb..d6533cb85213 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -847,17 +847,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
> long start,
>   }
>  
>   if (start != vma->vm_start) {
> - vma->vm_start = start;
> + WRITE_ONCE(vma->vm_start, start);
>   start_changed = true;
>   }
>   if (end != vma->vm_end) {
> - vma->vm_end = end;
> + WRITE_ONCE(vma->vm_end, end);
>   end_changed = true;
>   }
> - vma->vm_pgoff = pgoff;
> + WRITE_ONCE(vma->vm_pgoff, pgoff);
>   if (adjust_next) {
> - next->vm_start += adjust_next << PAGE_SHIFT;
> - next->vm_pgoff += adjust_next;
> + WRITE_ONCE(next->vm_start,
> +next->vm_start + (adjust_next << PAGE_SHIFT));
> + WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next);
>   }
>  
>   if (root) {
> @@ -1781,6 +1782,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>  out:
>   perf_event_mmap(vma);
>  
> + vm_write_begin(vma);
>   vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>   if (vm_flags & VM_LOCKED) {
>   if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
> @@ -1803,6 +1805,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   vma->vm_flags |= VM_SOFTDIRTY;
>  
>   vma_set_page_prot(vma);
> + vm_write_end(vma);
>  
>   return addr;
>  

Shouldn't this also protect vma->vm_flags?

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1796,7 +1796,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vma == get_gate_vma(current->mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & VM_LOCKED_CLEAR_MASK);
}
 
if (file)
@@ -1809,7 +1810,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
 * then new mapped in-place (which must be aimed as
 * a completely new data area).
 */
-   vma->vm_flags |= VM_SOFTDIRTY;
+   WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY);
 
vma_set_page_prot(vma);
vm_write_end(vma);


Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-27 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 65ae54659833..a2d9c87b7b0b 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, 
> const char __user *buf,
>   goto out_mm;
>   }
>   for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - vma->vm_flags &= ~VM_SOFTDIRTY;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags,
> +vma->vm_flags & 
> ~VM_SOFTDIRTY);
>   vma_set_page_prot(vma);
> + vm_write_end(vma);
>   }
>   downgrade_write(>mmap_sem);
>   break;
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index cec550c8468f..b8212ba17695 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
> list_head *fcs)
>  
>   octx = vma->vm_userfaultfd_ctx.ctx;
>   if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> + vm_write_begin(vma);
>   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + WRITE_ONCE(vma->vm_flags,
> +vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
> + vm_write_end(vma);
>   return 0;
>   }
>  

In several locations in this patch vm_write_begin(vma) -> 
vm_write_end(vma) is nesting things other than vma->vm_flags, 
vma->vm_policy, etc.  I think it's better to do vm_write_end(vma) as soon 
as the members that the seqcount protects are modified.  In other words, 
this isn't offering protection for vma->vm_userfaultfd_ctx.  There are 
several examples of this in the patch.

> @@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, 
> struct file *file)
>   vma = prev;
>   else
>   prev = vma;
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
>   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> + vm_write_end(vma);
>   }
>   up_write(>mmap_sem);
>   mmput(mm);
> @@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
> *ctx,
>* the next vma was merged into the current one and
>* the current one has not been updated yet.
>*/
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
>   vma->vm_userfaultfd_ctx.ctx = ctx;
> + vm_write_end(vma);
>  
>   skip:
>   prev = vma;
> @@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct 
> userfaultfd_ctx *ctx,
>* the next vma was merged into the current one and
>* the current one has not been updated yet.
>*/
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
>   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> + vm_write_end(vma);
>  
>   skip:
>   prev = vma;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b7e2268dfc9a..32314e9e48dd 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>   if (mm_find_pmd(mm, address) != pmd)
>   goto out;
>  
> + vm_write_begin(vma);
>   anon_vma_lock_write(vma->anon_vma);
>  
>   pte = pte_offset_map(pmd, address);
> @@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>   pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>   spin_unlock(pmd_ptl);
>   anon_vma_unlock_write(vma->anon_vma);
> + vm_write_end(vma);
>   result = SCAN_FAIL;
>   goto out;
>   }
> @@ -1075,6 +1077,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>   set_pmd_at(mm, address, pmd, _pmd);
>   update_mmu_cache_pmd(vma, address, pmd);
>   spin_unlock(pmd_ptl);
> + vm_write_end(vma);
>  
>   *hpage = NULL;
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d3c922ea1a1..e328f7ab5942 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -184,7 +184,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
>   /*
>* vm_flags is protected by the mmap_sem held in write mode.
>*/
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, 

Re: [PATCH v9 07/24] mm: VMA sequence count

2018-03-27 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/mm/mmap.c b/mm/mmap.c
> index faf85699f1a1..5898255d0aeb 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -558,6 +558,10 @@ void __vma_link_rb(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   else
>   mm->highest_vm_end = vm_end_gap(vma);
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(>vm_sequence);
> +#endif
> +
>   /*
>* vma->vm_prev wasn't known when we followed the rbtree to find the
>* correct insertion point for that vma. As a result, we could not
> @@ -692,6 +696,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned 
> long start,
>   long adjust_next = 0;
>   int remove_next = 0;
>  
> + /*
> +  * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
> +  *
> +  * Locked is complaining about a theoretical lock dependency, involving
> +  * 3 locks:
> +  *   mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
> +  *
> +  * Here are the major path leading to this dependency :
> +  *  1. __vma_adjust() mmap_sem  -> vm_sequence -> i_mmap_rwsem
> +  *  2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
> +  *  3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
> +  *  4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
> +  *
> +  * So there is no way to solve this easily, especially because in
> +  * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
> +  * VMAs are not yet known.
> +  * However, the way the vm_seq is used is guarantying that we will
> +  * never block on it since we just check for its value and never wait
> +  * for it to move, see vma_has_changed() and handle_speculative_fault().
> +  */
> + vm_raw_write_begin(vma);
> + if (next)
> + vm_raw_write_begin(next);
> +
>   if (next && !insert) {
>   struct vm_area_struct *exporter = NULL, *importer = NULL;
>  

Eek, what about later on:

/*
 * Easily overlooked: when mprotect shifts the boundary,
 * make sure the expanding vma has anon_vma set if the
 * shrinking vma had, to cover any anon pages imported.
 */
if (exporter && exporter->anon_vma && !importer->anon_vma) {
int error;

importer->anon_vma = exporter->anon_vma;
error = anon_vma_clone(importer, exporter);
if (error)
return error;
}

This needs

if (error) {
if (next && next != vma)
vm_raw_write_end(next);
vm_raw_write_end(vma);
return error;
}


Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-03-27 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2f3e98edc94a..b6432a261e63 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page 
> *page)
>  #define VM_FAULT_NEEDDSYNC  0x2000   /* ->fault did not modify page tables
>* and needs fsync() to complete (for
>* synchronous page faults in DAX) */
> +#define VM_FAULT_PTNOTSAME 0x4000/* Page table entries have changed */
>  
>  #define VM_FAULT_ERROR   (VM_FAULT_OOM | VM_FAULT_SIGBUS | 
> VM_FAULT_SIGSEGV | \
>VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
> diff --git a/mm/memory.c b/mm/memory.c
> index 21b1212a0892..4bc7b0bdcb40 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
>   * parts, do_swap_page must check under lock before unmapping the pte and
>   * proceeding (but do_wp_page is only called after already making such a 
> check;
>   * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + *   0   if the PTE are the same
> + *   VM_FAULT_PTNOTSAME  if the PTE are different
> + *   VM_FAULT_RETRY  if the VMA has changed in our back during
> + *   a speculative page fault handling.
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
> - int same = 1;
> + int ret = 0;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
>   if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
>   }
>  #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
>  }
>  
>  static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
>   int exclusive = 0;
>   int ret = 0;

Initialization is now unneeded.

Otherwise:

Acked-by: David Rientjes <rient...@google.com>


Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-03-26 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> Add support for the new speculative faults event.
> 
> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>

Acked-by: David Rientjes <rient...@google.com>

Aside: should there be a new spec_flt field for struct task_struct that 
complements maj_flt and min_flt and be exported through /proc/pid/stat?


Re: [PATCH v9 20/24] perf: Add a speculative page fault sw event

2018-03-26 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> Add a new software event to count succeeded speculative page faults.
> 
> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>

Acked-by: David Rientjes <rient...@google.com>


Re: [PATCH v9 23/24] x86/mm: Add speculative pagefault handling

2018-03-26 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index e6af2b464c3d..a73cf227edd6 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1239,6 +1239,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   unsigned long address)
>  {
>   struct vm_area_struct *vma;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + struct vm_area_struct *spf_vma = NULL;
> +#endif
>   struct task_struct *tsk;
>   struct mm_struct *mm;
>   int fault, major = 0;
> @@ -1332,6 +1335,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   if (error_code & X86_PF_INSTR)
>   flags |= FAULT_FLAG_INSTRUCTION;
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + if ((error_code & X86_PF_USER) && (atomic_read(>mm_users) > 1)) {
> + fault = handle_speculative_fault(mm, address, flags,
> +  _vma);
> +
> + if (!(fault & VM_FAULT_RETRY)) {
> + if (!(fault & VM_FAULT_ERROR)) {
> + perf_sw_event(PERF_COUNT_SW_SPF, 1,
> +   regs, address);
> + goto done;
> + }
> + /*
> +  * In case of error we need the pkey value, but
> +  * can't get it from the spf_vma as it is only returned
> +  * when VM_FAULT_RETRY is returned. So we have to
> +  * retry the page fault with the mmap_sem grabbed.
> +  */
> + }
> + }
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */

All the comments from the powerpc version will apply here as well, the 
only interesting point is whether VM_FAULT_FALLBACK can be returned from 
handle_speculative_fault() to indicate its not possible.

> +
>   /*
>* When running in the kernel we expect faults to occur only to
>* addresses in user space.  All other faults represent errors in
> @@ -1365,7 +1389,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   might_sleep();
>   }
>  
> - vma = find_vma(mm, address);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + if (spf_vma) {
> + if (can_reuse_spf_vma(spf_vma, address))
> + vma = spf_vma;
> + else
> + vma = find_vma(mm, address);
> + spf_vma = NULL;
> + } else
> +#endif
> + vma = find_vma(mm, address);
>   if (unlikely(!vma)) {
>   bad_area(regs, error_code, address);
>   return;
> @@ -1451,6 +1484,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   return;
>   }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +done:
> +#endif
>   /*
>* Major/minor page fault accounting. If any of the events
>* returned VM_FAULT_MAJOR, we account it as a major fault.


Re: [PATCH v9 24/24] powerpc/mm: Add speculative page fault

2018-03-26 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 866446cf2d9a..104f3cc86b51 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
> long address,
>  unsigned long error_code)
>  {
>   struct vm_area_struct * vma;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + struct vm_area_struct *spf_vma = NULL;
> +#endif
>   struct mm_struct *mm = current->mm;
>   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>   int is_exec = TRAP(regs) == 0x400;
> @@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>   if (is_exec)
>   flags |= FAULT_FLAG_INSTRUCTION;
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + if (is_user && (atomic_read(>mm_users) > 1)) {
> + /* let's try a speculative page fault without grabbing the
> +  * mmap_sem.
> +  */
> + fault = handle_speculative_fault(mm, address, flags, _vma);
> + if (!(fault & VM_FAULT_RETRY)) {
> + perf_sw_event(PERF_COUNT_SW_SPF, 1,
> +   regs, address);
> + goto done;
> + }
> + }
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +

Can't you elimiate all #ifdef's in this patch if 
handle_speculative_fault() can be passed is_user and return some error 
code that fallback is needed?  Maybe reuse VM_FAULT_FALLBACK?

>   /* When running in the kernel we expect faults to occur only to
>* addresses in user space.  All other faults represent errors in the
>* kernel and should generate an OOPS.  Unfortunately, in the case of an
> @@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>   might_sleep();
>   }
>  
> - vma = find_vma(mm, address);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + if (spf_vma) {
> + if (can_reuse_spf_vma(spf_vma, address))
> + vma = spf_vma;
> + else
> + vma =  find_vma(mm, address);
> + spf_vma = NULL;
> + } else
> +#endif
> + vma = find_vma(mm, address);
>   if (unlikely(!vma))
>   return bad_area(regs, address);
>   if (likely(vma->vm_start <= address))

I think the code quality here could be improved such that you can pass mm, 
_vma, and address and some helper function would return spf_vma if 
can_reuse_spf_vma() is true (and do *spf_vma to NULL) or otherwise return 
find_vma(mm, address).

Also, spf_vma is being set to NULL because of VM_FAULT_RETRY, but does it 
make sense to retry handle_speculative_fault() in this case since we've 
dropped mm->mmap_sem and there may have been a writer queued behind it?

> @@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
> long address,
>  
>   up_read(>mm->mmap_sem);
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +done:
> +#endif
>   if (unlikely(fault & VM_FAULT_ERROR))
>   return mm_fault_error(regs, address, fault);
>  

And things like this are trivially handled by doing

done: __maybe_unused


Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-03-25 Thread David Rientjes
On Wed, 21 Mar 2018, Laurent Dufour wrote:

> I found the root cause of this lockdep warning.
> 
> In mmap_region(), unmap_region() may be called while vma_link() has not been
> called. This happens during the error path if call_mmap() failed.
> 
> The only to fix that particular case is to call
> seqcount_init(>vm_sequence) when initializing the vma in mmap_region().
> 

Ack, although that would require a fixup to dup_mmap() as well.


Re: [PATCH v9 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-03-25 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> When handling page fault without holding the mmap_sem the fetch of the
> pte lock pointer and the locking will have to be done while ensuring
> that the VMA is not touched in our back.
> 
> So move the fetch and locking operations in a dedicated function.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  mm/memory.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8ac241b9f370..21b1212a0892 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
> long addr,
>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> +static bool pte_spinlock(struct vm_fault *vmf)

inline?

> +{
> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> + spin_lock(vmf->ptl);
> + return true;
> +}
> +
>  static bool pte_map_lock(struct vm_fault *vmf)
>  {
>   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,

Shouldn't pte_unmap_same() take struct vm_fault * and use the new 
pte_spinlock()?


Re: [PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE

2018-03-25 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> From: Peter Zijlstra 
> 
> When speculating faults (without holding mmap_sem) we need to validate
> that the vma against which we loaded pages is still valid when we're
> ready to install the new PTE.
> 
> Therefore, replace the pte_offset_map_lock() calls that (re)take the
> PTL with pte_map_lock() which can fail in case we find the VMA changed
> since we started the fault.
> 

Based on how its used, I would have suspected this to be named 
pte_map_trylock().

> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Port to 4.12 kernel]
> [Remove the comment about the fault_env structure which has been
>  implemented as the vm_fault structure in the kernel]
> [move pte_map_lock()'s definition upper in the file]
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 56 
> ++
>  2 files changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4d02524a7998..2f3e98edc94a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -300,6 +300,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER  0x40/* The fault originated in 
> userspace */
>  #define FAULT_FLAG_REMOTE0x80/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100/* The fault was during an 
> instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE   0x200   /* Speculative fault, not 
> holding mmap_sem */
>  
>  #define FAULT_FLAG_TRACE \
>   { FAULT_FLAG_WRITE, "WRITE" }, \

I think FAULT_FLAG_SPECULATIVE should be introduced in the patch that 
actually uses it.

> diff --git a/mm/memory.c b/mm/memory.c
> index e0ae4999c824..8ac241b9f370 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
> long addr,
>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> +static bool pte_map_lock(struct vm_fault *vmf)

inline?

> +{
> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +vmf->address, >ptl);
> + return true;
> +}
> +
>  /*
>   * handle_pte_fault chooses page fault handler according to an entry which 
> was
>   * read non-atomically.  Before making any commitment, on those architectures
> @@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>   const unsigned long mmun_start = vmf->address & PAGE_MASK;
>   const unsigned long mmun_end = mmun_start + PAGE_SIZE;
>   struct mem_cgroup *memcg;
> + int ret = VM_FAULT_OOM;
>  
>   if (unlikely(anon_vma_prepare(vma)))
>   goto oom;
> @@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>   /*
>* Re-check the pte - we dropped the lock
>*/
> - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
> + if (!pte_map_lock(vmf)) {
> + mem_cgroup_cancel_charge(new_page, memcg, false);
> + ret = VM_FAULT_RETRY;
> + goto oom_free_new;
> + }

Ugh, but we aren't oom here, so maybe rename oom_free_new so that it makes 
sense for return values other than VM_FAULT_OOM?

>   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>   if (old_page) {
>   if (!PageAnon(old_page)) {
> @@ -2596,7 +2608,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>  oom:
>   if (old_page)
>   put_page(old_page);
> - return VM_FAULT_OOM;
> + return ret;
>  }
>  
>  /**
> @@ -2617,8 +2629,8 @@ static int wp_page_copy(struct vm_fault *vmf)
>  int finish_mkwrite_fault(struct vm_fault *vmf)
>  {
>   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
> - vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
> ->ptl);
> + if (!pte_map_lock(vmf))
> + return VM_FAULT_RETRY;
>   /*
>* We might have raced with another page fault while we released the
>* pte_offset_map_lock.
> @@ -2736,8 +2748,11 @@ static int do_wp_page(struct vm_fault *vmf)
>   get_page(vmf->page);
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   lock_page(vmf->page);
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, >ptl);
> + if (!pte_map_lock(vmf)) {
> + unlock_page(vmf->page);
> + put_page(vmf->page);
> + return VM_FAULT_RETRY;
> + }
>   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
>   unlock_page(vmf->page);
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2947,8 +2962,10 @@ int 

Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-25 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> This configuration variable will be used to build the code needed to
> handle speculative page fault.
> 
> By default it is turned off, and activated depending on architecture
> support.
> 
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Laurent Dufour 
> ---
>  mm/Kconfig | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index abefa573bcd8..07c566c88faf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -759,3 +759,6 @@ config GUP_BENCHMARK
> performance of get_user_pages_fast().
>  
> See tools/testing/selftests/vm/gup_benchmark.c
> +
> +config SPECULATIVE_PAGE_FAULT
> +   bool

Should this be configurable even if the arch supports it?


Re: [PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more

2017-04-18 Thread David Rientjes
On Wed, 12 Apr 2017, Oliver O'Halloran wrote:

> Depending flags of the PMD being zapped there may or may not be a
> deposited pgtable to be freed. In two of the three cases this is open
> coded while the third uses the zap_deposited_table() helper. This patch
> converts the others to use the helper to clean things up a bit.
> 
> Cc: "Aneesh Kumar K.V" <aneesh.ku...@linux.vnet.ibm.com>
> Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
> Cc: linux...@kvack.org
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>

Acked-by: David Rientjes <rient...@google.com>


[no subject]

2016-02-02 Thread David Rientjes via Linuxppc-dev
--- Begin Message ---
On Thu, 28 Jan 2016, David Rientjes wrote:

> On Thu, 28 Jan 2016, Christian Borntraeger wrote:
> 
> > Indeed, I only touched the identity mapping and dump stack.
> > The question is do we really want to change free_init_pages as well?
> > The unmapping during runtime causes significant overhead, but the
> > unmapping after init imposes almost no runtime overhead. Of course,
> > things get fishy now as what is enabled and what not.
> > 
> > Kconfig after my patch "mm/debug_pagealloc: Ask users for default setting 
> > of debug_pagealloc"
> > (in mm) now states
> > snip
> > By default this option will have a small overhead, e.g. by not
> > allowing the kernel mapping to be backed by large pages on some
> > architectures. Even bigger overhead comes when the debugging is
> > enabled by DEBUG_PAGEALLOC_ENABLE_DEFAULT or the debug_pagealloc
> > command line parameter.
> > snip
> > 
> > So I am tempted to NOT change free_init_pages, but the x86 maintainers
> > can certainly decide differently. Ingo, Thomas, H. Peter, please advise.
> > 
> 
> I'm sorry, but I thought the discussion of the previous version of the 
> patchset led to deciding that all CONFIG_DEBUG_PAGEALLOC behavior would be 
> controlled by being enabled on the commandline and checked with 
> debug_pagealloc_enabled().
> 
> I don't think we should have a CONFIG_DEBUG_PAGEALLOC that does some stuff 
> and then a commandline parameter or CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT 
> to enable more stuff.  It should either be all enabled by the commandline 
> (or config option) or split into a separate entity.  
> CONFIG_DEBUG_PAGEALLOC_LIGHT and CONFIG_DEBUG_PAGEALLOC would be fine, but 
> the current state is very confusing about what is being done and what 
> isn't.
> 

Ping?
--- End Message ---
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[no subject]

2016-01-28 Thread David Rientjes via Linuxppc-dev
--- Begin Message ---
On Thu, 28 Jan 2016, Christian Borntraeger wrote:

> Indeed, I only touched the identity mapping and dump stack.
> The question is do we really want to change free_init_pages as well?
> The unmapping during runtime causes significant overhead, but the
> unmapping after init imposes almost no runtime overhead. Of course,
> things get fishy now as what is enabled and what not.
> 
> Kconfig after my patch "mm/debug_pagealloc: Ask users for default setting of 
> debug_pagealloc"
> (in mm) now states
> snip
> By default this option will have a small overhead, e.g. by not
> allowing the kernel mapping to be backed by large pages on some
> architectures. Even bigger overhead comes when the debugging is
> enabled by DEBUG_PAGEALLOC_ENABLE_DEFAULT or the debug_pagealloc
> command line parameter.
> snip
> 
> So I am tempted to NOT change free_init_pages, but the x86 maintainers
> can certainly decide differently. Ingo, Thomas, H. Peter, please advise.
> 

I'm sorry, but I thought the discussion of the previous version of the 
patchset led to deciding that all CONFIG_DEBUG_PAGEALLOC behavior would be 
controlled by being enabled on the commandline and checked with 
debug_pagealloc_enabled().

I don't think we should have a CONFIG_DEBUG_PAGEALLOC that does some stuff 
and then a commandline parameter or CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT 
to enable more stuff.  It should either be all enabled by the commandline 
(or config option) or split into a separate entity.  
CONFIG_DEBUG_PAGEALLOC_LIGHT and CONFIG_DEBUG_PAGEALLOC would be fine, but 
the current state is very confusing about what is being done and what 
isn't.

It also wouldn't hurt to enumerate what is enabled and what isn't enabled 
in the Kconfig entry.
--- End Message ---
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[no subject]

2016-01-27 Thread David Rientjes via Linuxppc-dev
--- Begin Message ---
On Wed, 27 Jan 2016, Christian Borntraeger wrote:

> We can use debug_pagealloc_enabled() to check if we can map
> the identity mapping with 2MB pages. We can also add the state
> into the dump_stack output.
> 
> The patch does not touch the code for the 1GB pages, which ignored
> CONFIG_DEBUG_PAGEALLOC. Do we need to fence this as well?
> 
> Signed-off-by: Christian Borntraeger 
> Reviewed-by: Thomas Gleixner 
> ---
>  arch/x86/kernel/dumpstack.c |  5 ++---
>  arch/x86/mm/init.c  |  7 ---
>  arch/x86/mm/pageattr.c  | 14 --
>  3 files changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index 9c30acf..32e5699 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -265,9 +265,8 @@ int __die(const char *str, struct pt_regs *regs, long err)
>  #ifdef CONFIG_SMP
>   printk("SMP ");
>  #endif
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> - printk("DEBUG_PAGEALLOC ");
> -#endif
> + if (debug_pagealloc_enabled())
> + printk("DEBUG_PAGEALLOC ");
>  #ifdef CONFIG_KASAN
>   printk("KASAN");
>  #endif
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 493f541..39823fd 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -150,13 +150,14 @@ static int page_size_mask;
>  
>  static void __init probe_page_size_mask(void)
>  {
> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
> +#if !defined(CONFIG_KMEMCHECK)
>   /*
> -  * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
> +  * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will
> +  * use small pages.
>* This will simplify cpa(), which otherwise needs to support splitting
>* large pages into small in interrupt context, etc.
>*/
> - if (cpu_has_pse)
> + if (cpu_has_pse && !debug_pagealloc_enabled())
>   page_size_mask |= 1 << PG_LEVEL_2M;
>  #endif
>  

I would have thought free_init_pages() would be modified to use 
debug_pagealloc_enabled() as well?
--- End Message ---
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[no subject]

2016-01-27 Thread David Rientjes via Linuxppc-dev
--- Begin Message ---
On Wed, 27 Jan 2016, Christian Borntraeger wrote:

> We can use debug_pagealloc_enabled() to check if we can map
> the identity mapping with 1MB/2GB pages as well as to print
> the current setting in dump_stack.
> 
> Signed-off-by: Christian Borntraeger <borntrae...@de.ibm.com>
> Reviewed-by: Heiko Carstens <heiko.carst...@de.ibm.com>

Acked-by: David Rientjes <rient...@google.com>
--- End Message ---
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-31 Thread David Rientjes
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index aa58a32..56355f2 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -2469,7 +2469,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, 
 struct mm_struct *mm,
*/
   up_read(mm-mmap_sem);
  
 - *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
 + *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
   if (unlikely(!*hpage)) {
   count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
   *hpage = ERR_PTR(-ENOMEM);
 @@ -2568,9 +2568,7 @@ static void collapse_huge_page(struct mm_struct *mm,
  
   VM_BUG_ON(address  ~HPAGE_PMD_MASK);
  
 - /* Only allocate from the target node */
 - gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) |
 - __GFP_THISNODE;
 + gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), 0);
  
   /* release the mmap_sem read lock. */
   new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node);

Hmm, where is the __GFP_THISNODE enforcement in khugepaged_alloc_page() 
that is removed in collapse_huge_page()?  I also don't see what happened 
to the __GFP_OTHER_NODE.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-23 Thread David Rientjes
On Thu, 23 Jul 2015, Christoph Lameter wrote:

  The only possible downside would be existing users of
  alloc_pages_node() that are calling it with an offline node.  Since it's a
  VM_BUG_ON() that would catch that, I think it should be changed to a
  VM_WARN_ON() and eventually fixed up because it's nonsensical.
  VM_BUG_ON() here should be avoided.
 
 The offline node thing could be addresses by using numa_mem_id()?
 

I was concerned about any callers that were passing an offline node, not 
NUMA_NO_NODE, today.  One of the alloc-node functions has a VM_BUG_ON() 
for it, the other silently calls node_zonelist() on it.

I suppose the final alloc_pages_node() implementation could be

if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid)))
nid = numa_mem_id();

VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));

though.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-22 Thread David Rientjes
On Wed, 22 Jul 2015, Paolo Bonzini wrote:

  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 2d73807..a8723a8 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
  struct page *pages;
  struct vmcs *vmcs;
   
  -   pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
  +   pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
  if (!pages)
  return NULL;
  vmcs = page_address(pages);
 
 Even though there's a pretty strong preference for the right node,
 things can work if the node is the wrong one.  The order is always zero
 in practice, so the allocation should succeed.
 

You're code is fine both before and after the patch since __GFP_THISNODE 
isn't set.  The allocation will eventually succeed but, as you said, may 
be from remote memory (and the success of allocating on node may be 
influenced by the global setting of zone_reclaim_mode).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-22 Thread David Rientjes
On Wed, 22 Jul 2015, Vlastimil Babka wrote:

  alloc_pages_exact_node(), as you said, connotates that the allocation will
  take place on that node or will fail.  So why not go beyond this patch and
  actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
  into a new alloc_pages_prefer_node(), which would be the current
  alloc_pages_exact_node() implementation, and then fix up the callers?
 
 OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and
 alloc_pages_exact_node(). Isn't that a bit too much? The first two
 differ only in tiny bit:
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 unsigned int order)
 {
 /* Unknown node is current node */
 if (nid  0)
 nid = numa_node_id();
 
 return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
 unsigned int order)
 {
 VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
 
 return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 

Eek, yeah, that does look bad.  I'm not even sure the

if (nid  0)
nid = numa_node_id();

is correct; I think this should be comparing to NUMA_NO_NODE rather than
all negative numbers, otherwise we silently ignore overflow and nobody 
ever knows.

 So _prefer_node is just a tiny optimization over the other one. It
 should be maybe called __alloc_pages_node() then? This would perhaps
 discourage users outside of mm/arch code (where it may matter). The
 savings of a skipped branch is likely dubious anyway... It would be also
 nice if alloc_pages_node() could use __alloc_pages_node() internally, but
 I'm not sure if all callers are safe wrt the
 VM_BUG_ON(!node_online(nid)) part.
 

I'm not sure how large you want to make your patch :)  In a perfect world 
I would think that we wouldn't have an alloc_pages_prefer_node() at all 
and everything would be converted to alloc_pages_node() which would do

if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));

and then alloc_pages_exact_node() would do

return alloc_pages_node(nid, gfp_mask | __GFP_THISNODE, order);

and existing alloc_pages_exact_node() callers fixed up depending on 
whether they set the bit or not.

The only possible downside would be existing users of 
alloc_pages_node() that are calling it with an offline node.  Since it's a 
VM_BUG_ON() that would catch that, I think it should be changed to a 
VM_WARN_ON() and eventually fixed up because it's nonsensical.  
VM_BUG_ON() here should be avoided.

Or just go with a single alloc_pages_node() and rename __GFP_THISNODE to 
__GFP_EXACT_NODE which may accomplish the same thing :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-21 Thread David Rientjes
On Tue, 21 Jul 2015, Vlastimil Babka wrote:

 The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 (page
 allocator: do not check NUMA node ID when the caller knows the node is valid)
 as an optimized variant of alloc_pages_node(), that doesn't allow the node id
 to be -1. Unfortunately the name of the function can easily suggest that the
 allocation is restricted to the given node. In truth, the node is only
 preferred, unless __GFP_THISNODE is among the gfp flags.
 
 The misleading name has lead to mistakes in the past, see 5265047ac301 (mm,
 thp: really limit transparent hugepage allocation to local node) and
 b360edb43f8e (mm, mempolicy: migrate_to_node should only migrate to node).
 
 To prevent further mistakes, this patch renames the function to
 alloc_pages_prefer_node() and documents it together with alloc_pages_node().
 

alloc_pages_exact_node(), as you said, connotates that the allocation will 
take place on that node or will fail.  So why not go beyond this patch and 
actually make alloc_pages_exact_node() set __GFP_THISNODE and then call 
into a new alloc_pages_prefer_node(), which would be the current 
alloc_pages_exact_node() implementation, and then fix up the callers?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-14 Thread David Rientjes
On Fri, 10 Jul 2015, Nishanth Aravamudan wrote:

  After the percpu areas on initialized and cpu_to_node() is correct, it 
  would be really nice to be able to make numa_cpu_lookup_table[] be 
  __initdata since it shouldn't be necessary anymore.  That probably has cpu 
  callbacks that need to be modified to no longer look at 
  numa_cpu_lookup_table[] or pass the value in, but it would make it much 
  cleaner.  Then nobody will have to worry about figuring out whether 
  early_cpu_to_node() or cpu_to_node() is the right one to call.
 
 When I worked on the original pcpu patches for power, I wanted to do
 this, but got myself confused and never came back to it. Thank you for
 suggesting it and I'll work on it soon.
 

Great, thanks for taking it on!  I have powerpc machines so I can test 
this and try to help where possible.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-08 Thread David Rientjes
On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

 Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
 have an ordering issue during boot with early calls to cpu_to_node().
 The value returned by those calls now depend on the per-cpu area being
 setup, but that is not guaranteed to be the case during boot. Instead,
 we need to add an early_cpu_to_node() which doesn't use the per-CPU area
 and call that from certain spots that are known to invoke cpu_to_node()
 before the per-CPU areas are not configured.
 
 On an example 2-node NUMA system with the following topology:
 
 available: 2 nodes (0-1)
 node 0 cpus: 0 1 2 3
 node 0 size: 2029 MB
 node 0 free: 1753 MB
 node 1 cpus: 4 5 6 7
 node 1 size: 2045 MB
 node 1 free: 1945 MB
 node distances:
 node   0   1 
   0:  10  40 
   1:  40  10 
 
 we currently emit at boot:
 
 [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
 
 After this commit, we correctly emit:
 
 [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
 
 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 diff --git a/arch/powerpc/include/asm/topology.h 
 b/arch/powerpc/include/asm/topology.h
 index 5f1048e..f2c4c89 100644
 --- a/arch/powerpc/include/asm/topology.h
 +++ b/arch/powerpc/include/asm/topology.h
 @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
  extern int __node_distance(int, int);
  #define node_distance(a, b) __node_distance(a, b)
  
 +extern int early_cpu_to_node(int);
 +
  extern void __init dump_numa_cpu_topology(void);
  
  extern int sysfs_add_device_to_node(struct device *dev, int nid);
 diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
 index c69671c..23a2cf3 100644
 --- a/arch/powerpc/kernel/setup_64.c
 +++ b/arch/powerpc/kernel/setup_64.c
 @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
  
  static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t 
 align)
  {
 - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
 - __pa(MAX_DMA_ADDRESS));
 + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
 + align, __pa(MAX_DMA_ADDRESS));
  }
  
  static void __init pcpu_fc_free(void *ptr, size_t size)
 @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
  
  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
  {
 - if (cpu_to_node(from) == cpu_to_node(to))
 + if (early_cpu_to_node(from) == early_cpu_to_node(to))
   return LOCAL_DISTANCE;
   else
   return REMOTE_DISTANCE;
 diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
 index 5e80621..9ffabf4 100644
 --- a/arch/powerpc/mm/numa.c
 +++ b/arch/powerpc/mm/numa.c
 @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
   cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
  }
  
 +int early_cpu_to_node(int cpu)
 +{
 + return numa_cpu_lookup_table[cpu];
 +}
 +
  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
  static void unmap_cpu_from_node(unsigned long cpu)
  {
 
 

early_cpu_to_node() looks like it's begging to be __init since we 
shouldn't have a need to reference to numa_cpu_lookup_table after boot and 
that appears like it can be done if pcpu_cpu_distance() is made __init in 
this patch and smp_prepare_boot_cpu() is made __init in the next patch.  
So I think this is fine, but those functions and things like 
reset_numa_cpu_lookup_table() should be in init.text.

After the percpu areas on initialized and cpu_to_node() is correct, it 
would be really nice to be able to make numa_cpu_lookup_table[] be 
__initdata since it shouldn't be necessary anymore.  That probably has cpu 
callbacks that need to be modified to no longer look at 
numa_cpu_lookup_table[] or pass the value in, but it would make it much 
cleaner.  Then nobody will have to worry about figuring out whether 
early_cpu_to_node() or cpu_to_node() is the right one to call.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-08 Thread David Rientjes
On Wed, 8 Jul 2015, Nishanth Aravamudan wrote:

  It looks like the symptom is that the per-cpu areas are all allocated on 
  node
  0, is that all that goes wrong?
 
 Yes, that's the symptom. I cc'd a few folks to see if they could help
 indicate the performance implications of such a setup -- sorry, I should
 have been more explicit about that.
 

Yeah, not sure it's really a bugfix but rather a performance optimization 
since cpu_to_node() with CONFIG_USE_PERCPU_NUMA_NODE_ID is only about 
performance.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table

2015-07-08 Thread David Rientjes
On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

 A simple move to a wrapper function to numa_cpu_lookup_table, now that
 power has the early_cpu_to_node() API.
 
 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

When early_cpu_to_node() is __init:

Acked-by: David Rientjes rient...@google.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread David Rientjes
On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:

 diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
 index 0257a7d659ef..24de29b3651b 100644
 --- a/arch/powerpc/mm/numa.c
 +++ b/arch/powerpc/mm/numa.c
 @@ -958,9 +958,17 @@ void __init initmem_init(void)
  
   memblock_dump_all();
  
 + /*
 +  * zero out the possible nodes after we parse the device-tree,
 +  * so that we lower the maximum NUMA node ID to what is actually
 +  * present.
 +  */
 + nodes_clear(node_possible_map);
 +
   for_each_online_node(nid) {
   unsigned long start_pfn, end_pfn;
  
 + node_set(nid, node_possible_map);
   get_pfn_range_for_nid(nid, start_pfn, end_pfn);
   setup_node_data(nid, start_pfn, end_pfn);
   sparse_memory_present_with_active_regions(nid);

This seems a bit strange, node_possible_map is supposed to be a superset 
of node_online_map and this loop is iterating over node_online_map to set 
nodes in node_possible_map.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread David Rientjes
On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:

 So if we compare to x86:
 
 arch/x86/mm/numa.c::numa_init():
 
 nodes_clear(numa_nodes_parsed);
 nodes_clear(node_possible_map);
 nodes_clear(node_online_map);
   ...
   numa_register_memblks(...);
 
 arch/x86/mm/numa.c::numa_register_memblks():
 
   node_possible_map = numa_nodes_parsed;
 
 Basically, it looks like x86 NUMA init clears out possible map and
 online map, probably for a similar reason to what I gave in the
 changelog that by default, the possible map seems to be based off
 MAX_NUMNODES, rather than nr_node_ids or anything dynamic.
 
 My patch was an attempt to emulate the same thing on powerpc. You are
 right that there is a window in which the node_possible_map and
 node_online_map are out of sync with my patch. It seems like it
 shouldn't matter given how early in boot we are, but perhaps the
 following would have been clearer:
 
 diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
 index 0257a7d659ef..1a118b08fad2 100644
 --- a/arch/powerpc/mm/numa.c
 +++ b/arch/powerpc/mm/numa.c
 @@ -958,6 +958,13 @@ void __init initmem_init(void)
  
 memblock_dump_all();
  
 +   /*
 +* Reduce the possible NUMA nodes to the online NUMA nodes,
 +* since we do not support node hotplug. This ensures that  we
 +* lower the maximum NUMA node ID to what is actually present.
 +*/
 +   nodes_and(node_possible_map, node_possible_map, node_online_map);

If you don't support node hotplug, then a node should always be possible 
if it's online unless there are other tricks powerpc plays with 
node_possible_map.  Shouldn't this just be 
node_possible_map = node_online_map?

 +
 for_each_online_node(nid) {
 unsigned long start_pfn, end_pfn;
  
 
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

2015-03-05 Thread David Rientjes
On Fri, 6 Mar 2015, Michael Ellerman wrote:

   diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
   index 0257a7d659ef..24de29b3651b 100644
   --- a/arch/powerpc/mm/numa.c
   +++ b/arch/powerpc/mm/numa.c
   @@ -958,9 +958,17 @@ void __init initmem_init(void)

 memblock_dump_all();

   + /*
   +  * zero out the possible nodes after we parse the device-tree,
   +  * so that we lower the maximum NUMA node ID to what is actually
   +  * present.
   +  */
   + nodes_clear(node_possible_map);
   +
 for_each_online_node(nid) {
 unsigned long start_pfn, end_pfn;

   + node_set(nid, node_possible_map);
 get_pfn_range_for_nid(nid, start_pfn, end_pfn);
 setup_node_data(nid, start_pfn, end_pfn);
 sparse_memory_present_with_active_regions(nid);
  
  This seems a bit strange, node_possible_map is supposed to be a superset 
  of node_online_map and this loop is iterating over node_online_map to set 
  nodes in node_possible_map.
  
 Yeah. Though at this point in boot I don't think it matters that the two maps
 are out-of-sync temporarily.
 
 But it would simpler to just set the possible map to be the online map. That
 would also maintain the invariant that the possible map is always a superset 
 of
 the online map.
 
 Or did I miss a detail there (sleep deprived parent mode).
 

I think reset_numa_cpu_lookup_table() which iterates over the possible 
map, and thus only a subset of nodes now, may be concerning.

I'm not sure why this is being proposed as a powerpc patch and now a patch 
for mem_cgroup_css_alloc().  In other words, why do we have to allocate 
for all possible nodes?  We should only be allocating for online nodes in 
N_MEMORY with mem hotplug disabled initially and then have a mem hotplug 
callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that 
transition from memoryless - memory.  The extra bonus is that 
alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the 
TODO in that function can be removed.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] slab: Fix nodeid bounds check for non-contiguous node IDs

2014-12-01 Thread David Rientjes
On Mon, 1 Dec 2014, Paul Mackerras wrote:

 The bounds check for nodeid in cache_alloc_node gives false
 positives on machines where the node IDs are not contiguous, leading
 to a panic at boot time.  For example, on a POWER8 machine the node
 IDs are typically 0, 1, 16 and 17.  This means that num_online_nodes()
 returns 4, so when cache_alloc_node is called with nodeid = 16 the
 VM_BUG_ON triggers, like this:
 
 kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079!
 Oops: Exception in kernel mode, sig: 5 [#1]
 SMP NR_CPUS=1024 NUMA PowerNV
 Modules linked in:
 CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.0-rc5-kvm+ #17
 task: c13ba230 ti: c1494000 task.ti: c1494000
 NIP: c0264f6c LR: c0264f5c CTR: 
 REGS: c14979a0 TRAP: 0700   Not tainted  (3.18.0-rc5-kvm+)
 MSR: 92021032 SF,HV,VEC,ME,IR,DR,RI  CR: 28000448  XER: 2000
 CFAR: c047e978 SOFTE: 0
 GPR00: c0264f5c c1497c20 c1499d48 0004
 GPR04: 0100 0010 0068 
 GPR08:  0001 082d c0cca5a8
 GPR12: 48000448 cfda 01003bd44ff0 10020578
 GPR16: 01003bd44ff8 01003bd45000 0001 
 GPR20:    0010
 GPR24: c00ffe80 c0c824ec 0068 c00ffe80
 GPR28: 0010 c00ffe80 0010 
 NIP [c0264f6c] .cache_alloc_node+0x6c/0x270
 LR [c0264f5c] .cache_alloc_node+0x5c/0x270
 Call Trace:
 [c1497c20] [c0264f5c] .cache_alloc_node+0x5c/0x270 
 (unreliable)
 [c1497cf0] [c026552c] .kmem_cache_alloc_node_trace+0xdc/0x360
 [c1497dc0] [c0c824ec] .init_list+0x3c/0x128
 [c1497e50] [c0c827b4] .kmem_cache_init+0x1dc/0x258
 [c1497ef0] [c0c54090] .start_kernel+0x2a0/0x568
 [c1497f90] [c0008c6c] start_here_common+0x20/0xa8
 Instruction dump:
 7c7d1b78 7c962378 4bda4e91 6000 3c620004 38800100 386370d8 48219959
 6000 7f83e000 7d301026 5529effe 0b09 393c0010 79291f24 7d3d4a14
 
 To fix this, we instead compare the nodeid with MAX_NUMNODES, and
 additionally make sure it isn't negative (since nodeid is an int).
 The check is there mainly to protect the array dereference in the
 get_node() call in the next line, and the array being dereferenced is
 of size MAX_NUMNODES.  If the nodeid is in range but invalid (for
 example if the node is off-line), the BUG_ON in the next line will
 catch that.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Acked-by: David Rientjes rient...@google.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-22 Thread David Rientjes
On Tue, 22 Jul 2014, Nishanth Aravamudan wrote:

  I think there's two use cases of interest:
  
   - allocating from a memoryless node where numa_node_id() is memoryless, 
 and
  
   - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
  
  I believe the first should have its own node_zonelist[0], whether it's 
  memoryless or not, that points to a list of zones that start with those 
  with the smallest distance.
 
 Ok, and that would be used for falling back in the appropriate priority?
 

There's no real fallback since there's never a case when you can allocate 
on a memoryless node.  The zonelist defines the appropriate order in which 
to try to allocate from zones, so it depends on things like the 
numa_node_id() in alloc_pages_current() and whether the zonelist for a 
memoryless node is properly initialized or whether this needs to be 
numa_mem_id().  It depends on the intended behavior of calling 
alloc_pages_{node,vma}() with a memoryless node, the complexity of 
(re-)building the zonelists at bootstrap and for memory hotplug isn't a 
hotpath.

This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL 
is set.

  I think its own node_zonelist[1], for __GFP_THISNODE allocations,
  should point to the node with present memory that has the smallest
  distance.
 
 And so would this, but with the caveat that we can fail here and don't
 go further? Semantically, __GFP_THISNODE then means as close as
 physically possible ignoring run-time memory constraints. I say that
 because obviously we might get off-node memory without memoryless nodes,
 but that shouldn't be used to satisfy __GPF_THISNODE allocations.
 

alloc_pages_current() substitutes any existing mempolicy for the default 
local policy when __GFP_THISNODE is set, and that would require local 
allocation.  That, currently, is numa_node_id() and not numa_mem_id().

The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so 
it will allocate remotely anyway.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-07-21 Thread David Rientjes
On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:

 Sorry for bringing up this old thread again, but I had a question for
 you, David. node_to_mem_node(), which does seem like a useful API,
 doesn't seem like it can just node_distance() solely, right? Because
 that just tells us the relative cost (or so I think about it) of using
 resources from that node. But we also need to know if that node itself
 has memory, etc. So using the zonelists is required no matter what? And
 upon memory hotplug (or unplug), the topology can change in a way that
 affects things, so node online time isn't right either?
 

I think there's two use cases of interest:

 - allocating from a memoryless node where numa_node_id() is memoryless, 
   and

 - using node_to_mem_node() for a possibly-memoryless node for kmalloc().

I believe the first should have its own node_zonelist[0], whether it's 
memoryless or not, that points to a list of zones that start with those 
with the smallest distance.  I think its own node_zonelist[1], for 
__GFP_THISNODE allocations, should point to the node with present memory 
that has the smallest distance.

For sure node_zonelist[0] cannot be NULL since things like 
first_online_pgdat() would break and it should be unnecessary to do 
node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
since the zonelists should already be defined properly.  All nodes, 
regardless of whether they have memory or not, should probably end up 
having a struct pglist_data unless there's a reason for another level of 
indirection.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: Fix build warning

2014-06-16 Thread David Rientjes
On Fri, 13 Jun 2014, Guenter Roeck wrote:

 If compiled with W=1, the following warning is seen in powerpc builds.
 
 arch/powerpc/kernel/smp.c:750:18: warning:
   type qualifiers ignored on function return type
 static const int powerpc_smt_flags(void)
  ^
 
 This is caused by a function returning 'const int', which doesn't
 make sense to gcc. Drop 'const' to fix the problem.
 
 Reported-by: Vincent Guittot vincent.guit...@linaro.org
 Signed-off-by: Guenter Roeck li...@roeck-us.net

Acked-by: David Rientjes rient...@google.com

Although it's strange you report this happening on line 750 in the 
changelog but the patch shows it differently.

 ---
  arch/powerpc/kernel/smp.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
 index 10e..49d5d4e 100644
 --- a/arch/powerpc/kernel/smp.c
 +++ b/arch/powerpc/kernel/smp.c
 @@ -768,7 +768,7 @@ int setup_profiling_timer(unsigned int multiplier)
  
  #ifdef CONFIG_SCHED_SMT
  /* cpumask of CPUs with asymetric SMT dependancy */
 -static const int powerpc_smt_flags(void)
 +static int powerpc_smt_flags(void)
  {
   int flags = SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES;
  
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: NUMA topology question wrt. d4edc5b6

2014-06-09 Thread David Rientjes
On Fri, 23 May 2014, Srivatsa S. Bhat wrote:

 diff --git a/arch/powerpc/include/asm/topology.h 
 b/arch/powerpc/include/asm/topology.h
 index c920215..58e6469 100644
 --- a/arch/powerpc/include/asm/topology.h
 +++ b/arch/powerpc/include/asm/topology.h
 @@ -18,6 +18,7 @@ struct device_node;
   */
  #define RECLAIM_DISTANCE 10
  
 +#include linux/nodemask.h
  #include asm/mmzone.h
  
  static inline int cpu_to_node(int cpu)
 @@ -30,7 +31,7 @@ static inline int cpu_to_node(int cpu)
* During early boot, the numa-cpu lookup table might not have been
* setup for all CPUs yet. In such cases, default to node 0.
*/
 - return (nid  0) ? 0 : nid;
 + return (nid  0) ? first_online_node : nid;
  }
  
  #define parent_node(node)(node)

I wonder what would happen on ppc if we just returned NUMA_NO_NODE here 
for cpus that have not been mapped (they shouldn't even be possible).  
This would at least allow callers that do
kmalloc_node(..., cpu_to_node(cpu)) to be allocated on the local cpu 
rather than on a perhaps offline or remote node 0.

It would seem better to catch callers that do 
cpu_to_node(not-possible-cpu) rather than blindly return an online node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Node 0 not necessary for powerpc?

2014-06-09 Thread David Rientjes
On Wed, 21 May 2014, Nishanth Aravamudan wrote:

 For context: I was looking at why N_ONLINE was statically setting Node 0
 to be online, whether or not the topology is that way -- I've been
 getting several bugs lately where Node 0 is online, but has no CPUs and
 no memory on it, on powerpc. 
 
 On powerpc, setup_per_cpu_areas calls into ___alloc_bootmem_node using
 NODE_DATA(cpu_to_node(cpu)).
 
 Currently, cpu_to_node() in arch/powerpc/include/asm/topology.h does:
 
 /*
  * During early boot, the numa-cpu lookup table might not have been
  * setup for all CPUs yet. In such cases, default to node 0.
  */
 return (nid  0) ? 0 : nid;
 
 And so early at boot, if node 0 is not present, we end up accessing an
 unitialized NODE_DATA(). So this seems buggy (I'll contact the powerpc
 deveopers separately on that).
 

I think what this really wants to do is NODE_DATA(cpu_to_mem(cpu)) and I 
thought ppc had the cpu-to-local-memory-node mappings correct?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND PATCH] slub: search partial list on numa_mem_id(), instead of numa_node_id()

2014-06-04 Thread David Rientjes
On Wed, 21 May 2014, Joonsoo Kim wrote:

 Currently, if allocation constraint to node is NUMA_NO_NODE, we search
 a partial slab on numa_node_id() node. This doesn't work properly on the
 system having memoryless node, since it can have no memory on that node and
 there must be no partial slab on that node.
 
 On that node, page allocation always fallback to numa_mem_id() first. So
 searching a partial slab on numa_node_id() in that case is proper solution
 for memoryless node case.
 
 Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 Acked-by: David Rientjes rient...@google.com
 Acked-by: Christoph Lameter c...@linux.com
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/mm/slub.c b/mm/slub.c
 index 545a170..cc1f995 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
   struct kmem_cache_cpu *c)
  {
   void *object;
 - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
  
   object = get_partial_node(s, get_node(s, searchnode), c, flags);
   if (object || node != NUMA_NO_NODE)

Andrew, can you merge this please?  It's still not in linux-next.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Node 0 not necessary for powerpc?

2014-03-11 Thread David Rientjes
On Tue, 11 Mar 2014, Nishanth Aravamudan wrote:

 I have a P7 system that has no node0, but a node0 shows up in numactl
 --hardware, which has no cpus and no memory (and no PCI devices):
 
 numactl --hardware
 available: 4 nodes (0-3)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
 node 1 size: 0 MB
 node 1 free: 0 MB
 node 2 cpus:
 node 2 size: 7935 MB
 node 2 free: 7716 MB
 node 3 cpus:
 node 3 size: 8395 MB
 node 3 free: 8015 MB
 node distances:
 node   0   1   2   3 
   0:  10  20  10  20 
   1:  20  10  20  20 
   2:  10  20  10  20 
   3:  20  20  20  10 
 
 This is because we statically initialize N_ONLINE to be [0] in
 mm/page_alloc.c:
 
 [N_ONLINE] = { { [0] = 1UL } },
 
 I'm not sure what the architectural requirements are here, but at least
 on this test system, removing this initialization, it boots fine and is
 running. I've not yet tried stress tests, but it's survived the
 beginnings of kernbench so far.
 
 numactl --hardware
 available: 3 nodes (1-3)
 node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
 node 1 size: 0 MB
 node 1 free: 0 MB
 node 2 cpus:
 node 2 size: 7935 MB
 node 2 free: 7479 MB
 node 3 cpus:
 node 3 size: 8396 MB
 node 3 free: 8375 MB
 node distances:
 node   1   2   3 
   1:  10  20  20 
   2:  20  10  20 
   3:  20  20  10
 
 Perhaps we could put in a ARCH_DOES_NOT_NEED_NODE0 and only define it on
 powerpc for now, conditionalizing the above initialization on that?
 

I don't know if anything has recently changed in the past year or so, but 
I've booted x86 machines with a hacked BIOS so that all memory on node 0 
is hotpluggable and offline, so I believe this is possible on x86 as well.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[patch] mm, hotplug: avoid compiling memory hotremove functions when disabled

2013-04-10 Thread David Rientjes
__remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE.  PowerPC
pseries will return -EOPNOTSUPP if unsupported.

Adding an #ifdef causes several other functions it depends on to also
become unnecessary, which saves in .text when disabled (it's disabled in
most defconfigs besides powerpc, including x86).  remove_memory_block()
becomes static since it is not referenced outside of
drivers/base/memory.c.

Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
and disabled.

Signed-off-by: David Rientjes rient...@google.com
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 12 +
 drivers/base/memory.c   | 44 +++
 include/linux/memory.h  |  3 +-
 include/linux/memory_hotplug.h  |  4 +-
 mm/memory_hotplug.c | 68 +++
 mm/sparse.c | 72 +
 6 files changed, 113 insertions(+), 90 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -72,6 +72,7 @@ unsigned long memory_block_size_bytes(void)
return get_memblock_size();
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
 static int pseries_remove_memblock(unsigned long base, unsigned int 
memblock_size)
 {
unsigned long start, start_pfn;
@@ -153,6 +154,17 @@ static int pseries_remove_memory(struct device_node *np)
ret = pseries_remove_memblock(base, lmb_size);
return ret;
 }
+#else
+static inline int pseries_remove_memblock(unsigned long base,
+ unsigned int memblock_size)
+{
+   return -EOPNOTSUPP;
+}
+static inline int pseries_remove_memory(struct device_node *np)
+{
+   return -EOPNOTSUPP;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 static int pseries_add_memory(struct device_node *np)
 {
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -93,16 +93,6 @@ int register_memory(struct memory_block *memory)
return error;
 }
 
-static void
-unregister_memory(struct memory_block *memory)
-{
-   BUG_ON(memory-dev.bus != memory_subsys);
-
-   /* drop the ref. we got in remove_memory_block() */
-   kobject_put(memory-dev.kobj);
-   device_unregister(memory-dev);
-}
-
 unsigned long __weak memory_block_size_bytes(void)
 {
return MIN_MEMORY_BLOCK_SIZE;
@@ -637,8 +627,28 @@ static int add_memory_section(int nid, struct mem_section 
*section,
return ret;
 }
 
-int remove_memory_block(unsigned long node_id, struct mem_section *section,
-   int phys_device)
+/*
+ * need an interface for the VM to add new memory regions,
+ * but without onlining it.
+ */
+int register_new_memory(int nid, struct mem_section *section)
+{
+   return add_memory_section(nid, section, NULL, MEM_OFFLINE, HOTPLUG);
+}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static void
+unregister_memory(struct memory_block *memory)
+{
+   BUG_ON(memory-dev.bus != memory_subsys);
+
+   /* drop the ref. we got in remove_memory_block() */
+   kobject_put(memory-dev.kobj);
+   device_unregister(memory-dev);
+}
+
+static int remove_memory_block(unsigned long node_id,
+  struct mem_section *section, int phys_device)
 {
struct memory_block *mem;
 
@@ -661,15 +671,6 @@ int remove_memory_block(unsigned long node_id, struct 
mem_section *section,
return 0;
 }
 
-/*
- * need an interface for the VM to add new memory regions,
- * but without onlining it.
- */
-int register_new_memory(int nid, struct mem_section *section)
-{
-   return add_memory_section(nid, section, NULL, MEM_OFFLINE, HOTPLUG);
-}
-
 int unregister_memory_section(struct mem_section *section)
 {
if (!present_section(section))
@@ -677,6 +678,7 @@ int unregister_memory_section(struct mem_section *section)
 
return remove_memory_block(0, section, 0);
 }
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /*
  * offline one memory block. If the memory block has been offlined, do nothing.
diff --git a/include/linux/memory.h b/include/linux/memory.h
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -114,9 +114,10 @@ extern void unregister_memory_notifier(struct 
notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 extern int register_new_memory(int, struct mem_section *);
+#ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
+#endif
 extern int memory_dev_init(void);
-extern int remove_memory_block(unsigned long, struct mem_section *, int);
 extern int memory_notify(unsigned long val, void *v);
 extern int memory_isolate_notify(unsigned long val, void *v);
 extern

[patch 4/4] mm, oom: remove statically defined arch functions of same name

2012-11-14 Thread David Rientjes
out_of_memory() is a globally defined function to call the oom killer.
x86, sh, and powerpc all use a function of the same name within file
scope in their respective fault.c unnecessarily.  Inline the functions
into the pagefault handlers to clean the code up.

Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Paul Mundt let...@linux-sh.org
Signed-off-by: David Rientjes rient...@google.com
---
 arch/powerpc/mm/fault.c |   27 ---
 arch/sh/mm/fault.c  |   19 +++
 arch/x86/mm/fault.c |   23 ---
 3 files changed, 27 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -113,19 +113,6 @@ static int store_updates_sp(struct pt_regs *regs)
 #define MM_FAULT_CONTINUE  -1
 #define MM_FAULT_ERR(sig)  (sig)
 
-static int out_of_memory(struct pt_regs *regs)
-{
-   /*
-* We ran out of memory, or some other thing happened to us that made
-* us unable to handle the page fault gracefully.
-*/
-   up_read(current-mm-mmap_sem);
-   if (!user_mode(regs))
-   return MM_FAULT_ERR(SIGKILL);
-   pagefault_out_of_memory();
-   return MM_FAULT_RETURN;
-}
-
 static int do_sigbus(struct pt_regs *regs, unsigned long address)
 {
siginfo_t info;
@@ -169,8 +156,18 @@ static int mm_fault_error(struct pt_regs *regs, unsigned 
long addr, int fault)
return MM_FAULT_CONTINUE;
 
/* Out of memory */
-   if (fault  VM_FAULT_OOM)
-   return out_of_memory(regs);
+   if (fault  VM_FAULT_OOM) {
+   up_read(current-mm-mmap_sem);
+
+   /*
+* We ran out of memory, or some other thing happened to us that
+* made us unable to handle the page fault gracefully.
+*/
+   if (!user_mode(regs))
+   return MM_FAULT_ERR(SIGKILL);
+   pagefault_out_of_memory();
+   return MM_FAULT_RETURN;
+   }
 
/* Bus error. x86 handles HWPOISON here, we'll add this if/when
 * we support the feature in HW
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -301,17 +301,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long 
error_code,
__bad_area(regs, error_code, address, SEGV_ACCERR);
 }
 
-static void out_of_memory(void)
-{
-   /*
-* We ran out of memory, call the OOM killer, and return the userspace
-* (which will retry the fault, or kill us if we got oom-killed):
-*/
-   up_read(current-mm-mmap_sem);
-
-   pagefault_out_of_memory();
-}
-
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long 
address)
 {
@@ -353,8 +342,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long 
error_code,
no_context(regs, error_code, address);
return 1;
}
+   up_read(current-mm-mmap_sem);
 
-   out_of_memory();
+   /*
+* We ran out of memory, call the OOM killer, and return the
+* userspace (which will retry the fault, or kill us if we got
+* oom-killed):
+*/
+   pagefault_out_of_memory();
} else {
if (fault  VM_FAULT_SIGBUS)
do_sigbus(regs, error_code, address);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -803,20 +803,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long 
error_code,
__bad_area(regs, error_code, address, SEGV_ACCERR);
 }
 
-/* TODO: fixup for mm-invoke-oom-killer-from-page-fault.patch */
-static void
-out_of_memory(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
-{
-   /*
-* We ran out of memory, call the OOM killer, and return the userspace
-* (which will retry the fault, or kill us if we got oom-killed):
-*/
-   up_read(current-mm-mmap_sem);
-
-   pagefault_out_of_memory();
-}
-
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long 
address,
  unsigned int fault)
@@ -879,7 +865,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long 
error_code,
return 1;
}
 
-   out_of_memory(regs, error_code, address);
+   up_read(current-mm-mmap_sem);
+
+   /*
+* We ran out of memory, call the OOM killer, and return the
+* userspace (which will retry the fault, or kill us if we got
+* oom-killed

Re: [RFC PATCH 1/12] memory-hotplug : rename remove_memory to offline_memory

2012-06-27 Thread David Rientjes
On Wed, 27 Jun 2012, Yasuaki Ishimatsu wrote:

 remove_memory() does not remove memory but just offlines memory. The patch
 changes name of it to offline_memory().
 

The kernel is never going to physically remove the memory itself, so I 
don't see the big problem with calling it remove_memory().  If you're 
going to change it to offline_memory(), which is just as good but not 
better, then I'd suggest changing add_memory() to online_memory() for 
completeness.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/12] memory-hogplug : check memory offline in offline_pages

2012-06-27 Thread David Rientjes
On Wed, 27 Jun 2012, Yasuaki Ishimatsu wrote:

 Index: linux-3.5-rc4/mm/memory_hotplug.c
 ===
 --- linux-3.5-rc4.orig/mm/memory_hotplug.c2012-06-26 13:28:16.743211538 
 +0900
 +++ linux-3.5-rc4/mm/memory_hotplug.c 2012-06-26 13:48:38.264940468 +0900
 @@ -887,6 +887,11 @@ static int __ref offline_pages(unsigned
 
   lock_memory_hotplug();
 
 + if (memory_is_offline(start_pfn, end_pfn)) {
 + ret = 0;
 + goto out;
 + }
 +
   zone = page_zone(pfn_to_page(start_pfn));
   node = zone_to_nid(zone);
   nr_pages = end_pfn - start_pfn;

Are there additional prerequisites for this patch?  Otherwise it changes 
the return value of offline_memory() which will now call 
acpi_memory_powerdown_device() in the acpi memhotplug case when disabling.  
Is that a problem?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[patch] sched, numa: Allocate node_queue on any node for offline nodes

2012-05-22 Thread David Rientjes
struct node_queue must be allocated with NUMA_NO_NODE for nodes that are 
not (yet) online, otherwise the page allocator has a bad zonelist and 
results in an early crash.

Tested-by: Stephen Rothwell s...@canb.auug.org.au
Signed-off-by: David Rientjes rient...@google.com
---
 kernel/sched/numa.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -885,7 +885,8 @@ static __init int numa_init(void)
 
for_each_node(node) {
struct node_queue *nq = kmalloc_node(sizeof(*nq),
-   GFP_KERNEL | __GFP_ZERO, node);
+   GFP_KERNEL | __GFP_ZERO,
+   node_online(node) ? node : NUMA_NO_NODE);
BUG_ON(!nq);
 
spin_lock_init(nq-lock);
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: PowerPC boot failures in next-20120521

2012-05-21 Thread David Rientjes
On Tue, 22 May 2012, Stephen Rothwell wrote:

 Unable to handle kernel paging request for data at address 0x1688
 Faulting instruction address: 0xc016e154
 Oops: Kernel access of bad area, sig: 11 [#1]
 SMP NR_CPUS=32 NUMA pSeries
 Modules linked in:
 NIP: c016e154 LR: c01b9140 CTR: 
 REGS: c003fc8c76d0 TRAP: 0300   Not tainted  (3.4.0-autokern1)
 MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 24044022  XER: 0003
 SOFTE: 1
 CFAR: 562c
 DAR: 1688, DSISR: 4000
 TASK = c003fc8c8000[1] 'swapper/0' THREAD: c003fc8c4000 CPU: 0
 GPR00:  c003fc8c7950 c0d05b30 12d0 
 GPR04:  1680  c003fe032f60 
 GPR08: 000400540001  c980 c0d24fe0 
 GPR12: 24044024 cf33b000 01a3fa78 009bac00 
 GPR16: 00e1f338 02d513f0 1680  
 GPR20: 0001 c003fc8c7c00  0001 
 GPR24: 0001 c0d1b490  1680 
 GPR28:   c0c7ce58 c003fe009200 
 NIP [c016e154] .__alloc_pages_nodemask+0xc4/0x8f0
 LR [c01b9140] .new_slab+0xd0/0x3c0
 Call Trace:
 [c003fc8c7950] [2e6e756d615f696e] 0x2e6e756d615f696e (unreliable)
 [c003fc8c7ae0] [c01b9140] .new_slab+0xd0/0x3c0
 [c003fc8c7b90] [c01b9844] .__slab_alloc+0x254/0x5b0
 [c003fc8c7cd0] [c01bb7a4] .kmem_cache_alloc_node_trace+0x94/0x260
 [c003fc8c7d80] [c0ba36d0] .numa_init+0x98/0x1dc
 [c003fc8c7e10] [c000ace4] .do_one_initcall+0x1a4/0x1e0
 [c003fc8c7ed0] [c0b7b354] .kernel_init+0x124/0x2e0
 [c003fc8c7f90] [c00211c8] .kernel_thread+0x54/0x70
 Instruction dump:
 5400d97e 7b170020 0b00 eb3e8000 3b80 80190088 2f80 40de0014 
 7860efe2 787c6fe2 78000fa4 7f9c0378 e81b0008 83f9 2fa0 7fff1838 
 ---[ end trace 31fd0ba7d8756001 ]---
 
 swapper/0 (1) used greatest stack depth: 10864 bytes left
 Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b
 
 I may be completely wrong, but I guess the obvious target would be the
 sched/numa branch that came in via the tip tree.
 
 Config file attached.  I haven't had a chance to try to bisect this yet.
 
 Anyone have any ideas?

Yeah, it's sched/numa since that's what introduced numa_init().  It does 
for_each_node() for each node and does a kmalloc_node() even though that 
node may not be online.  Slub ends up passing this node to the page 
allocator through alloc_pages_exact_node().  CONFIG_DEBUG_VM would have 
caught this and your config confirms its not enabled.

sched/numa either needs a memory hotplug notifier or it needs to pass 
NUMA_NO_NODE for nodes that aren't online.  Until we get the former, the 
following should fix it.


sched, numa: Allocate node_queue on any node for offline nodes

struct node_queue must be allocated with NUMA_NO_NODE for nodes that are 
not (yet) online, otherwise the page allocator has a bad zonelist.

Signed-off-by: David Rientjes rient...@google.com
---
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -885,7 +885,8 @@ static __init int numa_init(void)
 
for_each_node(node) {
struct node_queue *nq = kmalloc_node(sizeof(*nq),
-   GFP_KERNEL | __GFP_ZERO, node);
+   GFP_KERNEL | __GFP_ZERO,
+   node_online(node) ? node : NUMA_NO_NODE);
BUG_ON(!nq);
 
spin_lock_init(nq-lock);
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: PowerPC boot failures in next-20120521

2012-05-21 Thread David Rientjes
On Tue, 22 May 2012, Michael Neuling wrote:

 console [tty0] enabled
 console [hvc0] enabled
 pid_max: default: 32768 minimum: 301
 Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
 Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
 Mount-cache hash table entries: 4096
 Initializing cgroup subsys cpuacct
 Initializing cgroup subsys devices
 Initializing cgroup subsys freezer
 POWER7 performance monitor hardware support registered
 Unable to handle kernel paging request for data at address 0x1388
 Faulting instruction address: 0xc014a070
 Oops: Kernel access of bad area, sig: 11 [#1]
 SMP NR_CPUS=1024 NUMA pSeries
 Modules linked in:
 NIP: c014a070 LR: c01978cc CTR: c00b6870
 REGS: c0007e5836b0 TRAP: 0300   Tainted: GW (3.4.0-rc6-mikey)
 MSR: 90009032 SF,HV,EE,ME,IR,DR,RI  CR: 28004022  XER: 0200
 SOFTE: 1
 CFAR: 50fc
 DAR: 1388, DSISR: 4000
 TASK = c0007e56[1] 'swapper/0' THREAD: c0007e58 CPU: 0
 GPR00:  c0007e583930 c0c034d8 12d0 
 GPR04:  1380  0001 
 GPR08: c0007e0dff60  c0ca05a0  
 GPR12: 28004024 cff2   
 GPR16:   0001 1380 
 GPR20: 0001 c0e14900 c0e148f0 0001 
 GPR24: c0c6f378  1380 02aa 
 GPR28:   c0b576b0 c0007e021200 
 NIP [c014a070] .__alloc_pages_nodemask+0xd0/0x910
 LR [c01978cc] .new_slab+0xcc/0x3d0
 Call Trace:
 [c0007e583930] [c0007e5839c0] 0xc0007e5839c0 (unreliable)
 [c0007e583ac0] [c01978cc] .new_slab+0xcc/0x3d0
 [c0007e583b70] [c072ae98] .__slab_alloc+0x38c/0x4f8
 [c0007e583cb0] [c0198190] .kmem_cache_alloc_node_trace+0x90/0x260
 [c0007e583d60] [c0a5a404] .numa_init+0x9c/0x188
 [c0007e583e00] [c000aa30] .do_one_initcall+0x60/0x1e0
 [c0007e583ec0] [c0a40b60] .kernel_init+0x128/0x294
 [c0007e583f90] [c0020788] .kernel_thread+0x54/0x70
 Instruction dump:
 0b00 eb1e8000 3b80 801800a8 2f80 409e001c 7860efe3 3800 
 41820008 3802 787c6fe2 7f9c0378 e93a0008 801800a4 3b60 2fa9 
 ---[ end trace 31fd0ba7d8756002 ]---
 
 Which seems to be this code in __alloc_pages_nodemask
 ---
 /*
  * Check the zones suitable for the gfp_mask contain at least one
  * valid zone. It's possible to have an empty zonelist as a result
  * of GFP_THISNODE and a memoryless node
  */
 if (unlikely(!zonelist-_zonerefs-zone))
 c014a070:   e9 3a 00 08 ld  r9,8(r26)
 ---
 
 r26 is coming from r5 which is the struct zonelist *zonelist parameter
 to __alloc_pages_nodemask.  Having 1380 in there is clearly
 a bogus pointer.
 
 Bisecting it points to b4cdf91668c27a5a6a5a3ed4234756c042dd8288
   b4cdf91 sched/numa: Implement numa balancer
 
 Trying David's patch just posted doesn't fix it.
 

Hmm, what does CONFIG_DEBUG_VM say?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: PowerPC boot failures in next-20120521

2012-05-21 Thread David Rientjes
On Tue, 22 May 2012, Michael Neuling wrote:

Trying David's patch just posted doesn't fix it.

   
   Hmm, what does CONFIG_DEBUG_VM say?
  
  No set.
 
 Sorry, should have read Not set
 

I mean if it's set, what does it emit to the kernel log with my patch 
applied?

I made CONFIG_DEBUG_VM catch !node_online(node) about six months ago, so I 
was thinking it would have caught this if either you or Stephen enable it.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: PowerPC boot failures in next-20120521

2012-05-21 Thread David Rientjes
On Tue, 22 May 2012, Michael Neuling wrote:

 Sorry, got it... CONFIG_DEBUG_VM enabled below...
 
 pid_max: default: 32768 minimum: 301
 Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes)
 Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes)
 Mount-cache hash table entries: 4096
 Initializing cgroup subsys cpuacct
 Initializing cgroup subsys devices
 Initializing cgroup subsys freezer
 POWER7 performance monitor hardware support registered
 [ cut here ]
 kernel BUG at /scratch/mikey/src/linux-next/include/linux/gfp.h:318!

Yeah, this is what I was expecting, it's tripping on

VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));

and slub won't pass nid  0.  You're sure my patch is applied? :)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[patch] powerpc, mm: fix section mismatch for mark_reserved_regions_for_nid

2011-12-08 Thread David Rientjes
On Fri, 9 Dec 2011, Subrata Modak wrote:

 WARNING: vmlinux.o(.text+0x4c760): Section mismatch in reference from
 the function .mark_reserved_regions_for_nid() to the
 function .meminit.text:.early_pfn_to_nid()
 The function .mark_reserved_regions_for_nid() references
 the function __meminit .early_pfn_to_nid().
 This is often because .mark_reserved_regions_for_nid lacks a __meminit 
 annotation or the annotation of .early_pfn_to_nid is wrong.
 
 WARNING: vmlinux.o(.text+0x4c780): Section mismatch in reference from
 the function .mark_reserved_regions_for_nid() to the
 function .init.text:.work_with_active_regions()
 The function .mark_reserved_regions_for_nid() references
 the function __init .work_with_active_regions().
 This is often because .mark_reserved_regions_for_nid lacks a __init 
 annotation or the annotation of .work_with_active_regions is wrong.
 
 WARNING: vmlinux.o(.text+0x4c7d4): Section mismatch in reference from
 the function .mark_reserved_regions_for_nid() to the
 function .meminit.text:.early_pfn_to_nid()
 The function .mark_reserved_regions_for_nid() references
 the function __meminit .early_pfn_to_nid().
 This is often because .mark_reserved_regions_for_nid lacks a __meminit 
 annotation or the annotation of .early_pfn_to_nid is wrong.
 
 WARNING: vmlinux.o(.text+0x4c7f0): Section mismatch in reference from
 the function .mark_reserved_regions_for_nid() to the
 function .init.text:.work_with_active_regions()
 The function .mark_reserved_regions_for_nid() references
 the function __init .work_with_active_regions().
 This is often because .mark_reserved_regions_for_nid lacks a __init 
 annotation or the annotation of .work_with_active_regions is wrong.
 
 WARNING: vmlinux.o(.text+0x4c828): Section mismatch in reference from
 the function .mark_reserved_regions_for_nid() to the
 function .init.text:.reserve_bootmem_node()
 The function .mark_reserved_regions_for_nid() references
 the function __init .reserve_bootmem_node().
 This is often because .mark_reserved_regions_for_nid lacks a __init 
 annotation or the annotation of .reserve_bootmem_node is wrong.
 

Wow, lots of ibm folks on the cc :)  I can only talk about the mm related 
section mismatches, but these five can easily be solved with the 
following.


powerpc, mm: fix section mismatch for mark_reserved_regions_for_nid

mark_reserved_regions_for_nid() is only called from do_init_bootmem(), 
which is in .init.text, so it must be in the same section to avoid a 
section mismatch warning.

Reported-by: Subrata Modak subr...@linux.vnet.ibm.com
Signed-off-by: David Rientjes rient...@google.com
---
 arch/powerpc/mm/numa.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -969,7 +969,7 @@ static struct notifier_block __cpuinitdata ppc64_numa_nb = {
.priority = 1 /* Must run before sched domains notifier. */
 };
 
-static void mark_reserved_regions_for_nid(int nid)
+static void __init mark_reserved_regions_for_nid(int nid)
 {
struct pglist_data *node = NODE_DATA(nid);
struct memblock_region *reg;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[patch] powerpc, mm: fix section mismatch for read_n_cells

2011-12-08 Thread David Rientjes
read_n_cells() cannot be marked as .devinit.text since it is referenced 
from two functions that are not in that section: of_get_lmb_size() and 
hot_add_drconf_scn_to_nid().

Signed-off-by: David Rientjes rient...@google.com
---
 arch/powerpc/mm/numa.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -406,7 +406,7 @@ static void __init get_n_mem_cells(int *n_addr_cells, int 
*n_size_cells)
of_node_put(memory);
 }
 
-static unsigned long __devinit read_n_cells(int n, const unsigned int **buf)
+static unsigned long read_n_cells(int n, const unsigned int **buf)
 {
unsigned long result = 0;
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Badness with the kernel version 2.6.35-rc1-git1 running on P6 box

2010-07-16 Thread David Rientjes
On Fri, 16 Jul 2010, Dave Hansen wrote:

   SLUB: Unable to allocate memory on node -1 (gfp=0x20)
  cache: kmalloc-16384, object size: 16384, buffer size: 16384,
  default order: 2, min order: 0
  node 0: slabs: 28, objs: 292, free: 0
   ip: page allocation failure. order:0, mode:0x8020
   Call Trace:
   [c6a0eb40] [c0011c30] .show_stack+0x6c/0x16c (unreliable)
   [c6a0ebf0] [c012129c] .__alloc_pages_nodemask+0x6a0/0x75c
   [c6a0ed70] [c01527cc] .alloc_pages_current+0xc4/0x104
   [c6a0ee10] [c011fca4] .__get_free_pages+0x18/0x90
   [c6a0ee90] [c04f7058] .ehea_get_stats+0x4c/0x1bc
   [c6a0ef30] [c05a0a04] .dev_get_stats+0x38/0x64
   [c6a0efc0] [c05b456c] .rtnl_fill_ifinfo+0x35c/0x85c
   [c6a0f150] [c05b5920] .rtmsg_ifinfo+0x164/0x204
   [c6a0f210] [c05a6d6c] .dev_change_flags+0x4c/0x7c
   [c6a0f2a0] [c05b50b4] .do_setlink+0x31c/0x750
   [c6a0f3b0] [c05b6724] .rtnl_newlink+0x388/0x618
   [c6a0f5f0] [c05b6350] .rtnetlink_rcv_msg+0x268/0x2b4
   [c6a0f6a0] [c05cfdc0] .netlink_rcv_skb+0x74/0x108
   [c6a0f730] [c05b60c4] .rtnetlink_rcv+0x38/0x5c
   [c6a0f7c0] [c05cf8c8] .netlink_unicast+0x318/0x3f4
   [c6a0f890] [c05d05b4] .netlink_sendmsg+0x2d0/0x310
   [c6a0f970] [c058e1e8] .sock_sendmsg+0xd4/0x110
   [c6a0fb50] [c058e514] .SyS_sendmsg+0x1f4/0x288
   [c6a0fd70] [c058c2b8] .SyS_socketcall+0x214/0x280
   [c6a0fe30] [c00085b4] syscall_exit+0x0/0x40
   Mem-Info:
   Node 0 DMA per-cpu:
   CPU0: hi:0, btch:   1 usd:   0
   CPU1: hi:0, btch:   1 usd:   0
   CPU2: hi:0, btch:   1 usd:   0
   CPU3: hi:0, btch:   1 usd:   0
   
   The mainline 2.6.35-rc5 worked fine.
  
  Maybe you were lucky with 2.6.35-rc5
  
  Anyway ehea should not use GFP_ATOMIC in its ehea_get_stats() method,
  called in process context, but GFP_KERNEL.
  
  Another patch is needed for ehea_refill_rq_def() as well.
 
 You're right that this is abusing GFP_ATOMIC.
 
 But is, this is just a normal GFP_ATOMIC allocation failure?  SLUB:
 Unable to allocate memory on node -1 seems like a somewhat
 inappropriate error message for that.  
 

The slub message is seperate and doesn't generate a call trace, even 
though it is a (minimum) order-0 GFP_ATOMIC allocation as well.  The page 
allocation failure is seperate instance that is calling the page 
allocator, not the slab allocator.

 It isn't immediately obvious where the -1 is coming from.  Does it truly
 mean allocate from any node here, or is that a buglet in and of
 itself?
 

Yes, slub uses -1 to indicate that the allocation need not come from a 
specific node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [patch 11/14] powerpc: invoke oom-killer from page fault

2010-04-22 Thread David Rientjes
On Fri, 23 Apr 2010, npig...@suse.de wrote:

 As explained in commit 1c0fe6e3bd, we want to call the architecture 
 independent
 oom killer when getting an unexplained OOM from handle_mm_fault, rather than
 simply killing current.
 
 Cc: linuxppc-...@ozlabs.org
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: linux-a...@vger.kernel.org
 Signed-off-by: Nick Piggin npig...@suse.de
 ---
 Index: linux-2.6/arch/powerpc/mm/fault.c
 ===
 --- linux-2.6.orig/arch/powerpc/mm/fault.c
 +++ linux-2.6/arch/powerpc/mm/fault.c
 @@ -359,15 +359,10 @@ bad_area_nosemaphore:
   */
  out_of_memory:
   up_read(mm-mmap_sem);
 - if (is_global_init(current)) {
 - yield();
 - down_read(mm-mmap_sem);
 - goto survive;
 - }
 - printk(VM: killing process %s\n, current-comm);
 - if (user_mode(regs))
 - do_group_exit(SIGKILL);
 - return SIGKILL;
 + if (!user_mode(regs))
 + return SIGKILL;
 + pagefault_out_of_memory();
 + return 0;

Do we really want to return 0 and indicate that the fault was handled?  It 
seems more consistent to do

if (user_mode(regs))
pagefault_out_of_memory();
return SIGKILL;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Fix fake numa on ppc

2009-09-02 Thread David Rientjes
On Wed, 2 Sep 2009, Ankita Garg wrote:

   With the patch,
   
   # cat /proc/cmdline
   root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
   # cat /sys/devices/system/node/node0/cpulist
   0-3
   # cat /sys/devices/system/node/node1/cpulist
   
  
  Oh! interesting.. cpuless nodes :) I think we need to fix this in the
  longer run and distribute cpus between fake numa nodes of a real node
  using some acceptable heuristic.
 
 
 True. Presently this is broken on both x86 and ppc systems. It would be
 interesting to find a way to map, for example, 4 cpus to 4 number of
 fake nodes created from a single real numa node!
  

We've done it for years on x86_64.  It's quite trivial to map all fake 
nodes within a physical node to the cpus to which they have affinity both 
via node_to_cpumask_map() and cpu_to_node_map().  There should be no 
kernel space dependencies on a cpu appearing in only a single node's 
cpumask and if you map each fake node to its physical node's pxm, you can 
index into the slit and generate local NUMA distances amongst fake nodes.

So if you map the apicids and pxms appropriately depending on the 
physical topology of the machine, that is the only emulation necessary on 
x86_64 for the page allocator zonelist ordering, task migration, etc.  (If 
you use CONFIG_SLAB, you'll need to avoid the exponential growth of alien 
caches, but that's an implementation detail and isn't really within the 
scope of numa=fake's purpose to modify.)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Fix fake numa on ppc

2009-09-02 Thread David Rientjes
On Wed, 2 Sep 2009, Benjamin Herrenschmidt wrote:

 Since I'm pretty sure there could be CPU less nodes just like there
 could be memory-less nodes, it would be good if fake numa could
 simulate them too :-)
 

You don't want to simulate cpu less nodes since they do have affinity to 
ranges of memory, you want to map each fake node to a cpumask including 
all cpus with affinity to its memory, map each cpu to one fake node (with 
memory) that it has physical affinity to, and then give all fake nodes 
local NUMA distance to those on the same physical node.  Memoryless nodes 
take care of themselves since they rely purely on node_distance(), so the 
index into the slit for all fake nodes to those without memory will be the 
same.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] Fix fake numa on ppc

2009-09-02 Thread David Rientjes
On Wed, 2 Sep 2009, Ankita Garg wrote:

 Hi,
 
 Below is a patch to fix a couple of issues with fake numa node creation
 on ppc:
 
 1) Presently, fake nodes could be created such that real numa node
 boundaries are not respected. So a node could have lmbs that belong to
 different real nodes.
 

On x86_64, we can use numa=off to completely disable NUMA so that all 
memory and all cpus are mapped to a single node 0.  That's an extreme 
example of the above and is totally permissible.

 2) The cpu association is broken. On a JS22 blade for example, which is
 a 2-node numa machine, I get the following:
 
 # cat /proc/cmdline
 root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
 # cat /sys/devices/system/node/node0/cpulist
 0-3
 # cat /sys/devices/system/node/node1/cpulist
 4-7
 # cat /sys/devices/system/node/node4/cpulist
 
 #
 

This doesn't show what the true NUMA topology of the machine is, could you 
please post the output of

$ cat /sys/devices/system/node/node*/cpulist
$ cat /sys/devices/system/node/node*/distance
$ ls -d /sys/devices/system/node/node*/cpu[0-8]

from a normal boot without any numa=fake?

 So, though the cpus 4-7 should have been associated with node4, they
 still belong to node1. The patch works by recording a real numa node
 boundary and incrementing the fake node count. At the same time, a
 mapping is stored from the real numa node to the first fake node that
 gets created on it.
 

If there are multiple fake nodes on a real physical node, all cpus in that 
node should appear in the cpulist for each fake node for which it has 
local distance.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] Fix fake numa on ppc

2009-09-02 Thread David Rientjes
On Wed, 2 Sep 2009, Ankita Garg wrote:

 Currently, the behavior of fake numa is not so on x86 as well? Below is
 a sample output from a single node x86 system booted with numa=fake=8:
 
 # cat node0/cpulist
 
 # cat node1/cpulist
 
 ...
 # cat node6/cpulist
 
 # cat node7/cpulist
 0-7
 
 Presently, just fixing the cpu association issue with ppc, as explained
 in my previous mail.
 

Right, I'm proposing an alternate mapping scheme (which we've used for 
years) for both platforms such that a cpu is bound (and is set in 
cpumask_of_node()) to each fake node with which it has physical affinity.  
That is the only way for zonelist ordering in node order, task migration 
from offlined cpus, correct sched domains, etc.  I can propose a patchset 
for x86_64 to do exactly this if there aren't any objections and I hope 
you'll help do ppc.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] Fix fake numa on ppc

2009-09-02 Thread David Rientjes
On Thu, 3 Sep 2009, Balbir Singh wrote:

  Right, I'm proposing an alternate mapping scheme (which we've used for
  years) for both platforms such that a cpu is bound (and is set in
  cpumask_of_node()) to each fake node with which it has physical affinity.
  That is the only way for zonelist ordering in node order, task migration
  from offlined cpus, correct sched domains, etc.  I can propose a patchset
  for x86_64 to do exactly this if there aren't any objections and I hope
  you'll help do ppc.
 
 Sounds interesting, I'd definitely be interested in seeing your
 proposal, but I would think of that as additional development on top
 of this patch
 

Absolutely.  I'm not familiar with numa=fake on ppc, but if cpus are being 
bound to nodes with which they don't have affinity, it definitely warrants 
a fix such as this (although the initial value for fake_enabled looks 
wrong and fake_numa_node_mapping[] can be __cpuinitdata).  I'll cc you, 
Ben, and Ankita on the x86_64 patches.  Thanks.___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] Fake NUMA emulation for PowerPC

2007-12-07 Thread David Rientjes
On Sat, 8 Dec 2007, Balbir Singh wrote:

  You're going to want to distribute the cpu's based on how they match up 
  physically with the actual platform that you're running on.  x86_64 does 
 
 Could you explain this better, how does it match up CPU's with fake NUMA
 memory? Is there some smartness there? I'll try and look at the code and
 also see what I can do for PowerPC
 

numa_cpumask_lookup_table[] would return the correct cpumask for the fake 
node index.  Then all the code that uses node_to_cpumask() in generic 
kernel code like the scheduler and VM still preserve their true NUMA 
affinity that matches the underlying hardware.  I tried to make x86_64 
fake NUMA as close to the real thing as possible.

You also probably want to make all you changes dependent on 
CONFIG_NUMA_EMU like the x86_64 case.  That'll probably be helpful as you 
extend this tool more and more.

David
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fake NUMA emulation for PowerPC

2007-12-07 Thread David Rientjes
On Sat, 8 Dec 2007, Balbir Singh wrote:

 Yes, they all appear on node 0. We could have tweaks to distribute CPU's
 as well.
 

You're going to want to distribute the cpu's based on how they match up 
physically with the actual platform that you're running on.  x86_64 does 
this already and it makes fake NUMA more useful because it matches the 
real-life case more often.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fake NUMA emulation for PowerPC

2007-12-07 Thread David Rientjes
On Sat, 8 Dec 2007, Balbir Singh wrote:

 To be able to test the memory controller under NUMA, I use fake NUMA
 nodes. x86-64 has a similar feature, the code I have here is the
 simplest I could come up with for PowerPC.
 

Magnus Damm had patches from over a year ago that, I believe, made much of 
the x86_64 fake NUMA code generic so that it could be extended for 
architectures such as i386.  Perhaps he could resurrect those patches if 
there is wider interest in such a tool.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fake NUMA emulation for PowerPC

2007-12-07 Thread David Rientjes
On Fri, 7 Dec 2007, Olof Johansson wrote:

  Comments are as always welcome!
 
 Care to explain what this is useful for? (Not saying it's a stupid idea,
 just wondering what the reason for doing it is).
 

Fake NUMA has always been useful for testing NUMA code without having to 
have a wide range of hardware available to you.  It's a clever tool on 
x86_64 intended for kernel developers that simply makes it easier to test 
code and adds an increased level of robustness to the kernel.  I think 
it's a valuable addition.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev