subject:"Re\: linux\-next\: manual merge of the akpm\-current tree with the tip tree"

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2020-12-11 Thread Jason Gunthorpe

On Fri, Dec 11, 2020 at 07:56:54PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   mm/gup.c
> 
> between commit:
> 
>   2a4a06da8a4b ("mm/gup: Provide gup_get_pte() more generic")
> 
> from the tip tree and commit:
> 
>   1eb2fe862a51 ("mm/gup: combine put_compound_head() and unpin_user_page()")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Looks OK

Thanks,
Jason

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2020-11-30 Thread Thomas Gleixner

On Fri, Nov 27 2020 at 13:54, Andy Shevchenko wrote:
>> I fixed it up (see below) and can carry the fix as necessary. This
>> is now fixed as far as linux-next is concerned, but any non trivial
>> conflicts should be mentioned to your upstream maintainer when your tree
>> is submitted for merging.  You may also want to consider cooperating
>> with the maintainer of the conflicting tree to minimise any particularly
>> complex conflicts.
>
> Thanks, from my perspective looks good, dunno if scheduler part is okay.

The final outcome in -next looks correct.

Thanks,

tglx

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2020-11-27 Thread Andy Shevchenko

On Fri, Nov 27, 2020 at 06:39:24PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   include/linux/kernel.h
> 
> between commit:
> 
>   74d862b682f5 ("sched: Make migrate_disable/enable() independent of RT")
> 
> from the tip tree and commit:
> 
>   761ace49e56f ("kernel.h: Split out mathematical helpers")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Thanks, from my perspective looks good, dunno if scheduler part is okay.

> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc include/linux/kernel.h
> index dbf6018fc312,f97ab3283a8b..
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@@ -272,48 -145,13 +159,6 @@@ extern void __cant_migrate(const char *
>   
>   #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
>   
> - /**
> -  * abs - return absolute value of an argument
> -  * @x: the value.  If it is unsigned type, it is converted to signed type 
> first.
> -  * char is treated as if it was signed (regardless of whether it really 
> is)
> -  * but the macro's return type is preserved as char.
> -  *
> -  * Return: an absolute value of x.
> -  */
> - #define abs(x)  __abs_choose_expr(x, long long, 
> \
> - __abs_choose_expr(x, long,  \
> - __abs_choose_expr(x, int,   \
> - __abs_choose_expr(x, short, \
> - __abs_choose_expr(x, char,  \
> - __builtin_choose_expr(  \
> - __builtin_types_compatible_p(typeof(x), char),  \
> - (char)({ signed char __x = (x); __x<0?-__x:__x; }), \
> - ((void)0)))
> - 
> - #define __abs_choose_expr(x, type, other) __builtin_choose_expr(\
> - __builtin_types_compatible_p(typeof(x),   signed type) ||   \
> - __builtin_types_compatible_p(typeof(x), unsigned type), \
> - ({ signed type __x = (x); __x < 0 ? -__x : __x; }), other)
> - 
> - /**
> -  * reciprocal_scale - "scale" a value into range [0, ep_ro)
> -  * @val: value
> -  * @ep_ro: right open interval endpoint
> -  *
> -  * Perform a "reciprocal multiplication" in order to "scale" a value into
> -  * range [0, @ep_ro), where the upper interval endpoint is right-open.
> -  * This is useful, e.g. for accessing a index of an array containing
> -  * @ep_ro elements, for example. Think of it as sort of modulus, only that
> -  * the result isn't that of modulo. ;) Note that if initial input is a
> -  * small value, then result will return 0.
> -  *
> -  * Return: a result based on @val in interval [0, @ep_ro).
> -  */
> - static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
> - {
> - return (u32)(((u64) val * ep_ro) >> 32);
> - }
>  -#ifndef CONFIG_PREEMPT_RT
>  -# define cant_migrate() cant_sleep()
>  -#else
>  -  /* Placeholder for now */
>  -# define cant_migrate() do { } while (0)
>  -#endif
> --
>   #if defined(CONFIG_MMU) && \
>   (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
>   #define might_fault() __might_fault(__FILE__, __LINE__)



-- 
With Best Regards,
Andy Shevchenko

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2020-06-02 Thread Stephen Rothwell

Hi all,

On Mon, 25 May 2020 21:04:43 +1000 Stephen Rothwell  
wrote:
>
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   arch/x86/mm/tlb.c
> 
> between commit:
> 
>   83ce56f712af ("x86/mm: Refactor cond_ibpb() to support other use cases")
> 
> from the tip tree and commit:
> 
>   36c8e34d03a1 ("x86/mm: remove vmalloc faulting")
> 
> from the akpm-current tree.
> 
> diff --cc arch/x86/mm/tlb.c
> index c8524c506ab0,f3fe261e5936..
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@@ -345,48 -161,16 +345,20 @@@ void switch_mm(struct mm_struct *prev, 
>   local_irq_restore(flags);
>   }
>   
> - static void sync_current_stack_to_mm(struct mm_struct *mm)
> - {
> - unsigned long sp = current_stack_pointer;
> - pgd_t *pgd = pgd_offset(mm, sp);
> - 
> - if (pgtable_l5_enabled()) {
> - if (unlikely(pgd_none(*pgd))) {
> - pgd_t *pgd_ref = pgd_offset_k(sp);
> - 
> - set_pgd(pgd, *pgd_ref);
> - }
> - } else {
> - /*
> -  * "pgd" is faked.  The top level entries are "p4d"s, so sync
> -  * the p4d.  This compiles to approximately the same code as
> -  * the 5-level case.
> -  */
> - p4d_t *p4d = p4d_offset(pgd, sp);
> - 
> - if (unlikely(p4d_none(*p4d))) {
> - pgd_t *pgd_ref = pgd_offset_k(sp);
> - p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
> - 
> - set_p4d(p4d, *p4d_ref);
> - }
> - }
> - }
> - 
>  -static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
>  +static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct 
> *next)
>   {
>   unsigned long next_tif = task_thread_info(next)->flags;
>  -unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
>  +unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & 
> LAST_USER_MM_SPEC_MASK;
>   
>  -return (unsigned long)next->mm | ibpb;
>  +BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
>  +
>  +return (unsigned long)next->mm | spec_bits;
>   }
>   
>  -static void cond_ibpb(struct task_struct *next)
>  +static void cond_mitigation(struct task_struct *next)
>   {
>  +unsigned long prev_mm, next_mm;
>  +
>   if (!next || !next->mm)
>   return;
>   
> @@@ -587,20 -343,12 +559,11 @@@ void switch_mm_irqs_off(struct mm_struc
>   need_flush = true;
>   } else {
>   /*
>  - * Avoid user/user BTB poisoning by flushing the branch
>  - * predictor when switching between processes. This stops
>  - * one process from doing Spectre-v2 attacks on another.
>  + * Apply process to process speculation vulnerability
>  + * mitigations if applicable.
>*/
>  -cond_ibpb(tsk);
>  +cond_mitigation(tsk);
>   
> - if (IS_ENABLED(CONFIG_VMAP_STACK)) {
> - /*
> -  * If our current stack is in vmalloc space and isn't
> -  * mapped in the new pgd, we'll double-fault.  Forcibly
> -  * map it.
> -  */
> - sync_current_stack_to_mm(next);
> - }
> - 
>   /*
>* Stop remote flushes for the previous mm.
>* Skip kernel threads; we never send init_mm TLB flushing IPIs,

This is now a conflict between commit

  94709049fb84 ("Merge branch 'akpm' (patches from Andrew)")

from Linus' tree and the above tip tree commit.

-- 
Cheers,
Stephen Rothwell


pgpgZXYeA_qYi.pgp
Description: OpenPGP digital signature

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2020-05-25 Thread Singh, Balbir

On Mon, 2020-05-25 at 21:04 +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   arch/x86/mm/tlb.c
> 
> between commit:
> 
>   83ce56f712af ("x86/mm: Refactor cond_ibpb() to support other use cases")
> 
> from the tip tree and commit:
> 
>   36c8e34d03a1 ("x86/mm: remove vmalloc faulting")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 

The changes look reasonable to me (in terms of the merge resolution).

Acked-by: Balbir Singh

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2018-08-20 Thread Andrew Morton

On Mon, 20 Aug 2018 14:32:22 +1000 Stephen Rothwell  
wrote:

> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   fs/proc/kcore.c
>   include/linux/kcore.h
> 
> between commit:
> 
>   6855dc41b246 ("x86: Add entry trampolines to kcore")
> 
> from the tip tree and commits:
> 
>   4eb27c275abf ("fs/proc/kcore.c: use __pa_symbol() for KCORE_TEXT list 
> entries")
>   ea551910d3f4 ("proc/kcore: clean up ELF header generation")
>   537412a2958f ("proc/kcore: don't grab lock for kclist_add()")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Yup.

What's happening here?  A two month old patch turns up in linux-next in the
middle of the merge window, in the "perf/urgent" branch.  That's a strange
branch for a June 6 patch!

Is it intended that this material be merged into 4.19-rc1?

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2018-08-20 Thread Andrew Morton

On Mon, 20 Aug 2018 14:32:22 +1000 Stephen Rothwell  
wrote:

> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   fs/proc/kcore.c
>   include/linux/kcore.h
> 
> between commit:
> 
>   6855dc41b246 ("x86: Add entry trampolines to kcore")
> 
> from the tip tree and commits:
> 
>   4eb27c275abf ("fs/proc/kcore.c: use __pa_symbol() for KCORE_TEXT list 
> entries")
>   ea551910d3f4 ("proc/kcore: clean up ELF header generation")
>   537412a2958f ("proc/kcore: don't grab lock for kclist_add()")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Yup.

What's happening here?  A two month old patch turns up in linux-next in the
middle of the merge window, in the "perf/urgent" branch.  That's a strange
branch for a June 6 patch!

Is it intended that this material be merged into 4.19-rc1?

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-23 Thread Vlastimil Babka

On 08/22/2017 08:57 AM, Stephen Rothwell wrote:
> Hi Andrew,

Hi,

> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   init/main.c
> 
> between commit:
> 
>   caba4cbbd27d ("debugobjects: Make kmemleak ignore debug objects")
> 
> from the tip tree and commit:
> 
>   50a7dc046b58 ("mm, page_ext: move page_ext_init() after 
> page_alloc_init_late()")

This patch can be also dropped from mmotm. It was a RFC and review
suggested a different approach which I didn't get to try yet. (The other
patches in the series should be fine to stay in any case).

> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-23 Thread Vlastimil Babka

On 08/22/2017 08:57 AM, Stephen Rothwell wrote:
> Hi Andrew,

Hi,

> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   init/main.c
> 
> between commit:
> 
>   caba4cbbd27d ("debugobjects: Make kmemleak ignore debug objects")
> 
> from the tip tree and commit:
> 
>   50a7dc046b58 ("mm, page_ext: move page_ext_init() after 
> page_alloc_init_late()")

This patch can be also dropped from mmotm. It was a RFC and review
suggested a different approach which I didn't get to try yet. (The other
patches in the series should be fine to stay in any case).

> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-15 Thread Minchan Kim

On Mon, Aug 14, 2017 at 09:57:23PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 14, 2017 at 05:38:39PM +0900, Minchan Kim wrote:
> > memory-barrier.txt always scares me. I have read it for a while
> > and IIUC, it seems semantic of spin_unlock(_pte) would be
> > enough without some memory-barrier inside mm_tlb_flush_nested.
> 
> Indeed, see the email I just send. Its both spin_lock() and
> spin_unlock() that we care about.
> 
> Aside from the semi permeable barrier of these primitives, RCpc ensures
> these orderings only work against the _same_ lock variable.
> 
> Let me try and explain the ordering for PPC (which is by far the worst
> we have in this regard):
> 
> 
> spin_lock(lock)
> {
>   while (test_and_set(lock))
>   cpu_relax();
>   lwsync();
> }
> 
> 
> spin_unlock(lock)
> {
>   lwsync();
>   clear(lock);
> }
> 
> Now LWSYNC has fairly 'simple' semantics, but with fairly horrible
> ramifications. Consider LWSYNC to provide _local_ TSO ordering, this
> means that it allows 'stores reordered after loads'.
> 
> For the spin_lock() that implies that all load/store's inside the lock
> do indeed stay in, but the ACQUIRE is only on the LOAD of the
> test_and_set(). That is, the actual _set_ can leak in. After all it can
> re-order stores after load (inside the lock).
> 
> For unlock it again means all load/store's prior stay prior, and the
> RELEASE is on the store clearing the lock state (nothing surprising
> here).
> 
> Now the _local_ part, the main take-away is that these orderings are
> strictly CPU local. What makes the spinlock work across CPUs (as we'd
> very much expect it to) is the address dependency on the lock variable.
> 
> In order for the spin_lock() to succeed, it must observe the clear. Its
> this link that crosses between the CPUs and builds the ordering. But
> only the two CPUs agree on this order. A third CPU not involved in
> this transaction can disagree on the order of events.

The detail explanation in your previous reply makes me comfortable
from scary memory-barrier.txt but this reply makes me scared again. ;-)

Thanks for the kind clarification, Peter!

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-15 Thread Minchan Kim

On Mon, Aug 14, 2017 at 09:57:23PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 14, 2017 at 05:38:39PM +0900, Minchan Kim wrote:
> > memory-barrier.txt always scares me. I have read it for a while
> > and IIUC, it seems semantic of spin_unlock(_pte) would be
> > enough without some memory-barrier inside mm_tlb_flush_nested.
> 
> Indeed, see the email I just send. Its both spin_lock() and
> spin_unlock() that we care about.
> 
> Aside from the semi permeable barrier of these primitives, RCpc ensures
> these orderings only work against the _same_ lock variable.
> 
> Let me try and explain the ordering for PPC (which is by far the worst
> we have in this regard):
> 
> 
> spin_lock(lock)
> {
>   while (test_and_set(lock))
>   cpu_relax();
>   lwsync();
> }
> 
> 
> spin_unlock(lock)
> {
>   lwsync();
>   clear(lock);
> }
> 
> Now LWSYNC has fairly 'simple' semantics, but with fairly horrible
> ramifications. Consider LWSYNC to provide _local_ TSO ordering, this
> means that it allows 'stores reordered after loads'.
> 
> For the spin_lock() that implies that all load/store's inside the lock
> do indeed stay in, but the ACQUIRE is only on the LOAD of the
> test_and_set(). That is, the actual _set_ can leak in. After all it can
> re-order stores after load (inside the lock).
> 
> For unlock it again means all load/store's prior stay prior, and the
> RELEASE is on the store clearing the lock state (nothing surprising
> here).
> 
> Now the _local_ part, the main take-away is that these orderings are
> strictly CPU local. What makes the spinlock work across CPUs (as we'd
> very much expect it to) is the address dependency on the lock variable.
> 
> In order for the spin_lock() to succeed, it must observe the clear. Its
> this link that crosses between the CPUs and builds the ordering. But
> only the two CPUs agree on this order. A third CPU not involved in
> this transaction can disagree on the order of events.

The detail explanation in your previous reply makes me comfortable
from scary memory-barrier.txt but this reply makes me scared again. ;-)

Thanks for the kind clarification, Peter!

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-15 Thread Nadav Amit

Peter Zijlstra  wrote:

> On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
 So I'm not entirely clear about this yet.
 
 How about:
 
 
CPU0CPU1
 
tlb_gather_mmu()
 
lock PTLn
no mod
unlock PTLn
 
tlb_gather_mmu()
 
lock PTLm
mod
include in tlb range
unlock PTLm
 
lock PTLn
mod
unlock PTLn
 
tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);
 
 
... more ...
 
tlb_finish_mmu()
 
 
 
 In this case you also want CPU1's mm_tlb_flush_nested() call to return
 true, right?
>>> 
>>> No, because CPU 1 mofified pte and added it into tlb range
>>> so regardless of nested, it will flush TLB so there is no stale
>>> TLB problem.
> 
>> To clarify: the main problem that these patches address is when the first
>> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
>> PTE is already what I wanted - no flush is needed”.
> 
> OK, that simplifies things.
> 
>> For some reason (I would assume intentional), all the examples here first
>> “do not modify” the PTE, and then modify it - which is not an “interesting”
>> case.
> 
> Depends on what you call 'interesting' :-) They are 'interesting' to
> make work from a memory ordering POV. And since I didn't get they were
> excluded from the set, I worried.
> 
> In fact, if they were to be included, I couldn't make it work at all. So
> I'm really glad to hear we can disregard them.
> 
>> However, based on what I understand on the memory barriers, I think
>> there is indeed a missing barrier before reading it in
>> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
>> before reading, would solve the problem with least impact on systems with
>> strong memory ordering.
> 
> No, all is well. If, as you say, we're naturally constrained to the case
> where we only care about prior modification we can rely on the RCpc PTL
> locks.
> 
> Consider:
> 
> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   tlb_gather_mmu()
> inc   .
>   | (inc is constrained by RELEASE)
>   lock PTLn   |
>   mod ^
>   unlock PTLn ->  lock PTLn
>   v   no mod
>   |   unlock PTLn
>   |
>   |   lock PTLm
>   |   mod
>   |   include in tlb range
>   |   unlock PTLm
>   |
>   (read is constrained|
> by ACQUIRE)   |
>   |   tlb_finish_mmu()
>   ` force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> 
> Then CPU1's acquire of PTLn orders against CPU0's release of that same
> PTLn which guarantees we observe both its (prior) modified PTE and the
> mm->tlb_flush_pending increment from tlb_gather_mmu().
> 
> So all we need for mm_tlb_flush_nested() to work is having acquired the
> right PTL at least once before calling it.
> 
> At the same time, the decrements need to be after the TLB invalidate is
> complete, this ensures that _IF_ we observe the decrement, we must've
> also observed the corresponding invalidate.
> 
> Something like the below is then sufficient.
> 
> ---
> Subject: mm: Clarify tlb_flush_pending barriers
> From: Peter Zijlstra 
> Date: Fri, 11 Aug 2017 16:04:50 +0200
> 
> Better document the ordering around tlb_flush_pending.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
> include/linux/mm_types.h |   78 
> +++
> 1 file changed, 45 insertions(+), 33 deletions(-)
> 
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
> extern void tlb_finish_mmu(struct mmu_gather *tlb,
>   unsigned long start, unsigned long end);
> 
> -/*
> - * Memory barriers to keep this state in sync are graciously provided by
> - * the page table locks, outside of which no page table modifications happen.
> - * The barriers are used to ensure the

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-15 Thread Nadav Amit

Peter Zijlstra  wrote:

> On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
 So I'm not entirely clear about this yet.
 
 How about:
 
 
CPU0CPU1
 
tlb_gather_mmu()
 
lock PTLn
no mod
unlock PTLn
 
tlb_gather_mmu()
 
lock PTLm
mod
include in tlb range
unlock PTLm
 
lock PTLn
mod
unlock PTLn
 
tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);
 
 
... more ...
 
tlb_finish_mmu()
 
 
 
 In this case you also want CPU1's mm_tlb_flush_nested() call to return
 true, right?
>>> 
>>> No, because CPU 1 mofified pte and added it into tlb range
>>> so regardless of nested, it will flush TLB so there is no stale
>>> TLB problem.
> 
>> To clarify: the main problem that these patches address is when the first
>> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
>> PTE is already what I wanted - no flush is needed”.
> 
> OK, that simplifies things.
> 
>> For some reason (I would assume intentional), all the examples here first
>> “do not modify” the PTE, and then modify it - which is not an “interesting”
>> case.
> 
> Depends on what you call 'interesting' :-) They are 'interesting' to
> make work from a memory ordering POV. And since I didn't get they were
> excluded from the set, I worried.
> 
> In fact, if they were to be included, I couldn't make it work at all. So
> I'm really glad to hear we can disregard them.
> 
>> However, based on what I understand on the memory barriers, I think
>> there is indeed a missing barrier before reading it in
>> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
>> before reading, would solve the problem with least impact on systems with
>> strong memory ordering.
> 
> No, all is well. If, as you say, we're naturally constrained to the case
> where we only care about prior modification we can rely on the RCpc PTL
> locks.
> 
> Consider:
> 
> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   tlb_gather_mmu()
> inc   .
>   | (inc is constrained by RELEASE)
>   lock PTLn   |
>   mod ^
>   unlock PTLn ->  lock PTLn
>   v   no mod
>   |   unlock PTLn
>   |
>   |   lock PTLm
>   |   mod
>   |   include in tlb range
>   |   unlock PTLm
>   |
>   (read is constrained|
> by ACQUIRE)   |
>   |   tlb_finish_mmu()
>   ` force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> 
> Then CPU1's acquire of PTLn orders against CPU0's release of that same
> PTLn which guarantees we observe both its (prior) modified PTE and the
> mm->tlb_flush_pending increment from tlb_gather_mmu().
> 
> So all we need for mm_tlb_flush_nested() to work is having acquired the
> right PTL at least once before calling it.
> 
> At the same time, the decrements need to be after the TLB invalidate is
> complete, this ensures that _IF_ we observe the decrement, we must've
> also observed the corresponding invalidate.
> 
> Something like the below is then sufficient.
> 
> ---
> Subject: mm: Clarify tlb_flush_pending barriers
> From: Peter Zijlstra 
> Date: Fri, 11 Aug 2017 16:04:50 +0200
> 
> Better document the ordering around tlb_flush_pending.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
> include/linux/mm_types.h |   78 
> +++
> 1 file changed, 45 insertions(+), 33 deletions(-)
> 
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
> extern void tlb_finish_mmu(struct mmu_gather *tlb,
>   unsigned long start, unsigned long end);
> 
> -/*
> - * Memory barriers to keep this state in sync are graciously provided by
> - * the page table locks, outside of which no page table modifications happen.
> - * The barriers are used to ensure the order between tlb_flush_pending 
> updates,
> - * which happen while

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 05:38:39PM +0900, Minchan Kim wrote:
> memory-barrier.txt always scares me. I have read it for a while
> and IIUC, it seems semantic of spin_unlock(_pte) would be
> enough without some memory-barrier inside mm_tlb_flush_nested.

Indeed, see the email I just send. Its both spin_lock() and
spin_unlock() that we care about.

Aside from the semi permeable barrier of these primitives, RCpc ensures
these orderings only work against the _same_ lock variable.

Let me try and explain the ordering for PPC (which is by far the worst
we have in this regard):

spin_lock(lock)
{
while (test_and_set(lock))
cpu_relax();
lwsync();
}

spin_unlock(lock)
{
lwsync();
clear(lock);
}

Now LWSYNC has fairly 'simple' semantics, but with fairly horrible
ramifications. Consider LWSYNC to provide _local_ TSO ordering, this
means that it allows 'stores reordered after loads'.

For the spin_lock() that implies that all load/store's inside the lock
do indeed stay in, but the ACQUIRE is only on the LOAD of the
test_and_set(). That is, the actual _set_ can leak in. After all it can
re-order stores after load (inside the lock).

For unlock it again means all load/store's prior stay prior, and the
RELEASE is on the store clearing the lock state (nothing surprising
here).

Now the _local_ part, the main take-away is that these orderings are
strictly CPU local. What makes the spinlock work across CPUs (as we'd
very much expect it to) is the address dependency on the lock variable.

In order for the spin_lock() to succeed, it must observe the clear. Its
this link that crosses between the CPUs and builds the ordering. But
only the two CPUs agree on this order. A third CPU not involved in
this transaction can disagree on the order of events.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 05:38:39PM +0900, Minchan Kim wrote:
> memory-barrier.txt always scares me. I have read it for a while
> and IIUC, it seems semantic of spin_unlock(_pte) would be
> enough without some memory-barrier inside mm_tlb_flush_nested.

Indeed, see the email I just send. Its both spin_lock() and
spin_unlock() that we care about.

Aside from the semi permeable barrier of these primitives, RCpc ensures
these orderings only work against the _same_ lock variable.

Let me try and explain the ordering for PPC (which is by far the worst
we have in this regard):

spin_lock(lock)
{
while (test_and_set(lock))
cpu_relax();
lwsync();
}

spin_unlock(lock)
{
lwsync();
clear(lock);
}

Now LWSYNC has fairly 'simple' semantics, but with fairly horrible
ramifications. Consider LWSYNC to provide _local_ TSO ordering, this
means that it allows 'stores reordered after loads'.

For the spin_lock() that implies that all load/store's inside the lock
do indeed stay in, but the ACQUIRE is only on the LOAD of the
test_and_set(). That is, the actual _set_ can leak in. After all it can
re-order stores after load (inside the lock).

For unlock it again means all load/store's prior stay prior, and the
RELEASE is on the store clearing the lock state (nothing surprising
here).

Now the _local_ part, the main take-away is that these orderings are
strictly CPU local. What makes the spinlock work across CPUs (as we'd
very much expect it to) is the address dependency on the lock variable.

In order for the spin_lock() to succeed, it must observe the clear. Its
this link that crosses between the CPUs and builds the ordering. But
only the two CPUs agree on this order. A third CPU not involved in
this transaction can disagree on the order of events.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
> >> So I'm not entirely clear about this yet.
> >> 
> >> How about:
> >> 
> >> 
> >>CPU0CPU1
> >> 
> >>tlb_gather_mmu()
> >> 
> >>lock PTLn
> >>no mod
> >>unlock PTLn
> >> 
> >>tlb_gather_mmu()
> >> 
> >>lock PTLm
> >>mod
> >>include in tlb range
> >>unlock PTLm
> >> 
> >>lock PTLn
> >>mod
> >>unlock PTLn
> >> 
> >>tlb_finish_mmu()
> >>  force = mm_tlb_flush_nested(tlb->mm);
> >>  arch_tlb_finish_mmu(force);
> >> 
> >> 
> >>... more ...
> >> 
> >>tlb_finish_mmu()
> >> 
> >> 
> >> 
> >> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> >> true, right?
> > 
> > No, because CPU 1 mofified pte and added it into tlb range
> > so regardless of nested, it will flush TLB so there is no stale
> > TLB problem.

> To clarify: the main problem that these patches address is when the first
> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
> PTE is already what I wanted - no flush is needed”.

OK, that simplifies things.

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case.

Depends on what you call 'interesting' :-) They are 'interesting' to
make work from a memory ordering POV. And since I didn't get they were
excluded from the set, I worried.

In fact, if they were to be included, I couldn't make it work at all. So
I'm really glad to hear we can disregard them.

> However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
> before reading, would solve the problem with least impact on systems with
> strong memory ordering.

No, all is well. If, as you say, we're naturally constrained to the case
where we only care about prior modification we can rely on the RCpc PTL
locks.

Consider:


CPU0CPU1

tlb_gather_mmu()

tlb_gather_mmu()
  inc   .
| (inc is constrained by RELEASE)
lock PTLn   |
mod ^
unlock PTLn ->  lock PTLn
v   no mod
|   unlock PTLn
|
|   lock PTLm
|   mod
|   include in tlb range
|   unlock PTLm
|
(read is constrained|
  by ACQUIRE)   |
|   tlb_finish_mmu()
` force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()


Then CPU1's acquire of PTLn orders against CPU0's release of that same
PTLn which guarantees we observe both its (prior) modified PTE and the
mm->tlb_flush_pending increment from tlb_gather_mmu().

So all we need for mm_tlb_flush_nested() to work is having acquired the
right PTL at least once before calling it.

At the same time, the decrements need to be after the TLB invalidate is
complete, this ensures that _IF_ we observe the decrement, we must've
also observed the corresponding invalidate.

Something like the below is then sufficient.

---
Subject: mm: Clarify tlb_flush_pending barriers
From: Peter Zijlstra 
Date: Fri, 11 Aug 2017 16:04:50 +0200

Better document the ordering around tlb_flush_pending.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |   78 +++
 1 file changed, 45 insertions(+), 33 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
 extern void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end);
 
-/*
- * Memory barriers to keep this state in sync are graciously provided by
- * the page table locks, outside of which no page table modifications happen.
- * The barriers are used to ensure the order between tlb_flush_pending updates,
- * which happen while the lock is not taken, and the PTE updates, which happen
- * while the lock is taken, are serialized.
- */
-static

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
> >> So I'm not entirely clear about this yet.
> >> 
> >> How about:
> >> 
> >> 
> >>CPU0CPU1
> >> 
> >>tlb_gather_mmu()
> >> 
> >>lock PTLn
> >>no mod
> >>unlock PTLn
> >> 
> >>tlb_gather_mmu()
> >> 
> >>lock PTLm
> >>mod
> >>include in tlb range
> >>unlock PTLm
> >> 
> >>lock PTLn
> >>mod
> >>unlock PTLn
> >> 
> >>tlb_finish_mmu()
> >>  force = mm_tlb_flush_nested(tlb->mm);
> >>  arch_tlb_finish_mmu(force);
> >> 
> >> 
> >>... more ...
> >> 
> >>tlb_finish_mmu()
> >> 
> >> 
> >> 
> >> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> >> true, right?
> > 
> > No, because CPU 1 mofified pte and added it into tlb range
> > so regardless of nested, it will flush TLB so there is no stale
> > TLB problem.

> To clarify: the main problem that these patches address is when the first
> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
> PTE is already what I wanted - no flush is needed”.

OK, that simplifies things.

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case.

Depends on what you call 'interesting' :-) They are 'interesting' to
make work from a memory ordering POV. And since I didn't get they were
excluded from the set, I worried.

In fact, if they were to be included, I couldn't make it work at all. So
I'm really glad to hear we can disregard them.

> However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
> before reading, would solve the problem with least impact on systems with
> strong memory ordering.

No, all is well. If, as you say, we're naturally constrained to the case
where we only care about prior modification we can rely on the RCpc PTL
locks.

Consider:


CPU0CPU1

tlb_gather_mmu()

tlb_gather_mmu()
  inc   .
| (inc is constrained by RELEASE)
lock PTLn   |
mod ^
unlock PTLn ->  lock PTLn
v   no mod
|   unlock PTLn
|
|   lock PTLm
|   mod
|   include in tlb range
|   unlock PTLm
|
(read is constrained|
  by ACQUIRE)   |
|   tlb_finish_mmu()
` force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()


Then CPU1's acquire of PTLn orders against CPU0's release of that same
PTLn which guarantees we observe both its (prior) modified PTE and the
mm->tlb_flush_pending increment from tlb_gather_mmu().

So all we need for mm_tlb_flush_nested() to work is having acquired the
right PTL at least once before calling it.

At the same time, the decrements need to be after the TLB invalidate is
complete, this ensures that _IF_ we observe the decrement, we must've
also observed the corresponding invalidate.

Something like the below is then sufficient.

---
Subject: mm: Clarify tlb_flush_pending barriers
From: Peter Zijlstra 
Date: Fri, 11 Aug 2017 16:04:50 +0200

Better document the ordering around tlb_flush_pending.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |   78 +++
 1 file changed, 45 insertions(+), 33 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
 extern void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end);
 
-/*
- * Memory barriers to keep this state in sync are graciously provided by
- * the page table locks, outside of which no page table modifications happen.
- * The barriers are used to ensure the order between tlb_flush_pending updates,
- * which happen while the lock is not taken, and the PTE updates, which happen
- * while the lock is taken, are serialized.
- */
-static inline bool mm_tlb_flush_pending(struct

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 12:09:14PM +0900, Minchan Kim wrote:
> @@ -446,9 +450,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>*
>*/
>   bool force = mm_tlb_flush_nested(tlb->mm);
> -
>   arch_tlb_finish_mmu(tlb, start, end, force);
> - dec_tlb_flush_pending(tlb->mm);
>  }

No, I think this breaks all the mm_tlb_flush_pending() users. They need
the decrement to not be visible until the TLB flush is complete.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Peter Zijlstra

On Mon, Aug 14, 2017 at 12:09:14PM +0900, Minchan Kim wrote:
> @@ -446,9 +450,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
>*
>*/
>   bool force = mm_tlb_flush_nested(tlb->mm);
> -
>   arch_tlb_finish_mmu(tlb, start, end, force);
> - dec_tlb_flush_pending(tlb->mm);
>  }

No, I think this breaks all the mm_tlb_flush_pending() users. They need
the decrement to not be visible until the TLB flush is complete.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Minchan Kim

Hi Nadav,

On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
< snip >

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case. However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,

memory-barrier.txt always scares me. I have read it for a while
and IIUC, it seems semantic of spin_unlock(_pte) would be
enough without some memory-barrier inside mm_tlb_flush_nested.

I would be missing something totally.

Could you explain what kinds of sequence you have in mind to
have such problem?

Thanks.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-14 Thread Minchan Kim

Hi Nadav,

On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
< snip >

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case. However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,

memory-barrier.txt always scares me. I have read it for a while
and IIUC, it seems semantic of spin_unlock(_pte) would be
enough without some memory-barrier inside mm_tlb_flush_nested.

I would be missing something totally.

Could you explain what kinds of sequence you have in mind to
have such problem?

Thanks.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Minchan Kim

On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
< snip >

> Minchan, as for the solution you proposed, it seems to open again a race,
> since the “pending” indication is removed before the actual TLB flush is
> performed.

Oops, you're right!

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Minchan Kim

On Mon, Aug 14, 2017 at 05:07:19AM +, Nadav Amit wrote:
< snip >

> Minchan, as for the solution you proposed, it seems to open again a race,
> since the “pending” indication is removed before the actual TLB flush is
> performed.

Oops, you're right!

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Nadav Amit

Minchan Kim  wrote:

> On Sun, Aug 13, 2017 at 02:50:19PM +0200, Peter Zijlstra wrote:
>> On Sun, Aug 13, 2017 at 06:06:32AM +, Nadav Amit wrote:
 however mm_tlb_flush_nested() is a mystery, it appears to care about
 anything inside the range. For now rely on it doing at least _a_ PTL
 lock instead of taking  _the_ PTL lock.
>>> 
>>> It does not care about “anything” inside the range, but only on situations
>>> in which there is at least one (same) PT that was modified by one core and
>>> then read by the other. So, yes, it will always be _the_ same PTL, and not
>>> _a_ PTL - in the cases that flush is really needed.
>>> 
>>> The issue that might require additional barriers is that
>>> inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
>>> not held. IIUC, since the release-acquire might not behave as a full memory
>>> barrier, this requires an explicit memory barrier.
>> 
>> So I'm not entirely clear about this yet.
>> 
>> How about:
>> 
>> 
>>  CPU0CPU1
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  no mod
>>  unlock PTLn
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLm
>>  mod
>>  include in tlb range
>>  unlock PTLm
>> 
>>  lock PTLn
>>  mod
>>  unlock PTLn
>> 
>>  tlb_finish_mmu()
>>force = mm_tlb_flush_nested(tlb->mm);
>>arch_tlb_finish_mmu(force);
>> 
>> 
>>  ... more ...
>> 
>>  tlb_finish_mmu()
>> 
>> 
>> 
>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>> true, right?
> 
> No, because CPU 1 mofified pte and added it into tlb range
> so regardless of nested, it will flush TLB so there is no stale
> TLB problem.
> 
>> But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
>> you're not guaranteed CPU1 sees the increment. The only way to do that
>> is to make the PTL locks RCsc and that is a much more expensive
>> proposition.
>> 
>> 
>> What about:
>> 
>> 
>>  CPU0CPU1
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  no mod
>>  unlock PTLn
>> 
>> 
>>  lock PTLm
>>  mod
>>  include in tlb range
>>  unlock PTLm
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  mod
>>  unlock PTLn
>> 
>>  tlb_finish_mmu()
>>force = mm_tlb_flush_nested(tlb->mm);
>>arch_tlb_finish_mmu(force);
>> 
>> 
>>  ... more ...
>> 
>>  tlb_finish_mmu()
>> 
>> Do we want CPU1 to see it here? If so, where does it end?
> 
> Ditto. Since CPU 1 has added range, it will flush TLB regardless
> of nested condition.
> 
>> CPU0 CPU1
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  no mod
>>  unlock PTLn
>> 
>> 
>>  lock PTLm
>>  mod
>>  include in tlb range
>>  unlock PTLm
>> 
>>  tlb_finish_mmu()
>>force = mm_tlb_flush_nested(tlb->mm);
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  mod
>>  unlock PTLn
>> 
>>arch_tlb_finish_mmu(force);
>> 
>> 
>>  ... more ...
>> 
>>  tlb_finish_mmu()
>> 
>> 
>> This?
>> 
>> 
>> Could you clarify under what exact condition mm_tlb_flush_nested() must
>> return true?
> 
> mm_tlb_flush_nested aims for the CPU side where there is no pte update
> but need TLB flush.
> As I wrote 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dmm-26m-3D150267398226529-26w-3D2=DwIDaQ=uilaK90D4TOVoH58JNXRgQ=x9zhXCtCLvTDtvE65-BGSA=v2Z7eDi7z1H9zdngcjZvlNeBudWzA9KvcXFNpU2A77s=amaSu_gurmBHHPcl3Pxfdl0Tk_uTnmf60tMQAsNDHVU=
>  ,
> it has stable TLB problem if we don't flush TLB although there is no
> pte modification.

To clarify: the main problem that these patches address is when the first
CPU updates the PTE, and second CPU sees the updated value and thinks: “the
PTE is already what I wanted - no flush is needed”.

For some reason (I would assume intentional), all the

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Nadav Amit

Minchan Kim  wrote:

> On Sun, Aug 13, 2017 at 02:50:19PM +0200, Peter Zijlstra wrote:
>> On Sun, Aug 13, 2017 at 06:06:32AM +, Nadav Amit wrote:
 however mm_tlb_flush_nested() is a mystery, it appears to care about
 anything inside the range. For now rely on it doing at least _a_ PTL
 lock instead of taking  _the_ PTL lock.
>>> 
>>> It does not care about “anything” inside the range, but only on situations
>>> in which there is at least one (same) PT that was modified by one core and
>>> then read by the other. So, yes, it will always be _the_ same PTL, and not
>>> _a_ PTL - in the cases that flush is really needed.
>>> 
>>> The issue that might require additional barriers is that
>>> inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
>>> not held. IIUC, since the release-acquire might not behave as a full memory
>>> barrier, this requires an explicit memory barrier.
>> 
>> So I'm not entirely clear about this yet.
>> 
>> How about:
>> 
>> 
>>  CPU0CPU1
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  no mod
>>  unlock PTLn
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLm
>>  mod
>>  include in tlb range
>>  unlock PTLm
>> 
>>  lock PTLn
>>  mod
>>  unlock PTLn
>> 
>>  tlb_finish_mmu()
>>force = mm_tlb_flush_nested(tlb->mm);
>>arch_tlb_finish_mmu(force);
>> 
>> 
>>  ... more ...
>> 
>>  tlb_finish_mmu()
>> 
>> 
>> 
>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>> true, right?
> 
> No, because CPU 1 mofified pte and added it into tlb range
> so regardless of nested, it will flush TLB so there is no stale
> TLB problem.
> 
>> But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
>> you're not guaranteed CPU1 sees the increment. The only way to do that
>> is to make the PTL locks RCsc and that is a much more expensive
>> proposition.
>> 
>> 
>> What about:
>> 
>> 
>>  CPU0CPU1
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  no mod
>>  unlock PTLn
>> 
>> 
>>  lock PTLm
>>  mod
>>  include in tlb range
>>  unlock PTLm
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  mod
>>  unlock PTLn
>> 
>>  tlb_finish_mmu()
>>force = mm_tlb_flush_nested(tlb->mm);
>>arch_tlb_finish_mmu(force);
>> 
>> 
>>  ... more ...
>> 
>>  tlb_finish_mmu()
>> 
>> Do we want CPU1 to see it here? If so, where does it end?
> 
> Ditto. Since CPU 1 has added range, it will flush TLB regardless
> of nested condition.
> 
>> CPU0 CPU1
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  no mod
>>  unlock PTLn
>> 
>> 
>>  lock PTLm
>>  mod
>>  include in tlb range
>>  unlock PTLm
>> 
>>  tlb_finish_mmu()
>>force = mm_tlb_flush_nested(tlb->mm);
>> 
>>  tlb_gather_mmu()
>> 
>>  lock PTLn
>>  mod
>>  unlock PTLn
>> 
>>arch_tlb_finish_mmu(force);
>> 
>> 
>>  ... more ...
>> 
>>  tlb_finish_mmu()
>> 
>> 
>> This?
>> 
>> 
>> Could you clarify under what exact condition mm_tlb_flush_nested() must
>> return true?
> 
> mm_tlb_flush_nested aims for the CPU side where there is no pte update
> but need TLB flush.
> As I wrote 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dmm-26m-3D150267398226529-26w-3D2=DwIDaQ=uilaK90D4TOVoH58JNXRgQ=x9zhXCtCLvTDtvE65-BGSA=v2Z7eDi7z1H9zdngcjZvlNeBudWzA9KvcXFNpU2A77s=amaSu_gurmBHHPcl3Pxfdl0Tk_uTnmf60tMQAsNDHVU=
>  ,
> it has stable TLB problem if we don't flush TLB although there is no
> pte modification.

To clarify: the main problem that these patches address is when the first
CPU updates the PTE, and second CPU sees the updated value and thinks: “the
PTE is already what I wanted - no flush is needed”.

For some reason (I would assume intentional), all the examples here first
“do

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Minchan Kim

On Sun, Aug 13, 2017 at 02:50:19PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 13, 2017 at 06:06:32AM +, Nadav Amit wrote:
> > > however mm_tlb_flush_nested() is a mystery, it appears to care about
> > > anything inside the range. For now rely on it doing at least _a_ PTL
> > > lock instead of taking  _the_ PTL lock.
> > 
> > It does not care about “anything” inside the range, but only on situations
> > in which there is at least one (same) PT that was modified by one core and
> > then read by the other. So, yes, it will always be _the_ same PTL, and not
> > _a_ PTL - in the cases that flush is really needed.
> > 
> > The issue that might require additional barriers is that
> > inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
> > not held. IIUC, since the release-acquire might not behave as a full memory
> > barrier, this requires an explicit memory barrier.
> 
> So I'm not entirely clear about this yet.
> 
> How about:
> 
> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   no mod
>   unlock PTLn
> 
>   tlb_gather_mmu()
> 
>   lock PTLm
>   mod
>   include in tlb range
>   unlock PTLm
> 
>   lock PTLn
>   mod
>   unlock PTLn
> 
>   tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> 
> 
> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> true, right?

No, because CPU 1 mofified pte and added it into tlb range
so regardless of nested, it will flush TLB so there is no stale
TLB problem.

> 
> But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
> you're not guaranteed CPU1 sees the increment. The only way to do that
> is to make the PTL locks RCsc and that is a much more expensive
> proposition.
> 
> 
> What about:
> 
> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   no mod
>   unlock PTLn
> 
> 
>   lock PTLm
>   mod
>   include in tlb range
>   unlock PTLm
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   mod
>   unlock PTLn
> 
>   tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> Do we want CPU1 to see it here? If so, where does it end?

Ditto. Since CPU 1 has added range, it will flush TLB regardless
of nested condition.

> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   no mod
>   unlock PTLn
> 
> 
>   lock PTLm
>   mod
>   include in tlb range
>   unlock PTLm
> 
>   tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   mod
>   unlock PTLn
> 
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> 
> This?
> 
> 
> Could you clarify under what exact condition mm_tlb_flush_nested() must
> return true?

mm_tlb_flush_nested aims for the CPU side where there is no pte update
but need TLB flush.
As I wrote https://marc.info/?l=linux-mm=150267398226529=2,
it has stable TLB problem if we don't flush TLB although there is no
pte modification.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Minchan Kim

On Sun, Aug 13, 2017 at 02:50:19PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 13, 2017 at 06:06:32AM +, Nadav Amit wrote:
> > > however mm_tlb_flush_nested() is a mystery, it appears to care about
> > > anything inside the range. For now rely on it doing at least _a_ PTL
> > > lock instead of taking  _the_ PTL lock.
> > 
> > It does not care about “anything” inside the range, but only on situations
> > in which there is at least one (same) PT that was modified by one core and
> > then read by the other. So, yes, it will always be _the_ same PTL, and not
> > _a_ PTL - in the cases that flush is really needed.
> > 
> > The issue that might require additional barriers is that
> > inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
> > not held. IIUC, since the release-acquire might not behave as a full memory
> > barrier, this requires an explicit memory barrier.
> 
> So I'm not entirely clear about this yet.
> 
> How about:
> 
> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   no mod
>   unlock PTLn
> 
>   tlb_gather_mmu()
> 
>   lock PTLm
>   mod
>   include in tlb range
>   unlock PTLm
> 
>   lock PTLn
>   mod
>   unlock PTLn
> 
>   tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> 
> 
> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> true, right?

No, because CPU 1 mofified pte and added it into tlb range
so regardless of nested, it will flush TLB so there is no stale
TLB problem.

> 
> But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
> you're not guaranteed CPU1 sees the increment. The only way to do that
> is to make the PTL locks RCsc and that is a much more expensive
> proposition.
> 
> 
> What about:
> 
> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   no mod
>   unlock PTLn
> 
> 
>   lock PTLm
>   mod
>   include in tlb range
>   unlock PTLm
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   mod
>   unlock PTLn
> 
>   tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> Do we want CPU1 to see it here? If so, where does it end?

Ditto. Since CPU 1 has added range, it will flush TLB regardless
of nested condition.

> 
>   CPU0CPU1
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   no mod
>   unlock PTLn
> 
> 
>   lock PTLm
>   mod
>   include in tlb range
>   unlock PTLm
> 
>   tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> 
>   tlb_gather_mmu()
> 
>   lock PTLn
>   mod
>   unlock PTLn
> 
> arch_tlb_finish_mmu(force);
> 
> 
>   ... more ...
> 
>   tlb_finish_mmu()
> 
> 
> This?
> 
> 
> Could you clarify under what exact condition mm_tlb_flush_nested() must
> return true?

mm_tlb_flush_nested aims for the CPU side where there is no pte update
but need TLB flush.
As I wrote https://marc.info/?l=linux-mm=150267398226529=2,
it has stable TLB problem if we don't flush TLB although there is no
pte modification.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Minchan Kim

Hi Peter,

On Fri, Aug 11, 2017 at 04:04:50PM +0200, Peter Zijlstra wrote:
> 
> Ok, so I have the below to still go on-top.
> 
> Ideally someone would clarify the situation around
> mm_tlb_flush_nested(), because ideally we'd remove the
> smp_mb__after_atomic() and go back to relying on PTL alone.
> 
> This also removes the pointless smp_mb__before_atomic()

I'm not an expert of barrier stuff but IIUC, mm_tlb_flush_nested's
side full memory barrier can go with removing smp_mb__after_atomic
in inc_tlb_flush_pending side?


diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 490af494c2da..5ad0e66df363 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -544,7 +544,12 @@ static inline bool mm_tlb_flush_pending(struct mm_struct 
*mm)
  */
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
-   return atomic_read(>tlb_flush_pending) > 1;
+   /*
+* atomic_dec_and_test's full memory barrier guarantees
+* to see uptodate tlb_flush_pending count in other CPU
+* without relying on page table lock.
+*/
+   return !atomic_dec_and_test(>tlb_flush_pending);
 }
 
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
diff --git a/mm/memory.c b/mm/memory.c
index f571b0eb9816..e90b57bc65fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -407,6 +407,10 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct 
mm_struct *mm,
unsigned long start, unsigned long end)
 {
arch_tlb_gather_mmu(tlb, mm, start, end);
+   /*
+* couterpart is mm_tlb_flush_nested in tlb_finish_mmu
+* which decreases pending count.
+*/
inc_tlb_flush_pending(tlb->mm);
 }
 
@@ -446,9 +450,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 *
 */
bool force = mm_tlb_flush_nested(tlb->mm);
-
arch_tlb_finish_mmu(tlb, start, end, force);
-   dec_tlb_flush_pending(tlb->mm);
 }
 
 /*

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Minchan Kim

Hi Peter,

On Fri, Aug 11, 2017 at 04:04:50PM +0200, Peter Zijlstra wrote:
> 
> Ok, so I have the below to still go on-top.
> 
> Ideally someone would clarify the situation around
> mm_tlb_flush_nested(), because ideally we'd remove the
> smp_mb__after_atomic() and go back to relying on PTL alone.
> 
> This also removes the pointless smp_mb__before_atomic()

I'm not an expert of barrier stuff but IIUC, mm_tlb_flush_nested's
side full memory barrier can go with removing smp_mb__after_atomic
in inc_tlb_flush_pending side?


diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 490af494c2da..5ad0e66df363 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -544,7 +544,12 @@ static inline bool mm_tlb_flush_pending(struct mm_struct 
*mm)
  */
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
-   return atomic_read(>tlb_flush_pending) > 1;
+   /*
+* atomic_dec_and_test's full memory barrier guarantees
+* to see uptodate tlb_flush_pending count in other CPU
+* without relying on page table lock.
+*/
+   return !atomic_dec_and_test(>tlb_flush_pending);
 }
 
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
diff --git a/mm/memory.c b/mm/memory.c
index f571b0eb9816..e90b57bc65fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -407,6 +407,10 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct 
mm_struct *mm,
unsigned long start, unsigned long end)
 {
arch_tlb_gather_mmu(tlb, mm, start, end);
+   /*
+* couterpart is mm_tlb_flush_nested in tlb_finish_mmu
+* which decreases pending count.
+*/
inc_tlb_flush_pending(tlb->mm);
 }
 
@@ -446,9 +450,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 *
 */
bool force = mm_tlb_flush_nested(tlb->mm);
-
arch_tlb_finish_mmu(tlb, start, end, force);
-   dec_tlb_flush_pending(tlb->mm);
 }
 
 /*

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Peter Zijlstra

On Sun, Aug 13, 2017 at 06:06:32AM +, Nadav Amit wrote:
> > however mm_tlb_flush_nested() is a mystery, it appears to care about
> > anything inside the range. For now rely on it doing at least _a_ PTL
> > lock instead of taking  _the_ PTL lock.
> 
> It does not care about “anything” inside the range, but only on situations
> in which there is at least one (same) PT that was modified by one core and
> then read by the other. So, yes, it will always be _the_ same PTL, and not
> _a_ PTL - in the cases that flush is really needed.
> 
> The issue that might require additional barriers is that
> inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
> not held. IIUC, since the release-acquire might not behave as a full memory
> barrier, this requires an explicit memory barrier.

So I'm not entirely clear about this yet.

How about:


CPU0CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn

tlb_gather_mmu()

lock PTLm
mod
include in tlb range
unlock PTLm

lock PTLn
mod
unlock PTLn

tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()



In this case you also want CPU1's mm_tlb_flush_nested() call to return
true, right?

But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
you're not guaranteed CPU1 sees the increment. The only way to do that
is to make the PTL locks RCsc and that is a much more expensive
proposition.


What about:


CPU0CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn


lock PTLm
mod
include in tlb range
unlock PTLm

tlb_gather_mmu()

lock PTLn
mod
unlock PTLn

tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()

Do we want CPU1 to see it here? If so, where does it end?


CPU0CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn


lock PTLm
mod
include in tlb range
unlock PTLm

tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);

tlb_gather_mmu()

lock PTLn
mod
unlock PTLn

  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()


This?


Could you clarify under what exact condition mm_tlb_flush_nested() must
return true?

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Peter Zijlstra

On Sun, Aug 13, 2017 at 06:06:32AM +, Nadav Amit wrote:
> > however mm_tlb_flush_nested() is a mystery, it appears to care about
> > anything inside the range. For now rely on it doing at least _a_ PTL
> > lock instead of taking  _the_ PTL lock.
> 
> It does not care about “anything” inside the range, but only on situations
> in which there is at least one (same) PT that was modified by one core and
> then read by the other. So, yes, it will always be _the_ same PTL, and not
> _a_ PTL - in the cases that flush is really needed.
> 
> The issue that might require additional barriers is that
> inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
> not held. IIUC, since the release-acquire might not behave as a full memory
> barrier, this requires an explicit memory barrier.

So I'm not entirely clear about this yet.

How about:


CPU0CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn

tlb_gather_mmu()

lock PTLm
mod
include in tlb range
unlock PTLm

lock PTLn
mod
unlock PTLn

tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()



In this case you also want CPU1's mm_tlb_flush_nested() call to return
true, right?

But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
you're not guaranteed CPU1 sees the increment. The only way to do that
is to make the PTL locks RCsc and that is a much more expensive
proposition.


What about:


CPU0CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn


lock PTLm
mod
include in tlb range
unlock PTLm

tlb_gather_mmu()

lock PTLn
mod
unlock PTLn

tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);
  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()

Do we want CPU1 to see it here? If so, where does it end?


CPU0CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn


lock PTLm
mod
include in tlb range
unlock PTLm

tlb_finish_mmu()
  force = mm_tlb_flush_nested(tlb->mm);

tlb_gather_mmu()

lock PTLn
mod
unlock PTLn

  arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()


This?


Could you clarify under what exact condition mm_tlb_flush_nested() must
return true?

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Nadav Amit

Peter Zijlstra  wrote:

> 
> Ok, so I have the below to still go on-top.
> 
> Ideally someone would clarify the situation around
> mm_tlb_flush_nested(), because ideally we'd remove the
> smp_mb__after_atomic() and go back to relying on PTL alone.
> 
> This also removes the pointless smp_mb__before_atomic()
> 
> ---
> Subject: mm: Fix barriers for the tlb_flush_pending thing
> From: Peter Zijlstra 
> Date: Fri Aug 11 12:43:33 CEST 2017
> 
> I'm not 100% sure we always care about the same PTL and when we have
> SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of one does not
> in fact order against the LOCK of another lock. Therefore the
> documented scheme does not work if we care about multiple PTLs
> 
> mm_tlb_flush_pending() appears to only care about a single PTL:
> 
> - arch pte_accessible() (x86, arm64) only cares about that one PTE.
> - do_huge_pmd_numa_page() also only cares about a single (huge) page.
> - ksm write_protect_page() also only cares about a single page.
> 
> however mm_tlb_flush_nested() is a mystery, it appears to care about
> anything inside the range. For now rely on it doing at least _a_ PTL
> lock instead of taking  _the_ PTL lock.

It does not care about “anything” inside the range, but only on situations
in which there is at least one (same) PT that was modified by one core and
then read by the other. So, yes, it will always be _the_ same PTL, and not
_a_ PTL - in the cases that flush is really needed.

The issue that might require additional barriers is that
inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
not held. IIUC, since the release-acquire might not behave as a full memory
barrier, this requires an explicit memory barrier.

> Therefore add an explicit smp_mb__after_atomic() to cure things.
> 
> Also remove the smp_mb__before_atomic() on the dec side, as its
> completely pointless. We must rely on flush_tlb_range() to DTRT.

Good. It seemed fishy to me, but I was focused on the TLB consistency and
less on the barriers (that’s my excuse).

Nadav

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-13 Thread Nadav Amit

Peter Zijlstra  wrote:

> 
> Ok, so I have the below to still go on-top.
> 
> Ideally someone would clarify the situation around
> mm_tlb_flush_nested(), because ideally we'd remove the
> smp_mb__after_atomic() and go back to relying on PTL alone.
> 
> This also removes the pointless smp_mb__before_atomic()
> 
> ---
> Subject: mm: Fix barriers for the tlb_flush_pending thing
> From: Peter Zijlstra 
> Date: Fri Aug 11 12:43:33 CEST 2017
> 
> I'm not 100% sure we always care about the same PTL and when we have
> SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of one does not
> in fact order against the LOCK of another lock. Therefore the
> documented scheme does not work if we care about multiple PTLs
> 
> mm_tlb_flush_pending() appears to only care about a single PTL:
> 
> - arch pte_accessible() (x86, arm64) only cares about that one PTE.
> - do_huge_pmd_numa_page() also only cares about a single (huge) page.
> - ksm write_protect_page() also only cares about a single page.
> 
> however mm_tlb_flush_nested() is a mystery, it appears to care about
> anything inside the range. For now rely on it doing at least _a_ PTL
> lock instead of taking  _the_ PTL lock.

It does not care about “anything” inside the range, but only on situations
in which there is at least one (same) PT that was modified by one core and
then read by the other. So, yes, it will always be _the_ same PTL, and not
_a_ PTL - in the cases that flush is really needed.

The issue that might require additional barriers is that
inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
not held. IIUC, since the release-acquire might not behave as a full memory
barrier, this requires an explicit memory barrier.

> Therefore add an explicit smp_mb__after_atomic() to cure things.
> 
> Also remove the smp_mb__before_atomic() on the dec side, as its
> completely pointless. We must rely on flush_tlb_range() to DTRT.

Good. It seemed fishy to me, but I was focused on the TLB consistency and
less on the barriers (that’s my excuse).

Nadav

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra


Ok, so I have the below to still go on-top.

Ideally someone would clarify the situation around
mm_tlb_flush_nested(), because ideally we'd remove the
smp_mb__after_atomic() and go back to relying on PTL alone.

This also removes the pointless smp_mb__before_atomic()

---
Subject: mm: Fix barriers for the tlb_flush_pending thing
From: Peter Zijlstra 
Date: Fri Aug 11 12:43:33 CEST 2017

I'm not 100% sure we always care about the same PTL and when we have
SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of one does not
in fact order against the LOCK of another lock. Therefore the
documented scheme does not work if we care about multiple PTLs

mm_tlb_flush_pending() appears to only care about a single PTL:

 - arch pte_accessible() (x86, arm64) only cares about that one PTE.
 - do_huge_pmd_numa_page() also only cares about a single (huge) page.
 - ksm write_protect_page() also only cares about a single page.

however mm_tlb_flush_nested() is a mystery, it appears to care about
anything inside the range. For now rely on it doing at least _a_ PTL
lock instead of taking  _the_ PTL lock.

Therefore add an explicit smp_mb__after_atomic() to cure things.

Also remove the smp_mb__before_atomic() on the dec side, as its
completely pointless. We must rely on flush_tlb_range() to DTRT.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |   38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -537,13 +537,13 @@ static inline bool mm_tlb_flush_pending(
 {
/*
 * Must be called with PTL held; such that our PTL acquire will have
-* observed the store from set_tlb_flush_pending().
+* observed the increment from inc_tlb_flush_pending().
 */
-   return atomic_read(>tlb_flush_pending) > 0;
+   return atomic_read(>tlb_flush_pending);
 }
 
 /*
- * Returns true if there are two above TLB batching threads in parallel.
+ * Returns true if there are two or more TLB batching threads in parallel.
  */
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
@@ -558,15 +558,12 @@ static inline void init_tlb_flush_pendin
 static inline void inc_tlb_flush_pending(struct mm_struct *mm)
 {
atomic_inc(>tlb_flush_pending);
-
/*
-* The only time this value is relevant is when there are indeed pages
-* to flush. And we'll only flush pages after changing them, which
-* requires the PTL.
-*
 * So the ordering here is:
 *
 *  atomic_inc(>tlb_flush_pending);
+*  smp_mb__after_atomic();
+*
 *  spin_lock();
 *  ...
 *  set_pte_at();
@@ -580,21 +577,30 @@ static inline void inc_tlb_flush_pending
 *  flush_tlb_range();
 *  atomic_dec(>tlb_flush_pending);
 *
-* So the =true store is constrained by the PTL unlock, and the =false
-* store is constrained by the TLB invalidate.
+* Where we order the increment against the PTE modification with the
+* smp_mb__after_atomic(). It would appear that the spin_unlock()
+* is sufficient to constrain the inc, because we only care about the
+* value if there is indeed a pending PTE modification. However with
+* SPLIT_PTE_PTLOCKS and RCpc locks (PPC) the UNLOCK of one lock does
+* not order against the LOCK of another lock.
+*
+* The decrement is ordered by the flush_tlb_range(), such that
+* mm_tlb_flush_pending() will not return false unless all flushes have
+* completed.
 */
+   smp_mb__after_atomic();
 }
 
-/* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
/*
-* Guarantee that the tlb_flush_pending does not not leak into the
-* critical section, since we must order the PTE change and changes to
-* the pending TLB flush indication. We could have relied on TLB flush
-* as a memory barrier, but this behavior is not clearly documented.
+* See inc_tlb_flush_pending().
+*
+* This cannot be smp_mb__before_atomic() because smp_mb() simply does
+* not order against TLB invalidate completion, which is what we need.
+*
+* Therefore we must rely on tlb_flush_*() to guarantee order.
 */
-   smp_mb__before_atomic();
atomic_dec(>tlb_flush_pending);
 }

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra


Ok, so I have the below to still go on-top.

Ideally someone would clarify the situation around
mm_tlb_flush_nested(), because ideally we'd remove the
smp_mb__after_atomic() and go back to relying on PTL alone.

This also removes the pointless smp_mb__before_atomic()

---
Subject: mm: Fix barriers for the tlb_flush_pending thing
From: Peter Zijlstra 
Date: Fri Aug 11 12:43:33 CEST 2017

I'm not 100% sure we always care about the same PTL and when we have
SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of one does not
in fact order against the LOCK of another lock. Therefore the
documented scheme does not work if we care about multiple PTLs

mm_tlb_flush_pending() appears to only care about a single PTL:

 - arch pte_accessible() (x86, arm64) only cares about that one PTE.
 - do_huge_pmd_numa_page() also only cares about a single (huge) page.
 - ksm write_protect_page() also only cares about a single page.

however mm_tlb_flush_nested() is a mystery, it appears to care about
anything inside the range. For now rely on it doing at least _a_ PTL
lock instead of taking  _the_ PTL lock.

Therefore add an explicit smp_mb__after_atomic() to cure things.

Also remove the smp_mb__before_atomic() on the dec side, as its
completely pointless. We must rely on flush_tlb_range() to DTRT.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |   38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -537,13 +537,13 @@ static inline bool mm_tlb_flush_pending(
 {
/*
 * Must be called with PTL held; such that our PTL acquire will have
-* observed the store from set_tlb_flush_pending().
+* observed the increment from inc_tlb_flush_pending().
 */
-   return atomic_read(>tlb_flush_pending) > 0;
+   return atomic_read(>tlb_flush_pending);
 }
 
 /*
- * Returns true if there are two above TLB batching threads in parallel.
+ * Returns true if there are two or more TLB batching threads in parallel.
  */
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
@@ -558,15 +558,12 @@ static inline void init_tlb_flush_pendin
 static inline void inc_tlb_flush_pending(struct mm_struct *mm)
 {
atomic_inc(>tlb_flush_pending);
-
/*
-* The only time this value is relevant is when there are indeed pages
-* to flush. And we'll only flush pages after changing them, which
-* requires the PTL.
-*
 * So the ordering here is:
 *
 *  atomic_inc(>tlb_flush_pending);
+*  smp_mb__after_atomic();
+*
 *  spin_lock();
 *  ...
 *  set_pte_at();
@@ -580,21 +577,30 @@ static inline void inc_tlb_flush_pending
 *  flush_tlb_range();
 *  atomic_dec(>tlb_flush_pending);
 *
-* So the =true store is constrained by the PTL unlock, and the =false
-* store is constrained by the TLB invalidate.
+* Where we order the increment against the PTE modification with the
+* smp_mb__after_atomic(). It would appear that the spin_unlock()
+* is sufficient to constrain the inc, because we only care about the
+* value if there is indeed a pending PTE modification. However with
+* SPLIT_PTE_PTLOCKS and RCpc locks (PPC) the UNLOCK of one lock does
+* not order against the LOCK of another lock.
+*
+* The decrement is ordered by the flush_tlb_range(), such that
+* mm_tlb_flush_pending() will not return false unless all flushes have
+* completed.
 */
+   smp_mb__after_atomic();
 }
 
-/* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
/*
-* Guarantee that the tlb_flush_pending does not not leak into the
-* critical section, since we must order the PTE change and changes to
-* the pending TLB flush indication. We could have relied on TLB flush
-* as a memory barrier, but this behavior is not clearly documented.
+* See inc_tlb_flush_pending().
+*
+* This cannot be smp_mb__before_atomic() because smp_mb() simply does
+* not order against TLB invalidate completion, which is what we need.
+*
+* Therefore we must rely on tlb_flush_*() to guarantee order.
 */
-   smp_mb__before_atomic();
atomic_dec(>tlb_flush_pending);
 }

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Stephen Rothwell

Hi Ingo,

On Fri, 11 Aug 2017 14:44:25 +0200 Ingo Molnar  wrote:
>
> * Peter Zijlstra  wrote:
> 
> > On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:  
> > > I've done a minimal conflict resolution merge locally. Peter, could you 
> > > please 
> > > double check my resolution, in:
> > > 
> > >   040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve 
> > > conflicts  
> > 
> > That merge is a bit wonky, but not terminally broken afaict.
> > 
> > It now does two TLB flushes, the below cleans that up.  
> 
> Cool, thanks - I've applied it as a separate commit, to reduce the evilness 
> of the 
> merge commit.
> 
> Will push it all out in time to make Stephen's Monday morning a bit less of a 
> Monday morning.

Thanks you very much.

-- 
Cheers,
Stephen Rothwell

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Stephen Rothwell

Hi Ingo,

On Fri, 11 Aug 2017 14:44:25 +0200 Ingo Molnar  wrote:
>
> * Peter Zijlstra  wrote:
> 
> > On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:  
> > > I've done a minimal conflict resolution merge locally. Peter, could you 
> > > please 
> > > double check my resolution, in:
> > > 
> > >   040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve 
> > > conflicts  
> > 
> > That merge is a bit wonky, but not terminally broken afaict.
> > 
> > It now does two TLB flushes, the below cleans that up.  
> 
> Cool, thanks - I've applied it as a separate commit, to reduce the evilness 
> of the 
> merge commit.
> 
> Will push it all out in time to make Stephen's Monday morning a bit less of a 
> Monday morning.

Thanks you very much.

-- 
Cheers,
Stephen Rothwell

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> > I've done a minimal conflict resolution merge locally. Peter, could you 
> > please 
> > double check my resolution, in:
> > 
> >   040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts
> 
> That merge is a bit wonky, but not terminally broken afaict.
> 
> It now does two TLB flushes, the below cleans that up.

Cool, thanks - I've applied it as a separate commit, to reduce the evilness of 
the 
merge commit.

Will push it all out in time to make Stephen's Monday morning a bit less of a 
Monday morning.

Thanks,

Ingo

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> > I've done a minimal conflict resolution merge locally. Peter, could you 
> > please 
> > double check my resolution, in:
> > 
> >   040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts
> 
> That merge is a bit wonky, but not terminally broken afaict.
> 
> It now does two TLB flushes, the below cleans that up.

Cool, thanks - I've applied it as a separate commit, to reduce the evilness of 
the 
merge commit.

Will push it all out in time to make Stephen's Monday morning a bit less of a 
Monday morning.

Thanks,

Ingo

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra

On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> I've done a minimal conflict resolution merge locally. Peter, could you 
> please 
> double check my resolution, in:
> 
>   040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts

That merge is a bit wonky, but not terminally broken afaict.

It now does two TLB flushes, the below cleans that up.

---
 mm/huge_memory.c | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ce883459e246..08f6c1993832 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1410,7 +1410,6 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
-   bool need_flush = false;
bool page_locked;
bool migrated = false;
bool was_writable;
@@ -1497,22 +1496,18 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t 
pmd)
}
 
/*
-* The page_table_lock above provides a memory barrier
-* with change_protection_range.
-*/
-   if (mm_tlb_flush_pending(vma->vm_mm))
-   flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
-   /*
 * Since we took the NUMA fault, we must have observed the !accessible
 * bit. Make sure all other CPUs agree with that, to avoid them
 * modifying the page we're about to migrate.
 *
 * Must be done under PTL such that we'll observe the relevant
-* set_tlb_flush_pending().
+* inc_tlb_flush_pending().
+*
+* We are not sure a pending tlb flush here is for a huge page
+* mapping or not. Hence use the tlb range variant
 */
if (mm_tlb_flush_pending(vma->vm_mm))
-   need_flush = true;
+   flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 
/*
 * Migrate the THP to the requested node, returns with page unlocked
@@ -1520,13 +1515,6 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t 
pmd)
 */
spin_unlock(vmf->ptl);
 
-   /*
-* We are not sure a pending tlb flush here is for a huge page
-* mapping or not. Hence use the tlb range variant
-*/
-   if (need_flush)
-   flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
vmf->pmd, pmd, vmf->address, page, target_nid);
if (migrated) {

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra

On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> I've done a minimal conflict resolution merge locally. Peter, could you 
> please 
> double check my resolution, in:
> 
>   040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts

That merge is a bit wonky, but not terminally broken afaict.

It now does two TLB flushes, the below cleans that up.

---
 mm/huge_memory.c | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ce883459e246..08f6c1993832 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1410,7 +1410,6 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
-   bool need_flush = false;
bool page_locked;
bool migrated = false;
bool was_writable;
@@ -1497,22 +1496,18 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t 
pmd)
}
 
/*
-* The page_table_lock above provides a memory barrier
-* with change_protection_range.
-*/
-   if (mm_tlb_flush_pending(vma->vm_mm))
-   flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
-   /*
 * Since we took the NUMA fault, we must have observed the !accessible
 * bit. Make sure all other CPUs agree with that, to avoid them
 * modifying the page we're about to migrate.
 *
 * Must be done under PTL such that we'll observe the relevant
-* set_tlb_flush_pending().
+* inc_tlb_flush_pending().
+*
+* We are not sure a pending tlb flush here is for a huge page
+* mapping or not. Hence use the tlb range variant
 */
if (mm_tlb_flush_pending(vma->vm_mm))
-   need_flush = true;
+   flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 
/*
 * Migrate the THP to the requested node, returns with page unlocked
@@ -1520,13 +1515,6 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t 
pmd)
 */
spin_unlock(vmf->ptl);
 
-   /*
-* We are not sure a pending tlb flush here is for a huge page
-* mapping or not. Hence use the tlb range variant
-*/
-   if (need_flush)
-   flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
vmf->pmd, pmd, vmf->address, page, target_nid);
if (migrated) {

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Ingo Molnar


* Stephen Rothwell  wrote:

> Hi Peter,
> 
> On Fri, 11 Aug 2017 11:34:49 +0200 Peter Zijlstra  
> wrote:
> >
> > On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > > 
> > > Today's linux-next merge of the akpm-current tree got conflicts in:
> > > 
> > >   include/linux/mm_types.h
> > >   mm/huge_memory.c
> > > 
> > > between commit:
> > > 
> > >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > > 
> > > from the tip tree and commits:
> > > 
> > >   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> > >   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as 
> > > long as possible"")
> > > 
> > > from the akpm-current tree.
> > > 
> > > The latter 2 are now in Linus' tree as well (but were not when I started
> > > the day).
> >
> > Here's two patches that apply on top of tip.
> 
> What I will really need (on Monday) is a merge resolution between
> Linus' tree and the tip tree ...

I've done a minimal conflict resolution merge locally. Peter, could you please 
double check my resolution, in:

  040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts

Thanks,

Ingo

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Ingo Molnar


* Stephen Rothwell  wrote:

> Hi Peter,
> 
> On Fri, 11 Aug 2017 11:34:49 +0200 Peter Zijlstra  
> wrote:
> >
> > On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > > 
> > > Today's linux-next merge of the akpm-current tree got conflicts in:
> > > 
> > >   include/linux/mm_types.h
> > >   mm/huge_memory.c
> > > 
> > > between commit:
> > > 
> > >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > > 
> > > from the tip tree and commits:
> > > 
> > >   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> > >   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as 
> > > long as possible"")
> > > 
> > > from the akpm-current tree.
> > > 
> > > The latter 2 are now in Linus' tree as well (but were not when I started
> > > the day).
> >
> > Here's two patches that apply on top of tip.
> 
> What I will really need (on Monday) is a merge resolution between
> Linus' tree and the tip tree ...

I've done a minimal conflict resolution merge locally. Peter, could you please 
double check my resolution, in:

  040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts

Thanks,

Ingo

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Stephen Rothwell

Hi Peter,

On Fri, 11 Aug 2017 11:34:49 +0200 Peter Zijlstra  wrote:
>
> On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > 
> > Today's linux-next merge of the akpm-current tree got conflicts in:
> > 
> >   include/linux/mm_types.h
> >   mm/huge_memory.c
> > 
> > between commit:
> > 
> >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > 
> > from the tip tree and commits:
> > 
> >   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> >   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as 
> > long as possible"")
> > 
> > from the akpm-current tree.
> > 
> > The latter 2 are now in Linus' tree as well (but were not when I started
> > the day).
>
> Here's two patches that apply on top of tip.

What I will really need (on Monday) is a merge resolution between
Linus' tree and the tip tree ...

-- 
Cheers,
Stephen Rothwell

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Stephen Rothwell

Hi Peter,

On Fri, 11 Aug 2017 11:34:49 +0200 Peter Zijlstra  wrote:
>
> On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > 
> > Today's linux-next merge of the akpm-current tree got conflicts in:
> > 
> >   include/linux/mm_types.h
> >   mm/huge_memory.c
> > 
> > between commit:
> > 
> >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > 
> > from the tip tree and commits:
> > 
> >   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> >   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as 
> > long as possible"")
> > 
> > from the akpm-current tree.
> > 
> > The latter 2 are now in Linus' tree as well (but were not when I started
> > the day).
>
> Here's two patches that apply on top of tip.

What I will really need (on Monday) is a merge resolution between
Linus' tree and the tip tree ...

-- 
Cheers,
Stephen Rothwell

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra

On Fri, Aug 11, 2017 at 11:34:49AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Today's linux-next merge of the akpm-current tree got conflicts in:
> > 
> >   include/linux/mm_types.h
> >   mm/huge_memory.c
> > 
> > between commit:
> > 
> >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > 
> > from the tip tree and commits:
> > 
> >   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> >   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as 
> > long as possible"")
> > 
> > from the akpm-current tree.
> > 
> > The latter 2 are now in Linus' tree as well (but were not when I started
> > the day).
> > 
> > The only way forward I could see was to revert
> > 
> >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > 
> > and the three following commits
> > 
> >   ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() 
> > usage")
> >   d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
> >   a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
> > 
> > before merging the akpm-current tree again.
> 
> Here's two patches that apply on top of tip.
> 


And here's one to fix the PPC ordering issue I found while doing those
patches.


---
Subject: mm: Fix barrier for inc_tlb_flush_pending() for PPC
From: Peter Zijlstra 
Date: Fri Aug 11 12:43:33 CEST 2017

When we have SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of
one does not in fact order against the LOCK of another lock. Therefore
the documented scheme does not work.

Add an explicit smp_mb__after_atomic() to cure things.

Also update the comment to reflect the new inc/dec thing.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |   34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -533,7 +533,7 @@ static inline bool mm_tlb_flush_pending(
 {
/*
 * Must be called with PTL held; such that our PTL acquire will have
-* observed the store from set_tlb_flush_pending().
+* observed the increment from inc_tlb_flush_pending().
 */
return atomic_read(>tlb_flush_pending);
 }
@@ -547,13 +547,11 @@ static inline void inc_tlb_flush_pending
 {
atomic_inc(>tlb_flush_pending);
/*
-* The only time this value is relevant is when there are indeed pages
-* to flush. And we'll only flush pages after changing them, which
-* requires the PTL.
-*
 * So the ordering here is:
 *
-*  mm->tlb_flush_pending = true;
+*  atomic_inc(>tlb_flush_pending)
+*  smp_mb__after_atomic();
+*
 *  spin_lock();
 *  ...
 *  set_pte_at();
@@ -565,17 +563,33 @@ static inline void inc_tlb_flush_pending
 *  spin_unlock();
 *
 *  flush_tlb_range();
-*  mm->tlb_flush_pending = false;
+*  atomic_dec(>tlb_flush_pending);
 *
-* So the =true store is constrained by the PTL unlock, and the =false
-* store is constrained by the TLB invalidate.
+* Where we order the increment against the PTE modification with the
+* smp_mb__after_atomic(). It would appear that the spin_unlock()
+* is sufficient to constrain the inc, because we only care about the
+* value if there is indeed a pending PTE modification. However with
+* SPLIT_PTE_PTLOCKS and RCpc locks (PPC) the UNLOCK of one lock does
+* not order against the LOCK of another lock.
+*
+* The decrement is ordered by the flush_tlb_range(), such that
+* mm_tlb_flush_pending() will not return false unless all flushes have
+* completed.
 */
+   smp_mb__after_atomic();
 }
 
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
-   /* see set_tlb_flush_pending */
+   /*
+* See inc_tlb_flush_pending().
+*
+* This cannot be smp_mb__before_atomic() because smp_mb() simply does
+* not order against TLB invalidate completion, which is what we need.
+*
+* Therefore we must rely on tlb_flush_*() to guarantee order.
+*/
atomic_dec(>tlb_flush_pending);
 }
 #else

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra

On Fri, Aug 11, 2017 at 11:34:49AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Today's linux-next merge of the akpm-current tree got conflicts in:
> > 
> >   include/linux/mm_types.h
> >   mm/huge_memory.c
> > 
> > between commit:
> > 
> >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > 
> > from the tip tree and commits:
> > 
> >   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> >   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as 
> > long as possible"")
> > 
> > from the akpm-current tree.
> > 
> > The latter 2 are now in Linus' tree as well (but were not when I started
> > the day).
> > 
> > The only way forward I could see was to revert
> > 
> >   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > 
> > and the three following commits
> > 
> >   ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() 
> > usage")
> >   d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
> >   a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
> > 
> > before merging the akpm-current tree again.
> 
> Here's two patches that apply on top of tip.
> 


And here's one to fix the PPC ordering issue I found while doing those
patches.


---
Subject: mm: Fix barrier for inc_tlb_flush_pending() for PPC
From: Peter Zijlstra 
Date: Fri Aug 11 12:43:33 CEST 2017

When we have SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of
one does not in fact order against the LOCK of another lock. Therefore
the documented scheme does not work.

Add an explicit smp_mb__after_atomic() to cure things.

Also update the comment to reflect the new inc/dec thing.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |   34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -533,7 +533,7 @@ static inline bool mm_tlb_flush_pending(
 {
/*
 * Must be called with PTL held; such that our PTL acquire will have
-* observed the store from set_tlb_flush_pending().
+* observed the increment from inc_tlb_flush_pending().
 */
return atomic_read(>tlb_flush_pending);
 }
@@ -547,13 +547,11 @@ static inline void inc_tlb_flush_pending
 {
atomic_inc(>tlb_flush_pending);
/*
-* The only time this value is relevant is when there are indeed pages
-* to flush. And we'll only flush pages after changing them, which
-* requires the PTL.
-*
 * So the ordering here is:
 *
-*  mm->tlb_flush_pending = true;
+*  atomic_inc(>tlb_flush_pending)
+*  smp_mb__after_atomic();
+*
 *  spin_lock();
 *  ...
 *  set_pte_at();
@@ -565,17 +563,33 @@ static inline void inc_tlb_flush_pending
 *  spin_unlock();
 *
 *  flush_tlb_range();
-*  mm->tlb_flush_pending = false;
+*  atomic_dec(>tlb_flush_pending);
 *
-* So the =true store is constrained by the PTL unlock, and the =false
-* store is constrained by the TLB invalidate.
+* Where we order the increment against the PTE modification with the
+* smp_mb__after_atomic(). It would appear that the spin_unlock()
+* is sufficient to constrain the inc, because we only care about the
+* value if there is indeed a pending PTE modification. However with
+* SPLIT_PTE_PTLOCKS and RCpc locks (PPC) the UNLOCK of one lock does
+* not order against the LOCK of another lock.
+*
+* The decrement is ordered by the flush_tlb_range(), such that
+* mm_tlb_flush_pending() will not return false unless all flushes have
+* completed.
 */
+   smp_mb__after_atomic();
 }
 
 /* Clearing is done after a TLB flush, which also provides a barrier. */
 static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
-   /* see set_tlb_flush_pending */
+   /*
+* See inc_tlb_flush_pending().
+*
+* This cannot be smp_mb__before_atomic() because smp_mb() simply does
+* not order against TLB invalidate completion, which is what we need.
+*
+* Therefore we must rely on tlb_flush_*() to guarantee order.
+*/
atomic_dec(>tlb_flush_pending);
 }
 #else

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra

On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   include/linux/mm_types.h
>   mm/huge_memory.c
> 
> between commit:
> 
>   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> 
> from the tip tree and commits:
> 
>   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
>   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long 
> as possible"")
> 
> from the akpm-current tree.
> 
> The latter 2 are now in Linus' tree as well (but were not when I started
> the day).
> 
> The only way forward I could see was to revert
> 
>   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> 
> and the three following commits
> 
>   ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() usage")
>   d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
>   a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
> 
> before merging the akpm-current tree again.

Here's two patches that apply on top of tip.

Subject: mm: migrate: prevent racy access to tlb_flush_pending
From: Nadav Amit 
Date: Tue, 1 Aug 2017 17:08:12 -0700

Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.

This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.

An actual data corruption was not observed, yet the race was
was confirmed by adding assertion to check tlb_flush_pending is not set
by two threads, adding artificial latency in change_protection_range()
and using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")


Cc: 
Cc: CC: 
Cc: Andy Lutomirski 
Signed-off-by: Nadav Amit 
Acked-by: Mel Gorman 
Acked-by: Rik van Riel 
Acked-by: Minchan Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: http://lkml.kernel.org/r/20170802000818.4760-2-na...@vmware.com
---
 include/linux/mm_types.h |   29 +
 kernel/fork.c|2 +-
 mm/debug.c   |2 +-
 mm/mprotect.c|4 ++--
 4 files changed, 25 insertions(+), 12 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -493,7 +493,7 @@ struct mm_struct {
 	 * can move process memory needs to flush the TLB when moving a
 	 * PROT_NONE or PROT_NUMA mapped page.
 	 */
-	bool tlb_flush_pending;
+	atomic_t tlb_flush_pending;
 #endif
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	/* See flush_tlb_batched_pending() */
@@ -535,11 +535,17 @@ static inline bool mm_tlb_flush_pending(
 	 * Must be called with PTL held; such that our PTL acquire will have
 	 * observed the store from set_tlb_flush_pending().
 	 */
-	return mm->tlb_flush_pending;
+	return atomic_read(>tlb_flush_pending);
 }
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
-	mm->tlb_flush_pending = true;
+	atomic_set(>tlb_flush_pending, 0);
+}
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+	atomic_inc(>tlb_flush_pending);
 	/*
 	 * The only time this value is relevant is when there are indeed pages
 	 * to flush. And we'll only flush pages after changing them, which
@@ -565,21 +571,28 @@ static inline void set_tlb_flush_pending
 	 * store is constrained by the TLB invalidate.
 	 */
 }
+
 /* Clearing is done after a TLB flush, which also provides a barrier. */
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 	/* see set_tlb_flush_pending */
-	mm->tlb_flush_pending = false;
+	atomic_dec(>tlb_flush_pending);
 }
 #else
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
 	return false;
 }
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
 }
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 }
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -809,7 +809,7 @@ static struct mm_struct *mm_init(struct
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
-	clear_tlb_flush_pending(mm);
+	init_tlb_flush_pending(mm);
 #if

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-08-11 Thread Peter Zijlstra

On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   include/linux/mm_types.h
>   mm/huge_memory.c
> 
> between commit:
> 
>   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> 
> from the tip tree and commits:
> 
>   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
>   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long 
> as possible"")
> 
> from the akpm-current tree.
> 
> The latter 2 are now in Linus' tree as well (but were not when I started
> the day).
> 
> The only way forward I could see was to revert
> 
>   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> 
> and the three following commits
> 
>   ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() usage")
>   d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
>   a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
> 
> before merging the akpm-current tree again.

Here's two patches that apply on top of tip.

Subject: mm: migrate: prevent racy access to tlb_flush_pending
From: Nadav Amit 
Date: Tue, 1 Aug 2017 17:08:12 -0700

Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.

This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.

An actual data corruption was not observed, yet the race was
was confirmed by adding assertion to check tlb_flush_pending is not set
by two threads, adding artificial latency in change_protection_range()
and using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")


Cc: 
Cc: CC: 
Cc: Andy Lutomirski 
Signed-off-by: Nadav Amit 
Acked-by: Mel Gorman 
Acked-by: Rik van Riel 
Acked-by: Minchan Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: http://lkml.kernel.org/r/20170802000818.4760-2-na...@vmware.com
---
 include/linux/mm_types.h |   29 +
 kernel/fork.c|2 +-
 mm/debug.c   |2 +-
 mm/mprotect.c|4 ++--
 4 files changed, 25 insertions(+), 12 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -493,7 +493,7 @@ struct mm_struct {
 	 * can move process memory needs to flush the TLB when moving a
 	 * PROT_NONE or PROT_NUMA mapped page.
 	 */
-	bool tlb_flush_pending;
+	atomic_t tlb_flush_pending;
 #endif
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	/* See flush_tlb_batched_pending() */
@@ -535,11 +535,17 @@ static inline bool mm_tlb_flush_pending(
 	 * Must be called with PTL held; such that our PTL acquire will have
 	 * observed the store from set_tlb_flush_pending().
 	 */
-	return mm->tlb_flush_pending;
+	return atomic_read(>tlb_flush_pending);
 }
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
-	mm->tlb_flush_pending = true;
+	atomic_set(>tlb_flush_pending, 0);
+}
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+	atomic_inc(>tlb_flush_pending);
 	/*
 	 * The only time this value is relevant is when there are indeed pages
 	 * to flush. And we'll only flush pages after changing them, which
@@ -565,21 +571,28 @@ static inline void set_tlb_flush_pending
 	 * store is constrained by the TLB invalidate.
 	 */
 }
+
 /* Clearing is done after a TLB flush, which also provides a barrier. */
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 	/* see set_tlb_flush_pending */
-	mm->tlb_flush_pending = false;
+	atomic_dec(>tlb_flush_pending);
 }
 #else
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
 	return false;
 }
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
 }
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 }
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -809,7 +809,7 @@ static struct mm_struct *mm_init(struct
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
-	clear_tlb_flush_pending(mm);
+	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-04-19 Thread NeilBrown

On Wed, Apr 12 2017, Vlastimil Babka wrote:

> On 12.4.2017 8:46, Stephen Rothwell wrote:
>> Hi Andrew,
>> 
>> Today's linux-next merge of the akpm-current tree got conflicts in:
>> 
>>   drivers/block/nbd.c
>>   drivers/scsi/iscsi_tcp.c
>>   net/core/dev.c
>>   net/core/sock.c
>> 
>> between commit:
>> 
>>   717a94b5fc70 ("sched/core: Remove 'task' parameter and rename 
>> tsk_restore_flags() to current_restore_flags()")
>> 
>> from the tip tree and commit:
>> 
>>   61d5ad5b2e8a ("treewide: convert PF_MEMALLOC manipulations to new helpers")
>> 
>> from the akpm-current tree.
>
> Yeah, the first patch from Neil renames a function (as its subject says) and 
> the
> second patch from me converts most of its users to new helpers specific to the
> PF_MEMALLOC flags.
>
>> I fixed it up (the latter is just a superset of the former, so I used
>
> It's not a complete superset though, more on that below.
>
>> that) and can carry the fix as necessary. This is now fixed as far as
>> linux-next is concerned, but any non trivial conflicts should be mentioned
>> to your upstream maintainer when your tree is submitted for merging.
>> You may also want to consider cooperating with the maintainer of the
>> conflicting tree to minimise any particularly complex conflicts.
>
> Hmm I could redo my patch on top of Neil's patch, but then Andrew would have 
> to
> carry Neil's patch as well just to have a working mmotm? And then make sure to
> send my patch (but not Neil's) only after the tip tree is pulled? Would that
> work for the maintainers involved?
>
>> It looks like there may be more instances that the latter patch should
>> update.
>
> I see two remaining instances of current_restore_flags(). One in 
> __do_softirq()
> is even for PF_MEMALLOC, but there the flag is cleared first and then set 
> back,
> which is opposite of the common case that my helpers provide. The other in 
> nfsd
> is for PF_LESS_THROTTLE which is not common enough to earn own helpers yet. 
> IIRC
> Neil originally wanted to add a new one?

[Sorry - I thought I had sent this last week, but just noticed that I didn't]

In general, I'm not a fan of overly-specific helpers.

As a general rule, tsk_restore_flags() is probably better than
current_restore_flags() as it is more general.
However in this specific case, using any task other than 'current' would
almost certainly be incorrect code as locking is impossible.  So I
prefer the 'current' to be implicit, but the actual flag to be explicit.

If you are going to add helpers for setting/clearing PF flags, I would
much rather that you take

#define current_test_flags(f)   (current->flags & (f))
#define current_set_flags_nested(sp, f) \
(*(sp) = current->flags, current->flags |= (f))
#define current_clear_flags_nested(sp, f)   \
(*(sp) = current->flags, current->flags &= ~(f))
#define current_restore_flags_nested(sp, f) \
(current->flags = ((current->flags & ~(f)) | (*(sp) & (f

out of fs/xfs/xfs_linux.h and use them globally.

Your
  noreclaim_flag = memalloc_reclaim_save()
becomes
  current_set_flags_nested_flag, PF_MEMALLOC)
which is more typing, but arguably easier to read.

If you then changed all uses of tsk_restore_flags() to use
current_restore_flags_nested(), my patch could be discarded as
irrelevant.

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-04-19 Thread NeilBrown

On Wed, Apr 12 2017, Vlastimil Babka wrote:

> On 12.4.2017 8:46, Stephen Rothwell wrote:
>> Hi Andrew,
>> 
>> Today's linux-next merge of the akpm-current tree got conflicts in:
>> 
>>   drivers/block/nbd.c
>>   drivers/scsi/iscsi_tcp.c
>>   net/core/dev.c
>>   net/core/sock.c
>> 
>> between commit:
>> 
>>   717a94b5fc70 ("sched/core: Remove 'task' parameter and rename 
>> tsk_restore_flags() to current_restore_flags()")
>> 
>> from the tip tree and commit:
>> 
>>   61d5ad5b2e8a ("treewide: convert PF_MEMALLOC manipulations to new helpers")
>> 
>> from the akpm-current tree.
>
> Yeah, the first patch from Neil renames a function (as its subject says) and 
> the
> second patch from me converts most of its users to new helpers specific to the
> PF_MEMALLOC flags.
>
>> I fixed it up (the latter is just a superset of the former, so I used
>
> It's not a complete superset though, more on that below.
>
>> that) and can carry the fix as necessary. This is now fixed as far as
>> linux-next is concerned, but any non trivial conflicts should be mentioned
>> to your upstream maintainer when your tree is submitted for merging.
>> You may also want to consider cooperating with the maintainer of the
>> conflicting tree to minimise any particularly complex conflicts.
>
> Hmm I could redo my patch on top of Neil's patch, but then Andrew would have 
> to
> carry Neil's patch as well just to have a working mmotm? And then make sure to
> send my patch (but not Neil's) only after the tip tree is pulled? Would that
> work for the maintainers involved?
>
>> It looks like there may be more instances that the latter patch should
>> update.
>
> I see two remaining instances of current_restore_flags(). One in 
> __do_softirq()
> is even for PF_MEMALLOC, but there the flag is cleared first and then set 
> back,
> which is opposite of the common case that my helpers provide. The other in 
> nfsd
> is for PF_LESS_THROTTLE which is not common enough to earn own helpers yet. 
> IIRC
> Neil originally wanted to add a new one?

[Sorry - I thought I had sent this last week, but just noticed that I didn't]

In general, I'm not a fan of overly-specific helpers.

As a general rule, tsk_restore_flags() is probably better than
current_restore_flags() as it is more general.
However in this specific case, using any task other than 'current' would
almost certainly be incorrect code as locking is impossible.  So I
prefer the 'current' to be implicit, but the actual flag to be explicit.

If you are going to add helpers for setting/clearing PF flags, I would
much rather that you take

#define current_test_flags(f)   (current->flags & (f))
#define current_set_flags_nested(sp, f) \
(*(sp) = current->flags, current->flags |= (f))
#define current_clear_flags_nested(sp, f)   \
(*(sp) = current->flags, current->flags &= ~(f))
#define current_restore_flags_nested(sp, f) \
(current->flags = ((current->flags & ~(f)) | (*(sp) & (f

out of fs/xfs/xfs_linux.h and use them globally.

Your
  noreclaim_flag = memalloc_reclaim_save()
becomes
  current_set_flags_nested_flag, PF_MEMALLOC)
which is more typing, but arguably easier to read.

If you then changed all uses of tsk_restore_flags() to use
current_restore_flags_nested(), my patch could be discarded as
irrelevant.

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-04-12 Thread Vlastimil Babka

On 12.4.2017 8:46, Stephen Rothwell wrote:
> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   drivers/block/nbd.c
>   drivers/scsi/iscsi_tcp.c
>   net/core/dev.c
>   net/core/sock.c
> 
> between commit:
> 
>   717a94b5fc70 ("sched/core: Remove 'task' parameter and rename 
> tsk_restore_flags() to current_restore_flags()")
> 
> from the tip tree and commit:
> 
>   61d5ad5b2e8a ("treewide: convert PF_MEMALLOC manipulations to new helpers")
> 
> from the akpm-current tree.

Yeah, the first patch from Neil renames a function (as its subject says) and the
second patch from me converts most of its users to new helpers specific to the
PF_MEMALLOC flags.

> I fixed it up (the latter is just a superset of the former, so I used

It's not a complete superset though, more on that below.

> that) and can carry the fix as necessary. This is now fixed as far as
> linux-next is concerned, but any non trivial conflicts should be mentioned
> to your upstream maintainer when your tree is submitted for merging.
> You may also want to consider cooperating with the maintainer of the
> conflicting tree to minimise any particularly complex conflicts.

Hmm I could redo my patch on top of Neil's patch, but then Andrew would have to
carry Neil's patch as well just to have a working mmotm? And then make sure to
send my patch (but not Neil's) only after the tip tree is pulled? Would that
work for the maintainers involved?

> It looks like there may be more instances that the latter patch should
> update.

I see two remaining instances of current_restore_flags(). One in __do_softirq()
is even for PF_MEMALLOC, but there the flag is cleared first and then set back,
which is opposite of the common case that my helpers provide. The other in nfsd
is for PF_LESS_THROTTLE which is not common enough to earn own helpers yet. IIRC
Neil originally wanted to add a new one?

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2017-04-12 Thread Vlastimil Babka

On 12.4.2017 8:46, Stephen Rothwell wrote:
> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   drivers/block/nbd.c
>   drivers/scsi/iscsi_tcp.c
>   net/core/dev.c
>   net/core/sock.c
> 
> between commit:
> 
>   717a94b5fc70 ("sched/core: Remove 'task' parameter and rename 
> tsk_restore_flags() to current_restore_flags()")
> 
> from the tip tree and commit:
> 
>   61d5ad5b2e8a ("treewide: convert PF_MEMALLOC manipulations to new helpers")
> 
> from the akpm-current tree.

Yeah, the first patch from Neil renames a function (as its subject says) and the
second patch from me converts most of its users to new helpers specific to the
PF_MEMALLOC flags.

> I fixed it up (the latter is just a superset of the former, so I used

It's not a complete superset though, more on that below.

> that) and can carry the fix as necessary. This is now fixed as far as
> linux-next is concerned, but any non trivial conflicts should be mentioned
> to your upstream maintainer when your tree is submitted for merging.
> You may also want to consider cooperating with the maintainer of the
> conflicting tree to minimise any particularly complex conflicts.

Hmm I could redo my patch on top of Neil's patch, but then Andrew would have to
carry Neil's patch as well just to have a working mmotm? And then make sure to
send my patch (but not Neil's) only after the tip tree is pulled? Would that
work for the maintainers involved?

> It looks like there may be more instances that the latter patch should
> update.

I see two remaining instances of current_restore_flags(). One in __do_softirq()
is even for PF_MEMALLOC, but there the flag is cleared first and then set back,
which is opposite of the common case that my helpers provide. The other in nfsd
is for PF_LESS_THROTTLE which is not common enough to earn own helpers yet. IIRC
Neil originally wanted to add a new one?

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-06-18 Thread Manfred Spraul


Hi,

On 06/15/2016 07:23 AM, Stephen Rothwell wrote:

Hi Andrew,

Today's linux-next merge of the akpm-current tree got a conflict in:

   ipc/sem.c

between commit:

   33ac279677dc ("locking/barriers: Introduce smp_acquire__after_ctrl_dep()")

from the tip tree and commit:

   a1c58ea067cb ("ipc/sem.c: Fix complex_count vs. simple op race")

from the akpm-current tree.

Just in case, I have created a rediff of my patch against -tip.
And the patch with hysteresis would be ready as well.

I will send both patches.

More testers would be welcome, I can only test it on my laptop.

--
Manfred

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-06-18 Thread Manfred Spraul


Hi,

On 06/15/2016 07:23 AM, Stephen Rothwell wrote:

Hi Andrew,

Today's linux-next merge of the akpm-current tree got a conflict in:

   ipc/sem.c

between commit:

   33ac279677dc ("locking/barriers: Introduce smp_acquire__after_ctrl_dep()")

from the tip tree and commit:

   a1c58ea067cb ("ipc/sem.c: Fix complex_count vs. simple op race")

from the akpm-current tree.

Just in case, I have created a rediff of my patch against -tip.
And the patch with hysteresis would be ready as well.

I will send both patches.

More testers would be welcome, I can only test it on my laptop.

--
Manfred

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-04-29 Thread Ingo Molnar


* Stephen Rothwell  wrote:

> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   include/linux/efi.h
> 
> between commit:
> 
>   2c23b73c2d02 ("Ard Biesheuvel ")
> 
> from the tip tree and commit:
> 
>   9f2c36a7b097 ("include/linux/efi.h: redefine type, constant, macro from 
> generic code")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Btw., while looking at this, I noticed that akpm-current introduced this 
namespace 
collision:

include/acpi/acconfig.h:#define UUID_STRING_LENGTH  36  /* Total length 
of a UUID string */
include/linux/uuid.h:#defineUUID_STRING_LEN 36

I suspect the include/acpi/acconfig.h define should be renamed:

UUID_STRING_LENGTH -> ACPI_UUID_STRING_LENGTH
UUID_BUFFER_LENGTH -> ACPI_UUID_BUFFER_LENGTH

... before the collision causes any trouble.

Thanks,

Ingo

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-04-29 Thread Ingo Molnar


* Stephen Rothwell  wrote:

> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   include/linux/efi.h
> 
> between commit:
> 
>   2c23b73c2d02 ("Ard Biesheuvel ")
> 
> from the tip tree and commit:
> 
>   9f2c36a7b097 ("include/linux/efi.h: redefine type, constant, macro from 
> generic code")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Btw., while looking at this, I noticed that akpm-current introduced this 
namespace 
collision:

include/acpi/acconfig.h:#define UUID_STRING_LENGTH  36  /* Total length 
of a UUID string */
include/linux/uuid.h:#defineUUID_STRING_LEN 36

I suspect the include/acpi/acconfig.h define should be renamed:

UUID_STRING_LENGTH -> ACPI_UUID_STRING_LENGTH
UUID_BUFFER_LENGTH -> ACPI_UUID_BUFFER_LENGTH

... before the collision causes any trouble.

Thanks,

Ingo

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-02-26 Thread Andrew Morton

On Fri, 26 Feb 2016 16:07:12 +1100 Stephen Rothwell  
wrote:

> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   mm/mprotect.c
> 
> between commit:
> 
>   62b5f7d013fc ("mm/core, x86/mm/pkeys: Add execute-only protection keys 
> support")
> 
> from the tip tree and commit:
> 
>   aff3915ff831 ("mm/mprotect.c: don't imply PROT_EXEC on non-exec fs")
> 
> from the akpm-current tree.
> 
> I fixed it up (I think - see below) and can carry the fix as necessary
> (no action is required).
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc mm/mprotect.c
> index fa37c4cd973a,6ff5dfa65b33..
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@@ -414,7 -409,11 +411,11 @@@ SYSCALL_DEFINE3(mprotect, unsigned long
>   
>   /* Here we know that vma->vm_start <= nstart < vma->vm_end. */
>   
> + /* Does the application expect PROT_READ to imply PROT_EXEC */
> + if (rier && (vma->vm_flags & VM_MAYEXEC))
> + prot |= PROT_EXEC;
> + 
>  -newflags = calc_vm_prot_bits(prot);
>  +newflags = calc_vm_prot_bits(prot, pkey);
>   newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
>   
>   /* newflags >> 4 shift VM_MAY% in place of VM_% */

OK, thanks.

I moved this patch
(mm-mprotectc-dont-imply-prot_exec-on-non-exec-fs.patch) into the
"post-linux-next" section and reworked it to accommodate the -tip
changes.



From: Piotr Kwapulinski 
Subject: mm/mprotect.c: don't imply PROT_EXEC on non-exec fs

The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC binary
on a memory mapped file located on non-exec fs.  The mprotect does not
check whether fs is _executable_ or not.  The PROT_EXEC flag is set
automatically even if a memory mapped file is located on non-exec fs.  Fix
it by checking whether a memory mapped file is located on a non-exec fs. 
If so the PROT_EXEC is not implied by the PROT_READ.  The implementation
uses the VM_MAYEXEC flag set properly in mmap.  Now it is consistent with
mmap.

I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs).  I
also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
seems to work.

Signed-off-by: Piotr Kwapulinski 
Cc: Mel Gorman 
Cc: Kirill A. Shutemov 
Cc: Aneesh Kumar K.V 
Cc: Benjamin Herrenschmidt 
Cc: Konstantin Khlebnikov 
Cc: Dan Williams 
Cc: Dave Hansen 
Signed-off-by: Andrew Morton 
---

 mm/mprotect.c |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN mm/mprotect.c~mm-mprotectc-dont-imply-prot_exec-on-non-exec-fs 
mm/mprotect.c
--- a/mm/mprotect.c~mm-mprotectc-dont-imply-prot_exec-on-non-exec-fs
+++ a/mm/mprotect.c
@@ -359,6 +359,9 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
struct vm_area_struct *vma, *prev;
int error = -EINVAL;
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
+   const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
+   (prot & PROT_READ);
+
prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
return -EINVAL;
@@ -375,11 +378,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
return -EINVAL;
 
reqprot = prot;
-   /*
-* Does the application expect PROT_READ to imply PROT_EXEC:
-*/
-   if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
-   prot |= PROT_EXEC;
 
down_write(>mm->mmap_sem);
 
@@ -414,6 +412,10 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
+   /* Does the application expect PROT_READ to imply PROT_EXEC */
+   if (rier && (vma->vm_flags & VM_MAYEXEC))
+   prot |= PROT_EXEC;
+
newflags = calc_vm_prot_bits(prot, pkey);
newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
@@ -445,6 +447,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
error = -ENOMEM;
goto out;
}
+   prot = reqprot;
}
 out:
up_write(>mm->mmap_sem);
_

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-02-26 Thread Andrew Morton

On Fri, 26 Feb 2016 16:07:12 +1100 Stephen Rothwell  
wrote:

> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   mm/mprotect.c
> 
> between commit:
> 
>   62b5f7d013fc ("mm/core, x86/mm/pkeys: Add execute-only protection keys 
> support")
> 
> from the tip tree and commit:
> 
>   aff3915ff831 ("mm/mprotect.c: don't imply PROT_EXEC on non-exec fs")
> 
> from the akpm-current tree.
> 
> I fixed it up (I think - see below) and can carry the fix as necessary
> (no action is required).
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc mm/mprotect.c
> index fa37c4cd973a,6ff5dfa65b33..
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@@ -414,7 -409,11 +411,11 @@@ SYSCALL_DEFINE3(mprotect, unsigned long
>   
>   /* Here we know that vma->vm_start <= nstart < vma->vm_end. */
>   
> + /* Does the application expect PROT_READ to imply PROT_EXEC */
> + if (rier && (vma->vm_flags & VM_MAYEXEC))
> + prot |= PROT_EXEC;
> + 
>  -newflags = calc_vm_prot_bits(prot);
>  +newflags = calc_vm_prot_bits(prot, pkey);
>   newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
>   
>   /* newflags >> 4 shift VM_MAY% in place of VM_% */

OK, thanks.

I moved this patch
(mm-mprotectc-dont-imply-prot_exec-on-non-exec-fs.patch) into the
"post-linux-next" section and reworked it to accommodate the -tip
changes.



From: Piotr Kwapulinski 
Subject: mm/mprotect.c: don't imply PROT_EXEC on non-exec fs

The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC binary
on a memory mapped file located on non-exec fs.  The mprotect does not
check whether fs is _executable_ or not.  The PROT_EXEC flag is set
automatically even if a memory mapped file is located on non-exec fs.  Fix
it by checking whether a memory mapped file is located on a non-exec fs. 
If so the PROT_EXEC is not implied by the PROT_READ.  The implementation
uses the VM_MAYEXEC flag set properly in mmap.  Now it is consistent with
mmap.

I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs).  I
also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
seems to work.

Signed-off-by: Piotr Kwapulinski 
Cc: Mel Gorman 
Cc: Kirill A. Shutemov 
Cc: Aneesh Kumar K.V 
Cc: Benjamin Herrenschmidt 
Cc: Konstantin Khlebnikov 
Cc: Dan Williams 
Cc: Dave Hansen 
Signed-off-by: Andrew Morton 
---

 mm/mprotect.c |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN mm/mprotect.c~mm-mprotectc-dont-imply-prot_exec-on-non-exec-fs 
mm/mprotect.c
--- a/mm/mprotect.c~mm-mprotectc-dont-imply-prot_exec-on-non-exec-fs
+++ a/mm/mprotect.c
@@ -359,6 +359,9 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
struct vm_area_struct *vma, *prev;
int error = -EINVAL;
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
+   const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
+   (prot & PROT_READ);
+
prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
return -EINVAL;
@@ -375,11 +378,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
return -EINVAL;
 
reqprot = prot;
-   /*
-* Does the application expect PROT_READ to imply PROT_EXEC:
-*/
-   if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
-   prot |= PROT_EXEC;
 
down_write(>mm->mmap_sem);
 
@@ -414,6 +412,10 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
+   /* Does the application expect PROT_READ to imply PROT_EXEC */
+   if (rier && (vma->vm_flags & VM_MAYEXEC))
+   prot |= PROT_EXEC;
+
newflags = calc_vm_prot_bits(prot, pkey);
newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
@@ -445,6 +447,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
error = -ENOMEM;
goto out;
}
+   prot = reqprot;
}
 out:
up_write(>mm->mmap_sem);
_

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-02-19 Thread Ard Biesheuvel

On 19 February 2016 at 05:09, Stephen Rothwell  wrote:
> Hi Andrew,
>
> Today's linux-next merge of the akpm-current tree got a conflict in:
>
>   arch/x86/mm/extable.c
>
> between commit:
>
>   548acf19234d ("x86/mm: Expand the exception table logic to allow new 
> handling options")
>
> from the tip tree and commit:
>
>   f1cd2c09ff09 ("x86/extable: use generic search and sort routines")
>
> from the akpm-current tree.
>
> I couldn't figure out how to fix this up, so I just dropped the
> akpm-current tree patch.
>

Hi Andrew,

Unfortunately, this is not the only problem currently with my extable
series. The arm64 patch now also conflicts with patches that are
queued in the arm64 tree.

So could you please drop all six of them for now? I will ask Catalin
to take the ones that are essential to the arm64 KASLR implementation
via the arm64 tree, and once that hits mainline, I will rebase and
resubmit the remaining patches.

extable-add-support-for-relative-extables-to-search-and-sort-routines.patch
alpha-extable-use-generic-search-and-sort-routines.patch
s390-extable-use-generic-search-and-sort-routines.patch
x86-extable-use-generic-search-and-sort-routines.patch
ia64-extable-use-generic-search-and-sort-routines.patch
arm64-switch-to-relative-exception-tables.patch

Thanks,
Ard.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-02-19 Thread Ard Biesheuvel

On 19 February 2016 at 05:09, Stephen Rothwell  wrote:
> Hi Andrew,
>
> Today's linux-next merge of the akpm-current tree got a conflict in:
>
>   arch/x86/mm/extable.c
>
> between commit:
>
>   548acf19234d ("x86/mm: Expand the exception table logic to allow new 
> handling options")
>
> from the tip tree and commit:
>
>   f1cd2c09ff09 ("x86/extable: use generic search and sort routines")
>
> from the akpm-current tree.
>
> I couldn't figure out how to fix this up, so I just dropped the
> akpm-current tree patch.
>

Hi Andrew,

Unfortunately, this is not the only problem currently with my extable
series. The arm64 patch now also conflicts with patches that are
queued in the arm64 tree.

So could you please drop all six of them for now? I will ask Catalin
to take the ones that are essential to the arm64 KASLR implementation
via the arm64 tree, and once that hits mainline, I will rebase and
resubmit the remaining patches.

extable-add-support-for-relative-extables-to-search-and-sort-routines.patch
alpha-extable-use-generic-search-and-sort-routines.patch
s390-extable-use-generic-search-and-sort-routines.patch
x86-extable-use-generic-search-and-sort-routines.patch
ia64-extable-use-generic-search-and-sort-routines.patch
arm64-switch-to-relative-exception-tables.patch

Thanks,
Ard.

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-16 Thread Geert Uytterhoeven

Hi Andrew,

On Wed, Sep 9, 2015 at 1:21 AM, Andrew Morton  wrote:
> New syscalls are rather a pain, both from the patch-monkeying POV and
> also because nobody knows what the syscall numbers will be until
> everything lands in mainline.  Oh well, it doesn't happen often and
> it's easy stuff.

One more reason to let the assignment of syscall numbers be handled
(1) by the architecture maintainer, (2) after -rc1, even for x86.

If x86 is no more the canonical source, scripts/checksyscalls.sh needs an
update, though.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-16 Thread Geert Uytterhoeven

Hi Andrew,

On Wed, Sep 9, 2015 at 1:21 AM, Andrew Morton  wrote:
> New syscalls are rather a pain, both from the patch-monkeying POV and
> also because nobody knows what the syscall numbers will be until
> everything lands in mainline.  Oh well, it doesn't happen often and
> it's easy stuff.

One more reason to let the assignment of syscall numbers be handled
(1) by the architecture maintainer, (2) after -rc1, even for x86.

If x86 is no more the canonical source, scripts/checksyscalls.sh needs an
update, though.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Andrew Morton

On Tue, 8 Sep 2015 16:03:23 -0700 Linus Torvalds 
 wrote:

> On Tue, Sep 8, 2015 at 3:56 PM, Stephen Rothwell  
> wrote:
> >
> > I have been applying that patch I sent to you to -next for some time.
> > I guess I expected Andrew to pick it up when he rebased his patch
> > series before submitting it to you.  These things sometimes slip
> > through the cracks.
> 
> I suspect Andrew saw that patch, and thought it was a merge fixup like
> you sometimes send out, and didn't realize that it actually applied
> directly to his series.

I've had it all the time, as a post-linux-next fixup - the idea being
that I send it to you after its linux-next preconditions have been
merged up.

However I failed to put that patch inside the stephen-take-these-bits
markers, so it never went from -mm into -next.

New syscalls are rather a pain, both from the patch-monkeying POV and
also because nobody knows what the syscall numbers will be until
everything lands in mainline.  Oh well, it doesn't happen often and
it's easy stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Linus Torvalds

On Tue, Sep 8, 2015 at 3:56 PM, Stephen Rothwell  wrote:
>
> I have been applying that patch I sent to you to -next for some time.
> I guess I expected Andrew to pick it up when he rebased his patch
> series before submitting it to you.  These things sometimes slip
> through the cracks.

I suspect Andrew saw that patch, and thought it was a merge fixup like
you sometimes send out, and didn't realize that it actually applied
directly to his series.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Stephen Rothwell

Hi Linus,

On Tue, 8 Sep 2015 11:11:25 -0700 Linus Torvalds 
 wrote:
>
> On Mon, Sep 7, 2015 at 4:35 PM, Stephen Rothwell  
> wrote:
> >
> > The below patch was missed when the userfaultfd stuff and the x86 changes
> > were merged.  I have repeated the patch in the clear below.
> 
> When forwarding patches, please add your sign-off. I can see (and
> apply) the original in this thread, so I guess it doesn't matter in
> this particular case, but in general that's what you should be doing
> so that I don't then have to find the other email just to apply the
> patch from the original author.

Ooops, sorry about that.  I really should know better.

> Also, this wasn't actually a real merge error. This was just a bug in
> the whole process. Andrew's patches sent to me had been updated with
> the right number for all the other cases, why hadn't this been folded
> into the series too? Apparently Andrew's series has simply been buggy
> for the last month or more, and that was true even before it was sent
> to me and while it was cooking in -next.. Tssk.

I have been applying that patch I sent to you to -next for some time.
I guess I expected Andrew to pick it up when he rebased his patch
series before submitting it to you.  These things sometimes slip
through the cracks.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Linus Torvalds

On Mon, Sep 7, 2015 at 4:35 PM, Stephen Rothwell  wrote:
>
> The below patch was missed when the userfaultfd stuff and the x86 changes
> were merged.  I have repeated the patch in the clear below.

When forwarding patches, please add your sign-off. I can see (and
apply) the original in this thread, so I guess it doesn't matter in
this particular case, but in general that's what you should be doing
so that I don't then have to find the other email just to apply the
patch from the original author.

Also, this wasn't actually a real merge error. This was just a bug in
the whole process. Andrew's patches sent to me had been updated with
the right number for all the other cases, why hadn't this been folded
into the series too? Apparently Andrew's series has simply been buggy
for the last month or more, and that was true even before it was sent
to me and while it was cooking in -next.. Tssk.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Andrew Morton

On Tue, 8 Sep 2015 16:03:23 -0700 Linus Torvalds 
 wrote:

> On Tue, Sep 8, 2015 at 3:56 PM, Stephen Rothwell  
> wrote:
> >
> > I have been applying that patch I sent to you to -next for some time.
> > I guess I expected Andrew to pick it up when he rebased his patch
> > series before submitting it to you.  These things sometimes slip
> > through the cracks.
> 
> I suspect Andrew saw that patch, and thought it was a merge fixup like
> you sometimes send out, and didn't realize that it actually applied
> directly to his series.

I've had it all the time, as a post-linux-next fixup - the idea being
that I send it to you after its linux-next preconditions have been
merged up.

However I failed to put that patch inside the stephen-take-these-bits
markers, so it never went from -mm into -next.

New syscalls are rather a pain, both from the patch-monkeying POV and
also because nobody knows what the syscall numbers will be until
everything lands in mainline.  Oh well, it doesn't happen often and
it's easy stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Stephen Rothwell

Hi Linus,

On Tue, 8 Sep 2015 11:11:25 -0700 Linus Torvalds 
 wrote:
>
> On Mon, Sep 7, 2015 at 4:35 PM, Stephen Rothwell  
> wrote:
> >
> > The below patch was missed when the userfaultfd stuff and the x86 changes
> > were merged.  I have repeated the patch in the clear below.
> 
> When forwarding patches, please add your sign-off. I can see (and
> apply) the original in this thread, so I guess it doesn't matter in
> this particular case, but in general that's what you should be doing
> so that I don't then have to find the other email just to apply the
> patch from the original author.

Ooops, sorry about that.  I really should know better.

> Also, this wasn't actually a real merge error. This was just a bug in
> the whole process. Andrew's patches sent to me had been updated with
> the right number for all the other cases, why hadn't this been folded
> into the series too? Apparently Andrew's series has simply been buggy
> for the last month or more, and that was true even before it was sent
> to me and while it was cooking in -next.. Tssk.

I have been applying that patch I sent to you to -next for some time.
I guess I expected Andrew to pick it up when he rebased his patch
series before submitting it to you.  These things sometimes slip
through the cracks.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Linus Torvalds

On Tue, Sep 8, 2015 at 3:56 PM, Stephen Rothwell  wrote:
>
> I have been applying that patch I sent to you to -next for some time.
> I guess I expected Andrew to pick it up when he rebased his patch
> series before submitting it to you.  These things sometimes slip
> through the cracks.

I suspect Andrew saw that patch, and thought it was a merge fixup like
you sometimes send out, and didn't realize that it actually applied
directly to his series.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-08 Thread Linus Torvalds

On Mon, Sep 7, 2015 at 4:35 PM, Stephen Rothwell  wrote:
>
> The below patch was missed when the userfaultfd stuff and the x86 changes
> were merged.  I have repeated the patch in the clear below.

When forwarding patches, please add your sign-off. I can see (and
apply) the original in this thread, so I guess it doesn't matter in
this particular case, but in general that's what you should be doing
so that I don't then have to find the other email just to apply the
patch from the original author.

Also, this wasn't actually a real merge error. This was just a bug in
the whole process. Andrew's patches sent to me had been updated with
the right number for all the other cases, why hadn't this been folded
into the series too? Apparently Andrew's series has simply been buggy
for the last month or more, and that was true even before it was sent
to me and while it was cooking in -next.. Tssk.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-07 Thread Stephen Rothwell

Hi Linus,

On Wed, 29 Jul 2015 19:12:56 +0200 Andrea Arcangeli  wrote:
>
> On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
> >  -359   i386userfaultfd sys_userfaultfd
> > ++374   i386userfaultfd sys_userfaultfd
> 
> Do I understand correctly the syscall number of userfaultfd for x86
> 32bit has just changed from 359 to 374? Appreciated that you CCed me
> on such a relevant change to be sure I didn't miss it.
> 
> Then the below is needed as well.

The below patch was missed when the userfaultfd stuff and the x86 changes
were merged.  I have repeated the patch in the clear below.

From: Andrea Arcangeli 
Date: Wed, 29 Jul 2015 18:53:17 +0200
Subject: [PATCH] userfaultfd: selftest: update userfaultfd x86 32bit syscall 
number

It changed as result of linux-next merge of other syscalls.

Signed-off-by: Andrea Arcangeli 
---
 tools/testing/selftests/vm/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 0c0b839..76071b1 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -69,7 +69,7 @@
 #ifdef __x86_64__
 #define __NR_userfaultfd 323
 #elif defined(__i386__)
-#define __NR_userfaultfd 359
+#define __NR_userfaultfd 374
 #elif defined(__powewrpc__)
 #define __NR_userfaultfd 364
 #else

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-09-07 Thread Stephen Rothwell

Hi Linus,

On Wed, 29 Jul 2015 19:12:56 +0200 Andrea Arcangeli  wrote:
>
> On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
> >  -359   i386userfaultfd sys_userfaultfd
> > ++374   i386userfaultfd sys_userfaultfd
> 
> Do I understand correctly the syscall number of userfaultfd for x86
> 32bit has just changed from 359 to 374? Appreciated that you CCed me
> on such a relevant change to be sure I didn't miss it.
> 
> Then the below is needed as well.

The below patch was missed when the userfaultfd stuff and the x86 changes
were merged.  I have repeated the patch in the clear below.

From: Andrea Arcangeli 
Date: Wed, 29 Jul 2015 18:53:17 +0200
Subject: [PATCH] userfaultfd: selftest: update userfaultfd x86 32bit syscall 
number

It changed as result of linux-next merge of other syscalls.

Signed-off-by: Andrea Arcangeli 
---
 tools/testing/selftests/vm/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 0c0b839..76071b1 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -69,7 +69,7 @@
 #ifdef __x86_64__
 #define __NR_userfaultfd 323
 #elif defined(__i386__)
-#define __NR_userfaultfd 359
+#define __NR_userfaultfd 374
 #elif defined(__powewrpc__)
 #define __NR_userfaultfd 364
 #else

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-30 Thread Andrea Arcangeli

On Wed, Jul 29, 2015 at 08:46:10PM +0200, Thomas Gleixner wrote:
> On Wed, 29 Jul 2015, Andy Lutomirski wrote:
> > -tip people: want to assign Andrea a pair of syscall numbers?
> 
> Sure, just send a patch 

Awesome, I just sent the patch to register the syscall against -tip
with the usual placeholder that will return -ENOSYS.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-30 Thread Andrea Arcangeli

On Wed, Jul 29, 2015 at 08:46:10PM +0200, Thomas Gleixner wrote:
 On Wed, 29 Jul 2015, Andy Lutomirski wrote:
  -tip people: want to assign Andrea a pair of syscall numbers?
 
 Sure, just send a patch 

Awesome, I just sent the patch to register the syscall against -tip
with the usual placeholder that will return -ENOSYS.

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Thomas Gleixner

On Thu, 30 Jul 2015, Stephen Rothwell wrote:
> Hi Andrea,
> 
> On Wed, 29 Jul 2015 19:12:56 +0200 Andrea Arcangeli  
> wrote:
> >
> > On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
> > >  -359 i386userfaultfd sys_userfaultfd
> > > ++374 i386userfaultfd sys_userfaultfd
> > 
> > Do I understand correctly the syscall number of userfaultfd for x86
> > 32bit has just changed from 359 to 374? Appreciated that you CCed me
> > on such a relevant change to be sure I didn't miss it.
> > 
> > Then the below is needed as well.
> 
> I have added the below patch to linux-next from today.
> 
> > One related question: is it ok to ship kernels in production right now
> > with the userfaultfd syscall number 374 for x86 32bit ABI (after the
> > above change) and 323 for x86-64 64bit ABI, with these syscalls number
> > registered in linux-next or it may keep changing like it has just
> > happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
> > 64bit, not all other syscalls in linux-next.
> 
> These numbers are certainly not in any way official, they are just the
> result of my merge conflict fixup.  So, yes, they could change again if
> someone adds another new syscall to any tree but Andrew's.
> 
> > Of course, I know full well that the standard answer is no, and in
> > fact the above is an expected and fine change. In other words what I'm
> > really asking is if I wonder if I could get an agreement here that
> > from now on, the syscall number of userfaultfd for x86 32bit and
> > x86-64 64bit won't change anymore in linux-next and it's already
> > reserved just like if it was already upstream.
> 
> Like Thomas said, send a patch to the x86 maintainers.  I suspect (if
> the rest of the implementation needs to stay in Andrew's tree) that it
> could be a simple as a patch to the syscall tables using ni_syscall and
> a comment.  Thomas?

Yes, that's all it takes to reserve a syscall number.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Stephen Rothwell

Hi Andrea,

On Wed, 29 Jul 2015 19:12:56 +0200 Andrea Arcangeli  wrote:
>
> On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
> >  -359   i386userfaultfd sys_userfaultfd
> > ++374   i386userfaultfd sys_userfaultfd
> 
> Do I understand correctly the syscall number of userfaultfd for x86
> 32bit has just changed from 359 to 374? Appreciated that you CCed me
> on such a relevant change to be sure I didn't miss it.
> 
> Then the below is needed as well.

I have added the below patch to linux-next from today.

> One related question: is it ok to ship kernels in production right now
> with the userfaultfd syscall number 374 for x86 32bit ABI (after the
> above change) and 323 for x86-64 64bit ABI, with these syscalls number
> registered in linux-next or it may keep changing like it has just
> happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
> 64bit, not all other syscalls in linux-next.

These numbers are certainly not in any way official, they are just the
result of my merge conflict fixup.  So, yes, they could change again if
someone adds another new syscall to any tree but Andrew's.

> Of course, I know full well that the standard answer is no, and in
> fact the above is an expected and fine change. In other words what I'm
> really asking is if I wonder if I could get an agreement here that
> from now on, the syscall number of userfaultfd for x86 32bit and
> x86-64 64bit won't change anymore in linux-next and it's already
> reserved just like if it was already upstream.

Like Thomas said, send a patch to the x86 maintainers.  I suspect (if
the rest of the implementation needs to stay in Andrew's tree) that it
could be a simple as a patch to the syscall tables using ni_syscall and
a comment.  Thomas?

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Thomas Gleixner

On Wed, 29 Jul 2015, Andy Lutomirski wrote:
> On Wed, Jul 29, 2015 at 10:12 AM, Andrea Arcangeli  
> wrote:
> > Hello Stephen,
> >
> > On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
> >>  -359 i386userfaultfd sys_userfaultfd
> >> ++374 i386userfaultfd sys_userfaultfd
> >
> > Do I understand correctly the syscall number of userfaultfd for x86
> > 32bit has just changed from 359 to 374? Appreciated that you CCed me
> > on such a relevant change to be sure I didn't miss it.
> >
> > Then the below is needed as well.
> >
> > One related question: is it ok to ship kernels in production right now
> > with the userfaultfd syscall number 374 for x86 32bit ABI (after the
> > above change) and 323 for x86-64 64bit ABI, with these syscalls number
> > registered in linux-next or it may keep changing like it has just
> > happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
> > 64bit, not all other syscalls in linux-next.
> >
> > Of course, I know full well that the standard answer is no, and in
> > fact the above is an expected and fine change. In other words what I'm
> > really asking is if I wonder if I could get an agreement here that
> > from now on, the syscall number of userfaultfd for x86 32bit and
> > x86-64 64bit won't change anymore in linux-next and it's already
> > reserved just like if it was already upstream.
> >
> > Again: I'd only seek such guarantee for the x86-64 64bit and x86 32bit
> > ABIs (not any other arch, and not any other syscall). If I could get
> > such a guarantee from you within the next week or two, that would
> > avoid me complications and some work, so I thought it was worth
> > asking. If it's not possible never mind.
> 
> My (limited) understanding is that this is up to the arch maintainers.
> I certainly didn't intend to preempt your syscall number, but my patch
> beat your patch to -tip :-p
> 
> -tip people: want to assign Andrea a pair of syscall numbers?

Sure, just send a patch 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Andy Lutomirski

On Wed, Jul 29, 2015 at 10:12 AM, Andrea Arcangeli  wrote:
> Hello Stephen,
>
> On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
>>  -359 i386userfaultfd sys_userfaultfd
>> ++374 i386userfaultfd sys_userfaultfd
>
> Do I understand correctly the syscall number of userfaultfd for x86
> 32bit has just changed from 359 to 374? Appreciated that you CCed me
> on such a relevant change to be sure I didn't miss it.
>
> Then the below is needed as well.
>
> One related question: is it ok to ship kernels in production right now
> with the userfaultfd syscall number 374 for x86 32bit ABI (after the
> above change) and 323 for x86-64 64bit ABI, with these syscalls number
> registered in linux-next or it may keep changing like it has just
> happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
> 64bit, not all other syscalls in linux-next.
>
> Of course, I know full well that the standard answer is no, and in
> fact the above is an expected and fine change. In other words what I'm
> really asking is if I wonder if I could get an agreement here that
> from now on, the syscall number of userfaultfd for x86 32bit and
> x86-64 64bit won't change anymore in linux-next and it's already
> reserved just like if it was already upstream.
>
> Again: I'd only seek such guarantee for the x86-64 64bit and x86 32bit
> ABIs (not any other arch, and not any other syscall). If I could get
> such a guarantee from you within the next week or two, that would
> avoid me complications and some work, so I thought it was worth
> asking. If it's not possible never mind.

My (limited) understanding is that this is up to the arch maintainers.
I certainly didn't intend to preempt your syscall number, but my patch
beat your patch to -tip :-p

-tip people: want to assign Andrea a pair of syscall numbers?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Andrea Arcangeli

Hello Stephen,

On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
>  -359 i386userfaultfd sys_userfaultfd
> ++374 i386userfaultfd sys_userfaultfd

Do I understand correctly the syscall number of userfaultfd for x86
32bit has just changed from 359 to 374? Appreciated that you CCed me
on such a relevant change to be sure I didn't miss it.

Then the below is needed as well.

One related question: is it ok to ship kernels in production right now
with the userfaultfd syscall number 374 for x86 32bit ABI (after the
above change) and 323 for x86-64 64bit ABI, with these syscalls number
registered in linux-next or it may keep changing like it has just
happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
64bit, not all other syscalls in linux-next.

Of course, I know full well that the standard answer is no, and in
fact the above is an expected and fine change. In other words what I'm
really asking is if I wonder if I could get an agreement here that
from now on, the syscall number of userfaultfd for x86 32bit and
x86-64 64bit won't change anymore in linux-next and it's already
reserved just like if it was already upstream.

Again: I'd only seek such guarantee for the x86-64 64bit and x86 32bit
ABIs (not any other arch, and not any other syscall). If I could get
such a guarantee from you within the next week or two, that would
avoid me complications and some work, so I thought it was worth
asking. If it's not possible never mind.

Thanks,
Andrea

===
>From 873093c32b4b1d0b6c3f18ec1e52b56c24f67457 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Wed, 29 Jul 2015 18:53:17 +0200
Subject: [PATCH] userfaultfd: selftest: update userfaultfd x86 32bit syscall
 number

It changed as result of linux-next merge of other syscalls.

Signed-off-by: Andrea Arcangeli 
---
 tools/testing/selftests/vm/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 0c0b839..76071b1 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -69,7 +69,7 @@
 #ifdef __x86_64__
 #define __NR_userfaultfd 323
 #elif defined(__i386__)
-#define __NR_userfaultfd 359
+#define __NR_userfaultfd 374
 #elif defined(__powewrpc__)
 #define __NR_userfaultfd 364
 #else
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Andrea Arcangeli

Hello Stephen,

On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
  -359 i386userfaultfd sys_userfaultfd
 ++374 i386userfaultfd sys_userfaultfd

Do I understand correctly the syscall number of userfaultfd for x86
32bit has just changed from 359 to 374? Appreciated that you CCed me
on such a relevant change to be sure I didn't miss it.

Then the below is needed as well.

One related question: is it ok to ship kernels in production right now
with the userfaultfd syscall number 374 for x86 32bit ABI (after the
above change) and 323 for x86-64 64bit ABI, with these syscalls number
registered in linux-next or it may keep changing like it has just
happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
64bit, not all other syscalls in linux-next.

Of course, I know full well that the standard answer is no, and in
fact the above is an expected and fine change. In other words what I'm
really asking is if I wonder if I could get an agreement here that
from now on, the syscall number of userfaultfd for x86 32bit and
x86-64 64bit won't change anymore in linux-next and it's already
reserved just like if it was already upstream.

Again: I'd only seek such guarantee for the x86-64 64bit and x86 32bit
ABIs (not any other arch, and not any other syscall). If I could get
such a guarantee from you within the next week or two, that would
avoid me complications and some work, so I thought it was worth
asking. If it's not possible never mind.

Thanks,
Andrea

===
From 873093c32b4b1d0b6c3f18ec1e52b56c24f67457 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli aarca...@redhat.com
Date: Wed, 29 Jul 2015 18:53:17 +0200
Subject: [PATCH] userfaultfd: selftest: update userfaultfd x86 32bit syscall
 number

It changed as result of linux-next merge of other syscalls.

Signed-off-by: Andrea Arcangeli aarca...@redhat.com
---
 tools/testing/selftests/vm/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 0c0b839..76071b1 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -69,7 +69,7 @@
 #ifdef __x86_64__
 #define __NR_userfaultfd 323
 #elif defined(__i386__)
-#define __NR_userfaultfd 359
+#define __NR_userfaultfd 374
 #elif defined(__powewrpc__)
 #define __NR_userfaultfd 364
 #else
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Andy Lutomirski

On Wed, Jul 29, 2015 at 10:12 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 Hello Stephen,

 On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
  -359 i386userfaultfd sys_userfaultfd
 ++374 i386userfaultfd sys_userfaultfd

 Do I understand correctly the syscall number of userfaultfd for x86
 32bit has just changed from 359 to 374? Appreciated that you CCed me
 on such a relevant change to be sure I didn't miss it.

 Then the below is needed as well.

 One related question: is it ok to ship kernels in production right now
 with the userfaultfd syscall number 374 for x86 32bit ABI (after the
 above change) and 323 for x86-64 64bit ABI, with these syscalls number
 registered in linux-next or it may keep changing like it has just
 happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
 64bit, not all other syscalls in linux-next.

 Of course, I know full well that the standard answer is no, and in
 fact the above is an expected and fine change. In other words what I'm
 really asking is if I wonder if I could get an agreement here that
 from now on, the syscall number of userfaultfd for x86 32bit and
 x86-64 64bit won't change anymore in linux-next and it's already
 reserved just like if it was already upstream.

 Again: I'd only seek such guarantee for the x86-64 64bit and x86 32bit
 ABIs (not any other arch, and not any other syscall). If I could get
 such a guarantee from you within the next week or two, that would
 avoid me complications and some work, so I thought it was worth
 asking. If it's not possible never mind.

My (limited) understanding is that this is up to the arch maintainers.
I certainly didn't intend to preempt your syscall number, but my patch
beat your patch to -tip :-p

-tip people: want to assign Andrea a pair of syscall numbers?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Thomas Gleixner

On Wed, 29 Jul 2015, Andy Lutomirski wrote:
 On Wed, Jul 29, 2015 at 10:12 AM, Andrea Arcangeli aarca...@redhat.com 
 wrote:
  Hello Stephen,
 
  On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
   -359 i386userfaultfd sys_userfaultfd
  ++374 i386userfaultfd sys_userfaultfd
 
  Do I understand correctly the syscall number of userfaultfd for x86
  32bit has just changed from 359 to 374? Appreciated that you CCed me
  on such a relevant change to be sure I didn't miss it.
 
  Then the below is needed as well.
 
  One related question: is it ok to ship kernels in production right now
  with the userfaultfd syscall number 374 for x86 32bit ABI (after the
  above change) and 323 for x86-64 64bit ABI, with these syscalls number
  registered in linux-next or it may keep changing like it has just
  happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
  64bit, not all other syscalls in linux-next.
 
  Of course, I know full well that the standard answer is no, and in
  fact the above is an expected and fine change. In other words what I'm
  really asking is if I wonder if I could get an agreement here that
  from now on, the syscall number of userfaultfd for x86 32bit and
  x86-64 64bit won't change anymore in linux-next and it's already
  reserved just like if it was already upstream.
 
  Again: I'd only seek such guarantee for the x86-64 64bit and x86 32bit
  ABIs (not any other arch, and not any other syscall). If I could get
  such a guarantee from you within the next week or two, that would
  avoid me complications and some work, so I thought it was worth
  asking. If it's not possible never mind.
 
 My (limited) understanding is that this is up to the arch maintainers.
 I certainly didn't intend to preempt your syscall number, but my patch
 beat your patch to -tip :-p
 
 -tip people: want to assign Andrea a pair of syscall numbers?

Sure, just send a patch 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Thomas Gleixner

On Thu, 30 Jul 2015, Stephen Rothwell wrote:
 Hi Andrea,
 
 On Wed, 29 Jul 2015 19:12:56 +0200 Andrea Arcangeli aarca...@redhat.com 
 wrote:
 
  On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
-359 i386userfaultfd sys_userfaultfd
   ++374 i386userfaultfd sys_userfaultfd
  
  Do I understand correctly the syscall number of userfaultfd for x86
  32bit has just changed from 359 to 374? Appreciated that you CCed me
  on such a relevant change to be sure I didn't miss it.
  
  Then the below is needed as well.
 
 I have added the below patch to linux-next from today.
 
  One related question: is it ok to ship kernels in production right now
  with the userfaultfd syscall number 374 for x86 32bit ABI (after the
  above change) and 323 for x86-64 64bit ABI, with these syscalls number
  registered in linux-next or it may keep changing like it has just
  happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
  64bit, not all other syscalls in linux-next.
 
 These numbers are certainly not in any way official, they are just the
 result of my merge conflict fixup.  So, yes, they could change again if
 someone adds another new syscall to any tree but Andrew's.
 
  Of course, I know full well that the standard answer is no, and in
  fact the above is an expected and fine change. In other words what I'm
  really asking is if I wonder if I could get an agreement here that
  from now on, the syscall number of userfaultfd for x86 32bit and
  x86-64 64bit won't change anymore in linux-next and it's already
  reserved just like if it was already upstream.
 
 Like Thomas said, send a patch to the x86 maintainers.  I suspect (if
 the rest of the implementation needs to stay in Andrew's tree) that it
 could be a simple as a patch to the syscall tables using ni_syscall and
 a comment.  Thomas?

Yes, that's all it takes to reserve a syscall number.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2015-07-29 Thread Stephen Rothwell

Hi Andrea,

On Wed, 29 Jul 2015 19:12:56 +0200 Andrea Arcangeli aarca...@redhat.com wrote:

 On Tue, Jul 28, 2015 at 04:00:15PM +1000, Stephen Rothwell wrote:
   -359   i386userfaultfd sys_userfaultfd
  ++374   i386userfaultfd sys_userfaultfd
 
 Do I understand correctly the syscall number of userfaultfd for x86
 32bit has just changed from 359 to 374? Appreciated that you CCed me
 on such a relevant change to be sure I didn't miss it.
 
 Then the below is needed as well.

I have added the below patch to linux-next from today.

 One related question: is it ok to ship kernels in production right now
 with the userfaultfd syscall number 374 for x86 32bit ABI (after the
 above change) and 323 for x86-64 64bit ABI, with these syscalls number
 registered in linux-next or it may keep changing like it has just
 happened? I refer only to userfaultfd syscalls of x86 32bit and x86-64
 64bit, not all other syscalls in linux-next.

These numbers are certainly not in any way official, they are just the
result of my merge conflict fixup.  So, yes, they could change again if
someone adds another new syscall to any tree but Andrew's.

 Of course, I know full well that the standard answer is no, and in
 fact the above is an expected and fine change. In other words what I'm
 really asking is if I wonder if I could get an agreement here that
 from now on, the syscall number of userfaultfd for x86 32bit and
 x86-64 64bit won't change anymore in linux-next and it's already
 reserved just like if it was already upstream.

Like Thomas said, send a patch to the x86 maintainers.  I suspect (if
the rest of the implementation needs to stay in Andrew's tree) that it
could be a simple as a patch to the syscall tables using ni_syscall and
a comment.  Thomas?

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-03-19 Thread Andrew Morton

On Mon, 17 Mar 2014 10:36:02 +0100 Peter Zijlstra  wrote:

> On Mon, Mar 17, 2014 at 08:31:24PM +1100, Stephen Rothwell wrote:
> > Hi Andrew,
> > 
> > Today's linux-next merge of the akpm-current tree got a conflict in
> > kernel/locking/Makefile between commit fb0527bd5ea9 ("locking/mutexes:
> > Introduce cancelable MCS lock for adaptive spinning") from the  tree and
> > commit 4dc0fe493027 ("lglock: map to spinlock when !CONFIG_SMP") from the
> > akpm-current tree.
> > 
> > I fixed it up (see below) and can carry the fix as necessary (no action
> > is required).
> 
> I'm a bit sad of not having seen that lglock patch at all.

Here 'tis:

From: Josh Triplett 
Subject: lglock: map to spinlock when !CONFIG_SMP

When the system has only one CPU, lglock is effectively a spinlock; map it
directly to spinlock to eliminate the indirection and duplicate code.

In addition to removing overhead, this drops 1.6k of code with a defconfig
modified to have !CONFIG_SMP, and 1.1k with a minimal config.

Signed-off-by: Josh Triplett 
Cc: Rusty Russell 
Cc: Michal Marek 
Cc: Thomas Gleixner 
Cc: David Howells 
Cc: "H. Peter Anvin" 
Cc: Nick Piggin 
Signed-off-by: Andrew Morton 
---

 include/linux/lglock.h  |   16 
 kernel/locking/Makefile |3 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff -puN include/linux/lglock.h~lglock-map-to-spinlock-when-config_smp 
include/linux/lglock.h
--- a/include/linux/lglock.h~lglock-map-to-spinlock-when-config_smp
+++ a/include/linux/lglock.h
@@ -25,6 +25,8 @@
 #include 
 #include 
 
+#ifdef CONFIG_SMP
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define LOCKDEP_INIT_MAP lockdep_init_map
 #else
@@ -57,4 +59,18 @@ void lg_local_unlock_cpu(struct lglock *
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);
 
+#else
+/* When !CONFIG_SMP, map lglock to spinlock */
+#define lglock spinlock
+#define DEFINE_LGLOCK(name) DEFINE_SPINLOCK(name)
+#define DEFINE_STATIC_LGLOCK(name) static DEFINE_SPINLOCK(name)
+#define lg_lock_init(lg, name) spin_lock_init(lg)
+#define lg_local_lock spin_lock
+#define lg_local_unlock spin_unlock
+#define lg_local_lock_cpu(lg, cpu) spin_lock(lg)
+#define lg_local_unlock_cpu(lg, cpu) spin_unlock(lg)
+#define lg_global_lock spin_lock
+#define lg_global_unlock spin_unlock
+#endif
+
 #endif
diff -puN kernel/locking/Makefile~lglock-map-to-spinlock-when-config_smp 
kernel/locking/Makefile
--- a/kernel/locking/Makefile~lglock-map-to-spinlock-when-config_smp
+++ a/kernel/locking/Makefile
@@ -1,5 +1,5 @@
 
-obj-y += mutex.o semaphore.o rwsem.o lglock.o mcs_spinlock.o
+obj-y += mutex.o semaphore.o rwsem.o mcs_spinlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = -pg
@@ -14,6 +14,7 @@ ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
+obj-$(CONFIG_SMP) += lglock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-03-19 Thread Andrew Morton

On Mon, 17 Mar 2014 10:36:02 +0100 Peter Zijlstra pet...@infradead.org wrote:

 On Mon, Mar 17, 2014 at 08:31:24PM +1100, Stephen Rothwell wrote:
  Hi Andrew,
  
  Today's linux-next merge of the akpm-current tree got a conflict in
  kernel/locking/Makefile between commit fb0527bd5ea9 (locking/mutexes:
  Introduce cancelable MCS lock for adaptive spinning) from the  tree and
  commit 4dc0fe493027 (lglock: map to spinlock when !CONFIG_SMP) from the
  akpm-current tree.
  
  I fixed it up (see below) and can carry the fix as necessary (no action
  is required).
 
 I'm a bit sad of not having seen that lglock patch at all.

Here 'tis:

From: Josh Triplett j...@joshtriplett.org
Subject: lglock: map to spinlock when !CONFIG_SMP

When the system has only one CPU, lglock is effectively a spinlock; map it
directly to spinlock to eliminate the indirection and duplicate code.

In addition to removing overhead, this drops 1.6k of code with a defconfig
modified to have !CONFIG_SMP, and 1.1k with a minimal config.

Signed-off-by: Josh Triplett j...@joshtriplett.org
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Michal Marek mma...@suse.cz
Cc: Thomas Gleixner t...@linutronix.de
Cc: David Howells dhowe...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Nick Piggin npig...@kernel.dk
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/lglock.h  |   16 
 kernel/locking/Makefile |3 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff -puN include/linux/lglock.h~lglock-map-to-spinlock-when-config_smp 
include/linux/lglock.h
--- a/include/linux/lglock.h~lglock-map-to-spinlock-when-config_smp
+++ a/include/linux/lglock.h
@@ -25,6 +25,8 @@
 #include linux/cpu.h
 #include linux/notifier.h
 
+#ifdef CONFIG_SMP
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define LOCKDEP_INIT_MAP lockdep_init_map
 #else
@@ -57,4 +59,18 @@ void lg_local_unlock_cpu(struct lglock *
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);
 
+#else
+/* When !CONFIG_SMP, map lglock to spinlock */
+#define lglock spinlock
+#define DEFINE_LGLOCK(name) DEFINE_SPINLOCK(name)
+#define DEFINE_STATIC_LGLOCK(name) static DEFINE_SPINLOCK(name)
+#define lg_lock_init(lg, name) spin_lock_init(lg)
+#define lg_local_lock spin_lock
+#define lg_local_unlock spin_unlock
+#define lg_local_lock_cpu(lg, cpu) spin_lock(lg)
+#define lg_local_unlock_cpu(lg, cpu) spin_unlock(lg)
+#define lg_global_lock spin_lock
+#define lg_global_unlock spin_unlock
+#endif
+
 #endif
diff -puN kernel/locking/Makefile~lglock-map-to-spinlock-when-config_smp 
kernel/locking/Makefile
--- a/kernel/locking/Makefile~lglock-map-to-spinlock-when-config_smp
+++ a/kernel/locking/Makefile
@@ -1,5 +1,5 @@
 
-obj-y += mutex.o semaphore.o rwsem.o lglock.o mcs_spinlock.o
+obj-y += mutex.o semaphore.o rwsem.o mcs_spinlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = -pg
@@ -14,6 +14,7 @@ ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
+obj-$(CONFIG_SMP) += lglock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
_

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-03-17 Thread Peter Zijlstra

On Mon, Mar 17, 2014 at 08:31:24PM +1100, Stephen Rothwell wrote:
> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in
> kernel/locking/Makefile between commit fb0527bd5ea9 ("locking/mutexes:
> Introduce cancelable MCS lock for adaptive spinning") from the  tree and
> commit 4dc0fe493027 ("lglock: map to spinlock when !CONFIG_SMP") from the
> akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no action
> is required).

I'm a bit sad of not having seen that lglock patch at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-03-17 Thread Peter Zijlstra

On Mon, Mar 17, 2014 at 08:31:24PM +1100, Stephen Rothwell wrote:
 Hi Andrew,
 
 Today's linux-next merge of the akpm-current tree got a conflict in
 kernel/locking/Makefile between commit fb0527bd5ea9 (locking/mutexes:
 Introduce cancelable MCS lock for adaptive spinning) from the  tree and
 commit 4dc0fe493027 (lglock: map to spinlock when !CONFIG_SMP) from the
 akpm-current tree.
 
 I fixed it up (see below) and can carry the fix as necessary (no action
 is required).

I'm a bit sad of not having seen that lglock patch at all.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex: Switch to USER_DS for futex test (was: Re: linux-next: manual merge of the akpm-current tree with the tip tree)

2014-01-15 Thread Geert Uytterhoeven

On Tue, Jan 14, 2014 at 5:32 PM, Geert Uytterhoeven
 wrote:
> https://lkml.org/lkml/2013/12/11/141
>
> On Tue, Jan 14, 2014 at 5:19 PM, H. Peter Anvin  wrote:
>> On 01/14/2014 05:17 AM, Geert Uytterhoeven wrote:

 This seems terribly broken, the *futex_value*() ops should not need
 that; they are supposed to access userspace without any of that.
>>>
>>> Why don't they need set_fs(USER_DS)?
>>
>> Because USER_DS is the normal operating state?  It would appear m68k is
>
> Is it?
>
>> the only(?) arch that calls initcalls with get_fs() == KERNEL_DS...
>
> On ARM:
>
> fs = 0x0, USER_DS = 0xbf00, KERNEL_DS = 0x0
>
> Presumably also on s390, as the fix for m68k broke s390. That's why it's still
> in -mm and not yet in mainline.

On uml/amd64:

fs = 0x, USER_DS = 0x7fc000, KERNEL_DS = 0x

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex: Switch to USER_DS for futex test (was: Re: linux-next: manual merge of the akpm-current tree with the tip tree)

2014-01-15 Thread Geert Uytterhoeven

On Tue, Jan 14, 2014 at 5:32 PM, Geert Uytterhoeven
ge...@linux-m68k.org wrote:
 https://lkml.org/lkml/2013/12/11/141

 On Tue, Jan 14, 2014 at 5:19 PM, H. Peter Anvin h...@zytor.com wrote:
 On 01/14/2014 05:17 AM, Geert Uytterhoeven wrote:

 This seems terribly broken, the *futex_value*() ops should not need
 that; they are supposed to access userspace without any of that.

 Why don't they need set_fs(USER_DS)?

 Because USER_DS is the normal operating state?  It would appear m68k is

 Is it?

 the only(?) arch that calls initcalls with get_fs() == KERNEL_DS...

 On ARM:

 fs = 0x0, USER_DS = 0xbf00, KERNEL_DS = 0x0

 Presumably also on s390, as the fix for m68k broke s390. That's why it's still
 in -mm and not yet in mainline.

On uml/amd64:

fs = 0x, USER_DS = 0x7fc000, KERNEL_DS = 0x

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex: Switch to USER_DS for futex test (was: Re: linux-next: manual merge of the akpm-current tree with the tip tree)

2014-01-14 Thread Geert Uytterhoeven

CC linux-arch

https://lkml.org/lkml/2013/12/11/141

On Tue, Jan 14, 2014 at 5:19 PM, H. Peter Anvin  wrote:
> On 01/14/2014 05:17 AM, Geert Uytterhoeven wrote:
>>>
>>> This seems terribly broken, the *futex_value*() ops should not need
>>> that; they are supposed to access userspace without any of that.
>>
>> Why don't they need set_fs(USER_DS)?
>
> Because USER_DS is the normal operating state?  It would appear m68k is

Is it?

> the only(?) arch that calls initcalls with get_fs() == KERNEL_DS...

On ARM:

fs = 0x0, USER_DS = 0xbf00, KERNEL_DS = 0x0

Presumably also on s390, as the fix for m68k broke s390. That's why it's still
in -mm and not yet in mainline.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread H. Peter Anvin

On 01/14/2014 05:17 AM, Geert Uytterhoeven wrote:
>>
>> This seems terribly broken, the *futex_value*() ops should not need
>> that; they are supposed to access userspace without any of that.
> 
> Why don't they need set_fs(USER_DS)?
> 
> Gr{oetje,eeting}s,
> 
> Geert

Because USER_DS is the normal operating state?  It would appear m68k is
the only(?) arch that calls initcalls with get_fs() == KERNEL_DS...

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread H. Peter Anvin

On 01/14/2014 07:41 AM, Peter Zijlstra wrote:
>>>
>>> I am *guessing* that m68k is has get_fs() == KERNEL_DS at the point that
>>> futex_init() is called.  This would seem a bit of a peculiarity to m68k,
>>> and as such it would seem like it would be better for it to belong in
>>> the m68k-specific code, but since futex_init() is init code and only
>>> called once anyway it shouldn't cause any harm...
>>
>> Yes it does. So when getting the exception on 68030, we notice it's a kernel
>> space access error, not a user space access error, and crash.
> 
> Is there a good reason m68k works like this? That is, shouldn't we fix
> the arch code instead of littering the generic code with weirdness like
> this?
> 

Given that futex_init is called from initcall, this seems *really* weird
on the part of m68k.  I agree this should be fixed where the problem sits.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread Peter Zijlstra

On Tue, Jan 14, 2014 at 04:20:36PM +0100, Geert Uytterhoeven wrote:
> On Tue, Jan 14, 2014 at 4:15 PM, H. Peter Anvin  wrote:
> > On 01/14/2014 04:51 AM, Peter Zijlstra wrote:
> >> On Tue, Jan 14, 2014 at 03:53:31PM +1100, Stephen Rothwell wrote:
> >>> Hi Andrew,
> >>>
> >>> Today's linux-next merge of the akpm-current tree got a conflict in
> >>> kernel/futex.c between commit a52b89ebb6d4 ("futexes: Increase hash table
> >>> size for better performance") from the tip tree and commit 61beee6c76e5
> >>> ("futex: switch to USER_DS for futex test") from the akpm-current tree.
> >>>
> >>> @@@ -2869,10 -2748,13 +2871,13 @@@
> >>>   * implementation, the non-functional ones will return
> >>>   * -ENOSYS.
> >>>   */
> >>> +fs = get_fs();
> >>> +set_fs(USER_DS);
> >>>  if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
> >>>  futex_cmpxchg_enabled = 1;
> >>> +set_fs(fs);
> >>>
> >>
> >> This seems terribly broken, the *futex_value*() ops should not need
> >> that; they are supposed to access userspace without any of that.
> >
> > I am *guessing* that m68k is has get_fs() == KERNEL_DS at the point that
> > futex_init() is called.  This would seem a bit of a peculiarity to m68k,
> > and as such it would seem like it would be better for it to belong in
> > the m68k-specific code, but since futex_init() is init code and only
> > called once anyway it shouldn't cause any harm...
> 
> Yes it does. So when getting the exception on 68030, we notice it's a kernel
> space access error, not a user space access error, and crash.

Is there a good reason m68k works like this? That is, shouldn't we fix
the arch code instead of littering the generic code with weirdness like
this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread Geert Uytterhoeven

On Tue, Jan 14, 2014 at 4:15 PM, H. Peter Anvin  wrote:
> On 01/14/2014 04:51 AM, Peter Zijlstra wrote:
>> On Tue, Jan 14, 2014 at 03:53:31PM +1100, Stephen Rothwell wrote:
>>> Hi Andrew,
>>>
>>> Today's linux-next merge of the akpm-current tree got a conflict in
>>> kernel/futex.c between commit a52b89ebb6d4 ("futexes: Increase hash table
>>> size for better performance") from the tip tree and commit 61beee6c76e5
>>> ("futex: switch to USER_DS for futex test") from the akpm-current tree.
>>>
>>> @@@ -2869,10 -2748,13 +2871,13 @@@
>>>   * implementation, the non-functional ones will return
>>>   * -ENOSYS.
>>>   */
>>> +fs = get_fs();
>>> +set_fs(USER_DS);
>>>  if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
>>>  futex_cmpxchg_enabled = 1;
>>> +set_fs(fs);
>>>
>>
>> This seems terribly broken, the *futex_value*() ops should not need
>> that; they are supposed to access userspace without any of that.
>
> I am *guessing* that m68k is has get_fs() == KERNEL_DS at the point that
> futex_init() is called.  This would seem a bit of a peculiarity to m68k,
> and as such it would seem like it would be better for it to belong in
> the m68k-specific code, but since futex_init() is init code and only
> called once anyway it shouldn't cause any harm...

Yes it does. So when getting the exception on 68030, we notice it's a kernel
space access error, not a user space access error, and crash.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread H. Peter Anvin

On 01/14/2014 04:51 AM, Peter Zijlstra wrote:
> On Tue, Jan 14, 2014 at 03:53:31PM +1100, Stephen Rothwell wrote:
>> Hi Andrew,
>>
>> Today's linux-next merge of the akpm-current tree got a conflict in
>> kernel/futex.c between commit a52b89ebb6d4 ("futexes: Increase hash table
>> size for better performance") from the tip tree and commit 61beee6c76e5
>> ("futex: switch to USER_DS for futex test") from the akpm-current tree.
>>
>> @@@ -2869,10 -2748,13 +2871,13 @@@
>>   * implementation, the non-functional ones will return
>>   * -ENOSYS.
>>   */
>> +fs = get_fs();
>> +set_fs(USER_DS);
>>  if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
>>  futex_cmpxchg_enabled = 1;
>> +set_fs(fs);
>>   
> 
> This seems terribly broken, the *futex_value*() ops should not need
> that; they are supposed to access userspace without any of that.
> 

I am *guessing* that m68k is has get_fs() == KERNEL_DS at the point that
futex_init() is called.  This would seem a bit of a peculiarity to m68k,
and as such it would seem like it would be better for it to belong in
the m68k-specific code, but since futex_init() is init code and only
called once anyway it shouldn't cause any harm...

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread Peter Zijlstra

On Tue, Jan 14, 2014 at 02:17:55PM +0100, Geert Uytterhoeven wrote:
> On Tue, Jan 14, 2014 at 1:51 PM, Peter Zijlstra  wrote:
> > On Tue, Jan 14, 2014 at 03:53:31PM +1100, Stephen Rothwell wrote:
> >> Today's linux-next merge of the akpm-current tree got a conflict in
> >> kernel/futex.c between commit a52b89ebb6d4 ("futexes: Increase hash table
> >> size for better performance") from the tip tree and commit 61beee6c76e5
> >> ("futex: switch to USER_DS for futex test") from the akpm-current tree.
> >>
> >> @@@ -2869,10 -2748,13 +2871,13 @@@
> >>* implementation, the non-functional ones will return
> >>* -ENOSYS.
> >>*/
> >> + fs = get_fs();
> >> + set_fs(USER_DS);
> >>   if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
> >>   futex_cmpxchg_enabled = 1;
> >> + set_fs(fs);
> >>
> >
> > This seems terribly broken, the *futex_value*() ops should not need
> > that; they are supposed to access userspace without any of that.
> 
> Why don't they need set_fs(USER_DS)?

Why would they need it? These functions explicitly take a __user pointer
and are expected to do whatever is needed to perform the operation. None
of the other futex bits use USER_DS either.


Furthermore, from the Changelog of:
e4f2dfbb5e92be4e46c0625f4f8eb101110f756f

"   This patch adds that support, but only for uniprocessor machines,
which is adequate for M68K.  For UP it's enough to disable preemption
to ensure mutual exclusion (futexes don't need to care about other
hardware agents), and the mandatory pagefault_disable() does just that.  "

It is wrong to rely on the fact that pagefault_disable() disables
preemption. This is fact not true on -rt.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread Geert Uytterhoeven

On Tue, Jan 14, 2014 at 1:51 PM, Peter Zijlstra  wrote:
> On Tue, Jan 14, 2014 at 03:53:31PM +1100, Stephen Rothwell wrote:
>> Today's linux-next merge of the akpm-current tree got a conflict in
>> kernel/futex.c between commit a52b89ebb6d4 ("futexes: Increase hash table
>> size for better performance") from the tip tree and commit 61beee6c76e5
>> ("futex: switch to USER_DS for futex test") from the akpm-current tree.
>>
>> @@@ -2869,10 -2748,13 +2871,13 @@@
>>* implementation, the non-functional ones will return
>>* -ENOSYS.
>>*/
>> + fs = get_fs();
>> + set_fs(USER_DS);
>>   if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
>>   futex_cmpxchg_enabled = 1;
>> + set_fs(fs);
>>
>
> This seems terribly broken, the *futex_value*() ops should not need
> that; they are supposed to access userspace without any of that.

Why don't they need set_fs(USER_DS)?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: manual merge of the akpm-current tree with the tip tree

2014-01-14 Thread Peter Zijlstra

On Tue, Jan 14, 2014 at 03:53:31PM +1100, Stephen Rothwell wrote:
> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in
> kernel/futex.c between commit a52b89ebb6d4 ("futexes: Increase hash table
> size for better performance") from the tip tree and commit 61beee6c76e5
> ("futex: switch to USER_DS for futex test") from the akpm-current tree.
> 
> @@@ -2869,10 -2748,13 +2871,13 @@@
>* implementation, the non-functional ones will return
>* -ENOSYS.
>*/
> + fs = get_fs();
> + set_fs(USER_DS);
>   if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
>   futex_cmpxchg_enabled = 1;
> + set_fs(fs);
>   

This seems terribly broken, the *futex_value*() ops should not need
that; they are supposed to access userspace without any of that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 119 matches

Mail list logo