from:"Boqun Feng"

Re: BUG : PowerPC RCU: torture test failed with __stack_chk_fail

2023-04-24 Thread Boqun Feng

On Mon, Apr 24, 2023 at 12:29:00PM -0500, Segher Boessenkool wrote:
> On Mon, Apr 24, 2023 at 08:28:55AM -0700, Boqun Feng wrote:
> > On Mon, Apr 24, 2023 at 10:13:51AM -0500, Segher Boessenkool wrote:
> > > At what points can r13 change?  Only when some particular functions are
> > > called?
> > 
> > r13 is the local paca:
> > 
> > register struct paca_struct *local_paca asm("r13");
> > 
> > , which is a pointer to percpu data.
> 
> Yes, it is a global register variable.
> 
> > So if a task schedule from one CPU to anotehr CPU, the value gets
> > changed.
> 
> But the compiler does not see that something else changes local_paca (or

It's more like this, however, in this case r13 is not changed:

CPU 0   CPU 1
{r13 = 0x00}{r13 = 0x04}



 _switch():
  
  
  


_switch():
 
 
 


as you can see thread 1 schedules from CPU 0 to CPU 1 and neither CPU
changes its r13, but in the point of view for thread 1, its r13 changes.

> r13 some other way, via assembler code perhaps)?  Or is there a compiler
> bug?
> 

This looks to me a compiler bug, but I'm not 100% sure.

Regards,
Boqun


> If the latter is true:
> 
> Can you make a reproducer and open a GCC PR?  <https://gcc.gnu.org/bugs/>
> for how to get started doing that.  We need *exact* code that shows the
> problem, together with a compiler command line.  So that we can
> reproduce the problem.  That is step 0 in figuring out what is going on,
> and then maybe fixing the problem :-)
> 
> 
> Segher

Re: BUG : PowerPC RCU: torture test failed with __stack_chk_fail

2023-04-24 Thread Boqun Feng

On Mon, Apr 24, 2023 at 10:13:51AM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Apr 24, 2023 at 11:14:00PM +1000, Michael Ellerman wrote:
> > Boqun Feng  writes:
> > > On Sat, Apr 22, 2023 at 09:28:39PM +0200, Joel Fernandes wrote:
> > >> On Sat, Apr 22, 2023 at 2:47 PM Zhouyi Zhou  wrote:
> > >> > by debugging, I see the r10 is assigned with r13 on c0226eb4,
> > >> > but if there is a context-switch before c0226edc, a false
> > >> > positive will be reported.
> 
> > I've never understood why the compiler wants to make a copy of a
> > register variable into another register!? >:#
> 
> It is usually because a) you told it to (maybe via an earlyclobber), or
> b) it looked cheaper.  I don't see either here :-(
> 
> > > here I think that the compiler is using r10 as an alias to r13, since
> > > for userspace program, it's safe to assume the TLS pointer doesn't
> > > change. However this is not true for kernel percpu pointer.
> 
> r13 is a "fixed" register, but that means it has a fixed purpose (so not
> available for allocation), it does not mean "unchanging".
> 
> > > The real intention here is to compare 40(r1) vs 3192(r13) for stack
> > > guard checking, however since r13 is the percpu pointer in kernel, so
> > > the value of r13 can be changed if the thread gets scheduled to a
> > > different CPU after reading r13 for r10.
> > 
> > Yeah that's not good.
> 
> The GCC pattern here makes the four machine insns all stay together.
> That is to make sure to not leak any secret value, which is impossible
> to guarantee otherwise.
> 
> What tells GCC r13 can randomly change behind its back?  And, what then
> makes GCC ignore that fact?
> 
> > >   +   asm volatile("" : : : "r13", "memory");
> 
> Any asm without output is always volatile.
> 
> > > Needless to say, the correct fix is to make ppc stack protector aware of
> > > r13 is volatile.
> > 
> > I suspect the compiler developers will tell us to go jump :)
> 
> Why would r13 change over the course of *this* function / this macro,
> why can this not happen anywhere else?
> 
> > The problem of the compiler caching r13 has come up in the past, but I
> > only remember it being "a worry" rather than causing an actual bug.
> 
> In most cases the compiler is smart enough to use r13 directly, instead
> of copying it to another reg and then using that one.  But not here for
> some strange reason.  That of course is a very minor generated machine
> code quality bug and nothing more :-(
> 
> > We've had the DEBUG_PREEMPT checks in get_paca(), which have given us at
> > least some comfort that if the compiler is caching r13, it shouldn't be
> > doing it in preemptable regions.
> > 
> > But obviously that doesn't help at all with the stack protector check.
> > 
> > I don't see an easy fix.
> > 
> > Adding "volatile" to the definition of local_paca seems to reduce but
> > not elimate the caching of r13, and the GCC docs explicitly say *not* to
> > use volatile. It also triggers lots of warnings about volatile being
> > discarded.
> 
> The point here is to say some code clobbers r13, not the asm volatile?
> 
> > Or something simple I haven't thought of? :)
> 
> At what points can r13 change?  Only when some particular functions are
> called?
> 

r13 is the local paca:

register struct paca_struct *local_paca asm("r13");

, which is a pointer to percpu data.

So if a task schedule from one CPU to anotehr CPU, the value gets
changed.

Regards,
Boqun


> 
> Segher

Re: BUG : PowerPC RCU: torture test failed with __stack_chk_fail

2023-04-23 Thread Boqun Feng

On Sat, Apr 22, 2023 at 09:28:39PM +0200, Joel Fernandes wrote:
> On Sat, Apr 22, 2023 at 2:47 PM Zhouyi Zhou  wrote:
> >
> > Dear PowerPC and RCU developers:
> > During the RCU torture test on mainline (on the VM of Opensource Lab
> > of Oregon State University), SRCU-P failed with __stack_chk_fail:
> > [  264.381952][   T99] [c6c7bab0] [c10c67c0]
> > dump_stack_lvl+0x94/0xd8 (unreliable)
> > [  264.383786][   T99] [c6c7bae0] [c014fc94] 
> > panic+0x19c/0x468
> > [  264.385128][   T99] [c6c7bb80] [c10fca24]
> > __stack_chk_fail+0x24/0x30
> > [  264.386610][   T99] [c6c7bbe0] [c02293b4]
> > srcu_gp_start_if_needed+0x5c4/0x5d0
> > [  264.388188][   T99] [c6c7bc70] [c022f7f4]
> > srcu_torture_call+0x34/0x50
> > [  264.389611][   T99] [c6c7bc90] [c022b5e8]
> > rcu_torture_fwd_prog+0x8c8/0xa60
> > [  264.391439][   T99] [c6c7be00] [c018e37c] 
> > kthread+0x15c/0x170
> > [  264.392792][   T99] [c6c7be50] [c000df94]
> > ret_from_kernel_thread+0x5c/0x64
> > The kernel config file can be found in [1].
> > And I write a bash script to accelerate the bug reproducing [2].
> > After a week's debugging, I found the cause of the bug is because the
> > register r10 used to judge for stack overflow is not constant between
> > context switches.
> > The assembly code for srcu_gp_start_if_needed is located at [3]:
> > c0226eb4:   78 6b aa 7d mr  r10,r13
> > c0226eb8:   14 42 29 7d add r9,r9,r8
> > c0226ebc:   ac 04 00 7c hwsync
> > c0226ec0:   10 00 7b 3b addir27,r27,16
> > c0226ec4:   14 da 29 7d add r9,r9,r27
> > c0226ec8:   a8 48 00 7d ldarx   r8,0,r9
> > c0226ecc:   01 00 08 31 addic   r8,r8,1
> > c0226ed0:   ad 49 00 7d stdcx.  r8,0,r9
> > c0226ed4:   f4 ff c2 40 bne-c0226ec8
> > 
> > c0226ed8:   28 00 21 e9 ld  r9,40(r1)
> > c0226edc:   78 0c 4a e9 ld  r10,3192(r10)
> > c0226ee0:   79 52 29 7d xor.r9,r9,r10
> > c0226ee4:   00 00 40 39 li  r10,0
> > c0226ee8:   b8 03 82 40 bne c02272a0
> > 
> > by debugging, I see the r10 is assigned with r13 on c0226eb4,
> > but if there is a context-switch before c0226edc, a false
> > positive will be reported.
> >
> > [1] http://154.220.3.115/logs/0422/configformainline.txt
> > [2] 154.220.3.115/logs/0422/whilebash.sh
> > [3] http://154.220.3.115/logs/0422/srcu_gp_start_if_needed.txt
> >
> > My analysis and debugging may not be correct, but the bug is easily
> > reproducible.
> 
> If this is a bug in the stack smashing protection as you seem to hint,
> I wonder if you see the issue with a specific gcc version and is a
> compiler-specific issue. It's hard to say, but considering this I

Very likely, more asm code from Zhouyi's link:

This is the __srcu_read_unlock_nmisafe(), since "hwsync" is
smp_mb__{after,before}_atomic(), and the following code is first
barrier then atomic, so it's the unlock.

c0226eb4:   78 6b aa 7d mr  r10,r13

^ r13 is the pointer to percpu data on PPC64 kernel, and it's also
the pointer to TLS data for userspace code.

c0226eb8:   14 42 29 7d add r9,r9,r8
c0226ebc:   ac 04 00 7c hwsync
c0226ec0:   10 00 7b 3b addir27,r27,16
c0226ec4:   14 da 29 7d add r9,r9,r27
c0226ec8:   a8 48 00 7d ldarx   r8,0,r9
c0226ecc:   01 00 08 31 addic   r8,r8,1
c0226ed0:   ad 49 00 7d stdcx.  r8,0,r9
c0226ed4:   f4 ff c2 40 bne-c0226ec8 

c0226ed8:   28 00 21 e9 ld  r9,40(r1)
c0226edc:   78 0c 4a e9 ld  r10,3192(r10)

here I think that the compiler is using r10 as an alias to r13, since
for userspace program, it's safe to assume the TLS pointer doesn't
change. However this is not true for kernel percpu pointer.

The real intention here is to compare 40(r1) vs 3192(r13) for stack
guard checking, however since r13 is the percpu pointer in kernel, so
the value of r13 can be changed if the thread gets scheduled to a
different CPU after reading r13 for r10.

__srcu_read_unlock_nmisafe() triggers this issue, because:

* it contains a read from r13
* it locates at the very end of srcu_gp_start_if_needed().

This gives the compiler more opportunity to "optimize" a read from r13
away.

c0226ee0:   79 52 29 7d xor.r9,r9,r10
c0226ee4:   00 00 40 39 li  r10,0
c0226ee8:   b8 03 82 40 bne c02272a0 


As a result, here triggers __stack_chk_fail if mis-match.

If I'm correct, the following should be a workaround:

diff --git

Re: [PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-04-06 Thread Boqun Feng

Hi,

On Wed, Mar 31, 2021 at 02:30:32PM +, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Some architectures don't have sub-word swap atomic instruction,
> they only have the full word's one.
> 
> The sub-word swap only improve the performance when:
> NR_CPUS < 16K
>  *  0- 7: locked byte
>  * 8: pending
>  *  9-15: not used
>  * 16-17: tail index
>  * 18-31: tail cpu (+1)
> 
> The 9-15 bits are wasted to use xchg16 in xchg_tail.
> 
> Please let architecture select xchg16/xchg32 to implement
> xchg_tail.
> 

If the architecture doesn't have sub-word swap atomic, won't it generate
the same/similar code no matter which version xchg_tail() is used? That
is even CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y, xchg_tail() acts
similar to an xchg16() implemented by cmpxchg(), which means we still
don't have forward progress guarantee. So this configuration doesn't
solve the problem.

I think it's OK to introduce this config and don't provide xchg16() for
risc-v. But I don't see the point of converting other architectures to
use it.

Regards,
Boqun

> Signed-off-by: Guo Ren 
> Cc: Peter Zijlstra 
> Cc: Will Deacon 
> Cc: Ingo Molnar 
> Cc: Waiman Long 
> Cc: Arnd Bergmann 
> Cc: Anup Patel 
> ---
>  kernel/Kconfig.locks   |  3 +++
>  kernel/locking/qspinlock.c | 46 +-
>  2 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 3de8fd11873b..d02f1261f73f 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -239,6 +239,9 @@ config LOCK_SPIN_ON_OWNER
>  config ARCH_USE_QUEUED_SPINLOCKS
>   bool
>  
> +config ARCH_USE_QUEUED_SPINLOCKS_XCHG32
> + bool
> +
>  config QUEUED_SPINLOCKS
>   def_bool y if ARCH_USE_QUEUED_SPINLOCKS
>   depends on SMP
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba53d56..4bfaa969bd15 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -163,26 +163,6 @@ static __always_inline void 
> clear_pending_set_locked(struct qspinlock *lock)
>   WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
>  }
>  
> -/*
> - * xchg_tail - Put in the new queue tail code word & retrieve previous one
> - * @lock : Pointer to queued spinlock structure
> - * @tail : The new queue tail code word
> - * Return: The previous queue tail code word
> - *
> - * xchg(lock, tail), which heads an address dependency
> - *
> - * p,*,* -> n,*,* ; prev = xchg(lock, node)
> - */
> -static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> -{
> - /*
> -  * We can use relaxed semantics since the caller ensures that the
> -  * MCS node is properly initialized before updating the tail.
> -  */
> - return (u32)xchg_relaxed(>tail,
> -  tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
> -}
> -
>  #else /* _Q_PENDING_BITS == 8 */
>  
>  /**
> @@ -206,6 +186,30 @@ static __always_inline void 
> clear_pending_set_locked(struct qspinlock *lock)
>  {
>   atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, >val);
>  }
> +#endif /* _Q_PENDING_BITS == 8 */
> +
> +#if _Q_PENDING_BITS == 8 && !defined(CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32)
> +/*
> + * xchg_tail - Put in the new queue tail code word & retrieve previous one
> + * @lock : Pointer to queued spinlock structure
> + * @tail : The new queue tail code word
> + * Return: The previous queue tail code word
> + *
> + * xchg(lock, tail), which heads an address dependency
> + *
> + * p,*,* -> n,*,* ; prev = xchg(lock, node)
> + */
> +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> +{
> + /*
> +  * We can use relaxed semantics since the caller ensures that the
> +  * MCS node is properly initialized before updating the tail.
> +  */
> + return (u32)xchg_relaxed(>tail,
> +  tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
> +}
> +
> +#else
>  
>  /**
>   * xchg_tail - Put in the new queue tail code word & retrieve previous one
> @@ -236,7 +240,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
> *lock, u32 tail)
>   }
>   return old;
>  }
> -#endif /* _Q_PENDING_BITS == 8 */
> +#endif
>  
>  /**
>   * queued_fetch_set_pending_acquire - fetch the whole lock value and set 
> pending
> -- 
> 2.17.1
>

Re: [PATCH 3/3] powerpc: rewrite atomics to use ARCH_ATOMIC

2020-12-22 Thread Boqun Feng

On Tue, Dec 22, 2020 at 01:52:50PM +1000, Nicholas Piggin wrote:
> Excerpts from Boqun Feng's message of November 14, 2020 1:30 am:
> > Hi Nicholas,
> > 
> > On Wed, Nov 11, 2020 at 09:07:23PM +1000, Nicholas Piggin wrote:
> >> All the cool kids are doing it.
> >> 
> >> Signed-off-by: Nicholas Piggin 
> >> ---
> >>  arch/powerpc/include/asm/atomic.h  | 681 ++---
> >>  arch/powerpc/include/asm/cmpxchg.h |  62 +--
> >>  2 files changed, 248 insertions(+), 495 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/include/asm/atomic.h 
> >> b/arch/powerpc/include/asm/atomic.h
> >> index 8a55eb8cc97b..899aa2403ba7 100644
> >> --- a/arch/powerpc/include/asm/atomic.h
> >> +++ b/arch/powerpc/include/asm/atomic.h
> >> @@ -11,185 +11,285 @@
> >>  #include 
> >>  #include 
> >>  
> >> +#define ARCH_ATOMIC
> >> +
> >> +#ifndef CONFIG_64BIT
> >> +#include 
> >> +#endif
> >> +
> >>  /*
> >>   * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
> >>   * a "bne-" instruction at the end, so an isync is enough as a acquire 
> >> barrier
> >>   * on the platform without lwsync.
> >>   */
> >>  #define __atomic_acquire_fence()  \
> >> -  __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory")
> >> +  asm volatile(PPC_ACQUIRE_BARRIER "" : : : "memory")
> >>  
> >>  #define __atomic_release_fence()  \
> >> -  __asm__ __volatile__(PPC_RELEASE_BARRIER "" : : : "memory")
> >> +  asm volatile(PPC_RELEASE_BARRIER "" : : : "memory")
> >>  
> >> -static __inline__ int atomic_read(const atomic_t *v)
> >> -{
> >> -  int t;
> >> +#define __atomic_pre_full_fence   smp_mb
> >>  
> >> -  __asm__ __volatile__("lwz%U1%X1 %0,%1" : "=r"(t) : "m"(v->counter));
> >> +#define __atomic_post_full_fence  smp_mb
> >>  
> 
> Thanks for the review.
> 
> > Do you need to define __atomic_{pre,post}_full_fence for PPC? IIRC, they
> > are default smp_mb__{before,atomic}_atomic(), so are smp_mb() defautly
> > on PPC.
> 
> Okay I didn't realise that's not required.
> 
> >> -  return t;
> >> +#define arch_atomic_read(v)   
> >> __READ_ONCE((v)->counter)
> >> +#define arch_atomic_set(v, i) 
> >> __WRITE_ONCE(((v)->counter), (i))
> >> +#ifdef CONFIG_64BIT
> >> +#define ATOMIC64_INIT(i)  { (i) }
> >> +#define arch_atomic64_read(v) 
> >> __READ_ONCE((v)->counter)
> >> +#define arch_atomic64_set(v, i)   
> >> __WRITE_ONCE(((v)->counter), (i))
> >> +#endif
> >> +
> > [...]
> >>  
> >> +#define ATOMIC_FETCH_OP_UNLESS_RELAXED(name, type, dtype, width, asm_op) \
> >> +static inline int arch_##name##_relaxed(type *v, dtype a, dtype u)
> >> \
> > 
> > I don't think we have atomic_fetch_*_unless_relaxed() at atomic APIs,
> > ditto for:
> > 
> > atomic_fetch_add_unless_relaxed()
> > atomic_inc_not_zero_relaxed()
> > atomic_dec_if_positive_relaxed()
> > 
> > , and we don't have the _acquire() and _release() variants for them
> > either, and if you don't define their fully-ordered version (e.g.
> > atomic_inc_not_zero()), atomic-arch-fallback.h will use read and cmpxchg
> > to implement them, and I think not what we want.
> 
> Okay. How can those be added? The atoimc generation is pretty 
> complicated.
> 

Yeah, I know ;-) I think you can just implement and define fully-ordered
verions:

arch_atomic_fetch_*_unless()
arch_atomic_inc_not_zero()
arch_atomic_dec_if_postive()

, that should work.

Rules of atomic generation, IIRC:

1.  If you define _relaxed, _acquire, _release or fully-ordered
version, atomic generation will use that version

2.  If you define _relaxed, atomic generation will use that and
barriers to generate _acquire, _release and fully-ordered
versions, unless they are already defined (as Rule #1 says)

3.  If you don't define _relaxed, but define the fully-ordered
version, atomic generation will use the fully-ordered version
and use it as _relaxed variants and generate the rest using Rule
#2.

> > [...]
> >>  
> >>  #endif /* __KERNEL__ */
> >>  #endif /* _ASM_POWERPC_ATOMIC_H_ */
> >> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> >> b/arch/powerpc/include/asm/cmpxchg.h
> >> index cf091c4c22e5..181f7e8b3281 100644
> >> --- a/arch/powerpc/include/asm/cmpxchg.h
> >> +++ b/arch/powerpc/include/asm/cmpxchg.h
> >> @@ -192,7 +192,7 @@ __xchg_relaxed(void *ptr, unsigned long x, unsigned 
> >> int size)
> >>(unsigned long)_x_, sizeof(*(ptr)));
> >>  \
> >>})
> >>  
> >> -#define xchg_relaxed(ptr, x)  
> >> \
> >> +#define arch_xchg_relaxed(ptr, x) \
> >>  ({
> >> \
> >>__typeof__(*(ptr)) _x_ = (x);   \
> >>(__typeof__(*(ptr)))

Re: [PATCH 3/3] powerpc: rewrite atomics to use ARCH_ATOMIC

2020-11-13 Thread Boqun Feng

Hi Nicholas,

On Wed, Nov 11, 2020 at 09:07:23PM +1000, Nicholas Piggin wrote:
> All the cool kids are doing it.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/atomic.h  | 681 ++---
>  arch/powerpc/include/asm/cmpxchg.h |  62 +--
>  2 files changed, 248 insertions(+), 495 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/atomic.h 
> b/arch/powerpc/include/asm/atomic.h
> index 8a55eb8cc97b..899aa2403ba7 100644
> --- a/arch/powerpc/include/asm/atomic.h
> +++ b/arch/powerpc/include/asm/atomic.h
> @@ -11,185 +11,285 @@
>  #include 
>  #include 
>  
> +#define ARCH_ATOMIC
> +
> +#ifndef CONFIG_64BIT
> +#include 
> +#endif
> +
>  /*
>   * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
>   * a "bne-" instruction at the end, so an isync is enough as a acquire 
> barrier
>   * on the platform without lwsync.
>   */
>  #define __atomic_acquire_fence() \
> - __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory")
> + asm volatile(PPC_ACQUIRE_BARRIER "" : : : "memory")
>  
>  #define __atomic_release_fence() \
> - __asm__ __volatile__(PPC_RELEASE_BARRIER "" : : : "memory")
> + asm volatile(PPC_RELEASE_BARRIER "" : : : "memory")
>  
> -static __inline__ int atomic_read(const atomic_t *v)
> -{
> - int t;
> +#define __atomic_pre_full_fence  smp_mb
>  
> - __asm__ __volatile__("lwz%U1%X1 %0,%1" : "=r"(t) : "m"(v->counter));
> +#define __atomic_post_full_fence smp_mb
>  

Do you need to define __atomic_{pre,post}_full_fence for PPC? IIRC, they
are default smp_mb__{before,atomic}_atomic(), so are smp_mb() defautly
on PPC.

> - return t;
> +#define arch_atomic_read(v)  __READ_ONCE((v)->counter)
> +#define arch_atomic_set(v, i)
> __WRITE_ONCE(((v)->counter), (i))
> +#ifdef CONFIG_64BIT
> +#define ATOMIC64_INIT(i) { (i) }
> +#define arch_atomic64_read(v)
> __READ_ONCE((v)->counter)
> +#define arch_atomic64_set(v, i)  
> __WRITE_ONCE(((v)->counter), (i))
> +#endif
> +
[...]
>  
> +#define ATOMIC_FETCH_OP_UNLESS_RELAXED(name, type, dtype, width, asm_op) \
> +static inline int arch_##name##_relaxed(type *v, dtype a, dtype u)   \

I don't think we have atomic_fetch_*_unless_relaxed() at atomic APIs,
ditto for:

atomic_fetch_add_unless_relaxed()
atomic_inc_not_zero_relaxed()
atomic_dec_if_positive_relaxed()

, and we don't have the _acquire() and _release() variants for them
either, and if you don't define their fully-ordered version (e.g.
atomic_inc_not_zero()), atomic-arch-fallback.h will use read and cmpxchg
to implement them, and I think not what we want.

[...]
>  
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_ATOMIC_H_ */
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index cf091c4c22e5..181f7e8b3281 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -192,7 +192,7 @@ __xchg_relaxed(void *ptr, unsigned long x, unsigned int 
> size)
>   (unsigned long)_x_, sizeof(*(ptr)));
>  \
>})
>  
> -#define xchg_relaxed(ptr, x) \
> +#define arch_xchg_relaxed(ptr, x)\
>  ({   \
>   __typeof__(*(ptr)) _x_ = (x);   \
>   (__typeof__(*(ptr))) __xchg_relaxed((ptr),  \
> @@ -448,35 +448,7 @@ __cmpxchg_relaxed(void *ptr, unsigned long old, unsigned 
> long new,
>   return old;
>  }
>  
> -static __always_inline unsigned long
> -__cmpxchg_acquire(void *ptr, unsigned long old, unsigned long new,
> -   unsigned int size)
> -{
> - switch (size) {
> - case 1:
> - return __cmpxchg_u8_acquire(ptr, old, new);
> - case 2:
> - return __cmpxchg_u16_acquire(ptr, old, new);
> - case 4:
> - return __cmpxchg_u32_acquire(ptr, old, new);
> -#ifdef CONFIG_PPC64
> - case 8:
> - return __cmpxchg_u64_acquire(ptr, old, new);
> -#endif
> - }
> - BUILD_BUG_ON_MSG(1, "Unsupported size for __cmpxchg_acquire");
> - return old;
> -}
> -#define cmpxchg(ptr, o, n)\
> -  ({  \
> - __typeof__(*(ptr)) _o_ = (o);\
> - __typeof__(*(ptr)) _n_ = (n);\
> - (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_,   
>  \
> - (unsigned long)_n_, sizeof(*(ptr))); \
> -  })
> -
> -

If you remove {atomic_}_cmpxchg_{,_acquire}() and use the version
provided by atomic-arch-fallback.h, then a fail cmpxchg

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-20 Thread Boqun Feng

On Fri, May 18, 2018 at 02:17:17PM -0400, Mathieu Desnoyers wrote:
> - On May 17, 2018, at 7:50 PM, Boqun Feng boqun.f...@gmail.com wrote:
> [...]
> >> > I think you're right. So we have to introduce callsite to rseq_syscall()
> >> > in syscall path, something like:
> >> > 
> >> > diff --git a/arch/powerpc/kernel/entry_64.S 
> >> > b/arch/powerpc/kernel/entry_64.S
> >> > index 51695608c68b..a25734a96640 100644
> >> > --- a/arch/powerpc/kernel/entry_64.S
> >> > +++ b/arch/powerpc/kernel/entry_64.S
> >> > @@ -222,6 +222,9 @@ system_call_exit:
> >> >  mtmsrd  r11,1
> >> > #endif /* CONFIG_PPC_BOOK3E */
> >> > 
> >> > +addir3,r1,STACK_FRAME_OVERHEAD
> >> > +bl  rseq_syscall
> >> > +
> >> >  ld  r9,TI_FLAGS(r12)
> >> >  li  r11,-MAX_ERRNO
> >> >  andi.
> >> >  
> >> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> >> > 
> 
> By the way, I think this is not the right spot to call rseq_syscall, because
> interrupts are disabled. I think we should move this hunk right after 
> system_call_exit.
> 

Good point.

> Would you like to implement and test an updated patch adding those calls for 
> ppc 32 and 64 ?
> 

I'd like to help, but I don't have a handy ppc environment for test...
So I made the below patch which has only been build-tested, hope it
could be somewhat helpful.

Regards,
Boqun

->8
Subject: [PATCH] powerpc: Add syscall detection for restartable sequences

Syscalls are not allowed inside restartable sequences, so add a call to
rseq_syscall() at the very beginning of system call exiting path for
CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
is a syscall issued inside restartable sequences.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/kernel/entry_32.S | 5 +
 arch/powerpc/kernel/entry_64.S | 5 +
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index eb8d01bae8c6..2f134eebe7ed 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -365,6 +365,11 @@ syscall_dotrace_cont:
blrl/* Call handler */
.globl  ret_from_syscall
 ret_from_syscall:
+#ifdef CONFIG_DEBUG_RSEQ
+   /* Check whether the syscall is issued inside a restartable sequence */
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+#endif
mr  r6,r3
CURRENT_THREAD_INFO(r12, r1)
/* disable interrupts so current_thread_info()->flags can't change */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 2cb5109a7ea3..2e2d59bb45d0 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -204,6 +204,11 @@ system_call:   /* label this so stack 
traces look sane */
  * This is blacklisted from kprobes further below with _ASM_NOKPROBE_SYMBOL().
  */
 system_call_exit:
+#ifdef CONFIG_DEBUG_RSEQ
+   /* Check whether the syscall is issued inside a restartable sequence */
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+#endif
/*
 * Disable interrupts so current_thread_info()->flags can't change,
 * and so that we don't get interrupted after loading SRR0/1.
-- 
2.16.2

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-17 Thread Boqun Feng



On Thu, May 17, 2018, at 11:28 PM, Mathieu Desnoyers wrote:
> - On May 16, 2018, at 9:19 PM, Boqun Feng boqun.f...@gmail.com wrote:
> 
> > On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
> >> - On May 16, 2018, at 12:18 PM, Peter Zijlstra pet...@infradead.org 
> >> wrote:
> >> 
> >> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
> >> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> >> >> index c32a181a7cbb..ed21a777e8c6 100644
> >> >> --- a/arch/powerpc/Kconfig
> >> >> +++ b/arch/powerpc/Kconfig
> >> >> @@ -223,6 +223,7 @@ config PPC
> >> >> select HAVE_SYSCALL_TRACEPOINTS
> >> >> select HAVE_VIRT_CPU_ACCOUNTING
> >> >> select HAVE_IRQ_TIME_ACCOUNTING
> >> >> +   select HAVE_RSEQ
> >> >> select IRQ_DOMAIN
> >> >> select IRQ_FORCED_THREADING
> >> >> select MODULES_USE_ELF_RELA
> >> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> >> >> index 61db86ecd318..d3bb3aaaf5ac 100644
> >> >> --- a/arch/powerpc/kernel/signal.c
> >> >> +++ b/arch/powerpc/kernel/signal.c
> >> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
> >> >> /* Re-enable the breakpoints for the signal stack */
> >> >> thread_change_pc(tsk, tsk->thread.regs);
> >> >>  
> >> >> +   rseq_signal_deliver(tsk->thread.regs);
> >> >> +
> >> >> if (is32) {
> >> >> if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> >> >> ret = handle_rt_signal32(, oldset, tsk);
> >> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, 
> >> >> unsigned long
> >> >> thread_info_flags)
> >> >> if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> >> >> clear_thread_flag(TIF_NOTIFY_RESUME);
> >> >> tracehook_notify_resume(regs);
> >> >> +   rseq_handle_notify_resume(regs);
> >> >> }
> >> >>  
> >> >> user_enter();
> >> > 
> >> > Again no rseq_syscall().
> >> 
> >> Same question for PowerPC as for ARM:
> >> 
> >> Considering that rseq_syscall is implemented as follows:
> >> 
> >> +void rseq_syscall(struct pt_regs *regs)
> >> +{
> >> +   unsigned long ip = instruction_pointer(regs);
> >> +   struct task_struct *t = current;
> >> +   struct rseq_cs rseq_cs;
> >> +
> >> +   if (!t->rseq)
> >> +   return;
> >> +   if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
> >> +   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
> >> +   force_sig(SIGSEGV, t);
> >> +}
> >> 
> >> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
> >> now used in the fast-path since KPTI), I wonder where we should call
> > 
> > So we actually detect this after the syscall takes effect, right? I
> > wonder whether this could be problematic, because "disallowing syscall"
> > in rseq areas may means the syscall won't take effect to some people, I
> > guess?
> > 
> >> this on PowerPC ?  I was under the impression that PowerPC return to
> >> userspace fast-path was not calling C code unless work flags were set,
> >> but I might be wrong.
> >> 
> > 
> > I think you're right. So we have to introduce callsite to rseq_syscall()
> > in syscall path, something like:
> > 
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 51695608c68b..a25734a96640 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -222,6 +222,9 @@ system_call_exit:
> > mtmsrd  r11,1
> > #endif /* CONFIG_PPC_BOOK3E */
> > 
> > +   addir3,r1,STACK_FRAME_OVERHEAD
> > +   bl  rseq_syscall
> > +
> > ld  r9,TI_FLAGS(r12)
> > li  r11,-MAX_ERRNO
> > andi.
> > 
> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> > 
> > But I think it's important for us to first decide where (before or after
> > the syscall) we

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-16 Thread Boqun Feng

On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
> - On May 16, 2018, at 12:18 PM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> >> index c32a181a7cbb..ed21a777e8c6 100644
> >> --- a/arch/powerpc/Kconfig
> >> +++ b/arch/powerpc/Kconfig
> >> @@ -223,6 +223,7 @@ config PPC
> >>select HAVE_SYSCALL_TRACEPOINTS
> >>select HAVE_VIRT_CPU_ACCOUNTING
> >>select HAVE_IRQ_TIME_ACCOUNTING
> >> +  select HAVE_RSEQ
> >>select IRQ_DOMAIN
> >>select IRQ_FORCED_THREADING
> >>select MODULES_USE_ELF_RELA
> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> >> index 61db86ecd318..d3bb3aaaf5ac 100644
> >> --- a/arch/powerpc/kernel/signal.c
> >> +++ b/arch/powerpc/kernel/signal.c
> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
> >>/* Re-enable the breakpoints for the signal stack */
> >>thread_change_pc(tsk, tsk->thread.regs);
> >>  
> >> +  rseq_signal_deliver(tsk->thread.regs);
> >> +
> >>if (is32) {
> >>if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> >>ret = handle_rt_signal32(, oldset, tsk);
> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned 
> >> long
> >> thread_info_flags)
> >>if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> >>clear_thread_flag(TIF_NOTIFY_RESUME);
> >>tracehook_notify_resume(regs);
> >> +  rseq_handle_notify_resume(regs);
> >>}
> >>  
> >>user_enter();
> > 
> > Again no rseq_syscall().
> 
> Same question for PowerPC as for ARM:
> 
> Considering that rseq_syscall is implemented as follows:
> 
> +void rseq_syscall(struct pt_regs *regs)
> +{
> +   unsigned long ip = instruction_pointer(regs);
> +   struct task_struct *t = current;
> +   struct rseq_cs rseq_cs;
> +
> +   if (!t->rseq)
> +   return;
> +   if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
> +   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
> +   force_sig(SIGSEGV, t);
> +}
> 
> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
> now used in the fast-path since KPTI), I wonder where we should call

So we actually detect this after the syscall takes effect, right? I
wonder whether this could be problematic, because "disallowing syscall"
in rseq areas may means the syscall won't take effect to some people, I
guess?

> this on PowerPC ?  I was under the impression that PowerPC return to
> userspace fast-path was not calling C code unless work flags were set,
> but I might be wrong.
> 

I think you're right. So we have to introduce callsite to rseq_syscall()
in syscall path, something like:

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..a25734a96640 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -222,6 +222,9 @@ system_call_exit:
mtmsrd  r11,1
 #endif /* CONFIG_PPC_BOOK3E */
 
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+
ld  r9,TI_FLAGS(r12)
li  r11,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)

But I think it's important for us to first decide where (before or after
the syscall) we do the detection.

Regards,
Boqun

> Thoughts ?
> 
> Thanks!
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


signature.asc
Description: PGP signature

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-30 Thread Boqun Feng

On Fri, Jul 28, 2017 at 12:09:56PM -0700, Paul E. McKenney wrote:
> On Fri, Jul 28, 2017 at 11:41:29AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 28, 2017 at 07:55:30AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jul 28, 2017 at 08:54:16PM +0800, Boqun Feng wrote:
> 
> [ . . . ]
> 
> > > Even though Jonathan's testing indicates that it didn't fix this
> > > particular problem:
> > > 
> > > Acked-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > 
> > And while we are at it:
> > 
> > Tested-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> 
> Not because it it fixed the TREE01 issue -- it did not.  But as near
> as I can see, it didn't cause any additional issues.
> 

Understood.

Still work on waketorture for a test case which could trigger this
problem in a real world. My old plan is to send this out when I could
use waketorture to show this patch actually resolves some potential
bugs, but just put it ahead here in case it may help.

Will send it out with your Tested-by and Acked-by and continue to work
on waketorture.

Regards,
Boqun

>   Thanx, Paul
> 

signature.asc
Description: PGP signature

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-28 Thread Boqun Feng

On Fri, Jul 28, 2017 at 11:41:29AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 28, 2017 at 07:55:30AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 28, 2017 at 08:54:16PM +0800, Boqun Feng wrote:
> > > Hi Jonathan,
> > > 
> > > FWIW, there is wakeup-missing issue in swake_up() and swake_up_all():
> > > 
> > >   https://marc.info/?l=linux-kernel=149750022019663
> > > 
> > > and RCU begins to use swait/wake last year, so I thought this could be
> > > relevant.
> > > 
> > > Could you try the following patch and see if it works? Thanks.
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > -->8
> > > Subject: [PATCH] swait: Remove the lockless swait_active() check in
> > >  swake_up*()
> > > 
> > > Steven Rostedt reported a potential race in RCU core because of
> > > swake_up():
> > > 
> > > CPU0CPU1
> > > 
> > > __call_rcu_core() {
> > > 
> > >  spin_lock(rnp_root)
> > >  need_wake = __rcu_start_gp() {
> > >   rcu_start_gp_advanced() {
> > >gp_flags = FLAG_INIT
> > >   }
> > >  }
> > > 
> > >  rcu_gp_kthread() {
> > >swait_event_interruptible(wq,
> > > gp_flags & FLAG_INIT) {
> > 
> > So the idea is that we get the old value of ->gp_flags here, correct?
> > 

Yes.

> > >spin_lock(q->lock)
> > > 
> > > *fetch wq->task_list here! *
> > 
> > And the above fetch is really part of the swait_active() called out
> > below, right?
> > 

Right.

> > >list_add(wq->task_list, q->task_list)
> > >spin_unlock(q->lock);
> > > 
> > >*fetch old value of gp_flags here *
> > 
> > And here we fetch the old value of ->gp_flags again, this time under
> > the lock, right?
> > 

Hmm.. a bit different, this time is still lockless but *after* the wait
enqueued itself. We could rely on the spin_lock(q->lock) above to pair
with a spin_unlock() from another lock critical section of accessing
the wait queue(typically from some waker). But in the case Steven came
up, there is an lockless accessing to the wait queue from the waker, so
such a pair doesn't exist, which end up that the waker sees a empty wait
queue and do nothing, while the waiter still observes the old value
after its enqueue and goes to sleep.

> > >  spin_unlock(rnp_root)
> > > 
> > >  rcu_gp_kthread_wake() {
> > >   swake_up(wq) {
> > >swait_active(wq) {
> > > list_empty(wq->task_list)
> > > 
> > >} * return false *
> > > 
> > >   if (condition) * false *
> > > schedule();
> > > 
> > > In this case, a wakeup is missed, which could cause the rcu_gp_kthread
> > > waits for a long time.
> > > 
> > > The reason of this is that we do a lockless swait_active() check in
> > > swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
> > > before swait_active() to provide the proper order or 2) simply remove
> > > the swait_active() in swake_up().
> > > 
> > > The solution 2 not only fixes this problem but also keeps the swait and
> > > wait API as close as possible, as wake_up() doesn't provide a full
> > > barrier and doesn't do a lockless check of the wait queue either.
> > > Moreover, there are users already using swait_active() to do their quick
> > > checks for the wait queues, so it make less sense that swake_up() and
> > > swake_up_all() do this on their own.
> > > 
> > > This patch then removes the lockless swait_active() check in swake_up()
> > > and swake_up_all().
> > > 
> > > Reported-by: Steven Rostedt <rost...@goodmis.org>
> > > Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
> > 
> > Even though Jonathan's testing indicates that it didn't fix this
> > particular problem:
> > 
> > Acked-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> 
> And while we

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-28 Thread Boqun Feng

Hi Jonathan,

FWIW, there is wakeup-missing issue in swake_up() and swake_up_all():

https://marc.info/?l=linux-kernel=149750022019663

and RCU begins to use swait/wake last year, so I thought this could be
relevant.

Could you try the following patch and see if it works? Thanks.

Regards,
Boqun

-->8
Subject: [PATCH] swait: Remove the lockless swait_active() check in
 swake_up*()

Steven Rostedt reported a potential race in RCU core because of
swake_up():

CPU0CPU1

__call_rcu_core() {

 spin_lock(rnp_root)
 need_wake = __rcu_start_gp() {
  rcu_start_gp_advanced() {
   gp_flags = FLAG_INIT
  }
 }

 rcu_gp_kthread() {
   swait_event_interruptible(wq,
gp_flags & FLAG_INIT) {
   spin_lock(q->lock)

*fetch wq->task_list here! *

   list_add(wq->task_list, q->task_list)
   spin_unlock(q->lock);

   *fetch old value of gp_flags here *

 spin_unlock(rnp_root)

 rcu_gp_kthread_wake() {
  swake_up(wq) {
   swait_active(wq) {
list_empty(wq->task_list)

   } * return false *

  if (condition) * false *
schedule();

In this case, a wakeup is missed, which could cause the rcu_gp_kthread
waits for a long time.

The reason of this is that we do a lockless swait_active() check in
swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
before swait_active() to provide the proper order or 2) simply remove
the swait_active() in swake_up().

The solution 2 not only fixes this problem but also keeps the swait and
wait API as close as possible, as wake_up() doesn't provide a full
barrier and doesn't do a lockless check of the wait queue either.
Moreover, there are users already using swait_active() to do their quick
checks for the wait queues, so it make less sense that swake_up() and
swake_up_all() do this on their own.

This patch then removes the lockless swait_active() check in swake_up()
and swake_up_all().

Reported-by: Steven Rostedt <rost...@goodmis.org>
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 kernel/sched/swait.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 3d5610dcce11..2227e183e202 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -33,9 +33,6 @@ void swake_up(struct swait_queue_head *q)
 {
unsigned long flags;
 
-   if (!swait_active(q))
-   return;
-
raw_spin_lock_irqsave(>lock, flags);
swake_up_locked(q);
raw_spin_unlock_irqrestore(>lock, flags);
@@ -51,9 +48,6 @@ void swake_up_all(struct swait_queue_head *q)
struct swait_queue *curr;
LIST_HEAD(tmp);
 
-   if (!swait_active(q))
-   return;
-
raw_spin_lock_irq(>lock);
list_splice_init(>task_list, );
while (!list_empty()) {
-- 
2.13.0

Re: [PATCH RFC 21/26] powerpc: Remove spin_unlock_wait() arch-specific definitions

2017-07-01 Thread Boqun Feng

On Thu, Jun 29, 2017 at 05:01:29PM -0700, Paul E. McKenney wrote:
> There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> and it appears that all callers could do just as well with a lock/unlock
> pair.  This commit therefore removes the underlying arch-specific
> arch_spin_unlock_wait().
> 
> Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Cc: Paul Mackerras <pau...@samba.org>
> Cc: Michael Ellerman <m...@ellerman.id.au>
> Cc: <linuxppc-dev@lists.ozlabs.org>
> Cc: Will Deacon <will.dea...@arm.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Alan Stern <st...@rowland.harvard.edu>
> Cc: Andrea Parri <parri.and...@gmail.com>
> Cc: Linus Torvalds <torva...@linux-foundation.org>

Acked-by: Boqun Feng <boqun.f...@gmail.com>

Regards,
Boqun

> ---
>  arch/powerpc/include/asm/spinlock.h | 33 -
>  1 file changed, 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 8c1b913de6d7..d256e448ea49 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -170,39 +170,6 @@ static inline void arch_spin_unlock(arch_spinlock_t 
> *lock)
>   lock->slock = 0;
>  }
>  
> -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> -{
> - arch_spinlock_t lock_val;
> -
> - smp_mb();
> -
> - /*
> -  * Atomically load and store back the lock value (unchanged). This
> -  * ensures that our observation of the lock value is ordered with
> -  * respect to other lock operations.
> -  */
> - __asm__ __volatile__(
> -"1:  " PPC_LWARX(%0, 0, %2, 0) "\n"
> -"stwcx. %0, 0, %2\n"
> -"bne- 1b\n"
> - : "=" (lock_val), "+m" (*lock)
> - : "r" (lock)
> - : "cr0", "xer");
> -
> - if (arch_spin_value_unlocked(lock_val))
> - goto out;
> -
> - while (lock->slock) {
> - HMT_low();
> - if (SHARED_PROCESSOR)
> - __spin_yield(lock);
> - }
> - HMT_medium();
> -
> -out:
> - smp_mb();
> -}
> -
>  /*
>   * Read-write spinlocks, allowing multiple readers
>   * but only one writer.
> -- 
> 2.5.2
> 


signature.asc
Description: PGP signature

Re: [PATCH v8 3/6] powerpc: lib/locks.c: Add cpu yield/wake helper function

2016-12-05 Thread Boqun Feng

On Mon, Dec 05, 2016 at 10:19:23AM -0500, Pan Xinhui wrote:
> Add two corresponding helper functions to support pv-qspinlock.
> 
> For normal use, __spin_yield_cpu will confer current vcpu slices to the
> target vcpu(say, a lock holder). If target vcpu is not specified or it
> is in running state, such conferging to lpar happens or not depends.
> 
> Because hcall itself will introduce latency and a little overhead. And we
> do NOT want to suffer any latency on some cases, e.g. in interrupt handler.
> The second parameter *confer* can indicate such case.
> 
> __spin_wake_cpu is simpiler, it will wake up one vcpu regardless of its
> current vcpu state.
> 
> Signed-off-by: Pan Xinhui 
> ---
>  arch/powerpc/include/asm/spinlock.h |  4 +++
>  arch/powerpc/lib/locks.c| 59 
> +
>  2 files changed, 63 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 954099e..6426bd5 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -64,9 +64,13 @@ static inline bool vcpu_is_preempted(int cpu)
>  /* We only yield to the hypervisor if we are in shared processor mode */
>  #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
>  extern void __spin_yield(arch_spinlock_t *lock);
> +extern void __spin_yield_cpu(int cpu, int confer);
> +extern void __spin_wake_cpu(int cpu);
>  extern void __rw_yield(arch_rwlock_t *lock);
>  #else /* SPLPAR */
>  #define __spin_yield(x)barrier()
> +#define __spin_yield_cpu(x, y) barrier()
> +#define __spin_wake_cpu(x) barrier()
>  #define __rw_yield(x)  barrier()
>  #define SHARED_PROCESSOR   0
>  #endif
> diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
> index 6574626..bd872c9 100644
> --- a/arch/powerpc/lib/locks.c
> +++ b/arch/powerpc/lib/locks.c
> @@ -23,6 +23,65 @@
>  #include 
>  #include 
>  
> +/*
> + * confer our slices to a specified cpu and return. If it is in running state
> + * or cpu is -1, then we will check confer. If confer is NULL, we will return
> + * otherwise we confer our slices to lpar.
> + */
> +void __spin_yield_cpu(int cpu, int confer)
> +{
> + unsigned int holder_cpu = cpu, yield_count;

As I said at:

https://marc.info/?l=linux-kernel=147455748619343=2

@holder_cpu is not necessary and doesn't help anything.

> +
> + if (cpu == -1)
> + goto yield_to_lpar;
> +
> + BUG_ON(holder_cpu >= nr_cpu_ids);
> + yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
> +
> + /* if cpu is running, confer slices to lpar conditionally*/
> + if ((yield_count & 1) == 0)
> + goto yield_to_lpar;
> +
> + plpar_hcall_norets(H_CONFER,
> + get_hard_smp_processor_id(holder_cpu), yield_count);
> + return;
> +
> +yield_to_lpar:
> + if (confer)
> + plpar_hcall_norets(H_CONFER, -1, 0);
> +}
> +EXPORT_SYMBOL_GPL(__spin_yield_cpu);
> +
> +void __spin_wake_cpu(int cpu)
> +{
> + unsigned int holder_cpu = cpu;

And it's even wrong to call the parameter of _wake_cpu() a holder_cpu,
because it's not the current lock holder.

Regards,
Boqun

> +
> + BUG_ON(holder_cpu >= nr_cpu_ids);
> + /*
> +  * NOTE: we should always do this hcall regardless of
> +  * the yield_count of the holder_cpu.
> +  * as thers might be a case like below;
> +  *  CPU 1   CPU 2
> +  *  yielded = true
> +  * if (yielded)
> +  *  __spin_wake_cpu()
> +  *  __spin_yield_cpu()
> +  *
> +  * So we might lose a wake if we check the yield_count and
> +  * return directly if the holder_cpu is running.
> +  * IOW. do NOT code like below.
> +  *  yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
> +  *  if ((yield_count & 1) == 0)
> +  *  return;
> +  *
> +  * a PROD hcall marks the target_cpu proded, which cause the next cede
> +  * or confer called on the target_cpu invalid.
> +  */
> + plpar_hcall_norets(H_PROD,
> + get_hard_smp_processor_id(holder_cpu));
> +}
> +EXPORT_SYMBOL_GPL(__spin_wake_cpu);
> +
>  #ifndef CONFIG_QUEUED_SPINLOCKS
>  void __spin_yield(arch_spinlock_t *lock)
>  {
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature

Re: [PATCH v8 2/6] powerpc: pSeries/Kconfig: Add qspinlock build config

2016-12-05 Thread Boqun Feng

On Mon, Dec 05, 2016 at 10:19:22AM -0500, Pan Xinhui wrote:
> pSeries/powerNV will use qspinlock from now on.
> 
> Signed-off-by: Pan Xinhui 
> ---
>  arch/powerpc/platforms/pseries/Kconfig | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/Kconfig 
> b/arch/powerpc/platforms/pseries/Kconfig
> index bec90fb..8a87d06 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig

Why here? Not arch/powerpc/platforms/Kconfig?

> @@ -23,6 +23,14 @@ config PPC_PSERIES
>   select PPC_DOORBELL
>   default y
>  
> +config ARCH_USE_QUEUED_SPINLOCKS
> + default y
> + bool "Enable qspinlock"

I think you just enable qspinlock by default for all PPC platforms. I
guess you need to put

depends on PPC_PSERIES || PPC_POWERNV

here to achieve what you mean in you commit message.

Regards,
Boqun

> + help
> +   Enabling this option will let kernel use qspinlock which is a kind of
> +   fairlock.  It has shown a good performance improvement on x86 and 
> also ppc
> +   especially in high contention cases.
> +
>  config PPC_SPLPAR
>   depends on PPC_PSERIES
>   bool "Support for shared-processor logical partitions"
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature

Re: [PATCH v8 1/6] powerpc/qspinlock: powerpc support qspinlock

2016-12-05 Thread Boqun Feng

On Mon, Dec 05, 2016 at 10:19:21AM -0500, Pan Xinhui wrote:
> This patch add basic code to enable qspinlock on powerpc. qspinlock is
> one kind of fairlock implementation. And seen some performance improvement
> under some scenarios.
> 
> queued_spin_unlock() release the lock by just one write of NULL to the
> ::locked field which sits at different places in the two endianness
> system.
> 
> We override some arch_spin_XXX as powerpc has io_sync stuff which makes
> sure the io operations are protected by the lock correctly.
> 
> There is another special case, see commit
> 2c610022711 ("locking/qspinlock: Fix spin_unlock_wait() some more")
> 
> Signed-off-by: Pan Xinhui 
> ---
>  arch/powerpc/include/asm/qspinlock.h  | 66 
> +++
>  arch/powerpc/include/asm/spinlock.h   | 31 +--
>  arch/powerpc/include/asm/spinlock_types.h |  4 ++
>  arch/powerpc/lib/locks.c  | 59 +++
>  4 files changed, 147 insertions(+), 13 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/qspinlock.h
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> new file mode 100644
> index 000..4c89256
> --- /dev/null
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -0,0 +1,66 @@
> +#ifndef _ASM_POWERPC_QSPINLOCK_H
> +#define _ASM_POWERPC_QSPINLOCK_H
> +
> +#include 
> +
> +#define SPIN_THRESHOLD (1 << 15)
> +#define queued_spin_unlock queued_spin_unlock
> +#define queued_spin_is_locked queued_spin_is_locked
> +#define queued_spin_unlock_wait queued_spin_unlock_wait
> +
> +extern void queued_spin_unlock_wait(struct qspinlock *lock);
> +
> +static inline u8 *__qspinlock_lock_byte(struct qspinlock *lock)
> +{
> + return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
> +}
> +
> +static inline void queued_spin_unlock(struct qspinlock *lock)
> +{
> + /* release semantics is required */
> + smp_store_release(__qspinlock_lock_byte(lock), 0);
> +}
> +
> +static inline int queued_spin_is_locked(struct qspinlock *lock)
> +{
> + smp_mb();
> + return atomic_read(>val);
> +}
> +
> +#include 
> +
> +/* we need override it as ppc has io_sync stuff */
> +#undef arch_spin_trylock
> +#undef arch_spin_lock
> +#undef arch_spin_lock_flags
> +#undef arch_spin_unlock
> +#define arch_spin_trylock arch_spin_trylock
> +#define arch_spin_lock arch_spin_lock
> +#define arch_spin_lock_flags arch_spin_lock_flags
> +#define arch_spin_unlock arch_spin_unlock
> +
> +static inline int arch_spin_trylock(arch_spinlock_t *lock)
> +{
> + CLEAR_IO_SYNC;
> + return queued_spin_trylock(lock);
> +}
> +
> +static inline void arch_spin_lock(arch_spinlock_t *lock)
> +{
> + CLEAR_IO_SYNC;
> + queued_spin_lock(lock);
> +}
> +
> +static inline
> +void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
> +{
> + CLEAR_IO_SYNC;
> + queued_spin_lock(lock);
> +}
> +
> +static inline void arch_spin_unlock(arch_spinlock_t *lock)
> +{
> + SYNC_IO;
> + queued_spin_unlock(lock);
> +}
> +#endif /* _ASM_POWERPC_QSPINLOCK_H */
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 8c1b913..954099e 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -60,6 +60,23 @@ static inline bool vcpu_is_preempted(int cpu)
>  }
>  #endif
>  
> +#if defined(CONFIG_PPC_SPLPAR)
> +/* We only yield to the hypervisor if we are in shared processor mode */
> +#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
> +extern void __spin_yield(arch_spinlock_t *lock);
> +extern void __rw_yield(arch_rwlock_t *lock);
> +#else /* SPLPAR */
> +#define __spin_yield(x)barrier()
> +#define __rw_yield(x)  barrier()
> +#define SHARED_PROCESSOR   0
> +#endif
> +
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> +#include 
> +#else
> +
> +#define arch_spin_relax(lock)  __spin_yield(lock)
> +
>  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>  {
>   return lock.slock == 0;
> @@ -114,18 +131,6 @@ static inline int arch_spin_trylock(arch_spinlock_t 
> *lock)
>   * held.  Conveniently, we have a word in the paca that holds this
>   * value.
>   */
> -
> -#if defined(CONFIG_PPC_SPLPAR)
> -/* We only yield to the hypervisor if we are in shared processor mode */
> -#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
> -extern void __spin_yield(arch_spinlock_t *lock);
> -extern void __rw_yield(arch_rwlock_t *lock);
> -#else /* SPLPAR */
> -#define __spin_yield(x)  barrier()
> -#define __rw_yield(x)barrier()
> -#define SHARED_PROCESSOR 0
> -#endif
> -
>  static inline void arch_spin_lock(arch_spinlock_t *lock)
>  {
>   CLEAR_IO_SYNC;
> @@ -203,6 +208,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
> *lock)
>   smp_mb();
>  }
>  
> +#endif /* !CONFIG_QUEUED_SPINLOCKS */
>  /*
>   * Read-write spinlocks, allowing

[RFC v2] powerpc: xmon: Add address lookup for percpu symbols

2016-11-22 Thread Boqun Feng

Currently, in xmon, there is no obvious way to get an address for a
percpu symbol for a particular cpu. Having such an ability would be good
for debugging the system when percpu variables got involved.

Therefore, this patch introduces a new xmon command "lp" to lookup the
address for percpu symbols. Usage of "lp" is similar to "ls", except
that we could add a cpu number to choose the variable of which cpu we
want to lookup. If no cpu number is given, lookup for current cpu.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
v1 --> v2:

o   Using per_cpu_ptr() and this_cpu_ptr() instead of
per_cpu_offset() and my_cpu_offset, because the latter
are undefined on !SMP kernel.

o   Only print the address of percpu symbols, i.e. symbols
in [__per_cpu_start, __per_cpu_end)

 arch/powerpc/xmon/xmon.c | 35 ++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 760545519a0b..2747c94400a2 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -49,6 +49,7 @@
 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 #include 
@@ -229,6 +230,7 @@ Commands:\n\
   fflush cache\n\
   la   lookup symbol+offset of specified address\n\
   ls   lookup address of specified symbol\n\
+  lp s [#] lookup address of percpu symbol s for current cpu, or cpu #\n\
   mexamine/change memory\n\
   mm   move a block of memory\n\
   ms   set a block of memory\n\
@@ -2943,7 +2945,8 @@ static void
 symbol_lookup(void)
 {
int type = inchar();
-   unsigned long addr;
+   unsigned long addr, cpu;
+   void __percpu *ptr = NULL;
static char tmp[64];
 
switch (type) {
@@ -2967,6 +2970,36 @@ symbol_lookup(void)
catch_memory_errors = 0;
termch = 0;
break;
+   case 'p':
+   getstring(tmp, 64);
+   if (setjmp(bus_error_jmp) == 0) {
+   catch_memory_errors = 1;
+   sync();
+   ptr = (void __percpu *)kallsyms_lookup_name(tmp);
+   sync();
+   }
+
+   if (ptr && ptr >= (void __percpu *)__per_cpu_start
+   && ptr < (void __percpu *)__per_cpu_end) {
+
+   if (scanhex() && cpu < num_possible_cpus())
+   addr = (unsigned long)per_cpu_ptr(ptr, cpu);
+   else {
+   cpu = raw_smp_processor_id();
+   addr = (unsigned long)this_cpu_ptr(ptr);
+   }
+
+   printf("%s for cpu 0x%lx: %lx\n", tmp, cpu, addr);
+
+   } else {
+   printf("Percpu symbol '%s' not found.\n", tmp);
+   }
+
+
+
+   catch_memory_errors = 0;
+   termch = 0;
+   break;
}
 }
 
-- 
2.10.1

[RFC] powerpc: xmon: Add address lookup for percpu symbols

2016-11-17 Thread Boqun Feng

Currently, in xmon, there is no obvious way to get an address for a
percpu symbol for a particular cpu. Having such an ability would be good
for debugging the system when percpu variables got involved.

Therefore, this patch introduces a new xmon command "lp" to lookup the
address for percpu symbols. Usage of "lp" is similar to "ls", except
that we could add a cpu number to choose the variable of which cpu we
want to lookup. If no cpu number is given, lookup for current cpu.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/xmon/xmon.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 760545519a0b..3556966a29a5 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -229,6 +229,7 @@ Commands:\n\
   fflush cache\n\
   la   lookup symbol+offset of specified address\n\
   ls   lookup address of specified symbol\n\
+  lp s [#] lookup address of percpu symbol s for current cpu, or cpu #\n\
   mexamine/change memory\n\
   mm   move a block of memory\n\
   ms   set a block of memory\n\
@@ -2943,7 +2944,7 @@ static void
 symbol_lookup(void)
 {
int type = inchar();
-   unsigned long addr;
+   unsigned long addr, cpu, offset;
static char tmp[64];
 
switch (type) {
@@ -2967,6 +2968,31 @@ symbol_lookup(void)
catch_memory_errors = 0;
termch = 0;
break;
+   case 'p':
+   getstring(tmp, 64);
+   if (setjmp(bus_error_jmp) == 0) {
+   catch_memory_errors = 1;
+   sync();
+   addr = kallsyms_lookup_name(tmp);
+   sync();
+   }
+
+   if (scanhex() && cpu < num_possible_cpus())
+   offset = per_cpu_offset(cpu);
+   else {
+   offset = my_cpu_offset;
+   cpu = raw_smp_processor_id();
+   }
+
+   if (addr)
+   printf("%s for cpu 0x%lx: %lx\n", tmp, cpu,
+ addr + offset);
+   else
+   printf("Percpu symbol '%s' not found.\n", tmp);
+
+   catch_memory_errors = 0;
+   termch = 0;
+   break;
}
 }
 
-- 
2.10.1

Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-20 Thread Boqun Feng

On Thu, Oct 20, 2016 at 05:27:54PM -0400, Pan Xinhui wrote:
> Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
> preempted" into struct kvm_steal_time. This field tells if one vcpu is
> running or not.
> 
> It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
> preempted. Other values means the vcpu has been preempted.
  ^
s/preempted/not preempted

And better to fix other typos in the commit log ;-)
Maybe you can try aspell? That works for me.

Regards,
Boqun

> 
> Signed-off-by: Pan Xinhui 
> ---
>  Documentation/virtual/kvm/msr.txt | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/msr.txt 
> b/Documentation/virtual/kvm/msr.txt
> index 2a71c8f..3376f13 100644
> --- a/Documentation/virtual/kvm/msr.txt
> +++ b/Documentation/virtual/kvm/msr.txt
> @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>   __u64 steal;
>   __u32 version;
>   __u32 flags;
> - __u32 pad[12];
> + __u8  preempted;
> + __u32 pad[11];
>   }
>  
>   whose data will be filled in by the hypervisor periodically. Only one
> @@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>   nanoseconds. Time during which the vcpu is idle, will not be
>   reported as steal time.
>  
> + preempted: indicate the VCPU who owns this struct is running or
> + not. Non-zero values mean the VCPU has been preempted. Zero
> + means the VCPU is not preempted. NOTE, it is always zero if the
> + the hypervisor doesn't support this field.
> +
>  MSR_KVM_EOI_EN: 0x4b564d04
>   data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
>   when disabled.  Bit 1 is reserved and must be zero.  When PV end of
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature

Re: [PATCH v7 4/6] powerpc: lib/locks.c: Add cpu yield/wake helper function

2016-09-22 Thread Boqun Feng

Hi Xinhui,

On Mon, Sep 19, 2016 at 05:23:55AM -0400, Pan Xinhui wrote:
> Add two corresponding helper functions to support pv-qspinlock.
> 
> For normal use, __spin_yield_cpu will confer current vcpu slices to the
> target vcpu(say, a lock holder). If target vcpu is not specified or it
> is in running state, such conferging to lpar happens or not depends.
> 
> Because hcall itself will introduce latency and a little overhead. And
> we do NOT want to suffer any latency on some cases, e.g. in interrupt handler.
> The second parameter *confer* can indicate such case.
> 
> __spin_wake_cpu is simpiler, it will wake up one vcpu regardless of its
> current vcpu state.
> 
> Signed-off-by: Pan Xinhui 
> ---
>  arch/powerpc/include/asm/spinlock.h |  4 +++
>  arch/powerpc/lib/locks.c| 59 
> +
>  2 files changed, 63 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 6aef8dd..abb6b0f 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -56,9 +56,13 @@
>  /* We only yield to the hypervisor if we are in shared processor mode */
>  #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
>  extern void __spin_yield(arch_spinlock_t *lock);
> +extern void __spin_yield_cpu(int cpu, int confer);
> +extern void __spin_wake_cpu(int cpu);
>  extern void __rw_yield(arch_rwlock_t *lock);
>  #else /* SPLPAR */
>  #define __spin_yield(x)  barrier()
> +#define __spin_yield_cpu(x,y) barrier()
> +#define __spin_wake_cpu(x) barrier()
>  #define __rw_yield(x)barrier()
>  #define SHARED_PROCESSOR 0
>  #endif
> diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
> index 6574626..892df7d 100644
> --- a/arch/powerpc/lib/locks.c
> +++ b/arch/powerpc/lib/locks.c
> @@ -23,6 +23,65 @@
>  #include 
>  #include 
>  
> +/*
> + * confer our slices to a specified cpu and return. If it is already running 
> or
> + * cpu is -1, then we will check confer. If confer is NULL, we will return
> + * otherwise we confer our slices to lpar.
> + */
> +void __spin_yield_cpu(int cpu, int confer)
> +{
> + unsigned int holder_cpu = cpu, yield_count;
> +

You don't need @holder_cpu at all, do you?

> + if (cpu == -1)
> + goto yield_to_lpar;
> +
> + BUG_ON(holder_cpu >= nr_cpu_ids);
> + yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
> +
> + /* if cpu is running, confer slices to lpar conditionally*/
> + if ((yield_count & 1) == 0)
> + goto yield_to_lpar;
> +
> + plpar_hcall_norets(H_CONFER,
> + get_hard_smp_processor_id(holder_cpu), yield_count);
> + return;
> +
> +yield_to_lpar:
> + if (confer)
> + plpar_hcall_norets(H_CONFER, -1, 0);
> +}
> +EXPORT_SYMBOL_GPL(__spin_yield_cpu);
> +
> +void __spin_wake_cpu(int cpu)
> +{
> + unsigned int holder_cpu = cpu;
> +

Ditto.

Regards,
Boqun

> + BUG_ON(holder_cpu >= nr_cpu_ids);
> + /*
> +  * NOTE: we should always do this hcall regardless of
> +  * the yield_count of the holder_cpu.
> +  * as thers might be a case like below;
> +  * CPU  1   2
> +  *  yielded = true
> +  *  if (yielded)
> +  *  __spin_wake_cpu()
> +  *  __spin_yield_cpu()
> +  *
> +  * So we might lose a wake if we check the yield_count and
> +  * return directly if the holder_cpu is running.
> +  * IOW. do NOT code like below.
> +  *  yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
> +  *  if ((yield_count & 1) == 0)
> +  *  return;
> +  *
> +  * a PROD hcall marks the target_cpu proded, which cause the next cede 
> or confer
> +  * called on the target_cpu invalid.
> +  */
> + plpar_hcall_norets(H_PROD,
> + get_hard_smp_processor_id(holder_cpu));
> +}
> +EXPORT_SYMBOL_GPL(__spin_wake_cpu);
> +
>  #ifndef CONFIG_QUEUED_SPINLOCKS
>  void __spin_yield(arch_spinlock_t *lock)
>  {
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature

[PATCH] powerpc, hotplug: Avoid to touch non-existent cpumasks.

2016-08-16 Thread Boqun Feng

We observed a kernel oops when running a PPC guest with config NR_CPUS=4
and qemu option "-smp cores=1,threads=8":

[   30.634781] Unable to handle kernel paging request for data at
address 0xc0014192eb17
[   30.636173] Faulting instruction address: 0xc003e5cc
[   30.637069] Oops: Kernel access of bad area, sig: 11 [#1]
[   30.637877] SMP NR_CPUS=4 NUMA pSeries
[   30.638471] Modules linked in:
[   30.638949] CPU: 3 PID: 27 Comm: migration/3 Not tainted
4.7.0-07963-g9714b26 #1
[   30.640059] task: c0001e29c600 task.stack: c0001e2a8000
[   30.640956] NIP: c003e5cc LR: c003e550 CTR:

[   30.642001] REGS: c0001e2ab8e0 TRAP: 0300   Not tainted
(4.7.0-07963-g9714b26)
[   30.643139] MSR: 800102803033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 
22004084  XER: 
[   30.644583] CFAR: c0009e98 DAR: c0014192eb17 DSISR: 4000 
SOFTE: 0
GPR00: c140a6b8 c0001e2abb60 c16dd300 0003
GPR04:  0004 c16e5920 0008
GPR08: 0004 c0014192eb17  0020
GPR12: c140a6c0 cc00 c00d3ea8 c0001e005680
GPR16:    
GPR20:  c0001e6b3a00  0001
GPR24: c0001ff85138 c0001ff85130 1eb6f000 0001
GPR28:  c17014e0  0018
[   30.653882] NIP [c003e5cc] __cpu_disable+0xcc/0x190
[   30.654713] LR [c003e550] __cpu_disable+0x50/0x190
[   30.655528] Call Trace:
[   30.655893] [c0001e2abb60] [c003e550] __cpu_disable+0x50/0x190 
(unreliable)
[   30.657280] [c0001e2abbb0] [c00aca0c] take_cpu_down+0x5c/0x100
[   30.658365] [c0001e2abc10] [c0163918] multi_cpu_stop+0x1a8/0x1e0
[   30.659617] [c0001e2abc60] [c0163cc0] 
cpu_stopper_thread+0xf0/0x1d0
[   30.660737] [c0001e2abd20] [c00d8d70] 
smpboot_thread_fn+0x290/0x2a0
[   30.661879] [c0001e2abd80] [c00d3fa8] kthread+0x108/0x130
[   30.662876] [c0001e2abe30] [c0009968] 
ret_from_kernel_thread+0x5c/0x74
[   30.664017] Instruction dump:
[   30.664477] 7bde1f24 38a0 787f1f24 3b61 39890008 7d204b78 7d05e214 
7d0b07b4
[   30.665642] 796b1f24 7d26582a 7d204a14 7d29f214 <7d4048a8> 7d4a3878 7d4049ad 
40c2fff4
[   30.666854] ---[ end trace 32643b7195717741 ]---

The reason of this is that in __cpu_disable(), when we try to set the
cpu_sibling_mask or cpu_core_mask of the sibling CPUs of the disabled
one, we don't check whether the current configuration employs those
sibling CPUs(hw threads). And if a CPU is not employed by a
configuration, the percpu structures cpu_{sibling,core}_mask are not
allocated, therefore accessing those cpumasks will result in problems as
above.

This patch fixes this problem by adding an addition check on whether the
id is no less than nr_cpu_ids in the sibling CPU iteration code.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 25a39052bf6b..9c6f3fd58059 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -830,7 +830,7 @@ int __cpu_disable(void)
 
/* Update sibling maps */
base = cpu_first_thread_sibling(cpu);
-   for (i = 0; i < threads_per_core; i++) {
+   for (i = 0; i < threads_per_core && base + i < nr_cpu_ids; i++) {
cpumask_clear_cpu(cpu, cpu_sibling_mask(base + i));
cpumask_clear_cpu(base + i, cpu_sibling_mask(cpu));
cpumask_clear_cpu(cpu, cpu_core_mask(base + i));
-- 
2.9.0

Re: [PATCH 2/3] powerpc/spinlock: support vcpu preempted check

2016-06-27 Thread Boqun Feng

On Tue, Jun 28, 2016 at 11:39:18AM +0800, xinhui wrote:
[snip]
> > > +{
> > > + struct lppaca *lp = _of(cpu);
> > > +
> > > + if (unlikely(!(lppaca_shared_proc(lp) ||
> > > + lppaca_dedicated_proc(lp
> > 
> > Do you want to detect whether we are running in a guest(ie. pseries
> > kernel) here? Then I wonder whether "machine_is(pseries)" works here.
> > 
> I tried as you said yesterday. but .h file has dependencies.
> As you said, if we add #ifdef PPC_PSERIES, this is not a big problem. only 
> powernv will be affected as they are built into same kernel img.
> 

I never said this it not a big problem ;-)

The problem here is that we only need to detect the vcpu preemption in
a guest, and there could be several ways we can detect whether the
kernel is running in a guest. It's worthwhile to try find the best one
for this. Besides, it's really better that you can make sure we are
runing out of options before you introduce something like
lppaca_dedicated_proc().

I have a feeling that yield_count is non-zero only if we are running in
a guest, if so, we can use this and save several loads. But surely we
need the confirmation from ppc maintainers.

Regards,
Boqun

signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3] powerpc/spinlock: support vcpu preempted check

2016-06-27 Thread Boqun Feng

Hi Xinhui,

On Mon, Jun 27, 2016 at 01:41:29PM -0400, Pan Xinhui wrote:
> This is to fix some holder preemption issues. Spinning at one
> vcpu which is preempted is meaningless.
> 
> Kernel need such interfaces, So lets support it.
> 
> We also should suooprt both the shared and dedicated mode.
> So add lppaca_dedicated_proc macro in lppaca.h
> 
> Suggested-by: Boqun Feng <boqun.f...@gmail.com>
> Signed-off-by: Pan Xinhui <xinhui@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/lppaca.h   |  6 ++
>  arch/powerpc/include/asm/spinlock.h | 15 +++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/lppaca.h 
> b/arch/powerpc/include/asm/lppaca.h
> index d0a2a2f..0a263d3 100644
> --- a/arch/powerpc/include/asm/lppaca.h
> +++ b/arch/powerpc/include/asm/lppaca.h
> @@ -111,12 +111,18 @@ extern struct lppaca lppaca[];
>   * we will have to transition to something better.
>   */
>  #define LPPACA_OLD_SHARED_PROC   2
> +#define LPPACA_OLD_DEDICATED_PROC  (1 << 6)
>  

I think you should describe a little bit about the magic number here,
i.e. what document/specification says this should work, and how this
works.

>  static inline bool lppaca_shared_proc(struct lppaca *l)
>  {
>   return !!(l->__old_status & LPPACA_OLD_SHARED_PROC);
>  }
>  
> +static inline bool lppaca_dedicated_proc(struct lppaca *l)
> +{
> + return !!(l->__old_status & LPPACA_OLD_DEDICATED_PROC);
> +}
> +
>  /*
>   * SLB shadow buffer structure as defined in the PAPR.  The save_area
>   * contains adjacent ESID and VSID pairs for each shadowed SLB.  The
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 523673d..ae938ee 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -52,6 +52,21 @@
>  #define SYNC_IO
>  #endif
>  
> +/* For fixing some spinning issues in a guest.
> + * kernel would check if vcpu is preempted during a spin loop.
> + * we support that.
> + */
> +#define arch_vcpu_is_preempted arch_vcpu_is_preempted
> +static inline bool arch_vcpu_is_preempted(int cpu)

This function should be guarded by #ifdef PPC_PSERIES .. #endif, right?
Because if the kernel is not compiled with guest support,
vcpu_is_preempted() should always be false, right?

> +{
> + struct lppaca *lp = _of(cpu);
> +
> + if (unlikely(!(lppaca_shared_proc(lp) ||
> + lppaca_dedicated_proc(lp

Do you want to detect whether we are running in a guest(ie. pseries
kernel) here? Then I wonder whether "machine_is(pseries)" works here.

Regards,
Boqun

> + return false;
> + return !!(be32_to_cpu(lp->yield_count) & 1);
> +}
> +
>  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>  {
>   return lock.slock == 0;
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] kernel/sched: introduce vcpu preempted check interface

2016-06-27 Thread Boqun Feng

On Mon, Jun 27, 2016 at 01:41:28PM -0400, Pan Xinhui wrote:
> this supports to fix lock holder preempted issue which run as a guest
> 
> for kernel users, we could use bool vcpu_is_preempted(int cpu) to detech
> if one vcpu is preempted or not.
> 
> The default implementation is a macrodefined by false. So compiler can
> wrap it out if arch dose not support such vcpu pteempted check.
> 
> archs can implement it by define arch_vcpu_is_preempted().
> 
> Signed-off-by: Pan Xinhui 
> ---
>  include/linux/sched.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..dc0a9c3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3293,6 +3293,15 @@ static inline void set_task_cpu(struct task_struct *p, 
> unsigned int cpu)
>  
>  #endif /* CONFIG_SMP */
>  
> +#ifdef arch_vcpu_is_preempted
> +static inline bool vcpu_is_preempted(int cpu)
> +{
> + return arch_vcpu_is_preempted(cpu);
> +}
> +#else
> +#define vcpu_is_preempted(cpu)   false
> +#endif
> +

I think you are missing Peter's comment here. We can

#ifndef vcpu_is_preempted
#define vcpu_is_preempted(cpu)  fasle
#endif

And different archs implement their own versions of vcpu_is_preempted(),
IOW, no need for an arch_vcpu_is_preempted().

Regards,
Boqun

>  extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
>  extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>  
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4] powerpc: spinlock: Fix spin_unlock_wait()

2016-06-09 Thread Boqun Feng

There is an ordering issue with spin_unlock_wait() on powerpc, because
the spin_lock primitive is an ACQUIRE and an ACQUIRE is only ordering
the load part of the operation with memory operations following it.
Therefore the following event sequence can happen:

CPU 1   CPU 2   CPU 3

==  ==
spin_unlock();
spin_lock():
  r1 = *lock; // r1 == 0;
o = object; o = READ_ONCE(object); // reordered here
object = NULL;
smp_mb();
spin_unlock_wait();
  *lock = 1;
smp_mb();
o->dead = true; < o = READ_ONCE(object); > // reordered upwards
if (o) // true
BUG_ON(o->dead); // true!!

To fix this, we add a "nop" ll/sc loop in arch_spin_unlock_wait() on
ppc, the "nop" ll/sc loop reads the lock
value and writes it back atomically, in this way it will synchronize the
view of the lock on CPU1 with that on CPU2. Therefore in the scenario
above, either CPU2 will fail to get the lock at first or CPU1 will see
the lock acquired by CPU2, both cases will eliminate this bug. This is a
similar idea as what Will Deacon did for ARM64 in:

  d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent 
lockers")

Furthermore, if the "nop" ll/sc figures out the lock is locked, we
actually don't need to do the "nop" ll/sc trick again, we can just do a
normal load+check loop for the lock to be released, because in that
case, spin_unlock_wait() is called when someone is holding the lock, and
the store part of the "nop" ll/sc happens before the lock release of the
current lock holder:

"nop" ll/sc -> spin_unlock()

and the lock release happens before the next lock acquisition:

spin_unlock() -> spin_lock() 

which means the "nop" ll/sc happens before the next lock acquisition:

"nop" ll/sc -> spin_unlock() -> spin_lock() 

With a smp_mb() preceding spin_unlock_wait(), the store of object is
guaranteed to be observed by the next lock holder:

STORE -> smp_mb() -> "nop" ll/sc
-> spin_unlock() -> spin_lock() 

This patch therefore fixes the issue and also cleans the
arch_spin_unlock_wait() a little bit by removing superfluous memory
barriers in loops and consolidating the implementations for PPC32 and
PPC64 into one.

Suggested-by: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Reviewed-by: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
[mpe: Inline the "nop" ll/sc loop and set EH=0, munge change log]
Signed-off-by: Michael Ellerman <m...@ellerman.id.au>
---
v4 (boqun):
 - replace !arch_spin_value_unlocked() with lock->slock in the loop condition
   to avoid a bug caused by compiler optimization.

v3 (mpe):
 - Inline the ll/sc loop.
 - Change the EH on the LWARX to 0
 - Rewrite change log to cope with the fact we removed 
arch_spin_is_locked_sync()

v1-->v2:

 - Improve the commit log, suggested by Peter Zijlstra
 - Keep two smp_mb()s for the safety, which though could be deleted
   if all the users have been aduited and fixed later.

 arch/powerpc/include/asm/spinlock.h | 38 +++--
 arch/powerpc/lib/locks.c| 16 
 2 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 523673d7583c..fa37fe93bc02 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -162,12 +162,38 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
lock->slock = 0;
 }
 
-#ifdef CONFIG_PPC64
-extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
-#else
-#define arch_spin_unlock_wait(lock) \
-   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
-#endif
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   arch_spinlock_t lock_val;
+
+   smp_mb();
+
+   /*
+* Atomically load and store back the lock value (unchanged). This
+* ensures that our observation of the lock value is ordered with
+* respect to other lock operations.
+*/
+   __asm__ __volatile__(
+"1:" PPC_LWARX(%0, 0, %2, 0) "\n"
+"  stwcx. %0, 0, %2\n"
+"  bne- 1b\n"
+   : "=" (lock_val), "+m" (*lock)
+   : "r" (lock)
+   : "cr0", "xer");
+
+   if (arch_spin_value_unlocked(lock_val))
+   goto out;
+
+   while (lock->slock) {
+   HMT_low();
+   if (SHARED_PROCESSOR)
+   __spi

Re: [PATCH v3] powerpc: spinlock: Fix spin_unlock_wait()

2016-06-09 Thread Boqun Feng

On Fri, Jun 10, 2016 at 01:25:03AM +0800, Boqun Feng wrote:
> On Thu, Jun 09, 2016 at 10:23:28PM +1000, Michael Ellerman wrote:
> > On Wed, 2016-06-08 at 15:59 +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 08, 2016 at 11:49:20PM +1000, Michael Ellerman wrote:
> > >
> > > > > Ok; what tree does this go in? I have this dependent series which I'd
> > > > > like to get sorted and merged somewhere.
> > > > 
> > > > Ah sorry, I didn't realise. I was going to put it in my next (which 
> > > > doesn't
> > > > exist yet but hopefully will early next week).
> > > > 
> > > > I'll make a topic branch with just that commit based on rc2 or rc3?
> > > 
> > > Works for me; thanks!
> >  
> > Unfortunately the patch isn't 100%.
> > 
> > It's causing some of my machines to lock up hard, which isn't surprising 
> > when
> > you look at the generated code for the non-atomic spin loop:
> > 
> >   c009af48: 7c 21 0b 78 mr  r1,r1   
> > # HMT_LOW
> >   c009af4c: 40 9e ff fc bne cr7,c009af48 
> > <.do_exit+0x6d8>
> > 
> 
> There is even no code checking for SHARED_PROCESSOR here, so I assume
> your config is !PPC_SPLPAR.
> 
> > Which is a spin loop waiting for a result in cr7, but with no comparison.
> > 
> > The problem seems to be that we did:
> > 
> > @@ -184,7 +184,7 @@ static inline void 
> > arch_spin_unlock_wait(arch_spinlock_t *lock)
> > if (arch_spin_value_unlocked(lock_val))
> > goto out;
> >  
> > -   while (lock->slock) {
> > +   while (!arch_spin_value_unlocked(*lock)) {
> > HMT_low();
> > if (SHARED_PROCESSOR)
> > __spin_yield(lock);
> > 
> 
> And as I also did an consolidation in this patch, we now share the same
> piece of arch_spin_unlock_wait(), so if !PPC_SPLPAR, the previous loop
> became:
> 
>   while (!arch_spin_value_unlocked(*lock)) {
>   HMT_low();
>   }
> 
> and given HMT_low() is not a compiler barrier. So the compiler may
> optimize out the loop..
> 
> > Which seems to be hiding the fact that lock->slock is volatile from the
> > compiler, even though arch_spin_value_unlocked() is inline. Not sure if 
> > that's
> > our bug or gcc's.
> > 
> 
> I think arch_spin_value_unlocked() is not volatile because
> arch_spin_value_unlocked() takes the value of the lock rather than the
> address of the lock as its parameter, which makes it a pure function.
> 
> To fix this we can add READ_ONCE() for the read of lock value like the
> following:
> 
>   while(!arch_spin_value_unlock(READ_ONCE(*lock))) {
>   HMT_low();
>   ...
> 
> Or you prefer to simply using lock->slock which is a volatile variable
> already?
> 
> Or maybe we can refactor the code a little like this:
> 
> static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> {
>arch_spinlock_t lock_val;
> 
>smp_mb();
> 
>/*
> * Atomically load and store back the lock value (unchanged).  This
> * ensures that our observation of the lock value is ordered with
> * respect to other lock operations.
> */
>__asm__ __volatile__(
> "1:" PPC_LWARX(%0, 0, %2, 0) "\n"
> "  stwcx. %0, 0, %2\n"
> "  bne- 1b\n"
>: "=" (lock_val), "+m" (*lock)
>: "r" (lock)
>: "cr0", "xer");
> 
>while (!arch_spin_value_unlocked(lock_val)) {
>HMT_low();
>if (SHARED_PROCESSOR)
>__spin_yield(lock);
> 
>lock_val = READ_ONCE(*lock);
>}
>HMT_medium();
> 
>smp_mb();
> }
> 

This version will generate the correct code for the loop if !PPC_SPLPAR:

c009fa70:   78 0b 21 7c mr  r1,r1
c009fa74:   ec 06 37 81 lwz r9,1772(r23)
c009fa78:   00 00 a9 2f cmpdi   cr7,r9,0
c009fa7c:   f4 ff 9e 40 bne cr7,c009fa70 
<do_exit+0xf0>
c009fa80:   78 13 42 7c mr  r2,r2

The reason I used arch_spin_value_unlocked() was trying to be consistent
with arch_spin_is_locked(), but most of our all lock primitives use
->slock directly. So I don't see a strong reason for us to use
arch_spin_value_unlocked() here. That said, this version does save a few
lines of code and make the logic a little more clear, I think.

Thoughts?

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] powerpc: spinlock: Fix spin_unlock_wait()

2016-06-09 Thread Boqun Feng

On Thu, Jun 09, 2016 at 10:23:28PM +1000, Michael Ellerman wrote:
> On Wed, 2016-06-08 at 15:59 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 08, 2016 at 11:49:20PM +1000, Michael Ellerman wrote:
> >
> > > > Ok; what tree does this go in? I have this dependent series which I'd
> > > > like to get sorted and merged somewhere.
> > > 
> > > Ah sorry, I didn't realise. I was going to put it in my next (which 
> > > doesn't
> > > exist yet but hopefully will early next week).
> > > 
> > > I'll make a topic branch with just that commit based on rc2 or rc3?
> > 
> > Works for me; thanks!
>  
> Unfortunately the patch isn't 100%.
> 
> It's causing some of my machines to lock up hard, which isn't surprising when
> you look at the generated code for the non-atomic spin loop:
> 
>   c009af48:   7c 21 0b 78 mr  r1,r1   
> # HMT_LOW
>   c009af4c:   40 9e ff fc bne cr7,c009af48 
> <.do_exit+0x6d8>
> 

There is even no code checking for SHARED_PROCESSOR here, so I assume
your config is !PPC_SPLPAR.

> Which is a spin loop waiting for a result in cr7, but with no comparison.
> 
> The problem seems to be that we did:
> 
> @@ -184,7 +184,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
> *lock)
>   if (arch_spin_value_unlocked(lock_val))
>   goto out;
>  
> - while (lock->slock) {
> + while (!arch_spin_value_unlocked(*lock)) {
>   HMT_low();
>   if (SHARED_PROCESSOR)
>   __spin_yield(lock);
> 

And as I also did an consolidation in this patch, we now share the same
piece of arch_spin_unlock_wait(), so if !PPC_SPLPAR, the previous loop
became:

while (!arch_spin_value_unlocked(*lock)) {
HMT_low();
}

and given HMT_low() is not a compiler barrier. So the compiler may
optimize out the loop..

> Which seems to be hiding the fact that lock->slock is volatile from the
> compiler, even though arch_spin_value_unlocked() is inline. Not sure if that's
> our bug or gcc's.
> 

I think arch_spin_value_unlocked() is not volatile because
arch_spin_value_unlocked() takes the value of the lock rather than the
address of the lock as its parameter, which makes it a pure function.

To fix this we can add READ_ONCE() for the read of lock value like the
following:

while(!arch_spin_value_unlock(READ_ONCE(*lock))) {
HMT_low();
...

Or you prefer to simply using lock->slock which is a volatile variable
already?

Or maybe we can refactor the code a little like this:

static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
   arch_spinlock_t lock_val;

   smp_mb();

   /*
* Atomically load and store back the lock value (unchanged).  This
* ensures that our observation of the lock value is ordered with
* respect to other lock operations.
*/
   __asm__ __volatile__(
"1:" PPC_LWARX(%0, 0, %2, 0) "\n"
"  stwcx. %0, 0, %2\n"
"  bne- 1b\n"
   : "=" (lock_val), "+m" (*lock)
   : "r" (lock)
   : "cr0", "xer");

   while (!arch_spin_value_unlocked(lock_val)) {
   HMT_low();
   if (SHARED_PROCESSOR)
   __spin_yield(lock);

   lock_val = READ_ONCE(*lock);
   }
   HMT_medium();

   smp_mb();
}

> Will sleep on it.
> 

Bed time for me too, I will run more tests on the three proposals above
tomorrow and see how things are going.

Regards,
Boqun

> cheers
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v2] powerpc: spinlock: Fix spin_unlock_wait()

2016-06-05 Thread Boqun Feng

On Mon, Jun 06, 2016 at 02:52:05PM +1000, Michael Ellerman wrote:
> On Fri, 2016-03-06 at 03:49:48 UTC, Boqun Feng wrote:
> > There is an ordering issue with spin_unlock_wait() on powerpc, because
> > the spin_lock primitive is an ACQUIRE and an ACQUIRE is only ordering
> > the load part of the operation with memory operations following it.
> 
> ...
> > diff --git a/arch/powerpc/include/asm/spinlock.h 
> > b/arch/powerpc/include/asm/spinlock.h
> > index 523673d7583c..2ed893662866 100644
> > --- a/arch/powerpc/include/asm/spinlock.h
> > +++ b/arch/powerpc/include/asm/spinlock.h
> > @@ -162,12 +181,23 @@ static inline void arch_spin_unlock(arch_spinlock_t 
> > *lock)
> > lock->slock = 0;
> >  }
> >  
> > -#ifdef CONFIG_PPC64
> > -extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
> > -#else
> > -#define arch_spin_unlock_wait(lock) \
> > -   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
> > -#endif
> > +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> > +{
> > +   smp_mb();
> > +
> > +   if (!arch_spin_is_locked_sync(lock))
> > +   goto out;
> > +
> > +   while (!arch_spin_value_unlocked(*lock)) {
> > +   HMT_low();
> > +   if (SHARED_PROCESSOR)
> > +   __spin_yield(lock);
> > +   }
> > +   HMT_medium();
> > +
> > +out:
> > +   smp_mb();
> > +}
> 
> I think this would actually be easier to follow if it was all just in one 
> routine:
> 
> static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> {
>   arch_spinlock_t lock_val;
> 
>   smp_mb();
> 
>   /*
>* Atomically load and store back the lock value (unchanged). This
>* ensures that our observation of the lock value is ordered with
>* respect to other lock operations.
>*/
>   __asm__ __volatile__(
> "1:   " PPC_LWARX(%0, 0, %2, 1) "\n"
> " stwcx. %0, 0, %2\n"
> " bne- 1b\n"
>   : "=" (lock_val), "+m" (*lock)
>   : "r" (lock)
>   : "cr0", "xer");
> 
>   if (arch_spin_value_unlocked(lock_val))
>   goto out;
> 
>   while (!arch_spin_value_unlocked(*lock)) {
>   HMT_low();
>   if (SHARED_PROCESSOR)
>   __spin_yield(lock);
>   }
>   HMT_medium();
> 
> out:
>   smp_mb();
> }
> 
> 
> Thoughts?
> 

Make sense. I admit that I sort of overdesigned by introducing
arch_spin_is_locked_sync().

This version is better, thank you!

Regards,
Boqun

> cheers


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc: spinlock: Fix spin_unlock_wait()

2016-06-02 Thread Boqun Feng

There is an ordering issue with spin_unlock_wait() on powerpc, because
the spin_lock primitive is an ACQUIRE and an ACQUIRE is only ordering
the load part of the operation with memory operations following it.
Therefore the following event sequence can happen:

CPU 1   CPU 2   CPU 3
==  ==
spin_unlock();
spin_lock():
  r1 = *lock; // r1 == 0;
o = object; o = READ_ONCE(object); // reordered here
object = NULL;
smp_mb();
spin_unlock_wait();
  *lock = 1;
smp_mb();
o->dead = true; < o = READ_ONCE(object); > // reordered upwards
if (o) // true
BUG_ON(o->dead); // true!!

To fix this, we add a "nop" ll/sc loop in arch_spin_unlock_wait() on
ppc (arch_spin_is_locked_sync()), the "nop" ll/sc loop reads the lock
value and writes it back atomically, in this way it will synchronize the
view of the lock on CPU1 with that on CPU2. Therefore in the scenario
above, either CPU2 will fail to get the lock at first or CPU1 will see
the lock acquired by CPU2, both cases will eliminate this bug. This is a
similar idea as what Will Deacon did for ARM64 in:

  d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent 
lockers")

Further more, if arch_spin_is_locked_sync() figures out the lock is
locked, we actually don't need to do the "nop" ll/sc trick again, we can
just do a normal load+check loop for the lock to be released, because in
that case, spin_unlock_wait() is called when someone is holding the
lock, and the store part of arch_spin_is_locked_sync() happens before
the lock release of the current lock holder:

arch_spin_is_locked_sync() -> spin_unlock()

and the lock release happens before the next lock acquisition:

spin_unlock() -> spin_lock() 

which means arch_spin_is_locked_sync() happens before the next lock
acquisition:

arch_spin_is_locked_sync() -> spin_unlock() -> spin_lock() 

With a smp_mb() perceding spin_unlock_wait(), the store of object is
guaranteed to be observed by the next lock holder:

STORE -> smp_mb() -> arch_spin_is_locked_sync()
-> spin_unlock() -> spin_lock() 

This patch therefore fixes the issue and also cleans the
arch_spin_unlock_wait() a little bit by removing superfluous memory
barriers in loops and consolidating the implementations for PPC32 and
PPC64 into one.

Suggested-by: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Reviewed-by: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
---
v1-->v2:

*   Improve the commit log, suggested by Peter Zijlstra

*   Keep two smp_mb()s for the safety, which though could be deleted
if all the users have been aduited and fixed later.


 arch/powerpc/include/asm/spinlock.h | 42 +++--
 arch/powerpc/lib/locks.c| 16 --
 2 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 523673d7583c..2ed893662866 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -64,6 +64,25 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 }
 
 /*
+ * Use a ll/sc loop to read the lock value, the STORE part of this operation is
+ * used for making later lock operation observe it.
+ */
+static inline bool arch_spin_is_locked_sync(arch_spinlock_t *lock)
+{
+   arch_spinlock_t tmp;
+
+   __asm__ __volatile__(
+"1:" PPC_LWARX(%0, 0, %2, 1) "\n"
+"  stwcx. %0, 0, %2\n"
+"  bne- 1b\n"
+   : "=" (tmp), "+m" (*lock)
+   : "r" (lock)
+   : "cr0", "xer");
+
+   return !arch_spin_value_unlocked(tmp);
+}
+
+/*
  * This returns the old value in the lock, so we succeeded
  * in getting the lock if the return value is 0.
  */
@@ -162,12 +181,23 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
lock->slock = 0;
 }
 
-#ifdef CONFIG_PPC64
-extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
-#else
-#define arch_spin_unlock_wait(lock) \
-   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
-#endif
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   smp_mb();
+
+   if (!arch_spin_is_locked_sync(lock))
+   goto out;
+
+   while (!arch_spin_value_unlocked(*lock)) {
+   HMT_low();
+   if (SHARED_PROCESSOR)
+   __spin_yield(lock);
+   }
+   HMT_medium();
+
+out:
+   smp_mb();
+}
 
 /*
  * Read-write spinlocks

[PATCH 4/4] rcutorture: Don't specify the cpu type of QEMU on PPC

2016-05-18 Thread Boqun Feng

Do not restrict the cpu type to POWER7 for QEMU as we have POWER8 now.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 tools/testing/selftests/rcutorture/bin/functions.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh 
b/tools/testing/selftests/rcutorture/bin/functions.sh
index 77fdb46cc65a..56ac202859eb 100644
--- a/tools/testing/selftests/rcutorture/bin/functions.sh
+++ b/tools/testing/selftests/rcutorture/bin/functions.sh
@@ -174,7 +174,7 @@ identify_qemu_args () {
echo -soundhw pcspk
;;
qemu-system-ppc64)
-   echo -enable-kvm -M pseries -cpu POWER7 -nodefaults
+   echo -enable-kvm -M pseries -nodefaults
echo -device spapr-vscsi
if test -n "$TORTURE_QEMU_INTERACTIVE" -a -n "$TORTURE_QEMU_MAC"
then
-- 
2.8.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/4] rcutorture: Make -soundhw a x86 specific option

2016-05-18 Thread Boqun Feng

The option "-soundhw pcspk" gives me a error on PPC as follow:

qemu-system-ppc64: ISA bus not available for pcspk

, which means this option doesn't work on ppc by default. So simply make
this an x86-specific option via identify_qemu_args().

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 tools/testing/selftests/rcutorture/bin/functions.sh  | 1 +
 tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh 
b/tools/testing/selftests/rcutorture/bin/functions.sh
index 616180153208..77fdb46cc65a 100644
--- a/tools/testing/selftests/rcutorture/bin/functions.sh
+++ b/tools/testing/selftests/rcutorture/bin/functions.sh
@@ -171,6 +171,7 @@ identify_qemu_append () {
 identify_qemu_args () {
case "$1" in
qemu-system-x86_64|qemu-system-i386)
+   echo -soundhw pcspk
;;
qemu-system-ppc64)
echo -enable-kvm -M pseries -cpu POWER7 -nodefaults
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 46cb6bc098bd..d7a3de26dd11 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -8,9 +8,9 @@
 #
 # Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args
 #
-# qemu-args defaults to "-enable-kvm -soundhw pcspk -nographic", along with
-#  arguments specifying the number of CPUs and other
-#  options generated from the underlying CPU architecture.
+# qemu-args defaults to "-enable-kvm -nographic", along with arguments
+#  specifying the number of CPUs and other options
+#  generated from the underlying CPU architecture.
 # boot_args defaults to value returned by the per_version_boot_params
 #  shell function.
 #
@@ -148,7 +148,7 @@ then
 fi
 
 # Generate -smp qemu argument.
-qemu_args="-enable-kvm -soundhw pcspk -nographic $qemu_args"
+qemu_args="-enable-kvm -nographic $qemu_args"
 cpu_count=`configNR_CPUS.sh $config_template`
 cpu_count=`configfrag_boot_cpus "$boot_args" "$config_template" "$cpu_count"`
 vcpus=`identify_qemu_vcpus`
-- 
2.8.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/4] rcutorture: Use vmlinux as the fallback kernel image

2016-05-18 Thread Boqun Feng

vmlinux is available for all the architectures, and suitable for running
a KVM guest by QEMU, besides, we used to copy the vmlinux to $resdir
anyway. Therefore it makes sense to use it as the fallback kernel image
for rcutorture KVM tests.

This patch makes identify_boot_image() return vmlinux if
${TORTURE_BOOT_IMAGE} is not set on non-x86 architectures, also fixes
several places that hard-code "bzImage" as $KERNEL.

This also fixes a problem that PPC doesn't have a bzImage file as build
results.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 tools/testing/selftests/rcutorture/bin/functions.sh  | 10 --
 tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh |  5 +++--
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh 
b/tools/testing/selftests/rcutorture/bin/functions.sh
index b325470c01b3..616180153208 100644
--- a/tools/testing/selftests/rcutorture/bin/functions.sh
+++ b/tools/testing/selftests/rcutorture/bin/functions.sh
@@ -99,8 +99,9 @@ configfrag_hotplug_cpu () {
 # identify_boot_image qemu-cmd
 #
 # Returns the relative path to the kernel build image.  This will be
-# arch//boot/bzImage unless overridden with the TORTURE_BOOT_IMAGE
-# environment variable.
+# arch//boot/bzImage or vmlinux if bzImage is not a target for the
+# architecture, unless overridden with the TORTURE_BOOT_IMAGE environment
+# variable.
 identify_boot_image () {
if test -n "$TORTURE_BOOT_IMAGE"
then
@@ -110,11 +111,8 @@ identify_boot_image () {
qemu-system-x86_64|qemu-system-i386)
echo arch/x86/boot/bzImage
;;
-   qemu-system-ppc64)
-   echo arch/powerpc/boot/bzImage
-   ;;
*)
-   echo ""
+   echo vmlinux
;;
esac
fi
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 4109f306d855..46cb6bc098bd 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -96,7 +96,8 @@ if test "$base_resdir" != "$resdir" -a -f 
$base_resdir/bzImage -a -f $base_resdi
 then
# Rerunning previous test, so use that test's kernel.
QEMU="`identify_qemu $base_resdir/vmlinux`"
-   KERNEL=$base_resdir/bzImage
+   BOOT_IMAGE="`identify_boot_image $QEMU`"
+   KERNEL=$base_resdir/${BOOT_IMAGE##*/} # use the last component of 
${BOOT_IMAGE}
ln -s $base_resdir/Make*.out $resdir  # for kvm-recheck.sh
ln -s $base_resdir/.config $resdir  # for kvm-recheck.sh
 elif kvm-build.sh $config_template $builddir $T
@@ -110,7 +111,7 @@ then
if test -n "$BOOT_IMAGE"
then
cp $builddir/$BOOT_IMAGE $resdir
-   KERNEL=$resdir/bzImage
+   KERNEL=$resdir/${BOOT_IMAGE##*/}
else
echo No identifiable boot image, not running KVM, see $resdir.
echo Do the torture scripts know about your architecture?
-- 
2.8.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/4] rcutorture/doc: Add a new way to create initrd using dracut

2016-05-18 Thread Boqun Feng

Using dracut is another way to get an initramfs for kvm based rcu
torture tests, which is more flexible than using the host's initramfs
image, because modules and binaries may be added or removed via dracut
command options. So add an example in the document, in case that there
are some situations where host's initramfs couldn't be used.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 tools/testing/selftests/rcutorture/doc/initrd.txt | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/rcutorture/doc/initrd.txt 
b/tools/testing/selftests/rcutorture/doc/initrd.txt
index 4170e714f044..833f826d6ec2 100644
--- a/tools/testing/selftests/rcutorture/doc/initrd.txt
+++ b/tools/testing/selftests/rcutorture/doc/initrd.txt
@@ -13,6 +13,22 @@ cd initrd
 cpio -id < /tmp/initrd.img.zcat
 
 
+Another way to create an initramfs image is using "dracut"[1], which is
+available on many distros, however the initramfs dracut generates is a cpio
+archive with another cpio archive in it, so an extra step is needed to create
+the initrd directory hierarchy.
+
+Here are the commands to create a initrd directory for rcutorture using
+dracut:
+
+
+dracut --no-hostonly --no-hostonly-cmdline --module "base bash shutdown" 
/tmp/initramfs.img
+cd tools/testing/selftests/rcutorture
+mkdir initrd
+cd initrd
+/usr/lib/dracut/skipcpio /tmp/initramfs.img | zcat | cpio -id < 
/tmp/initramfs.img
+
+
 Interestingly enough, if you are running rcutorture, you don't really
 need userspace in many cases.  Running without userspace has the
 advantage of allowing you to test your kernel independently of the
@@ -89,3 +105,9 @@ while :
 do
sleep 10
 done
+
+
+References:
+[1]: https://dracut.wiki.kernel.org/index.php/Main_Page
+[2]: 
http://blog.elastocloud.org/2015/06/rapid-linux-kernel-devtest-with-qemu.html
+[3]: https://www.centos.org/forums/viewtopic.php?t=51621
-- 
2.8.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/4] rcutorture: Several fixes to run selftest scripts on PPC

2016-05-18 Thread Boqun Feng

I spend some time to make tools/testing/selftest/rcutorture run on PPC,
here are some documention and fixes made while I was trying.

The scripts are able to run and get results on PPC, however please
note there are some stalls even build errors that could be found
by the tests currently.

As I'm certainly not an expert of qemu or bash programming, there
may be something I am missing in those patches. So tests and comments
are welcome ;-)

Regards,
Boqun

Boqun Feng (4):
  rcutorture/doc: Add a new way to create initrd using dracut
  rcutorture: Use vmlinux as the fallback kernel image
  rcutorture: Make -soundhw a x86 specific option
  rcutorture: Don't specify the cpu type of QEMU on PPC

 .../testing/selftests/rcutorture/bin/functions.sh  | 13 ++---
 .../selftests/rcutorture/bin/kvm-test-1-run.sh | 13 +++--
 tools/testing/selftests/rcutorture/doc/initrd.txt  | 22 ++
 3 files changed, 35 insertions(+), 13 deletions(-)

-- 
2.8.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4] powerpc: Implement {cmp}xchg for u8 and u16

2016-04-27 Thread Boqun Feng

On Wed, Apr 27, 2016 at 10:50:34PM +0800, Boqun Feng wrote:
> 
> Sorry, my bad, we can't implement cmpxchg like this.. please ignore
> this, I should really go to bed soon...
> 
> But still, we can save the "tmp" for xchg() I think.
> 

No.. we can't. Sorry for all the noise.

This patch looks good to me.

FWIW, you can add

Acked-by: Boqun Feng <boqun.f...@gmail.com>

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4] powerpc: Implement {cmp}xchg for u8 and u16

2016-04-27 Thread Boqun Feng

On Wed, Apr 27, 2016 at 09:58:17PM +0800, Boqun Feng wrote:
> On Wed, Apr 27, 2016 at 05:16:45PM +0800, Pan Xinhui wrote:
> > From: Pan Xinhui <xinhui@linux.vnet.ibm.com>
> > 
> > Implement xchg{u8,u16}{local,relaxed}, and
> > cmpxchg{u8,u16}{,local,acquire,relaxed}.
> > 
> > It works on all ppc.
> > 
> > remove volatile of first parameter in __cmpxchg_local and __cmpxchg
> > 
> > Suggested-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> > Signed-off-by: Pan Xinhui <xinhui@linux.vnet.ibm.com>
> > ---
> > change from v3:
> > rewrite in asm for the LL/SC.
> > remove volatile in __cmpxchg_local and __cmpxchg.
> > change from v2:
> > in the do{}while(), we save one load and use corresponding cmpxchg 
> > suffix.
> > Also add corresponding __cmpxchg_u32 function declaration in the 
> > __XCHG_GEN 
> > change from V1:
> > rework totally.
> > ---
> >  arch/powerpc/include/asm/cmpxchg.h | 109 
> > -
> >  1 file changed, 106 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > b/arch/powerpc/include/asm/cmpxchg.h
> > index 44efe73..8a3735f 100644
> > --- a/arch/powerpc/include/asm/cmpxchg.h
> > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > @@ -7,6 +7,71 @@
> >  #include 
> >  #include 
> >  
> > +#ifdef __BIG_ENDIAN
> > +#define BITOFF_CAL(size, off)  ((sizeof(u32) - size - off) * 
> > BITS_PER_BYTE)
> > +#else
> > +#define BITOFF_CAL(size, off)  (off * BITS_PER_BYTE)
> > +#endif
> > +
> > +#define XCHG_GEN(type, sfx, cl)\
> > +static inline u32 __xchg_##type##sfx(void *p, u32 val) \
> > +{  \
> > +   unsigned int prev, prev_mask, tmp, bitoff, off; \
> > +   \
> > +   off = (unsigned long)p % sizeof(u32);   \
> > +   bitoff = BITOFF_CAL(sizeof(type), off); \
> > +   p -= off;   \
> > +   val <<= bitoff; \
> > +   prev_mask = (u32)(type)-1 << bitoff;\
> > +   \
> > +   __asm__ __volatile__(   \
> > +"1:lwarx   %0,0,%3\n"  \
> > +"  andc%1,%0,%5\n" \
> > +"  or  %1,%1,%4\n" \
> > +   PPC405_ERR77(0,%3)  \
> > +"  stwcx.  %1,0,%3\n"  \
> > +"  bne-1b\n"   \
> > +   : "=" (prev), "=" (tmp), "+m" (*(u32*)p)\
> 
> I think we can save the "tmp" here by:
> 
>   __asm__ volatile__(
> "1:   lwarx   %0,0,%2\n"
> " andc%0,%0,%4\n"
> " or  %0,%0,%3\n"
>   PPC405_ERR77(0,%2)
> " stwcx.  %0,0,%2\n"
> " bne-1b\n"
>   : "=" (prev), "+m" (*(u32*)p)
>   : "r" (p), "r" (val), "r" (prev_mask)
>   : "cc", cl);
> 
> right?
> 
> > +   : "r" (p), "r" (val), "r" (prev_mask)   \
> > +   : "cc", cl);\
> > +   \
> > +   return prev >> bitoff;  \
> > +}
> > +
> > +#define CMPXCHG_GEN(type, sfx, br, br2, cl)\
> > +static inline  \
> > +u32 __cmpxchg_##type##sfx(void *p, u32 old, u32 new)   \
> > +{  \
> > +   unsigned int prev, prev_mask, tmp, bitoff, off; \
> > +   \
> > +   off = (unsigned long)p % sizeof(u32);   \
> > +   bitoff = BITOFF_CAL(sizeof(type), off); \
> > +   p -= off;   \
> > +   old <<= bitoff; \
> > +   new <<= bitoff; \
> &g

Re: [PATCH V4] powerpc: Implement {cmp}xchg for u8 and u16

2016-04-27 Thread Boqun Feng

On Wed, Apr 27, 2016 at 09:58:17PM +0800, Boqun Feng wrote:
> On Wed, Apr 27, 2016 at 05:16:45PM +0800, Pan Xinhui wrote:
> > From: Pan Xinhui <xinhui@linux.vnet.ibm.com>
> > 
> > Implement xchg{u8,u16}{local,relaxed}, and
> > cmpxchg{u8,u16}{,local,acquire,relaxed}.
> > 
> > It works on all ppc.
> > 
> > remove volatile of first parameter in __cmpxchg_local and __cmpxchg
> > 
> > Suggested-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> > Signed-off-by: Pan Xinhui <xinhui@linux.vnet.ibm.com>
> > ---
> > change from v3:
> > rewrite in asm for the LL/SC.
> > remove volatile in __cmpxchg_local and __cmpxchg.
> > change from v2:
> > in the do{}while(), we save one load and use corresponding cmpxchg 
> > suffix.
> > Also add corresponding __cmpxchg_u32 function declaration in the 
> > __XCHG_GEN 
> > change from V1:
> > rework totally.
> > ---
> >  arch/powerpc/include/asm/cmpxchg.h | 109 
> > -
> >  1 file changed, 106 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > b/arch/powerpc/include/asm/cmpxchg.h
> > index 44efe73..8a3735f 100644
> > --- a/arch/powerpc/include/asm/cmpxchg.h
> > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > @@ -7,6 +7,71 @@
> >  #include 
> >  #include 
> >  
> > +#ifdef __BIG_ENDIAN
> > +#define BITOFF_CAL(size, off)  ((sizeof(u32) - size - off) * 
> > BITS_PER_BYTE)
> > +#else
> > +#define BITOFF_CAL(size, off)  (off * BITS_PER_BYTE)
> > +#endif
> > +
> > +#define XCHG_GEN(type, sfx, cl)\
> > +static inline u32 __xchg_##type##sfx(void *p, u32 val) \
> > +{  \
> > +   unsigned int prev, prev_mask, tmp, bitoff, off; \
> > +   \
> > +   off = (unsigned long)p % sizeof(u32);   \
> > +   bitoff = BITOFF_CAL(sizeof(type), off); \
> > +   p -= off;   \
> > +   val <<= bitoff; \
> > +   prev_mask = (u32)(type)-1 << bitoff;\
> > +   \
> > +   __asm__ __volatile__(   \
> > +"1:lwarx   %0,0,%3\n"  \
> > +"  andc%1,%0,%5\n" \
> > +"  or  %1,%1,%4\n" \
> > +   PPC405_ERR77(0,%3)  \
> > +"  stwcx.  %1,0,%3\n"  \
> > +"  bne-1b\n"   \
> > +   : "=" (prev), "=" (tmp), "+m" (*(u32*)p)\
> 
> I think we can save the "tmp" here by:
> 
>   __asm__ volatile__(
> "1:   lwarx   %0,0,%2\n"
> " andc%0,%0,%4\n"
> " or  %0,%0,%3\n"
>   PPC405_ERR77(0,%2)
> " stwcx.  %0,0,%2\n"
> " bne-1b\n"
>   : "=" (prev), "+m" (*(u32*)p)
>   : "r" (p), "r" (val), "r" (prev_mask)
>   : "cc", cl);
> 
> right?
> 
> > +   : "r" (p), "r" (val), "r" (prev_mask)   \
> > +   : "cc", cl);\
> > +   \
> > +   return prev >> bitoff;  \
> > +}
> > +
> > +#define CMPXCHG_GEN(type, sfx, br, br2, cl)\
> > +static inline  \
> > +u32 __cmpxchg_##type##sfx(void *p, u32 old, u32 new)   \
> > +{  \
> > +   unsigned int prev, prev_mask, tmp, bitoff, off; \
> > +   \
> > +   off = (unsigned long)p % sizeof(u32);   \
> > +   bitoff = BITOFF_CAL(sizeof(type), off); \
> > +   p -= off;   \
> > +   old <<= bitoff; \
> > +   new <<= bitoff; \
> &g

Re: [PATCH V4] powerpc: Implement {cmp}xchg for u8 and u16

2016-04-27 Thread Boqun Feng

On Wed, Apr 27, 2016 at 05:16:45PM +0800, Pan Xinhui wrote:
> From: Pan Xinhui 
> 
> Implement xchg{u8,u16}{local,relaxed}, and
> cmpxchg{u8,u16}{,local,acquire,relaxed}.
> 
> It works on all ppc.
> 
> remove volatile of first parameter in __cmpxchg_local and __cmpxchg
> 
> Suggested-by: Peter Zijlstra (Intel) 
> Signed-off-by: Pan Xinhui 
> ---
> change from v3:
>   rewrite in asm for the LL/SC.
>   remove volatile in __cmpxchg_local and __cmpxchg.
> change from v2:
>   in the do{}while(), we save one load and use corresponding cmpxchg 
> suffix.
>   Also add corresponding __cmpxchg_u32 function declaration in the 
> __XCHG_GEN 
> change from V1:
>   rework totally.
> ---
>  arch/powerpc/include/asm/cmpxchg.h | 109 
> -
>  1 file changed, 106 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index 44efe73..8a3735f 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -7,6 +7,71 @@
>  #include 
>  #include 
>  
> +#ifdef __BIG_ENDIAN
> +#define BITOFF_CAL(size, off)((sizeof(u32) - size - off) * 
> BITS_PER_BYTE)
> +#else
> +#define BITOFF_CAL(size, off)(off * BITS_PER_BYTE)
> +#endif
> +
> +#define XCHG_GEN(type, sfx, cl)  \
> +static inline u32 __xchg_##type##sfx(void *p, u32 val)   \
> +{\
> + unsigned int prev, prev_mask, tmp, bitoff, off; \
> + \
> + off = (unsigned long)p % sizeof(u32);   \
> + bitoff = BITOFF_CAL(sizeof(type), off); \
> + p -= off;   \
> + val <<= bitoff; \
> + prev_mask = (u32)(type)-1 << bitoff;\
> + \
> + __asm__ __volatile__(   \
> +"1:  lwarx   %0,0,%3\n"  \
> +"andc%1,%0,%5\n" \
> +"or  %1,%1,%4\n" \
> + PPC405_ERR77(0,%3)  \
> +"stwcx.  %1,0,%3\n"  \
> +"bne-1b\n"   \
> + : "=" (prev), "=" (tmp), "+m" (*(u32*)p)\

I think we can save the "tmp" here by:

__asm__ volatile__(
"1: lwarx   %0,0,%2\n"
"   andc%0,%0,%4\n"
"   or  %0,%0,%3\n"
PPC405_ERR77(0,%2)
"   stwcx.  %0,0,%2\n"
"   bne-1b\n"
: "=" (prev), "+m" (*(u32*)p)
: "r" (p), "r" (val), "r" (prev_mask)
: "cc", cl);

right?

> + : "r" (p), "r" (val), "r" (prev_mask)   \
> + : "cc", cl);\
> + \
> + return prev >> bitoff;  \
> +}
> +
> +#define CMPXCHG_GEN(type, sfx, br, br2, cl)  \
> +static inline\
> +u32 __cmpxchg_##type##sfx(void *p, u32 old, u32 new) \
> +{\
> + unsigned int prev, prev_mask, tmp, bitoff, off; \
> + \
> + off = (unsigned long)p % sizeof(u32);   \
> + bitoff = BITOFF_CAL(sizeof(type), off); \
> + p -= off;   \
> + old <<= bitoff; \
> + new <<= bitoff; \
> + prev_mask = (u32)(type)-1 << bitoff;\
> + \
> + __asm__ __volatile__(   \
> + br  \
> +"1:  lwarx   %0,0,%3\n"  \
> +"and %1,%0,%6\n" \
> +"cmpw0,%1,%4\n"  \
> +"bne-2f\n"   \
> +"andc%1,%0,%6\n" \
> +"or  %1,%1,%5\n" \
> + PPC405_ERR77(0,%3)  \
> +"stwcx.  %1,0,%3\n"  \
> +"bne-1b\n"   \
> + br2 \
> + "\n"\
> +"2:"

Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16

2016-04-21 Thread Boqun Feng

On Fri, Apr 22, 2016 at 09:59:22AM +0800, Pan Xinhui wrote:
> On 2016年04月21日 23:52, Boqun Feng wrote:
> > On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote:
> >> On 2016年04月20日 22:24, Peter Zijlstra wrote:
> >>> On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote:
> >>>
> >>>> +#define __XCHG_GEN(cmp, type, sfx, skip, v) 
> >>>> \
> >>>> +static __always_inline unsigned long
> >>>> \
> >>>> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old,
> >>>> \
> >>>> + unsigned long new);
> >>>> \
> >>>> +static __always_inline u32  
> >>>> \
> >>>> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new)
> >>>> \
> >>>> +{   
> >>>> \
> >>>> +int size = sizeof (type);   
> >>>> \
> >>>> +int off = (unsigned long)ptr % sizeof(u32); 
> >>>> \
> >>>> +volatile u32 *p = ptr - off;
> >>>> \
> >>>> +int bitoff = BITOFF_CAL(size, off); 
> >>>> \
> >>>> +u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff;
> >>>> \
> >>>> +u32 oldv, newv, tmp;
> >>>> \
> >>>> +u32 ret;
> >>>> \
> >>>> +oldv = READ_ONCE(*p);   
> >>>> \
> >>>> +do {
> >>>> \
> >>>> +ret = (oldv & bitmask) >> bitoff;   
> >>>> \
> >>>> +if (skip && ret != old) 
> >>>> \
> >>>> +break;  
> >>>> \
> >>>> +newv = (oldv & ~bitmask) | (new << bitoff); 
> >>>> \
> >>>> +tmp = oldv; 
> >>>> \
> >>>> +oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv);   
> >>>> \
> >>>> +} while (tmp != oldv);  
> >>>> \
> >>>> +return ret; 
> >>>> \
> >>>> +}
> >>>
> >>> So for an LL/SC based arch using cmpxchg() like that is sub-optimal.
> >>>
> >>> Why did you choose to write it entirely in C?
> >>>
> >> yes, you are right. more load/store will be done in C code.
> >> However such xchg_u8/u16 is just used by qspinlock now. and I did not see 
> >> any performance regression.
> >> So just wrote in C, for simple. :)
> >>
> >> Of course I have done xchg tests.
> >> we run code just like xchg((u8*), j++); in several threads.
> >> and the result is,
> >> [  768.374264] use time[1550072]ns in xchg_u8_asm
> > 
> > How was xchg_u8_asm() implemented, using lbarx or using a 32bit ll/sc
> > loop with shifting and masking in it?
> > 
> yes, using 32bit ll/sc loops.
> 
> looks like:
> __asm__ __volatile__(
> "1: lwarx   %0,0,%3\n"
> "   and %1,%0,%5\n"
> "   or %1,%1,%4\n"
>PPC405_ERR77(0,%2)
> "   stwcx.  %1,0,%3\n"
> "   bne-1b"
> : "=" (_oldv), "=" (tmp), "+m" (*(volatile unsigned int *)_p)
> : "r" (_p), "r" (_newv), "r" (_oldv_mask)
> : "cc", "memory");
> 

Good, so this works for all ppc ISAs too.

Given the performance benefit(maybe caused by the reason Peter
mentioned), I think we should use this as the implementation of u8/u16
{cmp}xchg for now. For Power7 and later, we can always switch to the
lbarx/lharx version if observable performance benefit can be achieved.

But the choice is left to you. After all, as you said, qspinlock is the
only user ;-)

Regards,
Boqun

> 
> > Regards,
> > Boqun
> > 
> >> [  768.377102] use time[2826802]ns in xchg_u8_c
> >>
> >> I think this is because there is one more load in C.
> >> If possible, we can move such code in asm-generic/.
> >>
> >> thanks
> >> xinhui
> >>
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16

2016-04-21 Thread Boqun Feng

On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote:
> On 2016年04月20日 22:24, Peter Zijlstra wrote:
> > On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote:
> > 
> >> +#define __XCHG_GEN(cmp, type, sfx, skip, v)   
> >> \
> >> +static __always_inline unsigned long  
> >> \
> >> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old,  \
> >> +   unsigned long new);\
> >> +static __always_inline u32
> >> \
> >> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new)  \
> >> +{ \
> >> +  int size = sizeof (type);   \
> >> +  int off = (unsigned long)ptr % sizeof(u32); \
> >> +  volatile u32 *p = ptr - off;\
> >> +  int bitoff = BITOFF_CAL(size, off); \
> >> +  u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff;\
> >> +  u32 oldv, newv, tmp;\
> >> +  u32 ret;\
> >> +  oldv = READ_ONCE(*p);   \
> >> +  do {\
> >> +  ret = (oldv & bitmask) >> bitoff;   \
> >> +  if (skip && ret != old) \
> >> +  break;  \
> >> +  newv = (oldv & ~bitmask) | (new << bitoff); \
> >> +  tmp = oldv; \
> >> +  oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv);   \
> >> +  } while (tmp != oldv);  \
> >> +  return ret; \
> >> +}
> > 
> > So for an LL/SC based arch using cmpxchg() like that is sub-optimal.
> > 
> > Why did you choose to write it entirely in C?
> > 
> yes, you are right. more load/store will be done in C code.
> However such xchg_u8/u16 is just used by qspinlock now. and I did not see any 
> performance regression.
> So just wrote in C, for simple. :)
> 
> Of course I have done xchg tests.
> we run code just like xchg((u8*), j++); in several threads.
> and the result is,
> [  768.374264] use time[1550072]ns in xchg_u8_asm

How was xchg_u8_asm() implemented, using lbarx or using a 32bit ll/sc
loop with shifting and masking in it?

Regards,
Boqun

> [  768.377102] use time[2826802]ns in xchg_u8_c
> 
> I think this is because there is one more load in C.
> If possible, we can move such code in asm-generic/.
> 
> thanks
> xinhui
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH powerpc/next RESEND] powerpc: spinlock: Fix spin_unlock_wait()

2016-04-19 Thread Boqun Feng

There is an ordering issue with spin_unlock_wait() on powerpc, because
the spin_lock primitive is an ACQUIRE and an ACQUIRE is only ordering
the load part of the operation with memory operations following it.
Therefore the following event sequence can happen:

CPU 1   CPU 2   CPU 3
==  ==
spin_unlock();
spin_lock():
  r1 = *lock; // r1 == 0;
o = object; o = READ_ONCE(object); // reordered here
object = NULL;
smp_mb();
spin_unlock_wait();
  *lock = 1;
smp_mb();
o->dead = true; < o = READ_ONCE(object); > // reordered upwards
if (o) // true
BUG_ON(o->dead); // true!!

To fix this, we add a "nop" ll/sc loop in arch_spin_unlock_wait() on
ppc (arch_spin_is_locked_sync()), the "nop" ll/sc loop reads the lock
value and writes it back atomically, in this way it will synchronize the
view of the lock on CPU1 with that on CPU2. Therefore in the scenario
above, either CPU2 will fail to get the lock at first or CPU1 will see
the lock acquired by CPU2, both cases will eliminate this bug. This is a
similar idea as what Will Deacon did for ARM64 in:

"arm64: spinlock: serialise spin_unlock_wait against concurrent lockers"

Further more, if arch_spin_is_locked_sync() figures out the lock is
locked, we actually don't need to do the "nop" ll/sc trick again, we can
just do a normal load+check loop for the lock to be released, because in
that case, spin_unlock_wait() is called when someone is holding the
lock, and the store part of arch_spin_is_locked_sync() happens before
the unlocking of the current lock holder, which means
arch_spin_is_locked_sync() happens before the next lock acquisition.
With the smp_mb() perceding spin_unlock_wait(), the store of object is
guaranteed to be observed by the next lock holder.

Please note spin_unlock_wait() on powerpc is still not an ACQUIRE after
this fix, the callers should add necessary barriers if they want to
promote it as all the current callers do.

This patch therefore fixes the issue and also cleans the
arch_spin_unlock_wait() a little bit by removing superfluous memory
barriers in loops and consolidating the implementations for PPC32 and
PPC64 into one.

Suggested-by: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Reviewed-by: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/spinlock.h | 48 -
 arch/powerpc/lib/locks.c| 16 -
 2 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 523673d7583c..0a517c1a751e 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -64,6 +64,25 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 }
 
 /*
+ * Use a ll/sc loop to read the lock value, the STORE part of this operation is
+ * used for making later lock operation observe it.
+ */
+static inline bool arch_spin_is_locked_sync(arch_spinlock_t *lock)
+{
+   arch_spinlock_t tmp;
+
+   __asm__ __volatile__(
+"1:" PPC_LWARX(%0, 0, %2, 1) "\n"
+"  stwcx. %0, 0, %2\n"
+"  bne- 1b\n"
+   : "=" (tmp), "+m" (*lock)
+   : "r" (lock)
+   : "cr0", "xer");
+
+   return !arch_spin_value_unlocked(tmp);
+}
+
+/*
  * This returns the old value in the lock, so we succeeded
  * in getting the lock if the return value is 0.
  */
@@ -162,12 +181,29 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
lock->slock = 0;
 }
 
-#ifdef CONFIG_PPC64
-extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
-#else
-#define arch_spin_unlock_wait(lock) \
-   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
-#endif
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   /*
+* Make sure previous loads and stores are observed by other cpu, this
+* pairs with the ACQUIRE barrier in lock.
+*/
+   smp_mb();
+
+   if (!arch_spin_is_locked_sync(lock))
+   return;
+
+   while (!arch_spin_value_unlocked(*lock)) {
+   HMT_low();
+   if (SHARED_PROCESSOR)
+   __spin_yield(lock);
+   }
+   HMT_medium();
+
+   /*
+* No barrier here, caller either relys on the control dependency or
+* should add a necessary barrier afterwards.
+*/
+}
 
 /*
  * Read-write spinlocks, allowing multiple readers
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index f7deebdf3365..b7b12

Re: [PATCH] arch/powerpc: use BUILD_BUG() when detect unfit {cmp}xchg, size

2016-02-23 Thread Boqun Feng

On Tue, Feb 23, 2016 at 04:45:16PM +0800, Pan Xinhui wrote:
> From: pan xinhui <xinhui@linux.vnet.ibm.com>
> 
> __xchg_called_with_bad_pointer() can't tell us what codes use {cmp}xchg
> in incorrect way.  And no error will be reported until the link stage.
> To fix such kinds of issues in a easy way, we use BUILD_BUG() here.
> 
> Signed-off-by: pan xinhui <xinhui@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/cmpxchg.h | 19 +--
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index d1a8d93..20c0a30 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Atomic exchange
> @@ -92,12 +93,6 @@ __xchg_u64_local(volatile void *p, unsigned long val)
>  }
>  #endif
>  
> -/*
> - * This function doesn't exist, so you'll get a linker error
> - * if something tries to do an invalid xchg().
> - */
> -extern void __xchg_called_with_bad_pointer(void);
> -
>  static __always_inline unsigned long
>  __xchg(volatile void *ptr, unsigned long x, unsigned int size)
>  {
> @@ -109,7 +104,7 @@ __xchg(volatile void *ptr, unsigned long x, unsigned int 
> size)
>   return __xchg_u64(ptr, x);
>  #endif
>   }
> - __xchg_called_with_bad_pointer();
> + BUILD_BUG();

Maybe we can use BUILD_BUG_ON_MSG(1, "Unsupported size for xchg"), which
could provide more information.

With or without this verbosity:

Acked-by: Boqun Feng <boqun.f...@gmail.com>


Regards,
Boqun

>   return x;
>  }
>  
> @@ -124,7 +119,7 @@ __xchg_local(volatile void *ptr, unsigned long x, 
> unsigned int size)
>   return __xchg_u64_local(ptr, x);
>  #endif
>   }
> - __xchg_called_with_bad_pointer();
> + BUILD_BUG();
>   return x;
>  }
>  #define xchg(ptr,x)   \
> @@ -235,10 +230,6 @@ __cmpxchg_u64_local(volatile unsigned long *p, unsigned 
> long old,
>  }
>  #endif
>  
> -/* This function doesn't exist, so you'll get a linker error
> -   if something tries to do an invalid cmpxchg().  */
> -extern void __cmpxchg_called_with_bad_pointer(void);
> -
>  static __always_inline unsigned long
>  __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new,
> unsigned int size)
> @@ -251,7 +242,7 @@ __cmpxchg(volatile void *ptr, unsigned long old, unsigned 
> long new,
>   return __cmpxchg_u64(ptr, old, new);
>  #endif
>   }
> - __cmpxchg_called_with_bad_pointer();
> + BUILD_BUG();
>   return old;
>  }
>  
> @@ -267,7 +258,7 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, 
> unsigned long new,
>   return __cmpxchg_u64_local(ptr, old, new);
>  #endif
>   }
> - __cmpxchg_called_with_bad_pointer();
> + BUILD_BUG();
>   return old;
>  }
>  
> -- 
> 2.5.0
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng

Hi Will,

On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve 
> > > > instructions in
> > > >the same thread, within that thread one set must contain the other.
> > > > 
> > > > P0  P1  P2
> > > > Rx=1Wy=1Wz=2
> > > > dep.lwsync  lwsync
> > > > Ry=0Wz=1Wx=1
> > > > Rz=1
> > > > 
> > > > assert(!(z=2))
> > > > 
> > > >Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0 | P1| P2;
> > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > >  lwz r8,0(r3)   |   |   ;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
 ^^^
Because of the fact that two reads on the same processors can't be
executed simultaneously? I feel like this is exactly something herd
missed.

> and what that means for the read of y.
> 

And the reasoning on PPC is similar, so looks like the read of z on P0
is a necessary condition for the exists clause to be forbidden.

Regards,
Boqun

> Will


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng

Hi Paul,

On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > Paul E. McKenney  wrote:
> > >
> > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > The reason for this is that smp_read_barrier_depends() must order the
> > > pointer load against any subsequent read or write through a dereference
> > > of that pointer.  For example:
> > > 
> > >p = READ_ONCE(gp);
> > >smp_rmb();
> > >r1 = p->a; /* ordered by smp_rmb(). */
> > >p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > >r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > 
> > > In contrast:
> > > 
> > >p = READ_ONCE(gp);
> > >smp_read_barrier_depends();
> > >r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > >p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > >r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. 
> > > */
> > > 
> > > Again, if your hardware maintains local ordering for address
> > > and data dependencies, you can have read_barrier_depends() and
> > > smp_read_barrier_depends() be no-ops like they are for most
> > > architectures.
> > > 
> > > Does that help?
> > 
> > This is crazy! smp_rmb started out being strictly stronger than
> > smp_read_barrier_depends, when did this stop being the case?
> 
> Hello, Herbert!
> 
> It is true that most Linux kernel code relies only on the read-read
> properties of dependencies, but the read-write properties are useful.
> Admittedly relatively rarely, but useful.
> 
> The better comparison for smp_read_barrier_depends(), especially in
> its rcu_dereference*() form, is smp_load_acquire().
> 

Confused..

I recall that last time you and Linus came into a conclusion that even
on Alpha, a barrier for read->write with data dependency is unnecessary:

http://article.gmane.org/gmane.linux.kernel/2077661

And in an earlier mail of that thread, Linus made his point that
smp_read_barrier_depends() should only be used to order read->read.

So right now, are we going to extend the semantics of
smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
still only work for read->read, and assume all the architectures won't
reorder read->write with data dependency, so that the code above having
a smp_rmb() also works?

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng

On Tue, Jan 26, 2016 at 03:29:21PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> > On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
> >  wrote:
> > >
> > > You might as well just write it as
> > >
> > > struct foo x = READ_ONCE(*ptr);
> > > x->bar = 5;
> > >
> > > because that "smp_read_barrier_depends()" does NOTHING wrt the second 
> > > write.
> > 
> > Just to clarify: on alpha it adds a memory barrier, but that memory
> > barrier is useless.
> 
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.
> 
> > On non-alpha, it is a no-op, and obviously does nothing simply because
> > it generates no code.
> > 
> > So if anybody believes that the "smp_read_barrier_depends()" does
> > something, they are *wrong*.
> 
> The other problem with smp_read_barrier_depends() is that it is often
> a pain figuring out which prior load it is supposed to apply to.
> Hence my preference for rcu_dereference*() and lockless_dereference().
> 

Because semantically speaking, rcu_derefence*() and
lockless_dereference() are CONSUME(i.e. data/address dependent
read->read and read->write pairs are ordered), whereas
smp_read_barrier_depends() only guarantees read->read pairs with data
dependency are ordered, right?

If so, maybe we need to call it out in memory-barriers.txt, for example:

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 904ee42..6b262c2 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1703,8 +1703,8 @@ There are some more advanced barrier functions:
 
 
  (*) lockless_dereference();
- This can be thought of as a pointer-fetch wrapper around the
- smp_read_barrier_depends() data-dependency barrier.
+ This is a load, and any load or store that has a data dependency on the
+ value returned by this load won't be reordered before this load.
 
  This is also similar to rcu_dereference(), but in cases where
  object lifetime is handled by some mechanism other than RCU, for


Regards,
Boqun

> > And if anybody sends out an email with that smp_read_barrier_depends()
> > in an example, they are actively just confusing other people, which is
> > even worse than just being wrong. Which is why I jumped in.
> > 
> > So stop perpetuating the myth that smp_read_barrier_depends() does
> > something here. It does not. It's a bug, and it has become this "mind
> > virus" for some people that seem to believe that it does something.
> 
> It looks like I should add words to memory-barriers.txt de-emphasizing
> smp_read_barrier_depends().  I will take a look at that.
> 
> > I had to remove this crap once from the kernel already, see commit
> > 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> > atomic*_read_ctrl()").
> > 
> > I don't want to ever see that broken construct again. And I want to
> > make sure that everybody is educated about how broken it was. I'm
> > extremely unhappy that it came up again.
> 
> Well, if it makes you feel better, that was control dependencies and this
> was data dependencies.  So it was not -exactly- the same.  ;-)
> 
> (Sorry, couldn't resist...)
> 
> > If it turns out that some architecture does actually need a barrier
> > between a read and a dependent write, then that will mean that
> > 
> >  (a) we'll have to make up a _new_ barrier, because
> > "smp_read_barrier_depends()" is not that barrier. We'll presumably
> > then have to make that new barrier part of "rcu_derefence()" and
> > friends.
> 
> Agreed.  We can worry about whether or not we replace the current
> smp_read_barrier_depends() with that new barrier when and if such
> hardware appears.
> 
> >  (b) we will have found an architecture with even worse memory
> > ordering semantics than alpha, and we'll have to stop castigating
> > alpha for being the worst memory ordering ever.
> 
> ;-) ;-) ;-)
> 
> > but I sincerely hope that we'll never find that kind of broken architecture.
> 
> Apparently at least some hardware vendors are reading memory-barriers.txt,
> so perhaps the odds of that kind of breakage have reduced.
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 15/32] powerpc: define __smp_xxx

2016-01-06 Thread Boqun Feng

On Wed, Jan 06, 2016 at 10:23:51PM +0200, Michael S. Tsirkin wrote:
[...]
> > > 
> > > Sorry, I don't understand - why do you have to do anything?
> > > I changed all users of smp_lwsync so they
> > > use __smp_lwsync on SMP and barrier() on !SMP.
> > > 
> > > This is exactly the current behaviour, I also tested that
> > > generated code does not change at all.
> > > 
> > > Is there a patch in your tree that conflicts with this?
> > > 
> > 
> > Because in a patchset which implements atomic relaxed/acquire/release
> > variants on PPC I use smp_lwsync(), this makes it have another user,
> > please see this mail:
> > 
> > http://article.gmane.org/gmane.linux.ports.ppc.embedded/89877
> > 
> > in definition of PPC's __atomic_op_release().
> > 
> > 
> > But I think removing smp_lwsync() is a good idea and actually I think we
> > can go further to remove __smp_lwsync() and let __smp_load_acquire and
> > __smp_store_release call __lwsync() directly, but that is another thing.
> > 
> > Anyway, I will modify my patch.
> > 
> > Regards,
> > Boqun
> 
> 
> Thanks!
> Could you send an ack then please?
> 

Sure, if you need one from me, feel free to add my ack for this patch:

Acked-by: Boqun Feng <boqun.f...@gmail.com>

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 15/32] powerpc: define __smp_xxx

2016-01-05 Thread Boqun Feng

On Tue, Jan 05, 2016 at 10:51:17AM +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 05, 2016 at 09:36:55AM +0800, Boqun Feng wrote:
> > Hi Michael,
> > 
> > On Thu, Dec 31, 2015 at 09:07:42PM +0200, Michael S. Tsirkin wrote:
> > > This defines __smp_xxx barriers for powerpc
> > > for use by virtualization.
> > > 
> > > smp_xxx barriers are removed as they are
> > > defined correctly by asm-generic/barriers.h
> 
> I think this is the part that was missed in review.
> 

Yes, I realized my mistake after reread the series. But smp_lwsync() is
not defined in asm-generic/barriers.h, right?

> > > This reduces the amount of arch-specific boiler-plate code.
> > > 
> > > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> > > Acked-by: Arnd Bergmann <a...@arndb.de>
> > > ---
> > >  arch/powerpc/include/asm/barrier.h | 24 
> > >  1 file changed, 8 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/barrier.h 
> > > b/arch/powerpc/include/asm/barrier.h
> > > index 980ad0c..c0deafc 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -44,19 +44,11 @@
> > >  #define dma_rmb()__lwsync()
> > >  #define dma_wmb()__asm__ __volatile__ (stringify_in_c(SMPWMB) : 
> > > : :"memory")
> > >  
> > > -#ifdef CONFIG_SMP
> > > -#define smp_lwsync() __lwsync()
> > > +#define __smp_lwsync()   __lwsync()
> > >  
> > 
> > so __smp_lwsync() is always mapped to lwsync, right?
> 
> Yes.
> 
> > > -#define smp_mb() mb()
> > > -#define smp_rmb()__lwsync()
> > > -#define smp_wmb()__asm__ __volatile__ (stringify_in_c(SMPWMB) : 
> > > : :"memory")
> > > -#else
> > > -#define smp_lwsync() barrier()
> > > -
> > > -#define smp_mb() barrier()
> > > -#define smp_rmb()barrier()
> > > -#define smp_wmb()barrier()
> > > -#endif /* CONFIG_SMP */
> > > +#define __smp_mb()   mb()
> > > +#define __smp_rmb()  __lwsync()
> > > +#define __smp_wmb()  __asm__ __volatile__ (stringify_in_c(SMPWMB) : 
> > > : :"memory")
> > >  
> > >  /*
> > >   * This is a barrier which prevents following instructions from being
> > > @@ -67,18 +59,18 @@
> > >  #define data_barrier(x)  \
> > >   asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> > >  
> > > -#define smp_store_release(p, v)  
> > > \
> > > +#define __smp_store_release(p, v)
> > > \
> > >  do { 
> > > \
> > >   compiletime_assert_atomic_type(*p); \
> > > - smp_lwsync();   \
> > > + __smp_lwsync(); \
> > 
> > , therefore this will emit an lwsync no matter SMP or UP.
> 
> Absolutely. But smp_store_release (without __) will not.
> 
> Please note I did test this: for ppc code before and after
> this patch generates exactly the same binary on SMP and UP.
> 

Yes, you're right, sorry for my mistake...

> 
> > Another thing is that smp_lwsync() may have a third user(other than
> > smp_load_acquire() and smp_store_release()):
> > 
> > http://article.gmane.org/gmane.linux.ports.ppc.embedded/89877
> > 
> > I'm OK to change my patch accordingly, but do we really want
> > smp_lwsync() get involved in this cleanup? If I understand you
> > correctly, this cleanup focuses on external API like smp_{r,w,}mb(),
> > while smp_lwsync() is internal to PPC.
> > 
> > Regards,
> > Boqun
> 
> I think you missed the leading ___ :)
> 

What I mean here was smp_lwsync() was originally internal to PPC, but
never mind ;-)

> smp_store_release is external and it needs __smp_lwsync as
> defined here.
> 
> I can duplicate some code and have smp_lwsync *not* call __smp_lwsync

You mean bringing smp_lwsync() back? because I haven't seen you defining
in asm-generic/barriers.h in previous patches and you just delete it in
this patch.

> but why do this? Still, if you prefer it this way,
> please let me know.
> 

I think deleting smp_lwsyn

Re: [PATCH v2 15/32] powerpc: define __smp_xxx

2016-01-05 Thread Boqun Feng

On Tue, Jan 05, 2016 at 06:16:48PM +0200, Michael S. Tsirkin wrote:
[snip]
> > > > Another thing is that smp_lwsync() may have a third user(other than
> > > > smp_load_acquire() and smp_store_release()):
> > > > 
> > > > http://article.gmane.org/gmane.linux.ports.ppc.embedded/89877
> > > > 
> > > > I'm OK to change my patch accordingly, but do we really want
> > > > smp_lwsync() get involved in this cleanup? If I understand you
> > > > correctly, this cleanup focuses on external API like smp_{r,w,}mb(),
> > > > while smp_lwsync() is internal to PPC.
> > > > 
> > > > Regards,
> > > > Boqun
> > > 
> > > I think you missed the leading ___ :)
> > > 
> > 
> > What I mean here was smp_lwsync() was originally internal to PPC, but
> > never mind ;-)
> > 
> > > smp_store_release is external and it needs __smp_lwsync as
> > > defined here.
> > > 
> > > I can duplicate some code and have smp_lwsync *not* call __smp_lwsync
> > 
> > You mean bringing smp_lwsync() back? because I haven't seen you defining
> > in asm-generic/barriers.h in previous patches and you just delete it in
> > this patch.
> > 
> > > but why do this? Still, if you prefer it this way,
> > > please let me know.
> > > 
> > 
> > I think deleting smp_lwsync() is fine, though I need to change atomic
> > variants patches on PPC because of it ;-/
> > 
> > Regards,
> > Boqun
> 
> Sorry, I don't understand - why do you have to do anything?
> I changed all users of smp_lwsync so they
> use __smp_lwsync on SMP and barrier() on !SMP.
> 
> This is exactly the current behaviour, I also tested that
> generated code does not change at all.
> 
> Is there a patch in your tree that conflicts with this?
> 

Because in a patchset which implements atomic relaxed/acquire/release
variants on PPC I use smp_lwsync(), this makes it have another user,
please see this mail:

http://article.gmane.org/gmane.linux.ports.ppc.embedded/89877

in definition of PPC's __atomic_op_release().


But I think removing smp_lwsync() is a good idea and actually I think we
can go further to remove __smp_lwsync() and let __smp_load_acquire and
__smp_store_release call __lwsync() directly, but that is another thing.

Anyway, I will modify my patch.

Regards,
Boqun

> 
> > > > >   WRITE_ONCE(*p, v);  
> > > > > \
> > > > >  } while (0)
> > > > >  
> > > > > -#define smp_load_acquire(p)  
> > > > > \
> > > > > +#define __smp_load_acquire(p)
> > > > > \
> > > > >  ({   
> > > > > \
> > > > >   typeof(*p) ___p1 = READ_ONCE(*p);   
> > > > > \
> > > > >   compiletime_assert_atomic_type(*p); 
> > > > > \
> > > > > - smp_lwsync();   
> > > > > \
> > > > > + __smp_lwsync(); 
> > > > > \
> > > > >   ___p1;  
> > > > > \
> > > > >  })
> > > > >  
> > > > > -- 
> > > > > MST
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > > linux-kernel" in
> > > > > the body of a message to majord...@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > Please read the FAQ at  http://www.tux.org/lkml/
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 2/4] powerpc: atomic: Implement atomic{, 64}__return_ variants

2016-01-05 Thread Boqun Feng

Hi all,

I will resend this one to avoid a potential conflict with:

http://article.gmane.org/gmane.linux.kernel/2116880

by open coding smp_lwsync() with:

__asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory");

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH RESEND v6 2/4] powerpc: atomic: Implement atomic{, 64}__return_ variants

2016-01-05 Thread Boqun Feng

On powerpc, acquire and release semantics can be achieved with
lightweight barriers("lwsync" and "ctrl+isync"), which can be used to
implement __atomic_op_{acquire,release}.

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore in __atomic_op_release() we
use PPC_RELEASE_BARRIER.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. Therefore in
__atomic_op_acquire() we use PPC_ACQUIRE_BARRIER, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

Implement atomic{,64}_{add,sub,inc,dec}_return_relaxed, and build other
variants with these helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 147 ++
 1 file changed, 85 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 55f106e..a35c277 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,24 @@
 
 #define ATOMIC_INIT(i) { (i) }
 
+/*
+ * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#define __atomic_op_acquire(op, args...)   \
+({ \
+   typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
+   __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory");\
+   __ret;  \
+})
+
+#define __atomic_op_release(op, args...)   \
+({ \
+   __asm__ __volatile__(PPC_RELEASE_BARRIER "" : : : "memory");\
+   op##_relaxed(args); \
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
int t;
@@ -42,27 +60,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v)  
\
: "cc");\
 }  \
 
-#define ATOMIC_OP_RETURN(op, asm_op)   \
-static __inline__ int atomic_##op##_return(int a, atomic_t *v) \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
-   PPC_ATOMIC_ENTRY_BARRIER\
-"1:lwarx   %0,0,%2 # atomic_" #op "_return\n"  \
-   #asm_op " %0,%1,%0\n"   \
-   PPC405_ERR77(0,%2)  \
-"  stwcx.  %0,0,%2 \n" \
+"1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
+   #asm_op " %0,%2,%0\n"   \
+   PPC405_ERR77(0, %3) \
+"  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
-   PPC_ATOMIC_EXIT_BARRIER \
-   : "=" (t) \
+   : "=" (t), "+m" (v->counter)  \
: "r" (a), "r" (>counter)\
-   : "cc", "memory");  \
+   : "cc");\
\
return t;   \
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, 
asm_op)
+#define ATOMIC_OPS(op, asm_op) \
+   ATOMIC_OP(op, asm_op)

Re: [PATCH v2 15/32] powerpc: define __smp_xxx

2016-01-04 Thread Boqun Feng

Hi Michael,

On Thu, Dec 31, 2015 at 09:07:42PM +0200, Michael S. Tsirkin wrote:
> This defines __smp_xxx barriers for powerpc
> for use by virtualization.
> 
> smp_xxx barriers are removed as they are
> defined correctly by asm-generic/barriers.h
> 
> This reduces the amount of arch-specific boiler-plate code.
> 
> Signed-off-by: Michael S. Tsirkin 
> Acked-by: Arnd Bergmann 
> ---
>  arch/powerpc/include/asm/barrier.h | 24 
>  1 file changed, 8 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index 980ad0c..c0deafc 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -44,19 +44,11 @@
>  #define dma_rmb()__lwsync()
>  #define dma_wmb()__asm__ __volatile__ (stringify_in_c(SMPWMB) : : 
> :"memory")
>  
> -#ifdef CONFIG_SMP
> -#define smp_lwsync() __lwsync()
> +#define __smp_lwsync()   __lwsync()
>  

so __smp_lwsync() is always mapped to lwsync, right?

> -#define smp_mb() mb()
> -#define smp_rmb()__lwsync()
> -#define smp_wmb()__asm__ __volatile__ (stringify_in_c(SMPWMB) : : 
> :"memory")
> -#else
> -#define smp_lwsync() barrier()
> -
> -#define smp_mb() barrier()
> -#define smp_rmb()barrier()
> -#define smp_wmb()barrier()
> -#endif /* CONFIG_SMP */
> +#define __smp_mb()   mb()
> +#define __smp_rmb()  __lwsync()
> +#define __smp_wmb()  __asm__ __volatile__ (stringify_in_c(SMPWMB) : : 
> :"memory")
>  
>  /*
>   * This is a barrier which prevents following instructions from being
> @@ -67,18 +59,18 @@
>  #define data_barrier(x)  \
>   asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>  
> -#define smp_store_release(p, v)  
> \
> +#define __smp_store_release(p, v)
> \
>  do { \
>   compiletime_assert_atomic_type(*p); \
> - smp_lwsync();   \
> + __smp_lwsync(); \

, therefore this will emit an lwsync no matter SMP or UP.

Another thing is that smp_lwsync() may have a third user(other than
smp_load_acquire() and smp_store_release()):

http://article.gmane.org/gmane.linux.ports.ppc.embedded/89877

I'm OK to change my patch accordingly, but do we really want
smp_lwsync() get involved in this cleanup? If I understand you
correctly, this cleanup focuses on external API like smp_{r,w,}mb(),
while smp_lwsync() is internal to PPC.

Regards,
Boqun

>   WRITE_ONCE(*p, v);  \
>  } while (0)
>  
> -#define smp_load_acquire(p)  \
> +#define __smp_load_acquire(p)
> \
>  ({   \
>   typeof(*p) ___p1 = READ_ONCE(*p);   \
>   compiletime_assert_atomic_type(*p); \
> - smp_lwsync();   \
> + __smp_lwsync(); \
>   ___p1;  \
>  })
>  
> -- 
> MST
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-27 Thread Boqun Feng

On Sun, Dec 27, 2015 at 06:53:39PM +1100, Michael Ellerman wrote:
> On Wed, 2015-12-23 at 18:54 +0800, Boqun Feng wrote:
> > On Wed, Dec 23, 2015 at 01:40:05PM +1100, Michael Ellerman wrote:
> > > On Tue, 2015-12-15 at 22:24 +0800, Boqun Feng wrote:
> > > > Hi all,
> > > > 
> > > > This is v6 of the series.
> > > > 
> > > > Link for v1: https://lkml.org/lkml/2015/8/27/798
> > > > Link for v2: https://lkml.org/lkml/2015/9/16/527
> > > > Link for v3: https://lkml.org/lkml/2015/10/12/368
> > > > Link for v4: https://lkml.org/lkml/2015/10/14/670
> > > > Link for v5: https://lkml.org/lkml/2015/10/26/141
> > > > 
> > > > 
> > > > Changes since v5:
> > > > 
> > > > *   rebase on the next branch of powerpc.
> > > > 
> > > > *   pull two fix and one testcase patches out, which are already
> > > > sent separately
> > > > 
> > > > *   some clean up or code format fixing.
> > > > 
> > > > 
> > > > Paul, Peter and Will, thank you for your comments and suggestions in 
> > > > the review
> > > > of previous versions. From this version on, This series is against the 
> > > > next
> > > > branch of powerpc tree, because most of the code touch arch/powerpc/*.
> > > 
> > > 
> > > Sorry if we already discussed this, but did we decide how we were going to
> > > merge this? There's the one patch to generic code and then three powerpc
> > > patches.
> > > 
> > > It'd make most sense for it to go via powerpc I think. Given that the 
> > > change to
> > > generic code is relatively trivial I'll plan to merge this unless someone
> > > objects.
> > > 
> > > Also it is pretty late in the -next cycle for something like this. But 
> > > AFAICS
> > > there are no users of these "atomic*relaxed" variants yet other than 
> > > arm64 code
> > > and qspinlocks, neither of which are used on powerpc. So adding them 
> > > should be
> > > pretty harmless.
> > > 
> > 
> > There is one thing we should be aware of, that is the bug:
> > 
> > http://lkml.kernel.org/r/5669d5f2.5050...@caviumnetworks.com
> > 
> > which though has been fixed by:
> > 
> > http://lkml.kernel.org/r/20151217160549.gh6...@twins.programming.kicks-ass.net
> > 
> > but the fix is not in powerpc/next right now. As this patchset makes
> > atomic_xchg_acquire a real ACQUIRE, so we will also trigger that bug if
> > this series gets merged in the next branch of powerpc tree, though
> > that's not the problem of this patchset.
> > 
> > Not sure whether this is a problem for your maintence, but just think
> > it's better to make you aware of this ;-)
> 
> Yes that's pretty important thank you :)
> 
> It's not so much that bug that's important, but the fact that I completely
> forget about the acquire/release implementations. Those are used already in
> mainline and so we don't want to add implementations this late in the cycle
> without wider testing.
> 

Understood.

> So I'll have to push this series until 4.6 so it can get some time in -next.
> Sorry!
> 

That's fine, thank you!

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-23 Thread Boqun Feng

On Wed, Dec 23, 2015 at 01:40:05PM +1100, Michael Ellerman wrote:
> On Tue, 2015-12-15 at 22:24 +0800, Boqun Feng wrote:
> 
> > Hi all,
> > 
> > This is v6 of the series.
> > 
> > Link for v1: https://lkml.org/lkml/2015/8/27/798
> > Link for v2: https://lkml.org/lkml/2015/9/16/527
> > Link for v3: https://lkml.org/lkml/2015/10/12/368
> > Link for v4: https://lkml.org/lkml/2015/10/14/670
> > Link for v5: https://lkml.org/lkml/2015/10/26/141
> > 
> > 
> > Changes since v5:
> > 
> > *   rebase on the next branch of powerpc.
> > 
> > *   pull two fix and one testcase patches out, which are already
> > sent separately
> > 
> > *   some clean up or code format fixing.
> > 
> > 
> > Paul, Peter and Will, thank you for your comments and suggestions in the 
> > review
> > of previous versions. From this version on, This series is against the next
> > branch of powerpc tree, because most of the code touch arch/powerpc/*.
> 
> 
> Sorry if we already discussed this, but did we decide how we were going to
> merge this? There's the one patch to generic code and then three powerpc
> patches.
> 
> It'd make most sense for it to go via powerpc I think. Given that the change 
> to
> generic code is relatively trivial I'll plan to merge this unless someone
> objects.
> 
> Also it is pretty late in the -next cycle for something like this. But AFAICS
> there are no users of these "atomic*relaxed" variants yet other than arm64 
> code
> and qspinlocks, neither of which are used on powerpc. So adding them should be
> pretty harmless.
> 

There is one thing we should be aware of, that is the bug:

http://lkml.kernel.org/r/5669d5f2.5050...@caviumnetworks.com

which though has been fixed by:

http://lkml.kernel.org/r/20151217160549.gh6...@twins.programming.kicks-ass.net

but the fix is not in powerpc/next right now. As this patchset makes
atomic_xchg_acquire a real ACQUIRE, so we will also trigger that bug if
this series gets merged in the next branch of powerpc tree, though
that's not the problem of this patchset.

Not sure whether this is a problem for your maintence, but just think
it's better to make you aware of this ;-)

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-22 Thread Boqun Feng

On Wed, Dec 23, 2015 at 01:40:05PM +1100, Michael Ellerman wrote:
> On Tue, 2015-12-15 at 22:24 +0800, Boqun Feng wrote:
> 
> > Hi all,
> > 
> > This is v6 of the series.
> > 
> > Link for v1: https://lkml.org/lkml/2015/8/27/798
> > Link for v2: https://lkml.org/lkml/2015/9/16/527
> > Link for v3: https://lkml.org/lkml/2015/10/12/368
> > Link for v4: https://lkml.org/lkml/2015/10/14/670
> > Link for v5: https://lkml.org/lkml/2015/10/26/141
> > 
> > 
> > Changes since v5:
> > 
> > *   rebase on the next branch of powerpc.
> > 
> > *   pull two fix and one testcase patches out, which are already
> > sent separately
> > 
> > *   some clean up or code format fixing.
> > 
> > 
> > Paul, Peter and Will, thank you for your comments and suggestions in the 
> > review
> > of previous versions. From this version on, This series is against the next
> > branch of powerpc tree, because most of the code touch arch/powerpc/*.
> 
> 
> Sorry if we already discussed this, but did we decide how we were going to
> merge this? There's the one patch to generic code and then three powerpc
> patches.
> 

We might have "discussed" this ;-) As I proposed this would go to the
powerpc next in this mail:

http://marc.info/?l=linux-kernel=144660021417639=2

Regards,
Boqun

> It'd make most sense for it to go via powerpc I think. Given that the change 
> to
> generic code is relatively trivial I'll plan to merge this unless someone
> objects.
> 
> Also it is pretty late in the -next cycle for something like this. But AFAICS
> there are no users of these "atomic*relaxed" variants yet other than arm64 
> code
> and qspinlocks, neither of which are used on powerpc. So adding them should be
> pretty harmless.
> 
> cheers
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-19 Thread Boqun Feng

On Fri, Dec 18, 2015 at 09:12:50AM -0800, Davidlohr Bueso wrote:
> I've left this series testing overnight on a power7 box and so far so good,
> nothing has broken.

Davidlohr, thank you for your testing!

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-15 Thread Boqun Feng

Hi all,

This is v6 of the series.

Link for v1: https://lkml.org/lkml/2015/8/27/798
Link for v2: https://lkml.org/lkml/2015/9/16/527
Link for v3: https://lkml.org/lkml/2015/10/12/368
Link for v4: https://lkml.org/lkml/2015/10/14/670
Link for v5: https://lkml.org/lkml/2015/10/26/141


Changes since v5:

*   rebase on the next branch of powerpc.

*   pull two fix and one testcase patches out, which are already
sent separately

*   some clean up or code format fixing.


Paul, Peter and Will, thank you for your comments and suggestions in the review
of previous versions. From this version on, This series is against the next
branch of powerpc tree, because most of the code touch arch/powerpc/*.


Relaxed/acquire/release variants of atomic operations {add,sub}_return and
{cmp,}xchg are introduced by commit:

"atomics: add acquire/release/relaxed variants of some atomic operations"

and {inc,dec}_return has been introduced by commit:

"locking/asm-generic: Add _{relaxed|acquire|release}() variants for inc/dec
atomics"

By default, the generic code will implement a relaxed variant as a full ordered
atomic operation and release/acquire a variant as a relaxed variant with a
necessary general barrier before or after.

On PPC, which has a weak memory order model, a relaxed variant can be
implemented more lightweightly than a full ordered one. Further more, release
and acquire variants can be implemented with arch-specific lightweight
barriers.

Therefore this patchset implements the relaxed/acquire/release variants based
on PPC memory model and specific barriers.

The patchset consists of 4 parts:

1.  Allow architectures to define their own __atomic_op_*() helpers
to build other variants based on relaxed.

2.  Implement atomic{,64}_{add,sub,inc,dec}_return_* variants

3.  Implement xchg_* and atomic{,64}_xchg_* variants

4.  Implement cmpxchg_* atomic{,64}_cmpxchg_* variants


This patchset is against:

git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
commit 5f337e3e5b04b32793fd51adab438d46df99c933 

and has been tested by 0day. I also have run build and boot tests of this in
both guest(pseries) and host(powernv) environments.

Looking forward to any suggestion, question and comment ;-)

Regards,
Boqun
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v6 1/4] atomics: Allow architectures to define their own __atomic_op_* helpers

2015-12-15 Thread Boqun Feng

Some architectures may have their special barriers for acquire, release
and fence semantics, so that general memory barriers(smp_mb__*_atomic())
in the default __atomic_op_*() may be too strong, so allow architectures
to define their own helpers which can overwrite the default helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 include/linux/atomic.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 301de78..5f3ee5a 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -34,20 +34,29 @@
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
  * variant is already fully ordered, no additional barriers are needed.
+ *
+ * Besides, if an arch has a special barrier for acquire/release, it could
+ * implement its own __atomic_op_* and use the same framework for building
+ * variants
  */
+#ifndef __atomic_op_acquire
 #define __atomic_op_acquire(op, args...)   \
 ({ \
typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
smp_mb__after_atomic(); \
__ret;  \
 })
+#endif
 
+#ifndef __atomic_op_release
 #define __atomic_op_release(op, args...)   \
 ({ \
smp_mb__before_atomic();\
op##_relaxed(args); \
 })
+#endif
 
+#ifndef __atomic_op_fence
 #define __atomic_op_fence(op, args...) \
 ({ \
typeof(op##_relaxed(args)) __ret;   \
@@ -56,6 +65,7 @@
smp_mb__after_atomic(); \
__ret;  \
 })
+#endif
 
 /* atomic_add_return_relaxed */
 #ifndef atomic_add_return_relaxed
-- 
2.6.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v6 3/4] powerpc: atomic: Implement acquire/release/relaxed variants for xchg

2015-12-15 Thread Boqun Feng

Implement xchg{,64}_relaxed and atomic{,64}_xchg_relaxed, based on these
_relaxed variants, release/acquire variants and fully ordered versions
can be built.

Note that xchg{,64}_relaxed and atomic_{,64}_xchg_relaxed are not
compiler barriers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  2 ++
 arch/powerpc/include/asm/cmpxchg.h | 69 +-
 2 files changed, 32 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index aac9e0b..409d3cf 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -177,6 +177,7 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * __atomic_add_unless - add unless the number is a given value
@@ -444,6 +445,7 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t 
*v)
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * atomic64_add_unless - add unless the number is a given value
diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index d1a8d93..17c7e14 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -9,21 +9,20 @@
 /*
  * Atomic exchange
  *
- * Changes the memory location '*ptr' to be val and returns
+ * Changes the memory location '*p' to be val and returns
  * the previous value stored there.
  */
+
 static __always_inline unsigned long
-__xchg_u32(volatile void *p, unsigned long val)
+__xchg_u32_local(volatile void *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -31,42 +30,34 @@ __xchg_u32(volatile void *p, unsigned long val)
return prev;
 }
 
-/*
- * Atomic exchange
- *
- * Changes the memory location '*ptr' to be val and returns
- * the previous value stored there.
- */
 static __always_inline unsigned long
-__xchg_u32_local(volatile void *p, unsigned long val)
+__xchg_u32_relaxed(u32 *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-"1:lwarx   %0,0,%2 \n"
-   PPC405_ERR77(0,%2)
-"  stwcx.  %3,0,%2 \n\
-   bne-1b"
-   : "=" (prev), "+m" (*(volatile unsigned int *)p)
+"1:lwarx   %0,0,%2\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %3,0,%2\n"
+"  bne-1b"
+   : "=" (prev), "+m" (*p)
: "r" (p), "r" (val)
-   : "cc", "memory");
+   : "cc");
 
return prev;
 }
 
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
-__xchg_u64(volatile void *p, unsigned long val)
+__xchg_u64_local(volatile void *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -75,18 +66,18 @@ __xchg_u64(volatile void *p, unsigned long val)
 }
 
 static __always_inline unsigned long
-__xchg_u64_local(volatile void *p, unsigned long val)
+__xchg_u64_relaxed(u64 *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-"1:ldarx   %0,0,%2 \n"
-   PPC405_ERR77(0,%2)
-"  stdcx.  %3,0,%2 \n\
-   bne-1b"
-   : "=" (prev), "+m" (*(volatile unsigned long *)p)
+"1:ldarx   %0,0,%2\n"
+   PPC405_ERR77(0, %2)
+"  stdcx.  %3,0,%2\n"
+"  bne-1b"
+   : "=" (prev), "+m" (*p)
: "r" (p), "r" (val)
-   : "cc", "memory");
+   : "cc");
 
return prev;
 }
@@ -99,14 +90,14 @@ __xchg_u64_local(volatile void *p, unsigned long val)
 extern void __xchg_called_with_bad_pointer(void);
 
 static __always_inline unsigned long
-__xchg(volatile void *ptr, unsigned long x, unsigned

[PATCH v6 4/4] powerpc: atomic: Implement acquire/release/relaxed variants for cmpxchg

2015-12-15 Thread Boqun Feng

Implement cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed, based on
which _release variants can be built.

To avoid superfluous barriers in _acquire variants, we implement these
operations with assembly code rather use __atomic_op_acquire() to build
them automatically.

For the same reason, we keep the assembly implementation of fully
ordered cmpxchg operations.

However, we don't do the similar for _release, because that will require
putting barriers in the middle of ll/sc loops, which is probably a bad
idea.

Note cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed are not
compiler barriers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  10 +++
 arch/powerpc/include/asm/cmpxchg.h | 149 -
 2 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 409d3cf..fb8ce7e 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -176,6 +176,11 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t 
*v)
 #define atomic_dec_return_relaxed atomic_dec_return_relaxed
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic_cmpxchg_relaxed(v, o, n) \
+   cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic_cmpxchg_acquire(v, o, n) \
+   cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
@@ -444,6 +449,11 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t 
*v)
 }
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic64_cmpxchg_relaxed(v, o, n) \
+   cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic64_cmpxchg_acquire(v, o, n) \
+   cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index 17c7e14..cae4fa8 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -181,6 +181,56 @@ __cmpxchg_u32_local(volatile unsigned int *p, unsigned 
long old,
return prev;
 }
 
+static __always_inline unsigned long
+__cmpxchg_u32_relaxed(u32 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:lwarx   %0,0,%2 # __cmpxchg_u32_relaxed\n"
+"  cmpw0,%0,%3\n"
+"  bne-2f\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %4,0,%2\n"
+"  bne-1b\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc");
+
+   return prev;
+}
+
+/*
+ * cmpxchg family don't have order guarantee if cmp part fails, therefore we
+ * can avoid superfluous barriers if we use assembly code to implement
+ * cmpxchg() and cmpxchg_acquire(), however we don't do the similar for
+ * cmpxchg_release() because that will result in putting a barrier in the
+ * middle of a ll/sc loop, which is probably a bad idea. For example, this
+ * might cause the conditional store more likely to fail.
+ */
+static __always_inline unsigned long
+__cmpxchg_u32_acquire(u32 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:lwarx   %0,0,%2 # __cmpxchg_u32_acquire\n"
+"  cmpw0,%0,%3\n"
+"  bne-2f\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %4,0,%2\n"
+"  bne-1b\n"
+   PPC_ACQUIRE_BARRIER
+   "\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc", "memory");
+
+   return prev;
+}
+
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
 __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
@@ -224,6 +274,46 @@ __cmpxchg_u64_local(volatile unsigned long *p, unsigned 
long old,
 
return prev;
 }
+
+static __always_inline unsigned long
+__cmpxchg_u64_relaxed(u64 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:ldarx   %0,0,%2 # __cmpxchg_u64_relaxed\n"
+"  cmpd0,%0,%3\n"
+"  bne-2f\n"
+"  stdcx.  %4,0,%2\n"
+"  bne-1b\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc");
+
+   return prev;
+}
+
+sta

[PATCH v6 2/4] powerpc: atomic: Implement atomic{, 64}__return_ variants

2015-12-15 Thread Boqun Feng

On powerpc, acquire and release semantics can be achieved with
lightweight barriers("lwsync" and "ctrl+isync"), which can be used to
implement __atomic_op_{acquire,release}.

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. Therefore in
__atomic_op_acquire() we use PPC_ACQUIRE_BARRIER, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

Implement atomic{,64}_{add,sub,inc,dec}_return_relaxed, and build other
variants with these helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 147 ++
 1 file changed, 85 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 55f106e..aac9e0b 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,24 @@
 
 #define ATOMIC_INIT(i) { (i) }
 
+/*
+ * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#define __atomic_op_acquire(op, args...)   \
+({ \
+   typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
+   __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory");\
+   __ret;  \
+})
+
+#define __atomic_op_release(op, args...)   \
+({ \
+   smp_lwsync();   \
+   op##_relaxed(args); \
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
int t;
@@ -42,27 +60,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v)  
\
: "cc");\
 }  \
 
-#define ATOMIC_OP_RETURN(op, asm_op)   \
-static __inline__ int atomic_##op##_return(int a, atomic_t *v) \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
-   PPC_ATOMIC_ENTRY_BARRIER\
-"1:lwarx   %0,0,%2 # atomic_" #op "_return\n"  \
-   #asm_op " %0,%1,%0\n"   \
-   PPC405_ERR77(0,%2)  \
-"  stwcx.  %0,0,%2 \n" \
+"1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
+   #asm_op " %0,%2,%0\n"   \
+   PPC405_ERR77(0, %3) \
+"  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
-   PPC_ATOMIC_EXIT_BARRIER \
-   : "=" (t) \
+   : "=" (t), "+m" (v->counter)  \
: "r" (a), "r" (>counter)\
-   : "cc", "memory");  \
+   : "cc");\
\
return t;   \
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, 
asm_op)
+#define ATOMIC_OPS(op, asm_op) \
+   ATOMIC_OP(op, asm_op)   \
+

Re: [RESEND, tip/locking/core, v5, 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-11-03 Thread Boqun Feng

On Mon, Nov 02, 2015 at 09:22:40AM +0800, Boqun Feng wrote:
> > On Tue, Oct 27, 2015 at 11:06:52AM +0800, Boqun Feng wrote:
> > > To summerize:
> > > 
> > > patch 1(split to two), 3, 4(remove inc/dec implementation), 5, 6 sent as
> > > powerpc patches for powerpc next, patch 2(unmodified) sent as tip patch
> > > for locking/core.
> > > 
> > > Peter and Michael, this works for you both?
> > > 
> > 
> > Thoughts? ;-)
> > 
> 
> Peter and Michael, I will split patch 1 to two and send them as patches
> for powerpc next first. The rest of this can wait util we are on the
> same page of where they'd better go.
> 

I'm about to send patch 2(adding trivial tests) as a patch for the tip
tree, and rest of this series will be patches for powerpc next.

Will, AFAIK, you are currently working on variants on arm64, right? I
wonder whether you depend on patch 3 (allow archictures to provide
self-defined __atomic_op_*), if so I can also send patch 3 as a patch
for tip tree and wait until it merged into powerpc next to send the
rest. 

Thanks and Best Regards,
Boqun

signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH powerpc/next 1/2] powerpc: Make value-returning atomics fully ordered

2015-11-01 Thread Boqun Feng

According to memory-barriers.txt:

> Any atomic operation that modifies some state in memory and returns
> information about the state (old or new) implies an SMP-conditional
> general memory barrier (smp_mb()) on each side of the actual
> operation ...

Which mean these operations should be fully ordered. However on PPC,
PPC_ATOMIC_ENTRY_BARRIER is the barrier before the actual operation,
which is currently "lwsync" if SMP=y. The leading "lwsync" can not
guarantee fully ordered atomics, according to Paul Mckenney:

https://lkml.org/lkml/2015/10/14/970

To fix this, we define PPC_ATOMIC_ENTRY_BARRIER as "sync" to guarantee
the fully-ordered semantics.

This also makes futex atomics fully ordered, which can avoid possible
memory ordering problems if userspace code relies on futex system call
for fully ordered semantics.

Cc: <sta...@vger.kernel.org> # 3.4+
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
These two are separated and splited from the patchset of powerpc atomic
variants implementation, whose link is:

https://lkml.org/lkml/2015/10/26/141

Based on next branch of powerpc tree, tested by 0day.

 arch/powerpc/include/asm/synch.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index e682a71..c508686 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -44,7 +44,7 @@ static inline void isync(void)
MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
 #define PPC_ACQUIRE_BARRIER "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER stringify_in_c(LWSYNC) "\n"
-#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(LWSYNC) "\n"
+#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(sync) "\n"
 #define PPC_ATOMIC_EXIT_BARRIER "\n" stringify_in_c(sync) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
-- 
2.6.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND, tip/locking/core, v5, 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-11-01 Thread Boqun Feng

On Fri, Oct 30, 2015 at 08:56:33AM +0800, Boqun Feng wrote:
> On Tue, Oct 27, 2015 at 11:06:52AM +0800, Boqun Feng wrote:
> > On Tue, Oct 27, 2015 at 01:33:47PM +1100, Michael Ellerman wrote:
> > > On Mon, 2015-26-10 at 10:15:36 UTC, Boqun Feng wrote:
> > > > This patch fixes two problems to make value-returning atomics and
> > > > {cmp}xchg fully ordered on PPC.
> > > 
> > > Hi Boqun,
> > > 
> > > Can you please split this into two patches. One that does the cmpxchg 
> > > change
> > > and one that changes PPC_ATOMIC_ENTRY_BARRIER.
> > > 
> > 
> > OK, make sense ;-)
> > 
> > > Also given how pervasive this change is I'd like to take it via the 
> > > powerpc
> > > next tree, so can you please send this patch (which will be two after you 
> > > split
> > > it) as powerpc patches. And the rest can go via tip?
> > > 
> > 
> > One problem is that patch 5 will remove __xchg_u32 and __xchg_64
> > entirely, which are modified in this patch(patch 1), so there will be
> > some conflicts if two branch get merged, I think.
> > 
> > Alternative way is that all this series go to powerpc next tree as most
> > of the dependent patches are already there. I just need to remove
> > inc/dec related code and resend them when appropriate. Besides, I can
> > pull patch 2 out and send it as a tip patch because it's general code
> > and no one depends on this in this series.
> > 
> > To summerize:
> > 
> > patch 1(split to two), 3, 4(remove inc/dec implementation), 5, 6 sent as
> > powerpc patches for powerpc next, patch 2(unmodified) sent as tip patch
> > for locking/core.
> > 
> > Peter and Michael, this works for you both?
> > 
> 
> Thoughts? ;-)
> 

Peter and Michael, I will split patch 1 to two and send them as patches
for powerpc next first. The rest of this can wait util we are on the
same page of where they'd better go.

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH powerpc/next 2/2] powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered

2015-11-01 Thread Boqun Feng

According to memory-barriers.txt, xchg*, cmpxchg* and their atomic_
versions all need to be fully ordered, however they are now just
RELEASE+ACQUIRE, which are not fully ordered.

So also replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee fully ordered semantics
of atomic{,64}_{cmp,}xchg() and {cmp,}xchg(), as a complement of commit
b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics")

This patch depends on patch "powerpc: Make value-returning atomics fully
ordered" for PPC_ATOMIC_ENTRY_BARRIER definition.

Cc: <sta...@vger.kernel.org> # 3.4+
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/cmpxchg.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
old, unsigned long new)
unsigned int prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 # __cmpxchg_u32\n\
cmpw0,%0,%3\n\
bne-2f\n"
PPC405_ERR77(0,%2)
 "  stwcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
old, unsigned long new)
unsigned long prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 # __cmpxchg_u64\n\
cmpd0,%0,%3\n\
bne-2f\n\
stdcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
-- 
2.6.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND, tip/locking/core, v5, 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-10-29 Thread Boqun Feng

On Tue, Oct 27, 2015 at 11:06:52AM +0800, Boqun Feng wrote:
> On Tue, Oct 27, 2015 at 01:33:47PM +1100, Michael Ellerman wrote:
> > On Mon, 2015-26-10 at 10:15:36 UTC, Boqun Feng wrote:
> > > This patch fixes two problems to make value-returning atomics and
> > > {cmp}xchg fully ordered on PPC.
> > 
> > Hi Boqun,
> > 
> > Can you please split this into two patches. One that does the cmpxchg change
> > and one that changes PPC_ATOMIC_ENTRY_BARRIER.
> > 
> 
> OK, make sense ;-)
> 
> > Also given how pervasive this change is I'd like to take it via the powerpc
> > next tree, so can you please send this patch (which will be two after you 
> > split
> > it) as powerpc patches. And the rest can go via tip?
> > 
> 
> One problem is that patch 5 will remove __xchg_u32 and __xchg_64
> entirely, which are modified in this patch(patch 1), so there will be
> some conflicts if two branch get merged, I think.
> 
> Alternative way is that all this series go to powerpc next tree as most
> of the dependent patches are already there. I just need to remove
> inc/dec related code and resend them when appropriate. Besides, I can
> pull patch 2 out and send it as a tip patch because it's general code
> and no one depends on this in this series.
> 
> To summerize:
> 
> patch 1(split to two), 3, 4(remove inc/dec implementation), 5, 6 sent as
> powerpc patches for powerpc next, patch 2(unmodified) sent as tip patch
> for locking/core.
> 
> Peter and Michael, this works for you both?
> 

Thoughts? ;-)

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v5 2/6] atomics: Add test for atomic operations with _relaxed variants

2015-10-26 Thread Boqun Feng

Some atomic operations now have _relaxed/acquire/release variants, this
patch then adds some trivial tests for two purpose:

1.  test the behavior of these new operations in single-CPU
environment.

2.  make their code generated before we actually use them somewhere,
so that we can examine their assembly code.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 lib/atomic64_test.c | 120 ++--
 1 file changed, 79 insertions(+), 41 deletions(-)

diff --git a/lib/atomic64_test.c b/lib/atomic64_test.c
index 83c33a5b..18e422b 100644
--- a/lib/atomic64_test.c
+++ b/lib/atomic64_test.c
@@ -27,6 +27,65 @@ do { 
\
(unsigned long long)r); \
 } while (0)
 
+/*
+ * Test for a atomic operation family,
+ * @test should be a macro accepting parameters (bit, op, ...)
+ */
+
+#define FAMILY_TEST(test, bit, op, args...)\
+do {   \
+   test(bit, op, ##args);  \
+   test(bit, op##_acquire, ##args);\
+   test(bit, op##_release, ##args);\
+   test(bit, op##_relaxed, ##args);\
+} while (0)
+
+#define TEST_RETURN(bit, op, c_op, val)\
+do {   \
+   atomic##bit##_set(, v0);  \
+   r = v0; \
+   r c_op val; \
+   BUG_ON(atomic##bit##_##op(val, ) != r);   \
+   BUG_ON(atomic##bit##_read() != r);\
+} while (0)
+
+#define RETURN_FAMILY_TEST(bit, op, c_op, val) \
+do {   \
+   FAMILY_TEST(TEST_RETURN, bit, op, c_op, val);   \
+} while (0)
+
+#define TEST_ARGS(bit, op, init, ret, expect, args...) \
+do {   \
+   atomic##bit##_set(, init);\
+   BUG_ON(atomic##bit##_##op(, ##args) != ret);  \
+   BUG_ON(atomic##bit##_read() != expect);   \
+} while (0)
+
+#define XCHG_FAMILY_TEST(bit, init, new)   \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, xchg, init, init, new, new);\
+} while (0)
+
+#define CMPXCHG_FAMILY_TEST(bit, init, new, wrong) \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, cmpxchg,\
+   init, init, new, init, new);\
+   FAMILY_TEST(TEST_ARGS, bit, cmpxchg,\
+   init, init, init, wrong, new);  \
+} while (0)
+
+#define INC_RETURN_FAMILY_TEST(bit, i) \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, inc_return, \
+   i, (i) + one, (i) + one);   \
+} while (0)
+
+#define DEC_RETURN_FAMILY_TEST(bit, i) \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, dec_return, \
+   i, (i) - one, (i) - one);   \
+} while (0)
+
 static __init void test_atomic(void)
 {
int v0 = 0xaaa31337;
@@ -45,6 +104,18 @@ static __init void test_atomic(void)
TEST(, and, &=, v1);
TEST(, xor, ^=, v1);
TEST(, andnot, &= ~, v1);
+
+   RETURN_FAMILY_TEST(, add_return, +=, onestwos);
+   RETURN_FAMILY_TEST(, add_return, +=, -one);
+   RETURN_FAMILY_TEST(, sub_return, -=, onestwos);
+   RETURN_FAMILY_TEST(, sub_return, -=, -one);
+
+   INC_RETURN_FAMILY_TEST(, v0);
+   DEC_RETURN_FAMILY_TEST(, v0);
+
+   XCHG_FAMILY_TEST(, v0, v1);
+   CMPXCHG_FAMILY_TEST(, v0, v1, onestwos);
+
 }
 
 #define INIT(c) do { atomic64_set(, c); r = c; } while (0)
@@ -74,25 +145,10 @@ static __init void test_atomic64(void)
TEST(64, xor, ^=, v1);
TEST(64, andnot, &= ~, v1);
 
-   INIT(v0);
-   r += onestwos;
-   BUG_ON(atomic64_add_return(onestwos, ) != r);
-   BUG_ON(v.counter != r);
-
-   INIT(v0);
-   r += -one;
-   BUG_ON(atomic64_add_return(-one, ) != r);
-   BUG_ON(v.counter != r);
-
-   INIT(v0);
-   r -= onestwos;
-   BUG_ON(atomic64_sub_return(onestwos, ) != r);
-   BUG_ON(v.counter != r);
-
-   INIT(v0);
-   r -= -one;
-   BUG_ON(atomic64_sub_return(-one, ) != r);
-   BUG_ON(v.counter != r);
+   RETURN_FAMILY_TEST(64, add_return, +=, onestwos);
+   RETURN_FAMILY_TEST(64, add_return, +=, -one);
+   RETURN_FAMILY_TEST(64, sub_return, -=, onestwos);
+   RETURN_FAMILY_TEST(64, su

[PATCH tip/locking/core v5 6/6] powerpc: atomic: Implement cmpxchg{, 64}_* and atomic{, 64}_cmpxchg_* variants

2015-10-26 Thread Boqun Feng

Implement cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed, based on
which _release variants can be built.

To avoid superfluous barriers in _acquire variants, we implement these
operations with assembly code rather use __atomic_op_acquire() to build
them automatically.

For the same reason, we keep the assembly implementation of fully
ordered cmpxchg operations.

However, we don't do the similar for _release, because that will require
putting barriers in the middle of ll/sc loops, which is probably a bad
idea.

Note cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed are not
compiler barriers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  10 +++
 arch/powerpc/include/asm/cmpxchg.h | 149 -
 2 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 2c3d4f0..195dc85 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -176,6 +176,11 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t 
*v)
 #define atomic_dec_return_relaxed atomic_dec_return_relaxed
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic_cmpxchg_relaxed(v, o, n) \
+   cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic_cmpxchg_acquire(v, o, n) \
+   cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
@@ -444,6 +449,11 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t 
*v)
 }
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic64_cmpxchg_relaxed(v, o, n) \
+   cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic64_cmpxchg_acquire(v, o, n) \
+   cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index 17c7e14..cae4fa8 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -181,6 +181,56 @@ __cmpxchg_u32_local(volatile unsigned int *p, unsigned 
long old,
return prev;
 }
 
+static __always_inline unsigned long
+__cmpxchg_u32_relaxed(u32 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:lwarx   %0,0,%2 # __cmpxchg_u32_relaxed\n"
+"  cmpw0,%0,%3\n"
+"  bne-2f\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %4,0,%2\n"
+"  bne-1b\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc");
+
+   return prev;
+}
+
+/*
+ * cmpxchg family don't have order guarantee if cmp part fails, therefore we
+ * can avoid superfluous barriers if we use assembly code to implement
+ * cmpxchg() and cmpxchg_acquire(), however we don't do the similar for
+ * cmpxchg_release() because that will result in putting a barrier in the
+ * middle of a ll/sc loop, which is probably a bad idea. For example, this
+ * might cause the conditional store more likely to fail.
+ */
+static __always_inline unsigned long
+__cmpxchg_u32_acquire(u32 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:lwarx   %0,0,%2 # __cmpxchg_u32_acquire\n"
+"  cmpw0,%0,%3\n"
+"  bne-2f\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %4,0,%2\n"
+"  bne-1b\n"
+   PPC_ACQUIRE_BARRIER
+   "\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc", "memory");
+
+   return prev;
+}
+
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
 __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
@@ -224,6 +274,46 @@ __cmpxchg_u64_local(volatile unsigned long *p, unsigned 
long old,
 
return prev;
 }
+
+static __always_inline unsigned long
+__cmpxchg_u64_relaxed(u64 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:ldarx   %0,0,%2 # __cmpxchg_u64_relaxed\n"
+"  cmpd0,%0,%3\n"
+"  bne-2f\n"
+"  stdcx.  %4,0,%2\n"
+"  bne-1b\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc");
+
+   return prev;
+}
+
+sta

[PATCH tip/locking/core v5 4/6] powerpc: atomic: Implement atomic{, 64}__return_ variants

2015-10-26 Thread Boqun Feng

On powerpc, acquire and release semantics can be achieved with
lightweight barriers("lwsync" and "ctrl+isync"), which can be used to
implement __atomic_op_{acquire,release}.

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. Therefore in
__atomic_op_acquire() we use PPC_ACQUIRE_BARRIER, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

Implement atomic{,64}_{add,sub,inc,dec}_return_relaxed, and build other
variants with these helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 107 +++---
 1 file changed, 65 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 55f106e..f9c0c6c 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,24 @@
 
 #define ATOMIC_INIT(i) { (i) }
 
+/*
+ * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#define __atomic_op_acquire(op, args...)   \
+({ \
+   typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
+   __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory");\
+   __ret;  \
+})
+
+#define __atomic_op_release(op, args...)   \
+({ \
+   smp_lwsync();   \
+   op##_relaxed(args); \
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
int t;
@@ -42,27 +60,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v)  
\
: "cc");\
 }  \
 
-#define ATOMIC_OP_RETURN(op, asm_op)   \
-static __inline__ int atomic_##op##_return(int a, atomic_t *v) \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
-   PPC_ATOMIC_ENTRY_BARRIER\
-"1:lwarx   %0,0,%2 # atomic_" #op "_return\n"  \
-   #asm_op " %0,%1,%0\n"   \
-   PPC405_ERR77(0,%2)  \
-"  stwcx.  %0,0,%2 \n" \
+"1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
+   #asm_op " %0,%2,%0\n"   \
+   PPC405_ERR77(0, %3) \
+"  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
-   PPC_ATOMIC_EXIT_BARRIER \
-   : "=" (t) \
+   : "=" (t), "+m" (v->counter)  \
: "r" (a), "r" (>counter)\
-   : "cc", "memory");  \
+   : "cc");\
\
return t;   \
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, 
asm_op)
+#define ATOMIC_OPS(op, asm_op) \
+   ATOMIC_OP(op, asm_op)   \
+

[PATCH tip/locking/core v5 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics

2015-10-26 Thread Boqun Feng

Hi all,

This is v5 of the series.

Link for v1: https://lkml.org/lkml/2015/8/27/798
Link for v2: https://lkml.org/lkml/2015/9/16/527
Link for v3: https://lkml.org/lkml/2015/10/12/368
Link for v4: https://lkml.org/lkml/2015/10/14/670

Changes since v4:

*   define PPC_ATOMIC_ENTRY_BARRIER as "sync" (Paul E. Mckenney)

*   remove PPC-specific __atomic_op_fence().


Relaxed/acquire/release variants of atomic operations {add,sub}_return
and {cmp,}xchg are introduced by commit:

"atomics: add acquire/release/relaxed variants of some atomic operations"

and {inc,dec}_return has been introduced by commit:

"locking/asm-generic: Add _{relaxed|acquire|release}() variants for
inc/dec atomics"

Both of these are in the current locking/core branch of the tip tree.

By default, the generic code will implement a relaxed variant as a full
ordered atomic operation and release/acquire a variant as a relaxed
variant with a necessary general barrier before or after.

On PPC, which has a weak memory order model, a relaxed variant can be
implemented more lightweightly than a full ordered one. Further more, release
and acquire variants can be implemented with arch-specific lightweight
barriers.

Besides, cmpxchg, xchg and their atomic_ versions are only RELEASE+ACQUIRE
rather that fully ordered in current PPC implementation, which is incorrect
according to memory-barriers.txt. Further more, PPC_ATOMIC_ENTRY_BARRIER, the
leading barrier of fully ordered atomics, should be "sync" rather than "lwsync"
if SMP=y, to guarantee fully ordered semantics.

Therefore this patchset fixes the order guarantee of cmpxchg, xchg and
value-returning atomics on PPC and implements the relaxed/acquire/release
variants based on PPC memory model and specific barriers, Some trivial tests
for these new variants are also included in this series, because some of these
variants are not used in kernel for now, I think is a good idea to at least
generate the code for these variants somewhere.

The patchset consists of 6 parts:

1.  Make value-returning atomics, futex atomics, xchg and cmpxchg fully
ordered

2.  Add trivial tests for the new variants in lib/atomic64_test.c

3.  Allow architectures to define their own __atomic_op_*() helpers
to build other variants based on relaxed.

4.  Implement atomic{,64}_{add,sub,inc,dec}_return_* variants

5.  Implement xchg_* and atomic{,64}_xchg_* variants

6.  Implement cmpxchg_* atomic{,64}_cmpxchg_* variants


This patchset is based on current locking/core branch of the tip tree
and all patches are built and boot tested for little endian pseries, and
also tested by 0day.


Looking forward to any suggestion, question and comment ;-)

Regards,
Boqun
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v5 3/6] atomics: Allow architectures to define their own __atomic_op_* helpers

2015-10-26 Thread Boqun Feng

Some architectures may have their special barriers for acquire, release
and fence semantics, so that general memory barriers(smp_mb__*_atomic())
in the default __atomic_op_*() may be too strong, so allow architectures
to define their own helpers which can overwrite the default helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 include/linux/atomic.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 27e580d..947c1dc 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -43,20 +43,29 @@ static inline int atomic_read_ctrl(const atomic_t *v)
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
  * variant is already fully ordered, no additional barriers are needed.
+ *
+ * Besides, if an arch has a special barrier for acquire/release, it could
+ * implement its own __atomic_op_* and use the same framework for building
+ * variants
  */
+#ifndef __atomic_op_acquire
 #define __atomic_op_acquire(op, args...)   \
 ({ \
typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
smp_mb__after_atomic(); \
__ret;  \
 })
+#endif
 
+#ifndef __atomic_op_release
 #define __atomic_op_release(op, args...)   \
 ({ \
smp_mb__before_atomic();\
op##_relaxed(args); \
 })
+#endif
 
+#ifndef __atomic_op_fence
 #define __atomic_op_fence(op, args...) \
 ({ \
typeof(op##_relaxed(args)) __ret;   \
@@ -65,6 +74,7 @@ static inline int atomic_read_ctrl(const atomic_t *v)
smp_mb__after_atomic(); \
__ret;  \
 })
+#endif
 
 /* atomic_add_return_relaxed */
 #ifndef atomic_add_return_relaxed
-- 
2.6.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v5 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 05:50:52PM +0800, Boqun Feng wrote:
> This patch fixes two problems to make value-returning atomics and
> {cmp}xchg fully ordered on PPC.
> 
> According to memory-barriers.txt:
> 
> > Any atomic operation that modifies some state in memory and returns
> > information about the state (old or new) implies an SMP-conditional
> > general memory barrier (smp_mb()) on each side of the actual
> > operation ...
> 
> which means these operations should be fully ordered. However on PPC,
> PPC_ATOMIC_ENTRY_BARRIER is the barrier before the actual operation,
> which is currently "lwsync" if SMP=y. The leading "lwsync" can not
> guarantee fully ordered atomics, according to Paul Mckenney:
> 
> https://lkml.org/lkml/2015/10/14/970
> 
> To fix this, we define PPC_ATOMIC_ENTRY_BARRIER as "sync" to guarantee
> the fully-ordered semantics.
> 
> This also makes futex atomics fully ordered, which can avoid possible
> memory ordering problems if userspace code relies on futex system call
> for fully ordered semantics.
> 
> Another thing to fix is that xchg, cmpxchg and their atomic{64}_
> versions are currently RELEASE+ACQUIRE, which are not fully ordered.
> 
> So also replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> __{cmp,}xchg_{u32,u64} respectively to guarantee fully ordered semantics
> of atomic{,64}_{cmp,}xchg() and {cmp,}xchg(), as a complement of commit
> b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics").
> 
> Cc: <sta...@vger.kernel.org> # 3.4+

Hmm.. I use the same Cc tag as v4, seems my git(2.6.2) send-email has a
weird behavior of composing Cc address?

I will resend this one soon, sorry ;-(

Regards,
Boqun

> Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
> ---
> 
> Michael, I also change PPC_ATOMIC_ENTRY_BARRIER as "sync" if SMP=y in this
> version , which is different from the previous one, so request for a new ack.
> Thank you ;-)
> 
>  arch/powerpc/include/asm/cmpxchg.h | 16 
>  arch/powerpc/include/asm/synch.h   |  2 +-
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index ad6263c..d1a8d93 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
>   unsigned long prev;
>  
>   __asm__ __volatile__(
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  lwarx   %0,0,%2 \n"
>   PPC405_ERR77(0,%2)
>  "stwcx.  %3,0,%2 \n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   : "=" (prev), "+m" (*(volatile unsigned int *)p)
>   : "r" (p), "r" (val)
>   : "cc", "memory");
> @@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
>   unsigned long prev;
>  
>   __asm__ __volatile__(
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  ldarx   %0,0,%2 \n"
>   PPC405_ERR77(0,%2)
>  "stdcx.  %3,0,%2 \n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   : "=" (prev), "+m" (*(volatile unsigned long *)p)
>   : "r" (p), "r" (val)
>   : "cc", "memory");
> @@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
> old, unsigned long new)
>   unsigned int prev;
>  
>   __asm__ __volatile__ (
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  lwarx   %0,0,%2 # __cmpxchg_u32\n\
>   cmpw0,%0,%3\n\
>   bne-2f\n"
>   PPC405_ERR77(0,%2)
>  "stwcx.  %4,0,%2\n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   "\n\
>  2:"
>   : "=" (prev), "+m" (*p)
> @@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
> old, unsigned long new)
>   unsigned long prev;
>  
>   __asm__ __volatile__ (
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  ldarx   %0,0,%2 # __cmpxchg_u64\n\
>   cmpd0,%0,%3\n\
>   bne-2f\n\
>   stdcx.  %4,0,%2\n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   "\n\
>  2:"
>   : "=" (prev), "+m" (*p)
> diff --git a/arch/powerpc/incl

[PATCH RESEND tip/locking/core v5 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-10-26 Thread Boqun Feng

This patch fixes two problems to make value-returning atomics and
{cmp}xchg fully ordered on PPC.

According to memory-barriers.txt:

> Any atomic operation that modifies some state in memory and returns
> information about the state (old or new) implies an SMP-conditional
> general memory barrier (smp_mb()) on each side of the actual
> operation ...

which means these operations should be fully ordered. However on PPC,
PPC_ATOMIC_ENTRY_BARRIER is the barrier before the actual operation,
which is currently "lwsync" if SMP=y. The leading "lwsync" can not
guarantee fully ordered atomics, according to Paul Mckenney:

https://lkml.org/lkml/2015/10/14/970

To fix this, we define PPC_ATOMIC_ENTRY_BARRIER as "sync" to guarantee
the fully-ordered semantics.

This also makes futex atomics fully ordered, which can avoid possible
memory ordering problems if userspace code relies on futex system call
for fully ordered semantics.

Another thing to fix is that xchg, cmpxchg and their atomic{64}_
versions are currently RELEASE+ACQUIRE, which are not fully ordered.

So also replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee fully ordered semantics
of atomic{,64}_{cmp,}xchg() and {cmp,}xchg(), as a complement of commit
b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics").

Cc: <sta...@vger.kernel.org> # 3.4+
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---

Michael, I also change PPC_ATOMIC_ENTRY_BARRIER as "sync" if SMP=y in this
version , which is different from the previous one, so request for a new ack.
Thank you ;-)

 arch/powerpc/include/asm/cmpxchg.h | 16 
 arch/powerpc/include/asm/synch.h   |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
old, unsigned long new)
unsigned int prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 # __cmpxchg_u32\n\
cmpw0,%0,%3\n\
bne-2f\n"
PPC405_ERR77(0,%2)
 "  stwcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
old, unsigned long new)
unsigned long prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 # __cmpxchg_u64\n\
cmpd0,%0,%3\n\
bne-2f\n\
stdcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index e682a71..c508686 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -44,7 +44,7 @@ static inline void isync(void)
MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
 #define PPC_ACQUIRE_BARRIER "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER stringify_in_c(LWSYNC) "\n"
-#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(LWSYNC) "\n"
+#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(sync) "\n"
 #define PPC_ATOMIC_EXIT_BARRIER "\n" stringify_in_c(sync) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
-- 
2.6.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v5 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-10-26 Thread Boqun Feng

This patch fixes two problems to make value-returning atomics and
{cmp}xchg fully ordered on PPC.

According to memory-barriers.txt:

> Any atomic operation that modifies some state in memory and returns
> information about the state (old or new) implies an SMP-conditional
> general memory barrier (smp_mb()) on each side of the actual
> operation ...

which means these operations should be fully ordered. However on PPC,
PPC_ATOMIC_ENTRY_BARRIER is the barrier before the actual operation,
which is currently "lwsync" if SMP=y. The leading "lwsync" can not
guarantee fully ordered atomics, according to Paul Mckenney:

https://lkml.org/lkml/2015/10/14/970

To fix this, we define PPC_ATOMIC_ENTRY_BARRIER as "sync" to guarantee
the fully-ordered semantics.

This also makes futex atomics fully ordered, which can avoid possible
memory ordering problems if userspace code relies on futex system call
for fully ordered semantics.

Another thing to fix is that xchg, cmpxchg and their atomic{64}_
versions are currently RELEASE+ACQUIRE, which are not fully ordered.

So also replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee fully ordered semantics
of atomic{,64}_{cmp,}xchg() and {cmp,}xchg(), as a complement of commit
b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics").

Cc: <sta...@vger.kernel.org> # 3.4+
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---

Michael, I also change PPC_ATOMIC_ENTRY_BARRIER as "sync" if SMP=y in this
version , which is different from the previous one, so request for a new ack.
Thank you ;-)

 arch/powerpc/include/asm/cmpxchg.h | 16 
 arch/powerpc/include/asm/synch.h   |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
old, unsigned long new)
unsigned int prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 # __cmpxchg_u32\n\
cmpw0,%0,%3\n\
bne-2f\n"
PPC405_ERR77(0,%2)
 "  stwcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
old, unsigned long new)
unsigned long prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 # __cmpxchg_u64\n\
cmpd0,%0,%3\n\
bne-2f\n\
stdcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index e682a71..c508686 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -44,7 +44,7 @@ static inline void isync(void)
MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
 #define PPC_ACQUIRE_BARRIER "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER stringify_in_c(LWSYNC) "\n"
-#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(LWSYNC) "\n"
+#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(sync) "\n"
 #define PPC_ATOMIC_EXIT_BARRIER "\n" stringify_in_c(sync) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
-- 
2.6.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v5 5/6] powerpc: atomic: Implement xchg_* and atomic{, 64}_xchg_* variants

2015-10-26 Thread Boqun Feng

Implement xchg_relaxed and atomic{,64}_xchg_relaxed, based on these
_relaxed variants, release/acquire variants and fully ordered versions
can be built.

Note that xchg_relaxed and atomic_{,64}_xchg_relaxed are not compiler
barriers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  2 ++
 arch/powerpc/include/asm/cmpxchg.h | 69 +-
 2 files changed, 32 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index f9c0c6c..2c3d4f0 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -177,6 +177,7 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * __atomic_add_unless - add unless the number is a given value
@@ -444,6 +445,7 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t 
*v)
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * atomic64_add_unless - add unless the number is a given value
diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index d1a8d93..17c7e14 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -9,21 +9,20 @@
 /*
  * Atomic exchange
  *
- * Changes the memory location '*ptr' to be val and returns
+ * Changes the memory location '*p' to be val and returns
  * the previous value stored there.
  */
+
 static __always_inline unsigned long
-__xchg_u32(volatile void *p, unsigned long val)
+__xchg_u32_local(volatile void *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -31,42 +30,34 @@ __xchg_u32(volatile void *p, unsigned long val)
return prev;
 }
 
-/*
- * Atomic exchange
- *
- * Changes the memory location '*ptr' to be val and returns
- * the previous value stored there.
- */
 static __always_inline unsigned long
-__xchg_u32_local(volatile void *p, unsigned long val)
+__xchg_u32_relaxed(u32 *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-"1:lwarx   %0,0,%2 \n"
-   PPC405_ERR77(0,%2)
-"  stwcx.  %3,0,%2 \n\
-   bne-1b"
-   : "=" (prev), "+m" (*(volatile unsigned int *)p)
+"1:lwarx   %0,0,%2\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %3,0,%2\n"
+"  bne-1b"
+   : "=" (prev), "+m" (*p)
: "r" (p), "r" (val)
-   : "cc", "memory");
+   : "cc");
 
return prev;
 }
 
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
-__xchg_u64(volatile void *p, unsigned long val)
+__xchg_u64_local(volatile void *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -75,18 +66,18 @@ __xchg_u64(volatile void *p, unsigned long val)
 }
 
 static __always_inline unsigned long
-__xchg_u64_local(volatile void *p, unsigned long val)
+__xchg_u64_relaxed(u64 *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-"1:ldarx   %0,0,%2 \n"
-   PPC405_ERR77(0,%2)
-"  stdcx.  %3,0,%2 \n\
-   bne-1b"
-   : "=" (prev), "+m" (*(volatile unsigned long *)p)
+"1:ldarx   %0,0,%2\n"
+   PPC405_ERR77(0, %2)
+"  stdcx.  %3,0,%2\n"
+"  bne-1b"
+   : "=" (prev), "+m" (*p)
: "r" (p), "r" (val)
-   : "cc", "memory");
+   : "cc");
 
return prev;
 }
@@ -99,14 +90,14 @@ __xchg_u64_local(volatile void *p, unsigned long val)
 extern void __xchg_called_with_bad_pointer(void);
 
 static __always_inline unsigned long
-__xchg(volatile void *ptr, unsigned long x, unsigned int size)
+__x

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 11:20:01AM +0900, Michael Ellerman wrote:
> 
> Sorry guys, these threads are so long I tend not to read them very actively :}
> 
> Looking at the system call path, the straight line path does not include any
> barriers. I can't see any hidden in macros either.
> 
> We also have an explicit sync in the switch_to() path, which suggests that we
> know system call is not a full barrier.
> 
> Also looking at the architecture, section 1.5 which talks about the
> synchronisation that occurs on system calls, defines nothing in terms of
> memory ordering, and includes a programming note which says "Unlike the
> Synchronize instruction, a context synchronizing operation does not affect the
> order in which storage accesses are performed.".
> 

Thank you, Michael. So IIUC, "sc" and "rfid" just imply an execution
barrier like "isync" rather than a memory barrier. So memory barriers
are needed if a system call need a memory ordering guarantee.

Regards,
Boqun

> Whether that's actually how it's implemented I don't know, I'll see if I can
> find out.
> 
> cheers
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 02:20:21PM +1100, Paul Mackerras wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> > 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> 
> Right - interrupts and returns from interrupt are context
> synchronizing operations, which means they wait until all outstanding
> instructions have got to the point where they have reported any
> exceptions they're going to report, which means in turn that loads and
> stores have completed address translation.  But all of that doesn't
> imply anything about the visibility of the loads and stores.
> 
> There is a full barrier in the context switch path, but not in the
> system call entry/exit path.
> 

Thank you, Paul. That's much clear now ;-)

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND, tip/locking/core, v5, 1/6] powerpc: atomic: Make _return atomics and *{cmp}xchg fully ordered

2015-10-26 Thread Boqun Feng

On Tue, Oct 27, 2015 at 01:33:47PM +1100, Michael Ellerman wrote:
> On Mon, 2015-26-10 at 10:15:36 UTC, Boqun Feng wrote:
> > This patch fixes two problems to make value-returning atomics and
> > {cmp}xchg fully ordered on PPC.
> 
> Hi Boqun,
> 
> Can you please split this into two patches. One that does the cmpxchg change
> and one that changes PPC_ATOMIC_ENTRY_BARRIER.
> 

OK, make sense ;-)

> Also given how pervasive this change is I'd like to take it via the powerpc
> next tree, so can you please send this patch (which will be two after you 
> split
> it) as powerpc patches. And the rest can go via tip?
> 

One problem is that patch 5 will remove __xchg_u32 and __xchg_64
entirely, which are modified in this patch(patch 1), so there will be
some conflicts if two branch get merged, I think.

Alternative way is that all this series go to powerpc next tree as most
of the dependent patches are already there. I just need to remove
inc/dec related code and resend them when appropriate. Besides, I can
pull patch 2 out and send it as a tip patch because it's general code
and no one depends on this in this series.

To summerize:

patch 1(split to two), 3, 4(remove inc/dec implementation), 5, 6 sent as
powerpc patches for powerpc next, patch 2(unmodified) sent as tip patch
for locking/core.

Peter and Michael, this works for you both?

Regards,

signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Boqun Feng

On Sat, Oct 24, 2015 at 07:53:56PM +0800, Boqun Feng wrote:
> On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> > 
> > Right, futexes are a pain; and I think we all agreed we didn't want to
> > go rely on implementation details unless we absolutely _have_ to.
> > 
> 
> Agreed.
> 
> Besides, after I have read why futex_wake_op(the caller of
> futex_atomic_op_inuser()) is introduced, I think your worries are quite
> reasonable. I thought the futex_atomic_op_inuser() only operated on
> futex related variables, but it turns out it can actually operate any
> userspace variable if userspace code likes, therefore we don't have
> control of all memory ordering guarantee of the variable. So if PPC
> doesn't provide a full barrier at user<->kernel boundries, we should
> make futex_atomic_op_inuser() fully ordered.
> 
> 
> Still looking into futex_atomic_cmpxchg_inatomic() ...
> 

I thought that the futex related variables (userspace variables that the
first parameter of futex system call points to) are only accessed by
futex system call in userspace, but it turns out not the fact. So
memordy ordering guarantees of these variables are also out of the
control of kernel. Therefore we should make
futex_atomic_cmpxchg_inatomic() fully ordered, of course, if PPC doesn't
provide a full barrier at user<->kernel boundries..

Regards
Boqun



signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Boqun Feng

On Wed, Oct 21, 2015 at 12:36:38PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.
> 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> > 

Michael and Ben, ping for this, thank you ;-)

Regards,
Boqun

> > Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> > spec is entirely quiet on this, it allows implementations full freedom
> > IIRC.
> 
> :-) ;-) ;-)
> 
> > 
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-24 Thread Boqun Feng

On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> > On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > > 
> > > > > > > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > > > So a full barrier on one side of these operations is enough, I think.
> > > > > IOW, there is no need to strengthen these operations.
> > > > 
> > > > Do we need to also worry about other futex use cases?
> > > 
> > > Worry, always!
> > > 
> > > But yes, there is one more specific usecase, which is that of a
> > > condition variable.
> > > 
> > > When we go sleep on a futex, we might want to assume visibility of the
> > > stores done by the thread that woke us by the time we wake up.
> > > 
> > 
> > But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> > and imply a full barrier, is an RELEASE(sc) semantics really needed
> > here?
> 
> For this, no, the current code should be fine I think.
> 
> > Further more, is this condition variable visibility guaranteed by other
> > part of futex? Because in futex_wake_op:
> > 
> > futex_wake_op()
> >   ...
> >   double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> >   wake_up_q(_q);
> > 
> > and in futex_wait():
> > 
> > futex_wait()
> >   ...
> >   futex_wait_queue_me(hb, , to); <- schedule() here
> >   ...
> >   unqueue_me()
> > drop_futex_key_refs(>key);
> > iput()/mmdrop(); <- a full barrier
> >   
> > 
> > The RELEASE(pc) barrier pairs with the full barrier, therefore the
> > userspace wakee can observe the condition variable modification.
> 
> Right, futexes are a pain; and I think we all agreed we didn't want to
> go rely on implementation details unless we absolutely _have_ to.
> 

Agreed.

Besides, after I have read why futex_wake_op(the caller of
futex_atomic_op_inuser()) is introduced, I think your worries are quite
reasonable. I thought the futex_atomic_op_inuser() only operated on
futex related variables, but it turns out it can actually operate any
userspace variable if userspace code likes, therefore we don't have
control of all memory ordering guarantee of the variable. So if PPC
doesn't provide a full barrier at user<->kernel boundries, we should
make futex_atomic_op_inuser() fully ordered.


Still looking into futex_atomic_cmpxchg_inatomic() ...

> > > And.. aside from the thoughts I outlined in the email referenced above,
> > > there is always the chance people accidentally rely on the strong
> > > ordering on their x86 CPU and find things come apart when ran on their
> > > ARM/MIPS/etc..
> > > 
> > > There are a fair number of people who use the raw futex call and we have
> > > 0 visibility into many of them. The assumed and accidental ordering
> > > guarantees will forever remain a mystery.
> > > 
> > 
> > Understood. That's truely a potential problem. Considering not all the
> > architectures imply a full barrier at user<->kernel boundries, maybe we
> > can use one bit in the opcode of the futex system call to indicate
> > whether userspace treats futex as fully ordered. Like:
> > 
> > #define FUTEX_ORDER_SEQ_CST  0
> > #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)
> 
> Not unless there's an actual performance problem with any of this.
> Futexes are painful enough as is.

Make sense, and we still have choices like modifying the userspace code
if there is actually a performance problem ;-)

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-22 Thread Boqun Feng

On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > 
> > > > > > https://lkml.org/lkml/2015/8/26/596
> 
> > > So a full barrier on one side of these operations is enough, I think.
> > > IOW, there is no need to strengthen these operations.
> > 
> > Do we need to also worry about other futex use cases?
> 
> Worry, always!
> 
> But yes, there is one more specific usecase, which is that of a
> condition variable.
> 
> When we go sleep on a futex, we might want to assume visibility of the
> stores done by the thread that woke us by the time we wake up.
> 

But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
and imply a full barrier, is an RELEASE(sc) semantics really needed
here?

Further more, is this condition variable visibility guaranteed by other
part of futex? Because in futex_wake_op:

futex_wake_op()
  ...
  double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
  wake_up_q(_q);

and in futex_wait():

futex_wait()
  ...
  futex_wait_queue_me(hb, , to); <- schedule() here
  ...
  unqueue_me()
drop_futex_key_refs(>key);
iput()/mmdrop(); <- a full barrier

The RELEASE(pc) barrier pairs with the full barrier, therefore the
userspace wakee can observe the condition variable modification.

> 
> 
> And.. aside from the thoughts I outlined in the email referenced above,
> there is always the chance people accidentally rely on the strong
> ordering on their x86 CPU and find things come apart when ran on their
> ARM/MIPS/etc..
> 
> There are a fair number of people who use the raw futex call and we have
> 0 visibility into many of them. The assumed and accidental ordering
> guarantees will forever remain a mystery.
> 

Understood. That's truely a potential problem. Considering not all the
architectures imply a full barrier at user<->kernel boundries, maybe we
can use one bit in the opcode of the futex system call to indicate
whether userspace treats futex as fully ordered. Like:

#define FUTEX_ORDER_SEQ_CST  0
#define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Therefore all existing code will run with a strong ordering version of
futex(of course, we need to check and modify kernel code first to
guarantee that), and if userspace code uses FUTEX_ORDER_RELAXED, it must
not rely on futex() for the strong ordering, and should add memory
barriers itself if necessary.

Regards,
Boqun

signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Boqun Feng

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> > > > 
> > > > Of course, if I am not missing something, then this applies also to the
> > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > If so, it would seem that I didn't think through all the possibilities
> > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > but not as the load/store itself.  :-/
> > > > 
> > > 
> > > Paul, I know this may be difficult, but could you recall why the
> > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > 
> > > I did some search, but couldn't find the discussion of that patch.
> > > 
> > > I ask this because I recall Peter once bought up a discussion:
> > > 
> > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > with futex atomics not being full barriers.
> 
> I have heard of user-level applications relying on unlock-lock being a
> full barrier.  So paranoia would argue for the full barrier.
> 

Understood.

So a full barrier on one side of these operations is enough, I think.
IOW, there is no need to strengthen these operations.

> > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > not, considering that I want this patch to go to -stable tree.
> > 
> > So Linus' argued that since we only need to order against user accesses
> > (true) and priv changes typically imply strong barriers (open) we might
> > want to allow archs to rely on those instead of mandating they have
> > explicit barriers in the futex primitives.
> > 
> > And I indeed forgot to follow up on that discussion.
> > 
> > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > not critical to the futex atomic implementations what extra barriers are
> > added.
> > 
> > If not; then strengthening the futex ops is indeed (probably) a good
> > thing :-)

Peter, that's probably a good thing, but I'm not that familiar with
futex right now, so I won't touch that part if unnecessary in this
series.

Regards,
Boqun

> 
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Boqun Feng

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 
> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 

Paul, I know this may be difficult, but could you recall why the
__futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?

I did some search, but couldn't find the discussion of that patch.

I ask this because I recall Peter once bought up a discussion:

https://lkml.org/lkml/2015/8/26/596

Peter's conclusion seems to be that we could(though didn't want to) live
with futex atomics not being full barriers.

Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
the futex atomics, just wonder whether such strengthen is a -fix- or
not, considering that I want this patch to go to -stable tree.

Of course, in the meanwhile of waiting for your answer, I will try to
figure out this by myself ;-)

Regards,
Boqun

signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] barriers: introduce smp_mb__release_acquire and update documentation

2015-10-20 Thread Boqun Feng

On Mon, Oct 19, 2015 at 12:23:24PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote:
> > This is confusing me right now. ;-)
> > 
> > Let's use a simple example for only one primitive, as I understand it,
> > if we say a primitive A is "fully ordered", we actually mean:
> > 
> > 1.  The memory operations preceding(in program order) A can't be
> > reordered after the memory operations following(in PO) A.
> > 
> > and
> > 
> > 2.  The memory operation(s) in A can't be reordered before the
> > memory operations preceding(in PO) A and after the memory
> > operations following(in PO) A.
> > 
> > If we say A is a "full barrier", we actually means:
> > 
> > 1.  The memory operations preceding(in program order) A can't be
> > reordered after the memory operations following(in PO) A.
> > 
> > and
> > 
> > 2.  The memory ordering guarantee in #1 is visible globally.
> > 
> > Is that correct? Or "full barrier" is more strong than I understand,
> > i.e. there is a third property of "full barrier":
> > 
> > 3.  The memory operation(s) in A can't be reordered before the
> > memory operations preceding(in PO) A and after the memory
> > operations following(in PO) A.
> > 
> > IOW, is "full barrier" a more strong version of "fully ordered" or not?
> 
> Yes, that was how I used it.
> 
> Now of course; the big question is do we want to promote this usage or
> come up with a different set of words describing this stuff.
> 
> I think separating the ordering from the transitivity is useful, for we
> can then talk about and specify them independently.
> 

Great idea! 

> That is, we can say:
> 
>   LOAD-ACQUIRE: orders LOAD->{LOAD,STORE}
> weak transitivity (RCpc)
> 
>   MB: orders {LOAD,STORE}->{LOAD,STORE} (fully ordered)
>   strong transitivity (RCsc)
> 

It will be helpful if we have this kind of description for each
primitive mentioned in memory-barriers.txt, which, IMO, is better than
the description like the following:

"""
Any atomic operation that modifies some state in memory and returns information
about the state (old or new) implies an SMP-conditional general memory barrier
(smp_mb()) on each side of the actual operation (with the exception of
"""

I'm assuming that the arrow "->" stands for the program order, and word
"orders" means that a primitive guarantees some program order becomes
the memory operation order, so that the description above can be
rewritten as:

value-returning atomics:
orders {LOAD,STORE}->RmW(atomic operation)->{LOAD,STORE}
strong transitivity

much simpler and clearer for discussion and reasoning

Regards,
Boqun

> etc..
> 
> Also, in the above I used weak and strong transitivity, but that too is
> of course up for grabs.


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] barriers: introduce smp_mb__release_acquire and update documentation

2015-10-20 Thread Boqun Feng

On Mon, Oct 12, 2015 at 04:30:48PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 09, 2015 at 07:33:28PM +0100, Will Deacon wrote:
> > On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote:
[snip]
> 
> > > > We could also include a link to the ppcmem/herd web frontends and your
> > > > lwn.net article. (ppcmem is already linked, but it's not obvious that
> > > > you can run litmus tests in your browser).
> > > 
> > > I bet that the URLs for the web frontends are not stable long term.
> > > Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly
> > > number of years, but professors do occasionally move from one institution
> > > to another.  For but one example, Susmit Sarkar is now at University
> > > of St. Andrews rather than at Cambridge.
> > > 
> > > So to make this work, we probably need to be thinking in terms of
> > > asking the researchers for permission to include their ocaml code in the
> > > Linux-kernel source tree.  I would be strongly in favor of this, actually.
> > > 
> > > Thoughts?
> > 
> > I'm extremely hesitant to import a bunch of dubiously licensed, academic
> > ocaml code into the kernel. Even if we did, who would maintain it?
> > 
> > A better solution might be to host a mirror of the code on kernel.org,
> > along with a web front-end for people to play with (the tests we're talking
> > about here do seem to run ok in my browser).
> 
> I am not too worried about how this happens, but we should avoid
> constraining the work of our academic partners.  The reason I was thinking
> in terms of in the kernel was to avoid version-synchronization issues.
> "Wait, this is Linux kernel v4.17, which means that you need to use
> version 8.3.5.1 of the tooling...  And with these four patches as well."
> 

Maybe including only the models' code(arm.cat, ppc.cat, etc.) into
kernel rather than the whole code base could also solve the
version-synchronization in some degree, and avoid maintaining the whole
tool code? I'm assuming modifying the verifier's code other than the
models' code will unlikely change the result of a litmus test.

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] barriers: introduce smp_mb__release_acquire and update documentation

2015-10-18 Thread Boqun Feng

On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote:
> On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote:
[snip]
> > 
> > So lots of little confusions added up to complete fail :-{
> > 
> > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I
> > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are
> > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but
> > again not against uninvolved CPUs).
> > 
> > Which leads me to think I would like to suggest alternative rules for
> > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are
> > partly responsible for my confusion).
> 
> Yeah, sorry. I originally used the phrase "fully ordered" but changed it
> to "full barrier", which has stronger transitivity (newly understood
> definition) requirements that I didn't intend.
> 
> RELEASE -> ACQUIRE should be used for message passing between two CPUs
> and not have ordering effects on other observers unless they're part of
> the RELEASE -> ACQUIRE chain.
> 
> >  - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when
> >they operate on the same variable and the ACQUIRE reads from the
> >RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity.
> 
> Are we explicit about the difference between "fully ordered" and "full
> barrier" somewhere else, because this looks like it will confuse people.
> 

This is confusing me right now. ;-)

Let's use a simple example for only one primitive, as I understand it,
if we say a primitive A is "fully ordered", we actually mean:

1.  The memory operations preceding(in program order) A can't be
reordered after the memory operations following(in PO) A.

and

2.  The memory operation(s) in A can't be reordered before the
memory operations preceding(in PO) A and after the memory
operations following(in PO) A.

If we say A is a "full barrier", we actually means:

1.  The memory operations preceding(in program order) A can't be
reordered after the memory operations following(in PO) A.

and

2.  The memory ordering guarantee in #1 is visible globally.

Is that correct? Or "full barrier" is more strong than I understand,
i.e. there is a third property of "full barrier":

3.  The memory operation(s) in A can't be reordered before the
memory operations preceding(in PO) A and after the memory
operations following(in PO) A.

IOW, is "full barrier" a more strong version of "fully ordered" or not?

Regards,
Boqun

> >  - RELEASE -> ACQUIRE can be upgraded to a full barrier (including
> >transitivity) using smp_mb__release_acquire(), either before RELEASE
> >or after ACQUIRE (but consistently [*]).
> 
> Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This
> is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently
> using (for PPC only).
> 
> Stepping back a second, I believe that there are three cases:
> 
> 
>  RELEASE X -> ACQUIRE Y (same CPU)
>* Needs a barrier on TSO architectures for full ordering
> 
>  UNLOCK X -> LOCK Y (same CPU)
>* Needs a barrier on PPC for full ordering
> 
>  RELEASE X -> ACQUIRE X (different CPUs)
>  UNLOCK X -> ACQUIRE X (different CPUs)
>* Fully ordered everywhere...
>* ... but needs a barrier on PPC to become a full barrier
> 
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-18 Thread Boqun Feng

On Thu, Oct 15, 2015 at 09:30:40AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
[snip]
> > 
> > > Why not try creating a longer litmus test that requires P0's write to
> > > "a" to propagate to P1 before both processes complete?
> > > 
> > 
> > I will try to write one, but to be clear, you mean we still observe 
> > 
> > 0:r3 == 0 && a == 2 && 1:r3 == 0 
> > 
> > at the end, right? Because I understand that if P1's write to 'a'
> > doesn't override P0's, P0's write to 'a' will propagate.
> 
> Your choice.  My question is whether you can come up with a similar
> litmus test where lwsync is allowing the behavior here, but clearly
> is affecting some other aspect of ordering.
> 

Got it, though my question about the propagation of P0's write to 'a'
was originally aimed at understanding the hardware behavior(or model) in
your sequence of events ;-)

To be clear, by "some other aspect of ordering", you mean something like
a paired RELEASE+ACQUIRE senario(i.e. P1 observes P0's write to 'a' via
a load, which means P0's write to 'a' propagates at some point), right?

If so I haven't yet came up with one, and I think there's probably none,
so my worry about "lwsync" in other places is likely unnecessary.

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Boqun Feng

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> 
> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?
> 

I did some litmus and put the result here. My understanding might be
wrong, and I think Paul can explain the lwsync and store->store order
better ;-)


When a store->lwsync->store pairs with load->lwsync->load, according to
herd, YES.

PPC W+lwsync+W-R+lwsync+R
"
  2015-10-15
  herds said (1:r1=0 /\ 1:r2=2) doesn't exist,
  so if P1 observe the write to 'b', it must also observe P0's write
  to 'a'
"
{
0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
1:r1=0; 1:r2=0; 1:r12=a; 1:r13=b;
}

 P0  | P1 ;
 stw r1, 0(r12)  | lwz r2, 0(r13) ;
 lwsync  | lwsync ;
 stw r2, 0(r13)  | lwz r1, 0(r12) ;

exists
(1:r1=0 /\ 1:r2=2)


If observation also includes "a write on one CPU -override- another
write on another CPU", then

when a store->lwsync->store pairs(?) with store->sync->load, according
to herd, NO(?).

PPC W+lwsync+W-W+sync+R
"
  2015-10-15
  herds said (1:r1=0 /\ b=3) exists sometimes,
  so if P1 observe P0's write to 'b'(by 'overriding' this write to
  'b'), it may not observe P0's write to 'a'.
"
{
0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
1:r1=0; 1:r2=3; 1:r12=a; 1:r13=b;
}

 P0  | P1 ;
 stw r1, 0(r12)  | stw r2, 0(r13) ;
 lwsync  | sync ;
 stw r2, 0(r13)  | lwz r1, 0(r12) ;

exists
(1:r1=0 /\ b=3)


Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Boqun Feng

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > versions all need to imply a full barrier, however they are now just
> > RELEASE+ACQUIRE, which is not a full barrier.
> > 
> > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > 
> > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > atomic_xxx_return barrier semantics").
> > 
> > Acked-by: Michael Ellerman <m...@ellerman.id.au>
> > Cc: <sta...@vger.kernel.org> # 3.4+
> > Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
> > ---
> >  arch/powerpc/include/asm/cmpxchg.h | 16 
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > b/arch/powerpc/include/asm/cmpxchg.h
> > index ad6263c..d1a8d93 100644
> > --- a/arch/powerpc/include/asm/cmpxchg.h
> > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> > unsigned long prev;
> > 
> > __asm__ __volatile__(
> > -   PPC_RELEASE_BARRIER
> > +   PPC_ATOMIC_ENTRY_BARRIER
> 
> This looks to be the lwsync instruction.
> 
> >  "1:lwarx   %0,0,%2 \n"
> > PPC405_ERR77(0,%2)
> >  "  stwcx.  %3,0,%2 \n\
> > bne-1b"
> > -   PPC_ACQUIRE_BARRIER
> > +   PPC_ATOMIC_EXIT_BARRIER
> 
> And this looks to be the sync instruction.
> 
> > : "=" (prev), "+m" (*(volatile unsigned int *)p)
> > : "r" (p), "r" (val)
> > : "cc", "memory");
> 
> Hmmm...
> 
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
>   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
>stwcx. r1,r10,r12  | ;
>bne Fail0  | ;
>mr r3,r11  | ;
>Fail0: | ;
>   exists
>   (0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 

If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
version of this patch, any concern?

Of course, I will wait to do that until we all understand this is
nececarry and agree to make the change.

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.

For the value-returning RMW atomics, if the leading barrier is
necessarily to be "sync", I will just remove my __atomic_op_fence() in
patch 4, but I will remain patch 3 unchanged for the consistency of
__atomic_op_*() macros' definitions. Peter and Will, do that works for
you both?

Regards,
Boqun

> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v4 4/6] powerpc: atomic: Implement atomic{, 64}__return_ variants

2015-10-14 Thread Boqun Feng

On powerpc, acquire and release semantics can be achieved with
lightweight barriers("lwsync" and "ctrl+isync"), which can be used to
implement __atomic_op_{acquire,release}.

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. Therefore in
__atomic_op_acquire() we use PPC_ACQUIRE_BARRIER, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

__atomic_op_fence is defined as smp_lwsync() + _relaxed +
smp_mb__after_atomic() to guarantee a full barrier.

Implement atomic{,64}_{add,sub,inc,dec}_return_relaxed, and build other
variants with these helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 116 --
 1 file changed, 74 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 55f106e..ab76461 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,33 @@
 
 #define ATOMIC_INIT(i) { (i) }
 
+/*
+ * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#define __atomic_op_acquire(op, args...)   \
+({ \
+   typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
+   __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory");\
+   __ret;  \
+})
+
+#define __atomic_op_release(op, args...)   \
+({ \
+   smp_lwsync();   \
+   op##_relaxed(args); \
+})
+
+#define __atomic_op_fence(op, args...) \
+({ \
+   typeof(op##_relaxed(args)) __ret;   \
+   smp_lwsync();   \
+   __ret = op##_relaxed(args); \
+   smp_mb__after_atomic(); \
+   __ret;  \
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
int t;
@@ -42,27 +69,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v)  
\
: "cc");\
 }  \
 
-#define ATOMIC_OP_RETURN(op, asm_op)   \
-static __inline__ int atomic_##op##_return(int a, atomic_t *v) \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
-   PPC_ATOMIC_ENTRY_BARRIER\
-"1:lwarx   %0,0,%2 # atomic_" #op "_return\n"  \
-   #asm_op " %0,%1,%0\n"   \
-   PPC405_ERR77(0,%2)  \
-"  stwcx.  %0,0,%2 \n" \
+"1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
+   #asm_op " %0,%2,%0\n"   \
+   PPC405_ERR77(0, %3) \
+"  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
-   PPC_ATOMIC_EXIT_BARRIER \
-   : "=" (t) \
+   : "=" (t), "+m" (v->counter)  \
: "r" (a), &

[PATCH tip/locking/core v4 3/6] atomics: Allow architectures to define their own __atomic_op_* helpers

2015-10-14 Thread Boqun Feng

Some architectures may have their special barriers for acquire, release
and fence semantics, so that general memory barriers(smp_mb__*_atomic())
in the default __atomic_op_*() may be too strong, so allow architectures
to define their own helpers which can overwrite the default helpers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 include/linux/atomic.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 27e580d..947c1dc 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -43,20 +43,29 @@ static inline int atomic_read_ctrl(const atomic_t *v)
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
  * variant is already fully ordered, no additional barriers are needed.
+ *
+ * Besides, if an arch has a special barrier for acquire/release, it could
+ * implement its own __atomic_op_* and use the same framework for building
+ * variants
  */
+#ifndef __atomic_op_acquire
 #define __atomic_op_acquire(op, args...)   \
 ({ \
typeof(op##_relaxed(args)) __ret  = op##_relaxed(args); \
smp_mb__after_atomic(); \
__ret;  \
 })
+#endif
 
+#ifndef __atomic_op_release
 #define __atomic_op_release(op, args...)   \
 ({ \
smp_mb__before_atomic();\
op##_relaxed(args); \
 })
+#endif
 
+#ifndef __atomic_op_fence
 #define __atomic_op_fence(op, args...) \
 ({ \
typeof(op##_relaxed(args)) __ret;   \
@@ -65,6 +74,7 @@ static inline int atomic_read_ctrl(const atomic_t *v)
smp_mb__after_atomic(); \
__ret;  \
 })
+#endif
 
 /* atomic_add_return_relaxed */
 #ifndef atomic_add_return_relaxed
-- 
2.5.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v4 5/6] powerpc: atomic: Implement xchg_* and atomic{, 64}_xchg_* variants

2015-10-14 Thread Boqun Feng

Implement xchg_relaxed and atomic{,64}_xchg_relaxed, based on these
_relaxed variants, release/acquire variants and fully ordered versions
can be built.

Note that xchg_relaxed and atomic_{,64}_xchg_relaxed are not compiler
barriers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  2 ++
 arch/powerpc/include/asm/cmpxchg.h | 69 +-
 2 files changed, 32 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index ab76461..1e9d526 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -186,6 +186,7 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * __atomic_add_unless - add unless the number is a given value
@@ -453,6 +454,7 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t 
*v)
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * atomic64_add_unless - add unless the number is a given value
diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index d1a8d93..17c7e14 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -9,21 +9,20 @@
 /*
  * Atomic exchange
  *
- * Changes the memory location '*ptr' to be val and returns
+ * Changes the memory location '*p' to be val and returns
  * the previous value stored there.
  */
+
 static __always_inline unsigned long
-__xchg_u32(volatile void *p, unsigned long val)
+__xchg_u32_local(volatile void *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -31,42 +30,34 @@ __xchg_u32(volatile void *p, unsigned long val)
return prev;
 }
 
-/*
- * Atomic exchange
- *
- * Changes the memory location '*ptr' to be val and returns
- * the previous value stored there.
- */
 static __always_inline unsigned long
-__xchg_u32_local(volatile void *p, unsigned long val)
+__xchg_u32_relaxed(u32 *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-"1:lwarx   %0,0,%2 \n"
-   PPC405_ERR77(0,%2)
-"  stwcx.  %3,0,%2 \n\
-   bne-1b"
-   : "=" (prev), "+m" (*(volatile unsigned int *)p)
+"1:lwarx   %0,0,%2\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %3,0,%2\n"
+"  bne-1b"
+   : "=" (prev), "+m" (*p)
: "r" (p), "r" (val)
-   : "cc", "memory");
+   : "cc");
 
return prev;
 }
 
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
-__xchg_u64(volatile void *p, unsigned long val)
+__xchg_u64_local(volatile void *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -75,18 +66,18 @@ __xchg_u64(volatile void *p, unsigned long val)
 }
 
 static __always_inline unsigned long
-__xchg_u64_local(volatile void *p, unsigned long val)
+__xchg_u64_relaxed(u64 *p, unsigned long val)
 {
unsigned long prev;
 
__asm__ __volatile__(
-"1:ldarx   %0,0,%2 \n"
-   PPC405_ERR77(0,%2)
-"  stdcx.  %3,0,%2 \n\
-   bne-1b"
-   : "=" (prev), "+m" (*(volatile unsigned long *)p)
+"1:ldarx   %0,0,%2\n"
+   PPC405_ERR77(0, %2)
+"  stdcx.  %3,0,%2\n"
+"  bne-1b"
+   : "=" (prev), "+m" (*p)
: "r" (p), "r" (val)
-   : "cc", "memory");
+   : "cc");
 
return prev;
 }
@@ -99,14 +90,14 @@ __xchg_u64_local(volatile void *p, unsigned long val)
 extern void __xchg_called_with_bad_pointer(void);
 
 static __always_inline unsigned long
-__xchg(volatile void *ptr, unsigned long x, unsigned int size)
+__x

[PATCH tip/locking/core v4 2/6] atomics: Add test for atomic operations with _relaxed variants

2015-10-14 Thread Boqun Feng

Some atomic operations now have _relaxed/acquire/release variants, this
patch then adds some trivial tests for two purpose:

1.  test the behavior of these new operations in single-CPU
environment.

2.  make their code generated before we actually use them somewhere,
so that we can examine their assembly code.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 lib/atomic64_test.c | 120 ++--
 1 file changed, 79 insertions(+), 41 deletions(-)

diff --git a/lib/atomic64_test.c b/lib/atomic64_test.c
index 83c33a5b..18e422b 100644
--- a/lib/atomic64_test.c
+++ b/lib/atomic64_test.c
@@ -27,6 +27,65 @@ do { 
\
(unsigned long long)r); \
 } while (0)
 
+/*
+ * Test for a atomic operation family,
+ * @test should be a macro accepting parameters (bit, op, ...)
+ */
+
+#define FAMILY_TEST(test, bit, op, args...)\
+do {   \
+   test(bit, op, ##args);  \
+   test(bit, op##_acquire, ##args);\
+   test(bit, op##_release, ##args);\
+   test(bit, op##_relaxed, ##args);\
+} while (0)
+
+#define TEST_RETURN(bit, op, c_op, val)\
+do {   \
+   atomic##bit##_set(, v0);  \
+   r = v0; \
+   r c_op val; \
+   BUG_ON(atomic##bit##_##op(val, ) != r);   \
+   BUG_ON(atomic##bit##_read() != r);\
+} while (0)
+
+#define RETURN_FAMILY_TEST(bit, op, c_op, val) \
+do {   \
+   FAMILY_TEST(TEST_RETURN, bit, op, c_op, val);   \
+} while (0)
+
+#define TEST_ARGS(bit, op, init, ret, expect, args...) \
+do {   \
+   atomic##bit##_set(, init);\
+   BUG_ON(atomic##bit##_##op(, ##args) != ret);  \
+   BUG_ON(atomic##bit##_read() != expect);   \
+} while (0)
+
+#define XCHG_FAMILY_TEST(bit, init, new)   \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, xchg, init, init, new, new);\
+} while (0)
+
+#define CMPXCHG_FAMILY_TEST(bit, init, new, wrong) \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, cmpxchg,\
+   init, init, new, init, new);\
+   FAMILY_TEST(TEST_ARGS, bit, cmpxchg,\
+   init, init, init, wrong, new);  \
+} while (0)
+
+#define INC_RETURN_FAMILY_TEST(bit, i) \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, inc_return, \
+   i, (i) + one, (i) + one);   \
+} while (0)
+
+#define DEC_RETURN_FAMILY_TEST(bit, i) \
+do {   \
+   FAMILY_TEST(TEST_ARGS, bit, dec_return, \
+   i, (i) - one, (i) - one);   \
+} while (0)
+
 static __init void test_atomic(void)
 {
int v0 = 0xaaa31337;
@@ -45,6 +104,18 @@ static __init void test_atomic(void)
TEST(, and, &=, v1);
TEST(, xor, ^=, v1);
TEST(, andnot, &= ~, v1);
+
+   RETURN_FAMILY_TEST(, add_return, +=, onestwos);
+   RETURN_FAMILY_TEST(, add_return, +=, -one);
+   RETURN_FAMILY_TEST(, sub_return, -=, onestwos);
+   RETURN_FAMILY_TEST(, sub_return, -=, -one);
+
+   INC_RETURN_FAMILY_TEST(, v0);
+   DEC_RETURN_FAMILY_TEST(, v0);
+
+   XCHG_FAMILY_TEST(, v0, v1);
+   CMPXCHG_FAMILY_TEST(, v0, v1, onestwos);
+
 }
 
 #define INIT(c) do { atomic64_set(, c); r = c; } while (0)
@@ -74,25 +145,10 @@ static __init void test_atomic64(void)
TEST(64, xor, ^=, v1);
TEST(64, andnot, &= ~, v1);
 
-   INIT(v0);
-   r += onestwos;
-   BUG_ON(atomic64_add_return(onestwos, ) != r);
-   BUG_ON(v.counter != r);
-
-   INIT(v0);
-   r += -one;
-   BUG_ON(atomic64_add_return(-one, ) != r);
-   BUG_ON(v.counter != r);
-
-   INIT(v0);
-   r -= onestwos;
-   BUG_ON(atomic64_sub_return(onestwos, ) != r);
-   BUG_ON(v.counter != r);
-
-   INIT(v0);
-   r -= -one;
-   BUG_ON(atomic64_sub_return(-one, ) != r);
-   BUG_ON(v.counter != r);
+   RETURN_FAMILY_TEST(64, add_return, +=, onestwos);
+   RETURN_FAMILY_TEST(64, add_return, +=, -one);
+   RETURN_FAMILY_TEST(64, sub_return, -=, onestwos);
+   RETURN_FAMILY_TEST(64, su

[PATCH tip/locking/core v4 6/6] powerpc: atomic: Implement cmpxchg{, 64}_* and atomic{, 64}_cmpxchg_* variants

2015-10-14 Thread Boqun Feng

Implement cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed, based on
which _release variants can be built.

To avoid superfluous barriers in _acquire variants, we implement these
operations with assembly code rather use __atomic_op_acquire() to build
them automatically.

For the same reason, we keep the assembly implementation of fully
ordered cmpxchg operations.

However, we don't do the similar for _release, because that will require
putting barriers in the middle of ll/sc loops, which is probably a bad
idea.

Note cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed are not
compiler barriers.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  10 +++
 arch/powerpc/include/asm/cmpxchg.h | 149 -
 2 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 1e9d526..e58188d 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -185,6 +185,11 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t 
*v)
 #define atomic_dec_return_relaxed atomic_dec_return_relaxed
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic_cmpxchg_relaxed(v, o, n) \
+   cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic_cmpxchg_acquire(v, o, n) \
+   cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
@@ -453,6 +458,11 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t 
*v)
 }
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic64_cmpxchg_relaxed(v, o, n) \
+   cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic64_cmpxchg_acquire(v, o, n) \
+   cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index 17c7e14..cae4fa8 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -181,6 +181,56 @@ __cmpxchg_u32_local(volatile unsigned int *p, unsigned 
long old,
return prev;
 }
 
+static __always_inline unsigned long
+__cmpxchg_u32_relaxed(u32 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:lwarx   %0,0,%2 # __cmpxchg_u32_relaxed\n"
+"  cmpw0,%0,%3\n"
+"  bne-2f\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %4,0,%2\n"
+"  bne-1b\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc");
+
+   return prev;
+}
+
+/*
+ * cmpxchg family don't have order guarantee if cmp part fails, therefore we
+ * can avoid superfluous barriers if we use assembly code to implement
+ * cmpxchg() and cmpxchg_acquire(), however we don't do the similar for
+ * cmpxchg_release() because that will result in putting a barrier in the
+ * middle of a ll/sc loop, which is probably a bad idea. For example, this
+ * might cause the conditional store more likely to fail.
+ */
+static __always_inline unsigned long
+__cmpxchg_u32_acquire(u32 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:lwarx   %0,0,%2 # __cmpxchg_u32_acquire\n"
+"  cmpw0,%0,%3\n"
+"  bne-2f\n"
+   PPC405_ERR77(0, %2)
+"  stwcx.  %4,0,%2\n"
+"  bne-1b\n"
+   PPC_ACQUIRE_BARRIER
+   "\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc", "memory");
+
+   return prev;
+}
+
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
 __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
@@ -224,6 +274,46 @@ __cmpxchg_u64_local(volatile unsigned long *p, unsigned 
long old,
 
return prev;
 }
+
+static __always_inline unsigned long
+__cmpxchg_u64_relaxed(u64 *p, unsigned long old, unsigned long new)
+{
+   unsigned long prev;
+
+   __asm__ __volatile__ (
+"1:ldarx   %0,0,%2 # __cmpxchg_u64_relaxed\n"
+"  cmpd0,%0,%3\n"
+"  bne-2f\n"
+"  stdcx.  %4,0,%2\n"
+"  bne-1b\n"
+"2:"
+   : "=" (prev), "+m" (*p)
+   : "r" (p), "r" (old), "r" (new)
+   : "cc");
+
+   return prev;
+}
+
+sta

[PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
versions all need to imply a full barrier, however they are now just
RELEASE+ACQUIRE, which is not a full barrier.

So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().

This patch is a complement of commit b97021f85517 ("powerpc: Fix
atomic_xxx_return barrier semantics").

Acked-by: Michael Ellerman <m...@ellerman.id.au>
Cc: <sta...@vger.kernel.org> # 3.4+
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/powerpc/include/asm/cmpxchg.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
old, unsigned long new)
unsigned int prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 # __cmpxchg_u32\n\
cmpw0,%0,%3\n\
bne-2f\n"
PPC405_ERR77(0,%2)
 "  stwcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
old, unsigned long new)
unsigned long prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 # __cmpxchg_u64\n\
cmpd0,%0,%3\n\
bne-2f\n\
stdcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
-- 
2.5.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics

2015-10-14 Thread Boqun Feng

Hi all,

This is v4 of the series.

Link for v1: https://lkml.org/lkml/2015/8/27/798
Link for v2: https://lkml.org/lkml/2015/9/16/527
Link for v3: https://lkml.org/lkml/2015/10/12/368

Changes since v3:

*   avoid to introduce smp_acquire_barrier__after_atomic()
(Will Deacon)

*   explain a little bit why we don't implement cmpxchg_release
with assembly code (Will Deacon)


Relaxed/acquire/release variants of atomic operations {add,sub}_return
and {cmp,}xchg are introduced by commit:

"atomics: add acquire/release/relaxed variants of some atomic operations"

and {inc,dec}_return has been introduced by commit:

"locking/asm-generic: Add _{relaxed|acquire|release}() variants for
inc/dec atomics"

Both of these are in the current locking/core branch of the tip tree.

By default, the generic code will implement a relaxed variant as a full
ordered atomic operation and release/acquire a variant as a relaxed
variant with a necessary general barrier before or after.

On powerpc, which has a weak memory order model, a relaxed variant can
be implemented more lightweightly than a full ordered one. Further more,
release and acquire variants can be implemented with arch-specific
lightweight barriers.

Besides, cmpxchg, xchg and their atomic_ versions are only RELEASE+ACQUIRE
rather that full barriers in current PPC implementation, which is
incorrect according to memory-barriers.txt.

Therefore this patchset fix the order guarantee of cmpxchg, xchg and
their atomic_ versions and implements the relaxed/acquire/release
variants based on powerpc memory model and specific barriers, Some
trivial tests for these new variants are also included in this series,
because some of these variants are not used in kernel for now, I think
is a good idea to at least generate the code for these variants
somewhere.

The patchset consists of 6 parts:

1.  Make xchg, cmpxchg and their atomic_ versions a full barrier

2.  Add trivial tests for the new variants in lib/atomic64_test.c

3.  Allow architectures to define their own __atomic_op_*() helpers
to build other variants based on relaxed.

4.  Implement atomic{,64}_{add,sub,inc,dec}_return_* variants

5.  Implement xchg_* and atomic{,64}_xchg_* variants

6.  Implement cmpxchg_* atomic{,64}_cmpxchg_* variants


This patchset is based on current locking/core branch of the tip tree
and all patches are built and boot tested for little endian pseries, and
also tested by 0day.


Looking forward to any suggestion, question and comment ;-)

Regards,
Boqun
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RESEND v3 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Wed, Oct 14, 2015 at 10:06:13AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 14, 2015 at 08:51:34AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 11:10:00AM +1100, Michael Ellerman wrote:
> 
> > > Thanks for fixing this. In future you should send a patch like this as a
> > > separate patch. I've not been paying attention to it because I assumed it 
> > > was
> > 
> > Got it. However, here is the thing, in previous version, this fix
> > depends on some of other patches in this patchset. So to make this fix
> > applied cleanly, I reorder my patchset to put this patch first, and the
> > result is that some of other patches in this patchset depends on
> > this(they need to remove code modified by this patch).
> > 
> > So I guess I'd better to stop Cc stable for this one, and wait until
> > this patchset merged and send a separate patch for -stable tree. Does
> > that work for you? I think this is what Peter want to suggests me to do
> > when he asked me about this, right, Peter?
> 
> I don't think I had explicit thoughts about any of that, just that it
> might make sense to have this patch not depend on the rest such that it
> could indeed be stuffed into stable.
> 

Got that. Sorry for misunderstanding you...

> I'll leave the details up to Michael since he's PPC maintainer.

Michael and Peter, rest of this patchset depends on commits which are
currently in the locking/core branch of the tip, so I would like it as a
whole queued there. Besides, I will keep this patch Cc'ed to stable in
future versions, that works for you both?

Regards,
Boqun


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > CPU 0   CPU 1
> > > > -   -
> > > > 
> > > > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > r3 = xchg(, 1);   smp_mb();
> > > > r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > ""
> > > > {
> > > > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > }
> > > >  P0 | P1 ;
> > > >  stw r1,0(r2)   | stw r1,0(r12)  ;
> > > >  lwsync | sync   ;
> > > >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > >  stwcx. r1,r10,r12  | ;
> > > >  bne Fail0  | ;
> > > >  mr r3,r11  | ;
> > > >  Fail0: | ;
> > > > exists
> > > > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 

I mean there will be more things we need to fix, perhaps even smp_wmb()
need to be sync then..

Regards,
Boqun

>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?
> 
> And even:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   smp_store_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
>   PPC weird-lwsync
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>stw  r1,0(r12) | lwz r3,0(r2)   ;
>   exists
>   (a=2 /\ 1:r3=0)
> 
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?
> 
> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > >   CPU0CPU1
> > > 
> > >   r3 = load_locked ();
> > >   a = 2;
> > >   sync();
> > >   r3 = x;
> > >   x = 1;
> > >   lwsync();
> > >   if (!store_cond(, 1))
> > >   goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's sto

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > Suppose we have something like the following, where "a" and "x" are both
> > > initially zero:
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   r3 = xchg(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > > If xchg() is fully ordered, we should never observe both CPUs'
> > > r3 values being zero, correct?
> > > 
> > > And wouldn't this be represented by the following litmus test?
> > > 
> > >   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > >   ""
> > >   {
> > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > >   }
> > >P0 | P1 ;
> > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > >lwsync | sync   ;
> > >lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > >stwcx. r1,r10,r12  | ;
> > >bne Fail0  | ;
> > >mr r3,r11  | ;
> > >Fail0: | ;
> > >   exists
> > >   (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > 
> > > I left off P0's trailing sync because there is nothing for it to order
> > > against in this particular litmus test.  I tried adding it and verified
> > > that it has no effect.
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.

I'm afraid more than that, the above litmus also shows that

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg_release(, 1);   smp_mb();
r3 = READ_ONCE(x);

(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted

in the implementation of this patchset, which should be disallowed by
the semantics of RELEASE, right?

And even:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
smp_store_release(, 1);   smp_mb();
r3 = READ_ONCE(x);

(1:r3 == 0 && a == 2) is not prohibitted

shows by:

PPC weird-lwsync
""
{
0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
}
 P0 | P1 ;
 stw r1,0(r2)   | stw r1,0(r12)  ;
 lwsync | sync   ;
 stw  r1,0(r12) | lwz r3,0(r2)   ;
exists
(a=2 /\ 1:r3=0)


Please find something I'm (or the tool is) missing, maybe we can't use
(a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
0?

And there is really something I find strange, see below.

> > 
> > So the scenario that would fail would be this one, right?
> > 
> > a = x = 0
> > 
> > CPU0CPU1
> > 
> > r3 = load_locked ();
> > a = 2;
> > sync();
> > r3 = x;
> > x = 1;
> > lwsync();
> > if (!store_cond(, 1))
> > goto again
> > 
> > 
> > Where we hoist the load way up because lwsync allows this.
> 
> That scenario would end up with a==1 rather than a==2.
> 
> > I always thought this would fail because CPU1's store to @a would fail
> > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > load and now seeing the new value (2).
> 
> The stwcx. failure was one thing that prevented a number of other
> misordering cases.  The problem is that we have to let go of the notion
> of an implicit global clock.
> 
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o Commit P0's write.  The model offers to propagate this write
>   to the coherence point and to P1, but don't do so yet.
> 
> o Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o Commit P0's lwsync.
> 
> o Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o Commit P0's stwcx. as successful.  This stores a=1.
> 
> o Commit P0's branch (not taken).
> 

So at this point, P0's write to 'a' has propagated to P1, right? But
P0's write to 'x' hasn't, even there is a lwsync between them, right?
Doesn't the lwsync prevent this from happening?

If at this point P0's write to 'a' hasn't propagated then when?

Regards,
Boqun

> o Commit P0's final register-to-register move.
> 
> o

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

Hi Paul,

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
[snip]
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > o   Commit P0's branch (not taken).
> > 
> 
> So at this point, P0's write to 'a' has propagated to P1, right? But
> P0's write to 'x' hasn't, even there is a lwsync between them, right?
> Doesn't the lwsync prevent this from happening?
> 
> If at this point P0's write to 'a' hasn't propagated then when?
> 

Hmm.. I played around ppcmem, and figured out what happens to
propagation of P0's write to 'a':

At this point, or some point after store 'a' to 1 and before sync on
P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
P0's write to 'a' "fails" and will not propagate.


I probably misunderstood the word "propagate", which actually means an
already coherent write gets seen by another CPU, right?

So my question should be:

As lwsync can order P0's write to 'a' happens after P0's write to 'x',
why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
P0's?

But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
lwsync will guarantee propagation of 'x' happens before that of 'a', but
if P0's write to 'a' "fails", there will be no propagation of 'a' from
P0. So that lwsync can't do anything here.

Regards,
Boqun

> 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> > 
> > o   There is still nothing that can happen in either processor.
> > So pick the barrier propagate, then the acknowledge sync.
> > 
> > o   P1 can now execute its read from x.  Because P0's write to
> > x is still waiting to propagate to P1, this still reads
> > x=0.  Execute and commit, and we now have both r3 registers
> > equal to zero and the final value a=2.
> > 
> > o   Clean up by propagating the write to x everywhere, and
> > propagating the lwsync.
> > 
> > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > 
> > I am still not 100% confident of my litmus test.  It is quite possible
> > that I lost something in translation, but that is looking less likely.
> > 


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

1 2 >

1 - 100 of 156 matches

Mail list logo