Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-17 Thread Anton Blanchard
 
Hi Nick,

 Ah, good to see this one come back. I also tested tbench over localhost
 btw which actually did show some speedup on the G5. 
 
 BTW. this was the last thing left:
 http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html
 
 Don't know if you took a look at that again, but maybe it's worth
 looking at. Hmm, we do actually seem to be growing number of smp_mb*
 calls in core kernel too.

Interesting idea! I do worry we will get a late night visit from the
architecture police. From memory they want the complete larx, stcwx, bne,
isync sequence to guarantee an acquire barrier.

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-17 Thread Nick Piggin
On Wed, Feb 17, 2010 at 08:43:14PM +1100, Anton Blanchard wrote:
  
 Hi Nick,
 
  Ah, good to see this one come back. I also tested tbench over localhost
  btw which actually did show some speedup on the G5. 
  
  BTW. this was the last thing left:
  http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html
  
  Don't know if you took a look at that again, but maybe it's worth
  looking at. Hmm, we do actually seem to be growing number of smp_mb*
  calls in core kernel too.
 
 Interesting idea! I do worry we will get a late night visit from the
 architecture police. From memory they want the complete larx, stcwx, bne,
 isync sequence to guarantee an acquire barrier.

Yes I suppose the branch can be executed non speculatively before the 
lwsync is completed. Wheras the larx/stcwx will have to complete before
the branch outcome can be known. I suppose probably not worthwhile
avoiding the full IO sync by adding yet more crap to make this work.

Thanks for putting my mind to rest though :)

I'd still be interested to know how expensive the full sync is when you
have a lot of IOs in flight.

Question, are you going to do the hint and isync-lwsync thing for
userspace as well? Too bad the kernel doesn't export synchronisation
primitives to userspace...

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-17 Thread Benjamin Herrenschmidt

 Yes I suppose the branch can be executed non speculatively before the 
 lwsync is completed. Wheras the larx/stcwx will have to complete before
 the branch outcome can be known. I suppose probably not worthwhile
 avoiding the full IO sync by adding yet more crap to make this work.
 
 Thanks for putting my mind to rest though :)
 
 I'd still be interested to know how expensive the full sync is when you
 have a lot of IOs in flight.
 
 Question, are you going to do the hint and isync-lwsync thing for
 userspace as well? Too bad the kernel doesn't export synchronisation
 primitives to userspace...

We'd love to the vdso inherits from all the dynamic patching
features of the kernel and more...

It's really a matter of getting glibc to grok it, and doing so
efficiently (ie, not just having a function point in glibc, might be a
good use of the IFUNC stuff).

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-15 Thread Olof Johansson
On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote:
 
 Nick Piggin discovered that lwsync barriers around locks were faster than 
 isync
 on 970. That was a long time ago and I completely dropped the ball in testing
 his patches across other ppc64 processors.
 
 Turns out the idea helps on other chips. Using a microbenchmark that
 uses a lot of threads to contend on a global pthread mutex (and therefore a
 global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
 and while I couldn't measure an improvement, there was no regression.
 
 This patch uses the lwsync patching code to replace the isyncs with lwsyncs
 on CPUs that support the instruction. We were marking POWER3 and RS64 as 
 lwsync
 capable but in reality they treat it as a full sync (ie slow). Remove the
 CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
 method.
 
 Signed-off-by: Anton Blanchard an...@samba.org

Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
significantly more expensive there. I see a 25% drop in the microbenchmark
doing pthread_lock/unlock loops on two cpus.

Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
the sync-lwsync changes definitely still can, and should, be done.


-Olof

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-15 Thread Benjamin Herrenschmidt
On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote:
 
 Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
 significantly more expensive there. I see a 25% drop in the microbenchmark
 doing pthread_lock/unlock loops on two cpus.
 
 Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
 the sync-lwsync changes definitely still can, and should, be done. 

So we should use a different feature bit. No biggie. If needed we can
split them more anyways.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-15 Thread Olof Johansson
On Tue, Feb 16, 2010 at 03:19:03PM +1100, Benjamin Herrenschmidt wrote:
 On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote:
  
  Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
  significantly more expensive there. I see a 25% drop in the microbenchmark
  doing pthread_lock/unlock loops on two cpus.
  
  Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
  the sync-lwsync changes definitely still can, and should, be done. 
 
 So we should use a different feature bit. No biggie. If needed we can
 split them more anyways.

Yeah, fine with me.


-Olof
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it

2010-02-10 Thread Nick Piggin
On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote:
 
 Nick Piggin discovered that lwsync barriers around locks were faster than 
 isync
 on 970. That was a long time ago and I completely dropped the ball in testing
 his patches across other ppc64 processors.
 
 Turns out the idea helps on other chips. Using a microbenchmark that
 uses a lot of threads to contend on a global pthread mutex (and therefore a
 global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
 and while I couldn't measure an improvement, there was no regression.

Ah, good to see this one come back. I also tested tbench over localhost
btw which actually did show some speedup on the G5. 

BTW. this was the last thing left:
http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html

Don't know if you took a look at that again, but maybe it's worth
looking at. Hmm, we do actually seem to be growing number of smp_mb*
calls in core kernel too.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev