Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

2015-08-17 Thread Kevin Hao
On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote:
 I tried a couple different benchmarks and didn't find a significant 
 difference, relative to the variability of the results running on the same 
 kernel.  A patch that claims to optimize a bit as its main purpose ought to 
 show some results. :-)

I tried to compare the execution time of these two code sequences with the
following test module:

#include linux/module.h
#include linux/kernel.h
#include linux/printk.h

static void test1(void)
{
int i;
unsigned char lock, c;
unsigned short cpu, s;

for (i = 0; i  10; i++) {
lock = 0;
cpu = 1;

asm volatile (  
1: lbarx   %0,0,%2\n\
lhz %1,0(%3)\n\
cmpdi   %0,0\n\
cmpdi   cr1,%1,1\n\
addi%1,%1,1\n\
bne 2f\n\
stbcx.  %1,0,%2\n\
bne 1b\n\
2:
: =r (c), =r (s) : r (lock), r (cpu) : cr0, 
cr1, memory); 
}
}

static void test2(void)
{
int i;
unsigned char lock, c;
unsigned short cpu, s;

for (i = 0; i  10; i++) {
lock = 0;
cpu = 1;

asm volatile (  
   lhz %1,0(%3)\n\
addi%1,%1,1\n\
crclr   cr1*4+eq\n\
1:  lbarx   %0,0,%2\n\
cmpdi   %0,0\n\
bne 2f\n\
stbcx.  %1,0,%2\n\
bne 1b\n\
2:
: =r (c), =r (s) : r (lock), r (cpu) : cr0, 
cr1, memory); 
}
}

static int test_init(void)
{
unsigned long s, e, tm1, tm2;

__hard_irq_disable();
/* Just for prefetch */
test1();
s = mftb();
test1();
e = mftb();
tm1 = e - s;

/* Just for prefetch */
test2();
s = mftb();
test2();
e = mftb();
tm2 = e - s;
__hard_irq_enable();

pr_err(test1: %ld, test2: %ld, %%%ld\n, tm1, tm2, (tm1 - tm2) * 100 / 
tm1);

return 0;
}

static void test_exit(void)
{
return;
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE(GPL);

The results:
test1: 156568, test2: 151675, %3
test1: 156604, test2: 151670, %3
test1: 156567, test2: 151684, %3
test1: 156567, test2: 151678, %3
test1: 156567, test2: 151688, %3
test1: 156570, test2: 151683, %3
test1: 156565, test2: 151675, %3
test1: 156565, test2: 151673, %3

It seems that there do have a %3 gain in performance by moving the
3 instructions out of lbarx/stbcx loop.

Thanks,
Kevin


pgpS6_d8tw8Wv.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

2015-08-17 Thread Scott Wood
On Mon, 2015-08-17 at 19:16 +0800, Kevin Hao wrote:
 On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote:
  I tried a couple different benchmarks and didn't find a significant 
  difference, relative to the variability of the results running on the 
  same 
  kernel.  A patch that claims to optimize a bit as its main purpose 
  ought to 
  show some results. :-)
 
 I tried to compare the execution time of these two code sequences with the
 following test module:
 
 #include linux/module.h
 #include linux/kernel.h
 #include linux/printk.h
 
 static void test1(void)
 {
   int i;
   unsigned char lock, c;
   unsigned short cpu, s;
 
   for (i = 0; i  10; i++) {
   lock = 0;
   cpu = 1;
 
   asm volatile (  
 1:   lbarx   %0,0,%2\n\
   lhz %1,0(%3)\n\
   cmpdi   %0,0\n\
   cmpdi   cr1,%1,1\n\

This should be either cmpdi cr1,%0,1 or crclr, not that it made much 
difference.  The test seemed to be rather sensitive to additional 
instructions inserted at the beginning of the asm statement (especially 
isync), so the initial instructions before the loop are probably pairing with 
something outside the asm.

That said, it looks like this patch at least doesn't make things worse, and 
does convert cmpdi to a more readable crclr, so I guess I'll apply it even 
though it doesn't show any measurable benefit when testing entire TLB misses 
(much less actual applications).

I suspect the point where I misunderstood the core manual was where it listed 
lbarx as having a repeat-rate of 3 cycles.  I probably assumed that that was 
because of the presync, and thus a subsequent unrelated load could execute 
partially in parallel, but it looks like the repeat rate is specifically 
talking about how long it is until the execution unit can accept any other 
instruction.

-Scott

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

2015-08-14 Thread Kevin Hao
On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote:
 On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
  It makes no sense to put the instructions for calculating the lock
  value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
  loop. And when the lock is acquired by the other thread, the current
  lock value has no chance to equal with the lock value used by current
  cpu. So we can skip the comparing for these two lock values in the
  lbz/bne loop.
  
  Signed-off-by: Kevin Hao haoke...@gmail.com
  ---
   arch/powerpc/mm/tlb_low_64e.S | 10 +-
   1 file changed, 5 insertions(+), 5 deletions(-)
  
  diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
  index 765b419883f2..e4185581c5a7 100644
  --- a/arch/powerpc/mm/tlb_low_64e.S
  +++ b/arch/powerpc/mm/tlb_low_64e.S
  @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION   /* CPU_FTR_SMT */
 *
 * MAS6:IND should be already set based on MAS4
 */
  -1:   lbarx   r15,0,r11
lhz r10,PACAPACAINDEX(r13)
  - cmpdi   r15,0
  - cmpdi   cr1,r15,1   /* set cr1.eq = 0 for non-recursive */
addir10,r10,1
  + crclr   cr1*4+eq/* set cr1.eq = 0 for non-recursive */
  +1:   lbarx   r15,0,r11
  + cmpdi   r15,0
bne 2f
 
 You're optimizing the contended case at the expense of introducing stalls in 
 the uncontended case.

Before the patch, the uncontended case code sequence are:
1:  lbarx   r15,0,r11
lhz r10,PACAPACAINDEX(r13)
cmpdi   r15,0
cmpdi   cr1,r15,1   /* set cr1.eq = 0 for non-recursive */
addir10,r10,1
bne 2f
stbcx.  r10,0,r11
bne 1b


After the patch:
lhz r10,PACAPACAINDEX(r13)
addir10,r10,1
crclr   cr1*4+eq/* set cr1.eq = 0 for non-recursive */
1:  lbarx   r15,0,r11
cmpdi   r15,0
bne 2f
stbcx.  r10,0,r11
bne 1b

As you know, the lbarx is a Presync instruction and stbcx is a Presync and
Postsync instruction. Putting the unnecessary instructions in the lbarx/stbcx
loop also serialize these instructions execution. The execution latency of
lbarx is only 3 cycles and there are still two instructions after it.
Considering the out of order execution optimization after this patch, do you
really think that new uncontended path will become slower due to this
additional stall?

  Does it really matter if there are more instructions 
 in the loop?

I really think we should minimize the window of lbarx/stbcx, for following two
reasons:
  - The bigger of this window, the more possible conflicts between the two
 threads run into this loop simultaneously.
  - The reservation used by lbarx may be cleared by another thread due to
 store to the same reservation granule. The smaller the window of
 lbarx/stbcx, the less possibility to be affected by this.

  This change just means that you'll spin in the loop for more 
 iterations (if it even does that -- I think the cycles per loop iteration 
 might be the same before and after, due to load latency and pairing) while 
 waiting for the other thread to release the lock.

Besides the optimization for the contended case, it also make the code more
readable with these changes:
  - It always seem a bit weird to calculate the lock value for the current
cpu in the lbarx/stbcx loop.
  - The cmpdi   cr1,r15,1 seems pretty confusion. It doesn't always do what
the comment said (set cr1.eq = 0). In some cases, it does set the
crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0.
All we need to do just clear the cr1.eq unconditionally.

 
 Do you have any benchmark results for this patch?

I doubt it will get any visible difference. Anyway I will gave it a try.

Thanks,
Kevin
 
 -Scott
 


pgptWm4iwxxv3.pgp
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

2015-08-14 Thread Scott Wood
On Fri, 2015-08-14 at 15:13 +0800, Kevin Hao wrote:
 On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote:
  On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
   It makes no sense to put the instructions for calculating the lock
   value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
   loop. And when the lock is acquired by the other thread, the current
   lock value has no chance to equal with the lock value used by current
   cpu. So we can skip the comparing for these two lock values in the
   lbz/bne loop.
   
   Signed-off-by: Kevin Hao haoke...@gmail.com
   ---
arch/powerpc/mm/tlb_low_64e.S | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
   
   diff --git a/arch/powerpc/mm/tlb_low_64e.S 
   b/arch/powerpc/mm/tlb_low_64e.S
   index 765b419883f2..e4185581c5a7 100644
   --- a/arch/powerpc/mm/tlb_low_64e.S
   +++ b/arch/powerpc/mm/tlb_low_64e.S
   @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION   /* CPU_FTR_SMT */
  *
  * MAS6:IND should be already set based on MAS4
  */
   -1:   lbarx   r15,0,r11
 lhz r10,PACAPACAINDEX(r13)
   - cmpdi   r15,0
   - cmpdi   cr1,r15,1   /* set cr1.eq = 0 for non-recursive */
 addir10,r10,1
   + crclr   cr1*4+eq/* set cr1.eq = 0 for non-recursive */
   +1:   lbarx   r15,0,r11
   + cmpdi   r15,0
 bne 2f
  
  You're optimizing the contended case at the expense of introducing stalls 
  in 
  the uncontended case.
 
 Before the patch, the uncontended case code sequence are:
 1:lbarx   r15,0,r11
   lhz r10,PACAPACAINDEX(r13)
   cmpdi   r15,0
   cmpdi   cr1,r15,1   /* set cr1.eq = 0 for non-recursive */
   addir10,r10,1
   bne 2f
   stbcx.  r10,0,r11
   bne 1b
 
 
 After the patch:
   lhz r10,PACAPACAINDEX(r13)
   addir10,r10,1
   crclr   cr1*4+eq/* set cr1.eq = 0 for non-recursive */
 1:lbarx   r15,0,r11
   cmpdi   r15,0
   bne 2f
   stbcx.  r10,0,r11
   bne 1b
 
 As you know, the lbarx is a Presync instruction and stbcx is a Presync and
 Postsync instruction.

Yes, so don't we want to move instructions after the lbarx if possible, so 
that the presync condition is achieved sooner?

  Putting the unnecessary instructions in the lbarx/stbcx
 loop also serialize these instructions execution.

Again, the common case should be that the loop executes only once.  The two 
cmpdi instructions should pair, the addi should pair with the bne, and the 
lhz should happen while waiting for the lbarx result.  My understanding of 
how to model this stuff is certainly imperfect/incomplete, so I generally try 
to confirm by testing, but I think both loops take the same number of cycles 
per iteration.

  The execution latency of
 lbarx is only 3 cycles and there are still two instructions after it.
 Considering the out of order execution optimization after this patch, do you
 really think that new uncontended path will become slower due to this
 additional stall?

The (theoretical) additional time is before the loop, not during it.

   Does it really matter if there are more instructions 
  in the loop?
 
 I really think we should minimize the window of lbarx/stbcx, for following 
 two
 reasons:
   - The bigger of this window, the more possible conflicts between the two
  threads run into this loop simultaneously.

That's more true of the total time the lock is held, not the lbarx/stbcx 
section.

   - The reservation used by lbarx may be cleared by another thread due to
  store to the same reservation granule. The smaller the window of
  lbarx/stbcx, the less possibility to be affected by this.

There's only one other thread that should be touching that reservation 
granule, and it's the one we're waiting for.

In any case, if there is a difference in loop iteration execution time, it's 
small.

  This change just means that you'll spin in the loop for more 
  iterations (if it even does that -- I think the cycles per loop iteration 
  might be the same before and after, due to load latency and pairing) 
  while 
  waiting for the other thread to release the lock.
 
 Besides the optimization for the contended case, it also make the code more
 readable with these changes:
   - It always seem a bit weird to calculate the lock value for the current
 cpu in the lbarx/stbcx loop.
   - The cmpdi   cr1,r15,1 seems pretty confusion. It doesn't always do 
 what
 the comment said (set cr1.eq = 0). In some cases, it does set the
 crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0.
 All we need to do just clear the cr1.eq unconditionally.

We only care about cr1.eq when we break out of the loop, in which case r15 
will have been zero.  But yes, crclr is better.

  
  Do you have any benchmark results for this patch?
 
 I doubt it will get any visible difference. Anyway I will gave it a try.

I tried a couple 

[PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

2015-08-13 Thread Kevin Hao
It makes no sense to put the instructions for calculating the lock
value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
loop. And when the lock is acquired by the other thread, the current
lock value has no chance to equal with the lock value used by current
cpu. So we can skip the comparing for these two lock values in the
lbz/bne loop.

Signed-off-by: Kevin Hao haoke...@gmail.com
---
 arch/powerpc/mm/tlb_low_64e.S | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index 765b419883f2..e4185581c5a7 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -308,11 +308,11 @@ BEGIN_FTR_SECTION /* CPU_FTR_SMT */
 *
 * MAS6:IND should be already set based on MAS4
 */
-1: lbarx   r15,0,r11
lhz r10,PACAPACAINDEX(r13)
-   cmpdi   r15,0
-   cmpdi   cr1,r15,1   /* set cr1.eq = 0 for non-recursive */
addir10,r10,1
+   crclr   cr1*4+eq/* set cr1.eq = 0 for non-recursive */
+1: lbarx   r15,0,r11
+   cmpdi   r15,0
bne 2f
stbcx.  r10,0,r11
bne 1b
@@ -320,9 +320,9 @@ BEGIN_FTR_SECTION   /* CPU_FTR_SMT */
.subsection 1
 2: cmpdcr1,r15,r10 /* recursive lock due to mcheck/crit/etc? */
beq cr1,3b  /* unlock will happen if cr1.eq = 0 */
-   lbz r15,0(r11)
+10:lbz r15,0(r11)
cmpdi   r15,0
-   bne 2b
+   bne 10b
b   1b
.previous
 
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes

2015-08-13 Thread Scott Wood
On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
 It makes no sense to put the instructions for calculating the lock
 value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
 loop. And when the lock is acquired by the other thread, the current
 lock value has no chance to equal with the lock value used by current
 cpu. So we can skip the comparing for these two lock values in the
 lbz/bne loop.
 
 Signed-off-by: Kevin Hao haoke...@gmail.com
 ---
  arch/powerpc/mm/tlb_low_64e.S | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)
 
 diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
 index 765b419883f2..e4185581c5a7 100644
 --- a/arch/powerpc/mm/tlb_low_64e.S
 +++ b/arch/powerpc/mm/tlb_low_64e.S
 @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION   /* CPU_FTR_SMT */
*
* MAS6:IND should be already set based on MAS4
*/
 -1:   lbarx   r15,0,r11
   lhz r10,PACAPACAINDEX(r13)
 - cmpdi   r15,0
 - cmpdi   cr1,r15,1   /* set cr1.eq = 0 for non-recursive */
   addir10,r10,1
 + crclr   cr1*4+eq/* set cr1.eq = 0 for non-recursive */
 +1:   lbarx   r15,0,r11
 + cmpdi   r15,0
   bne 2f

You're optimizing the contended case at the expense of introducing stalls in 
the uncontended case.  Does it really matter if there are more instructions 
in the loop?  This change just means that you'll spin in the loop for more 
iterations (if it even does that -- I think the cycles per loop iteration 
might be the same before and after, due to load latency and pairing) while 
waiting for the other thread to release the lock.

Do you have any benchmark results for this patch?

-Scott

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev