Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2013-01-09 Thread Jimi Xenidis

On Dec 18, 2012, at 10:31 AM, Peter Bergner berg...@vnet.ibm.com wrote:

 On Tue, 2012-12-18 at 07:28 -0600, Jimi Xenidis wrote:
 On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote:
 Jimi, are you using an old binutils from before my patch that
 changed the operand order for these types of instructions?
 
   http://sourceware.org/ml/binutils/2009-02/msg00044.html
 
 Actually, this confused me as well, that embedded has the same instruction
 encoding but different mnemonic.
 
 The mnemonic is the same (ie, dcbtst), and yes, the encoding is the same.
 All that is different is the accepted operand ordering...and yes, it is
 very unfortunate the operand ordering is different between embedded and
 server. :(
 
 
 I was under the impression that the assembler made no instruction decisions
 based on CPU.  So your only hint would be that '0b' prefix.
 Does AS even see that?
 
 GAS definitely makes decisions based on CPU (ie, -mcpu option).  Below is
 the GAS code used in recognizing the dcbtst instruction.  This shows that
 the server operand ordering is enabled for POWER4 and later cpus while
 the embedded operand ordering is enabled for pre POWER4 cpus (yes, not
 exactly a server versus embedded trigger, but that's we agreed on to
 mitigate breaking any old asm code out there).
 
 {dcbtst,X(31,246),  X_MASK,  POWER4,PPCNONE,{RA0, 
 RB, CT}},
 {dcbtst,X(31,246),  X_MASK,  PPC|PPCVLE, POWER4,{CT, 
 RA0, RB}},
 
 GAS doesn't look at how the operands are written to try and guess what
 operand ordering you are attempting to use.  Rather, it knows what ordering
 it expects and the values had better match that ordering.
 

I agree, but that means it is impossible for the same .S file can be compiled 
but -mcpu=e500mc and -mcpu=powerpc?
So either these files have to be Book3S versus Book3E --or-- we use a CPP macro 
to get them right.
FWIW, I prefer the latter which allows more code reuse.

-jx


 
 Peter
 
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2013-01-09 Thread Peter Bergner
On Wed, 2013-01-09 at 16:19 -0600, Jimi Xenidis wrote:
 I agree, but that means it is impossible for the same .S file can be compiled
 but -mcpu=e500mc and -mcpu=powerpc?  So either these files have to be Book3S
 versus Book3E --or-- we use a CPP macro to get them right.
 FWIW, I prefer the latter which allows more code reuse.

I agree using a CPP macro - like we do for new instructions for which some
older assemblers might not support yet - is probably the best solution.

Peter


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-18 Thread Jimi Xenidis

On Dec 17, 2012, at 5:33 AM, Anton Blanchard an...@samba.org wrote:

 
 Hi Jimi,
 
 I know this is a little late, but shouldn't these power7 specific
 thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is
 that my compiler pukes on dcbtst and as I deal with that I wanted
 to point this out.
 
 I guess we could do that.

I think it is the right idea since it is unclear that your optimizations would 
actually help an embedded system where most of these cache prefetches are NOPs 
and only wait decode/dispatch cycles.

 It's a bit strange your assembler is
 complaining about the dcbtst instructions since we wrap them with
 power4:

Not really, the binutils is a little old (RHEL 6.2), unfortunately it _is_ the 
toolchain most people are using at the moment.
It will take me a while to get everyone using newer ones since most are 
scientists using the packages they get.

My suggestion was really for correctness,  My current patches for BG/Q 
introduce a macro replacement.
-jx


 
 .machine push
 .machine power4
dcbtr0,r4,0b01000
dcbtr0,r7,0b01010
dcbtst  r0,r9,0b01000
dcbtst  r0,r10,0b01010
eieio
dcbtr0,r8,0b01010   /* GO */
 .machine pop
 
 Anton

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-18 Thread Jimi Xenidis

On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote:

 On Mon, 2012-12-17 at 22:33 +1100, Anton Blanchard wrote:
 Hi Jimi,
 
 I know this is a little late, but shouldn't these power7 specific
 thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is
 that my compiler pukes on dcbtst and as I deal with that I wanted
 to point this out.
 
 I guess we could do that. It's a bit strange your assembler is
 complaining about the dcbtst instructions since we wrap them with
 power4:
 
 .machine push
 .machine power4
dcbtr0,r4,0b01000
dcbtr0,r7,0b01010
dcbtst  r0,r9,0b01000
dcbtst  r0,r10,0b01010
eieio
dcbtr0,r8,0b01010   /* GO */
 .machine pop
 
 Jimi, are you using an old binutils from before my patch that
 changed the operand order for these types of instructions?
 
http://sourceware.org/ml/binutils/2009-02/msg00044.html

Actually, this confused me as well, that embedded has the same instruction 
encoding but different mnemonic.
I was under the impression that the assembler made no instruction decisions 
based on CPU.
So your only hint would be that '0b' prefix.
Does AS even see that?

If not, then without a _normalizing_ macro, I think will need that 
obj-$(CONFIG_PPC_BOOK3S_64) and .S files with the two can never be shared.

-jx


 
 Peter
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-18 Thread David Laight
 dcbtr0,r8,0b01010   /* GO */
  .machine pop
 
  Jimi, are you using an old binutils from before my patch that
  changed the operand order for these types of instructions?
 
 http://sourceware.org/ml/binutils/2009-02/msg00044.html
 
 Actually, this confused me as well, that embedded has the same
 instruction encoding but different mnemonic.

That it utterly horrid!

 I was under the impression that the assembler made no instruction decisions 
 based on CPU.
 So your only hint would be that '0b' prefix.
 Does AS even see that?

Or maybe see the 'r' prefix.
I know they tend to be absent making ppc asm even more unreadable.
It isn't as though the mnemonics were designed at a time when
the source file size or difference in decode time (or code space)
would be significant.

Otherwise it is a complete recipe for disaster.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-18 Thread Peter Bergner
On Tue, 2012-12-18 at 07:28 -0600, Jimi Xenidis wrote:
 On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote:
  Jimi, are you using an old binutils from before my patch that
  changed the operand order for these types of instructions?
  
 http://sourceware.org/ml/binutils/2009-02/msg00044.html
 
 Actually, this confused me as well, that embedded has the same instruction
 encoding but different mnemonic.

The mnemonic is the same (ie, dcbtst), and yes, the encoding is the same.
All that is different is the accepted operand ordering...and yes, it is
very unfortunate the operand ordering is different between embedded and
server. :(


 I was under the impression that the assembler made no instruction decisions
 based on CPU.  So your only hint would be that '0b' prefix.
 Does AS even see that?

GAS definitely makes decisions based on CPU (ie, -mcpu option).  Below is
the GAS code used in recognizing the dcbtst instruction.  This shows that
the server operand ordering is enabled for POWER4 and later cpus while
the embedded operand ordering is enabled for pre POWER4 cpus (yes, not
exactly a server versus embedded trigger, but that's we agreed on to
mitigate breaking any old asm code out there).

{dcbtst,  X(31,246),  X_MASK,  POWER4,PPCNONE,{RA0, 
RB, CT}},
{dcbtst,  X(31,246),  X_MASK,  PPC|PPCVLE, POWER4,{CT, 
RA0, RB}},

GAS doesn't look at how the operands are written to try and guess what
operand ordering you are attempting to use.  Rather, it knows what ordering
it expects and the values had better match that ordering.


Peter



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-17 Thread Anton Blanchard

Hi Jimi,

 I know this is a little late, but shouldn't these power7 specific
 thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is
 that my compiler pukes on dcbtst and as I deal with that I wanted
 to point this out.

I guess we could do that. It's a bit strange your assembler is
complaining about the dcbtst instructions since we wrap them with
power4:

.machine push
.machine power4
dcbtr0,r4,0b01000
dcbtr0,r7,0b01010
dcbtst  r0,r9,0b01000
dcbtst  r0,r10,0b01010
eieio
dcbtr0,r8,0b01010   /* GO */
.machine pop

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-17 Thread Peter Bergner
On Mon, 2012-12-17 at 22:33 +1100, Anton Blanchard wrote:
 Hi Jimi,
 
  I know this is a little late, but shouldn't these power7 specific
  thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is
  that my compiler pukes on dcbtst and as I deal with that I wanted
  to point this out.
 
 I guess we could do that. It's a bit strange your assembler is
 complaining about the dcbtst instructions since we wrap them with
 power4:
 
 .machine push
 .machine power4
 dcbtr0,r4,0b01000
 dcbtr0,r7,0b01010
 dcbtst  r0,r9,0b01000
 dcbtst  r0,r10,0b01010
 eieio
 dcbtr0,r8,0b01010   /* GO */
 .machine pop

Jimi, are you using an old binutils from before my patch that
changed the operand order for these types of instructions?

http://sourceware.org/ml/binutils/2009-02/msg00044.html

Peter


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-12-07 Thread Jimi Xenidis

On May 31, 2012, at 1:22 AM, Anton Blanchard an...@samba.org wrote:

 
 Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
 instructions.

snip

 
 Index: linux-build/arch/powerpc/lib/Makefile
 ===
 --- linux-build.orig/arch/powerpc/lib/Makefile2012-05-30 
 15:27:30.0 +1000
 +++ linux-build/arch/powerpc/lib/Makefile 2012-05-31 09:12:27.574372864 
 +1000
 @@ -17,7 +17,8 @@ obj-$(CONFIG_HAS_IOMEM) += devres.o
 obj-$(CONFIG_PPC64)   += copypage_64.o copyuser_64.o \
  memcpy_64.o usercopy_64.o mem_64.o string.o \
  checksum_wrappers_64.o hweight_64.o \
 -copyuser_power7.o string_64.o copypage_power7.o
 +copyuser_power7.o string_64.o copypage_power7.o \
 +memcpy_power7.o

Hi,
I know this is a little late, but shouldn't these power7 specific thingies be 
in obj-$(CONFIG_PPC_BOOK3S_64).
The reason I ask is that my compiler pukes on dcbtst and as I deal with that 
I wanted to point this out.

-jx


 obj-$(CONFIG_XMON)+= sstep.o ldstfp.o
 obj-$(CONFIG_KPROBES) += sstep.o ldstfp.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)  += sstep.o ldstfp.o
 Index: linux-build/arch/powerpc/lib/memcpy_64.S
 ===
 --- linux-build.orig/arch/powerpc/lib/memcpy_64.S 2012-05-30 
 09:39:59.0 +1000
 +++ linux-build/arch/powerpc/lib/memcpy_64.S  2012-05-31 09:12:00.093876936 
 +1000
 @@ -11,7 +11,11 @@
 
   .align  7
 _GLOBAL(memcpy)
 +BEGIN_FTR_SECTION
   std r3,48(r1)   /* save destination pointer for return value */
 +FTR_SECTION_ELSE
 + b   memcpy_power7
 +ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
   PPC_MTOCRF(0x01,r5)
   cmpldi  cr1,r5,16
   neg r6,r3   # LS 3 bits = # bytes to 8-byte dest bdry
 Index: linux-build/arch/powerpc/lib/memcpy_power7.S
 ===
 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ linux-build/arch/powerpc/lib/memcpy_power7.S  2012-05-31 
 15:28:03.495781127 +1000
 @@ -0,0 +1,650 @@
 +/*
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2 of the License, or
 + * (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 + *
 + * Copyright (C) IBM Corporation, 2012
 + *
 + * Author: Anton Blanchard an...@au.ibm.com
 + */
 +#include asm/ppc_asm.h
 +
 +#define STACKFRAMESIZE   256
 +#define STK_REG(i)   (112 + ((i)-14)*8)
 +
 +_GLOBAL(memcpy_power7)
 +#ifdef CONFIG_ALTIVEC
 + cmpldi  r5,16
 + cmpldi  cr1,r5,4096
 +
 + std r3,48(r1)
 +
 + blt .Lshort_copy
 + bgt cr1,.Lvmx_copy
 +#else
 + cmpldi  r5,16
 +
 + std r3,48(r1)
 +
 + blt .Lshort_copy
 +#endif
 +
 +.Lnonvmx_copy:
 + /* Get the source 8B aligned */
 + neg r6,r4
 + mtocrf  0x01,r6
 + clrldi  r6,r6,(64-3)
 +
 + bf  cr7*4+3,1f
 + lbz r0,0(r4)
 + addir4,r4,1
 + stb r0,0(r3)
 + addir3,r3,1
 +
 +1:   bf  cr7*4+2,2f
 + lhz r0,0(r4)
 + addir4,r4,2
 + sth r0,0(r3)
 + addir3,r3,2
 +
 +2:   bf  cr7*4+1,3f
 + lwz r0,0(r4)
 + addir4,r4,4
 + stw r0,0(r3)
 + addir3,r3,4
 +
 +3:   sub r5,r5,r6
 + cmpldi  r5,128
 + blt 5f
 +
 + mflrr0
 + stdur1,-STACKFRAMESIZE(r1)
 + std r14,STK_REG(r14)(r1)
 + std r15,STK_REG(r15)(r1)
 + std r16,STK_REG(r16)(r1)
 + std r17,STK_REG(r17)(r1)
 + std r18,STK_REG(r18)(r1)
 + std r19,STK_REG(r19)(r1)
 + std r20,STK_REG(r20)(r1)
 + std r21,STK_REG(r21)(r1)
 + std r22,STK_REG(r22)(r1)
 + std r0,STACKFRAMESIZE+16(r1)
 +
 + srdir6,r5,7
 + mtctr   r6
 +
 + /* Now do cacheline (128B) sized loads and stores. */
 + .align  5
 +4:
 + ld  r0,0(r4)
 + ld  r6,8(r4)
 + ld  r7,16(r4)
 + ld  r8,24(r4)
 + ld  r9,32(r4)
 + ld  r10,40(r4)
 + ld  r11,48(r4)
 + ld  r12,56(r4)
 + ld  r14,64(r4)
 + ld  r15,72(r4)
 + ld  r16,80(r4)
 + ld  r17,88(r4)
 + ld  r18,96(r4)
 + ld  r19,104(r4)
 + ld  r20,112(r4)
 + ld  

[PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch

2012-05-31 Thread Anton Blanchard

Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
instructions.

This is a copy of the POWER7 optimised copy_to_user/copy_from_user
loop. Detailed implementation and performance details can be found in
commit a66086b8197d (powerpc: POWER7 optimised
copy_to_user/copy_from_user using VMX).

I noticed memcpy issues when profiling a RAID6 workload:

.memcpy
.async_memcpy
.async_copy_data
.__raid_run_ops
.handle_stripe
.raid5d
.md_thread

I created a simplified testcase by building a RAID6 array with 4 1GB
ramdisks (booting with brd.rd_size=1048576):

# mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3]

I then timed how long it took to write to the entire array:

# dd if=/dev/zero of=/dev/md0 bs=1M

Before: 892 MB/s
After:  999 MB/s

A 12% improvement.

Signed-off-by: Anton Blanchard an...@samba.org
---

Index: linux-build/arch/powerpc/lib/Makefile
===
--- linux-build.orig/arch/powerpc/lib/Makefile  2012-05-30 15:27:30.0 
+1000
+++ linux-build/arch/powerpc/lib/Makefile   2012-05-31 09:12:27.574372864 
+1000
@@ -17,7 +17,8 @@ obj-$(CONFIG_HAS_IOMEM)   += devres.o
 obj-$(CONFIG_PPC64)+= copypage_64.o copyuser_64.o \
   memcpy_64.o usercopy_64.o mem_64.o string.o \
   checksum_wrappers_64.o hweight_64.o \
-  copyuser_power7.o string_64.o copypage_power7.o
+  copyuser_power7.o string_64.o copypage_power7.o \
+  memcpy_power7.o
 obj-$(CONFIG_XMON) += sstep.o ldstfp.o
 obj-$(CONFIG_KPROBES)  += sstep.o ldstfp.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += sstep.o ldstfp.o
Index: linux-build/arch/powerpc/lib/memcpy_64.S
===
--- linux-build.orig/arch/powerpc/lib/memcpy_64.S   2012-05-30 
09:39:59.0 +1000
+++ linux-build/arch/powerpc/lib/memcpy_64.S2012-05-31 09:12:00.093876936 
+1000
@@ -11,7 +11,11 @@
 
.align  7
 _GLOBAL(memcpy)
+BEGIN_FTR_SECTION
std r3,48(r1)   /* save destination pointer for return value */
+FTR_SECTION_ELSE
+   b   memcpy_power7
+ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
PPC_MTOCRF(0x01,r5)
cmpldi  cr1,r5,16
neg r6,r3   # LS 3 bits = # bytes to 8-byte dest bdry
Index: linux-build/arch/powerpc/lib/memcpy_power7.S
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-build/arch/powerpc/lib/memcpy_power7.S2012-05-31 
15:28:03.495781127 +1000
@@ -0,0 +1,650 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2012
+ *
+ * Author: Anton Blanchard an...@au.ibm.com
+ */
+#include asm/ppc_asm.h
+
+#define STACKFRAMESIZE 256
+#define STK_REG(i) (112 + ((i)-14)*8)
+
+_GLOBAL(memcpy_power7)
+#ifdef CONFIG_ALTIVEC
+   cmpldi  r5,16
+   cmpldi  cr1,r5,4096
+
+   std r3,48(r1)
+
+   blt .Lshort_copy
+   bgt cr1,.Lvmx_copy
+#else
+   cmpldi  r5,16
+
+   std r3,48(r1)
+
+   blt .Lshort_copy
+#endif
+
+.Lnonvmx_copy:
+   /* Get the source 8B aligned */
+   neg r6,r4
+   mtocrf  0x01,r6
+   clrldi  r6,r6,(64-3)
+
+   bf  cr7*4+3,1f
+   lbz r0,0(r4)
+   addir4,r4,1
+   stb r0,0(r3)
+   addir3,r3,1
+
+1: bf  cr7*4+2,2f
+   lhz r0,0(r4)
+   addir4,r4,2
+   sth r0,0(r3)
+   addir3,r3,2
+
+2: bf  cr7*4+1,3f
+   lwz r0,0(r4)
+   addir4,r4,4
+   stw r0,0(r3)
+   addir3,r3,4
+
+3: sub r5,r5,r6
+   cmpldi  r5,128
+   blt 5f
+
+   mflrr0
+   stdur1,-STACKFRAMESIZE(r1)
+   std r14,STK_REG(r14)(r1)
+   std r15,STK_REG(r15)(r1)
+   std r16,STK_REG(r16)(r1)
+   std r17,STK_REG(r17)(r1)
+   std r18,STK_REG(r18)(r1)
+   std r19,STK_REG(r19)(r1)
+   std r20,STK_REG(r20)(r1)
+   std r21,STK_REG(r21)(r1)
+   std r22,STK_REG(r22)(r1)
+   std r0,STACKFRAMESIZE+16(r1)
+
+   srdir6,r5,7
+   mtctr   r6
+