Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Dec 18, 2012, at 10:31 AM, Peter Bergner berg...@vnet.ibm.com wrote: On Tue, 2012-12-18 at 07:28 -0600, Jimi Xenidis wrote: On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote: Jimi, are you using an old binutils from before my patch that changed the operand order for these types of instructions? http://sourceware.org/ml/binutils/2009-02/msg00044.html Actually, this confused me as well, that embedded has the same instruction encoding but different mnemonic. The mnemonic is the same (ie, dcbtst), and yes, the encoding is the same. All that is different is the accepted operand ordering...and yes, it is very unfortunate the operand ordering is different between embedded and server. :( I was under the impression that the assembler made no instruction decisions based on CPU. So your only hint would be that '0b' prefix. Does AS even see that? GAS definitely makes decisions based on CPU (ie, -mcpu option). Below is the GAS code used in recognizing the dcbtst instruction. This shows that the server operand ordering is enabled for POWER4 and later cpus while the embedded operand ordering is enabled for pre POWER4 cpus (yes, not exactly a server versus embedded trigger, but that's we agreed on to mitigate breaking any old asm code out there). {dcbtst,X(31,246), X_MASK, POWER4,PPCNONE,{RA0, RB, CT}}, {dcbtst,X(31,246), X_MASK, PPC|PPCVLE, POWER4,{CT, RA0, RB}}, GAS doesn't look at how the operands are written to try and guess what operand ordering you are attempting to use. Rather, it knows what ordering it expects and the values had better match that ordering. I agree, but that means it is impossible for the same .S file can be compiled but -mcpu=e500mc and -mcpu=powerpc? So either these files have to be Book3S versus Book3E --or-- we use a CPP macro to get them right. FWIW, I prefer the latter which allows more code reuse. -jx Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Wed, 2013-01-09 at 16:19 -0600, Jimi Xenidis wrote: I agree, but that means it is impossible for the same .S file can be compiled but -mcpu=e500mc and -mcpu=powerpc? So either these files have to be Book3S versus Book3E --or-- we use a CPP macro to get them right. FWIW, I prefer the latter which allows more code reuse. I agree using a CPP macro - like we do for new instructions for which some older assemblers might not support yet - is probably the best solution. Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Dec 17, 2012, at 5:33 AM, Anton Blanchard an...@samba.org wrote: Hi Jimi, I know this is a little late, but shouldn't these power7 specific thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is that my compiler pukes on dcbtst and as I deal with that I wanted to point this out. I guess we could do that. I think it is the right idea since it is unclear that your optimizations would actually help an embedded system where most of these cache prefetches are NOPs and only wait decode/dispatch cycles. It's a bit strange your assembler is complaining about the dcbtst instructions since we wrap them with power4: Not really, the binutils is a little old (RHEL 6.2), unfortunately it _is_ the toolchain most people are using at the moment. It will take me a while to get everyone using newer ones since most are scientists using the packages they get. My suggestion was really for correctness, My current patches for BG/Q introduce a macro replacement. -jx .machine push .machine power4 dcbtr0,r4,0b01000 dcbtr0,r7,0b01010 dcbtst r0,r9,0b01000 dcbtst r0,r10,0b01010 eieio dcbtr0,r8,0b01010 /* GO */ .machine pop Anton ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote: On Mon, 2012-12-17 at 22:33 +1100, Anton Blanchard wrote: Hi Jimi, I know this is a little late, but shouldn't these power7 specific thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is that my compiler pukes on dcbtst and as I deal with that I wanted to point this out. I guess we could do that. It's a bit strange your assembler is complaining about the dcbtst instructions since we wrap them with power4: .machine push .machine power4 dcbtr0,r4,0b01000 dcbtr0,r7,0b01010 dcbtst r0,r9,0b01000 dcbtst r0,r10,0b01010 eieio dcbtr0,r8,0b01010 /* GO */ .machine pop Jimi, are you using an old binutils from before my patch that changed the operand order for these types of instructions? http://sourceware.org/ml/binutils/2009-02/msg00044.html Actually, this confused me as well, that embedded has the same instruction encoding but different mnemonic. I was under the impression that the assembler made no instruction decisions based on CPU. So your only hint would be that '0b' prefix. Does AS even see that? If not, then without a _normalizing_ macro, I think will need that obj-$(CONFIG_PPC_BOOK3S_64) and .S files with the two can never be shared. -jx Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
dcbtr0,r8,0b01010 /* GO */ .machine pop Jimi, are you using an old binutils from before my patch that changed the operand order for these types of instructions? http://sourceware.org/ml/binutils/2009-02/msg00044.html Actually, this confused me as well, that embedded has the same instruction encoding but different mnemonic. That it utterly horrid! I was under the impression that the assembler made no instruction decisions based on CPU. So your only hint would be that '0b' prefix. Does AS even see that? Or maybe see the 'r' prefix. I know they tend to be absent making ppc asm even more unreadable. It isn't as though the mnemonics were designed at a time when the source file size or difference in decode time (or code space) would be significant. Otherwise it is a complete recipe for disaster. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Tue, 2012-12-18 at 07:28 -0600, Jimi Xenidis wrote: On Dec 17, 2012, at 6:26 PM, Peter Bergner berg...@vnet.ibm.com wrote: Jimi, are you using an old binutils from before my patch that changed the operand order for these types of instructions? http://sourceware.org/ml/binutils/2009-02/msg00044.html Actually, this confused me as well, that embedded has the same instruction encoding but different mnemonic. The mnemonic is the same (ie, dcbtst), and yes, the encoding is the same. All that is different is the accepted operand ordering...and yes, it is very unfortunate the operand ordering is different between embedded and server. :( I was under the impression that the assembler made no instruction decisions based on CPU. So your only hint would be that '0b' prefix. Does AS even see that? GAS definitely makes decisions based on CPU (ie, -mcpu option). Below is the GAS code used in recognizing the dcbtst instruction. This shows that the server operand ordering is enabled for POWER4 and later cpus while the embedded operand ordering is enabled for pre POWER4 cpus (yes, not exactly a server versus embedded trigger, but that's we agreed on to mitigate breaking any old asm code out there). {dcbtst, X(31,246), X_MASK, POWER4,PPCNONE,{RA0, RB, CT}}, {dcbtst, X(31,246), X_MASK, PPC|PPCVLE, POWER4,{CT, RA0, RB}}, GAS doesn't look at how the operands are written to try and guess what operand ordering you are attempting to use. Rather, it knows what ordering it expects and the values had better match that ordering. Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
Hi Jimi, I know this is a little late, but shouldn't these power7 specific thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is that my compiler pukes on dcbtst and as I deal with that I wanted to point this out. I guess we could do that. It's a bit strange your assembler is complaining about the dcbtst instructions since we wrap them with power4: .machine push .machine power4 dcbtr0,r4,0b01000 dcbtr0,r7,0b01010 dcbtst r0,r9,0b01000 dcbtst r0,r10,0b01010 eieio dcbtr0,r8,0b01010 /* GO */ .machine pop Anton ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On Mon, 2012-12-17 at 22:33 +1100, Anton Blanchard wrote: Hi Jimi, I know this is a little late, but shouldn't these power7 specific thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is that my compiler pukes on dcbtst and as I deal with that I wanted to point this out. I guess we could do that. It's a bit strange your assembler is complaining about the dcbtst instructions since we wrap them with power4: .machine push .machine power4 dcbtr0,r4,0b01000 dcbtr0,r7,0b01010 dcbtst r0,r9,0b01000 dcbtst r0,r10,0b01010 eieio dcbtr0,r8,0b01010 /* GO */ .machine pop Jimi, are you using an old binutils from before my patch that changed the operand order for these types of instructions? http://sourceware.org/ml/binutils/2009-02/msg00044.html Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
On May 31, 2012, at 1:22 AM, Anton Blanchard an...@samba.org wrote: Implement a POWER7 optimised memcpy using VMX and enhanced prefetch instructions. snip Index: linux-build/arch/powerpc/lib/Makefile === --- linux-build.orig/arch/powerpc/lib/Makefile2012-05-30 15:27:30.0 +1000 +++ linux-build/arch/powerpc/lib/Makefile 2012-05-31 09:12:27.574372864 +1000 @@ -17,7 +17,8 @@ obj-$(CONFIG_HAS_IOMEM) += devres.o obj-$(CONFIG_PPC64) += copypage_64.o copyuser_64.o \ memcpy_64.o usercopy_64.o mem_64.o string.o \ checksum_wrappers_64.o hweight_64.o \ -copyuser_power7.o string_64.o copypage_power7.o +copyuser_power7.o string_64.o copypage_power7.o \ +memcpy_power7.o Hi, I know this is a little late, but shouldn't these power7 specific thingies be in obj-$(CONFIG_PPC_BOOK3S_64). The reason I ask is that my compiler pukes on dcbtst and as I deal with that I wanted to point this out. -jx obj-$(CONFIG_XMON)+= sstep.o ldstfp.o obj-$(CONFIG_KPROBES) += sstep.o ldstfp.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += sstep.o ldstfp.o Index: linux-build/arch/powerpc/lib/memcpy_64.S === --- linux-build.orig/arch/powerpc/lib/memcpy_64.S 2012-05-30 09:39:59.0 +1000 +++ linux-build/arch/powerpc/lib/memcpy_64.S 2012-05-31 09:12:00.093876936 +1000 @@ -11,7 +11,11 @@ .align 7 _GLOBAL(memcpy) +BEGIN_FTR_SECTION std r3,48(r1) /* save destination pointer for return value */ +FTR_SECTION_ELSE + b memcpy_power7 +ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY) PPC_MTOCRF(0x01,r5) cmpldi cr1,r5,16 neg r6,r3 # LS 3 bits = # bytes to 8-byte dest bdry Index: linux-build/arch/powerpc/lib/memcpy_power7.S === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-build/arch/powerpc/lib/memcpy_power7.S 2012-05-31 15:28:03.495781127 +1000 @@ -0,0 +1,650 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2012 + * + * Author: Anton Blanchard an...@au.ibm.com + */ +#include asm/ppc_asm.h + +#define STACKFRAMESIZE 256 +#define STK_REG(i) (112 + ((i)-14)*8) + +_GLOBAL(memcpy_power7) +#ifdef CONFIG_ALTIVEC + cmpldi r5,16 + cmpldi cr1,r5,4096 + + std r3,48(r1) + + blt .Lshort_copy + bgt cr1,.Lvmx_copy +#else + cmpldi r5,16 + + std r3,48(r1) + + blt .Lshort_copy +#endif + +.Lnonvmx_copy: + /* Get the source 8B aligned */ + neg r6,r4 + mtocrf 0x01,r6 + clrldi r6,r6,(64-3) + + bf cr7*4+3,1f + lbz r0,0(r4) + addir4,r4,1 + stb r0,0(r3) + addir3,r3,1 + +1: bf cr7*4+2,2f + lhz r0,0(r4) + addir4,r4,2 + sth r0,0(r3) + addir3,r3,2 + +2: bf cr7*4+1,3f + lwz r0,0(r4) + addir4,r4,4 + stw r0,0(r3) + addir3,r3,4 + +3: sub r5,r5,r6 + cmpldi r5,128 + blt 5f + + mflrr0 + stdur1,-STACKFRAMESIZE(r1) + std r14,STK_REG(r14)(r1) + std r15,STK_REG(r15)(r1) + std r16,STK_REG(r16)(r1) + std r17,STK_REG(r17)(r1) + std r18,STK_REG(r18)(r1) + std r19,STK_REG(r19)(r1) + std r20,STK_REG(r20)(r1) + std r21,STK_REG(r21)(r1) + std r22,STK_REG(r22)(r1) + std r0,STACKFRAMESIZE+16(r1) + + srdir6,r5,7 + mtctr r6 + + /* Now do cacheline (128B) sized loads and stores. */ + .align 5 +4: + ld r0,0(r4) + ld r6,8(r4) + ld r7,16(r4) + ld r8,24(r4) + ld r9,32(r4) + ld r10,40(r4) + ld r11,48(r4) + ld r12,56(r4) + ld r14,64(r4) + ld r15,72(r4) + ld r16,80(r4) + ld r17,88(r4) + ld r18,96(r4) + ld r19,104(r4) + ld r20,112(r4) + ld
[PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
Implement a POWER7 optimised memcpy using VMX and enhanced prefetch instructions. This is a copy of the POWER7 optimised copy_to_user/copy_from_user loop. Detailed implementation and performance details can be found in commit a66086b8197d (powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX). I noticed memcpy issues when profiling a RAID6 workload: .memcpy .async_memcpy .async_copy_data .__raid_run_ops .handle_stripe .raid5d .md_thread I created a simplified testcase by building a RAID6 array with 4 1GB ramdisks (booting with brd.rd_size=1048576): # mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3] I then timed how long it took to write to the entire array: # dd if=/dev/zero of=/dev/md0 bs=1M Before: 892 MB/s After: 999 MB/s A 12% improvement. Signed-off-by: Anton Blanchard an...@samba.org --- Index: linux-build/arch/powerpc/lib/Makefile === --- linux-build.orig/arch/powerpc/lib/Makefile 2012-05-30 15:27:30.0 +1000 +++ linux-build/arch/powerpc/lib/Makefile 2012-05-31 09:12:27.574372864 +1000 @@ -17,7 +17,8 @@ obj-$(CONFIG_HAS_IOMEM) += devres.o obj-$(CONFIG_PPC64)+= copypage_64.o copyuser_64.o \ memcpy_64.o usercopy_64.o mem_64.o string.o \ checksum_wrappers_64.o hweight_64.o \ - copyuser_power7.o string_64.o copypage_power7.o + copyuser_power7.o string_64.o copypage_power7.o \ + memcpy_power7.o obj-$(CONFIG_XMON) += sstep.o ldstfp.o obj-$(CONFIG_KPROBES) += sstep.o ldstfp.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += sstep.o ldstfp.o Index: linux-build/arch/powerpc/lib/memcpy_64.S === --- linux-build.orig/arch/powerpc/lib/memcpy_64.S 2012-05-30 09:39:59.0 +1000 +++ linux-build/arch/powerpc/lib/memcpy_64.S2012-05-31 09:12:00.093876936 +1000 @@ -11,7 +11,11 @@ .align 7 _GLOBAL(memcpy) +BEGIN_FTR_SECTION std r3,48(r1) /* save destination pointer for return value */ +FTR_SECTION_ELSE + b memcpy_power7 +ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY) PPC_MTOCRF(0x01,r5) cmpldi cr1,r5,16 neg r6,r3 # LS 3 bits = # bytes to 8-byte dest bdry Index: linux-build/arch/powerpc/lib/memcpy_power7.S === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-build/arch/powerpc/lib/memcpy_power7.S2012-05-31 15:28:03.495781127 +1000 @@ -0,0 +1,650 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2012 + * + * Author: Anton Blanchard an...@au.ibm.com + */ +#include asm/ppc_asm.h + +#define STACKFRAMESIZE 256 +#define STK_REG(i) (112 + ((i)-14)*8) + +_GLOBAL(memcpy_power7) +#ifdef CONFIG_ALTIVEC + cmpldi r5,16 + cmpldi cr1,r5,4096 + + std r3,48(r1) + + blt .Lshort_copy + bgt cr1,.Lvmx_copy +#else + cmpldi r5,16 + + std r3,48(r1) + + blt .Lshort_copy +#endif + +.Lnonvmx_copy: + /* Get the source 8B aligned */ + neg r6,r4 + mtocrf 0x01,r6 + clrldi r6,r6,(64-3) + + bf cr7*4+3,1f + lbz r0,0(r4) + addir4,r4,1 + stb r0,0(r3) + addir3,r3,1 + +1: bf cr7*4+2,2f + lhz r0,0(r4) + addir4,r4,2 + sth r0,0(r3) + addir3,r3,2 + +2: bf cr7*4+1,3f + lwz r0,0(r4) + addir4,r4,4 + stw r0,0(r3) + addir3,r3,4 + +3: sub r5,r5,r6 + cmpldi r5,128 + blt 5f + + mflrr0 + stdur1,-STACKFRAMESIZE(r1) + std r14,STK_REG(r14)(r1) + std r15,STK_REG(r15)(r1) + std r16,STK_REG(r16)(r1) + std r17,STK_REG(r17)(r1) + std r18,STK_REG(r18)(r1) + std r19,STK_REG(r19)(r1) + std r20,STK_REG(r20)(r1) + std r21,STK_REG(r21)(r1) + std r22,STK_REG(r22)(r1) + std r0,STACKFRAMESIZE+16(r1) + + srdir6,r5,7 + mtctr r6 +