Re: powerpc64: 64-bit-ize memmove.S
On Sat, 27 Jun 2020 01:27:14 +0200 Christian Weisgerber wrote: > That function simply copies as many (double)words plus a tail of > bytes as the length argument specifies. Neither source nor destination > are checked for alignment, so this will happily run a loop of > unaligned accesses, which doesn't sound very optimal. I made a benchmark and concluded that unaligned word copies are slower than aligned word copies, but faster than byte copies. In most cases, memmove.S is faster than memmove.c, but if aligned word copies between unaligned buffers are possible, then memmove.c is faster. The benchmark was on a 32-bit macppc G3 with cpu0 at mainbus0: 750 (Revision 0x202): 400 MHz: 512KB backside cache The benchmark has 4 implementations of memmove, stbu => byte copy with lbzu,stbu loop stbx => byte copy with lbzx,stbx,addi loop C => aligned word copy or byte copy (libc/string/memmove.c) asm => unaligned word copy (libc/arch/powerpc/string/memmove.S) It shows time measured by mftb (move from timebase). 1st bench: move 1 bytes up by 4 bytes, then down by 4 bytes, in aligned buffer (offset 0). asm wins: $ ./bench 1 4 0 stbustbxC asm 26392814792 633 25022814784 628 25012814783 627 25012814784 626 2nd bench: unaligned buffer (offset 1), but (src & 3) == (dst & 3), so C does aligned word copies, while asm does misaligned. C wins: $ ./bench 1 4 1 stbustbxC asm 26383006795 961 25022814786 938 25012814786 939 25012813785 939 3rd bench: move up then down by 5 bytes, src & 3 != dst & 3, can't align word copies. C does byte copies. asm wins: $ ./bench 1 5 0 stbustbxC asm 267528152514809 250128132504782 250228152504782 250128142503782 I think that memmove.S is probably better than memmove.c on G3. I haven't run the bench on POWER9.
Re: powerpc64: 64-bit-ize memmove.S
George Koehler wrote: > On Sat, 27 Jun 2020 01:27:14 +0200 > Christian Weisgerber wrote: > > > I'm also intrigued by this aside in the PowerPC ISA documentation: > > | Moreover, Load with Update instructions may take longer to execute > > | in some implementations than the corresponding pair of a non-update > > | Load instruction and an Add instruction. > > What does clang generate? > > clang likes load/store with update instructions. For example, the > powerpc64 kernel has /sys/lib/libkern/memcpy.c, which copies bytes: > > while (n-- > 0) > *t++ = *f++; Keep in mind in userland we use memcpy.c, which does backwards detection. It's probably slower. The original idea was to use the C version with backwards detection until it stopped finding bugs, but... it keeps finding bugs
Re: powerpc64: 64-bit-ize memmove.S
On Sat, 27 Jun 2020 01:27:14 +0200 Christian Weisgerber wrote: > I'm also intrigued by this aside in the PowerPC ISA documentation: > | Moreover, Load with Update instructions may take longer to execute > | in some implementations than the corresponding pair of a non-update > | Load instruction and an Add instruction. > What does clang generate? clang likes load/store with update instructions. For example, the powerpc64 kernel has /sys/lib/libkern/memcpy.c, which copies bytes: while (n-- > 0) *t++ = *f++; clang uses lbzu and stbu: memcpy: cmpldi r5,0x0 memcpy+0x4: beqlr memcpy+0x8: addi r4,r4,-1 memcpy+0xc: addi r6,r3,-1 memcpy+0x10:mtspr ctr,r5 memcpy+0x14:lbzu r5,1(r4) memcpy+0x18:stbu r5,1(r6) memcpy+0x1c:bdnz 0x26cd0d4 (memcpy+0x14) memcpy+0x20:blr > I think we should consider dropping this "optimized" memmove.S on > both powerpc and powerpc64. I might want to benchmark memmove.S against memmove.c to check if those unaligned accesses are too slow. First I would have to write a benchmark.
Re: powerpc64: 64-bit-ize memmove.S
Christian Weisgerber: > Well, I suggested it, so here's my attempt to switch powerpc64's > libc memmove.S over to 64 bits: Actually, on second thought: That function simply copies as many (double)words plus a tail of bytes as the length argument specifies. Neither source nor destination are checked for alignment, so this will happily run a loop of unaligned accesses, which doesn't sound very optimal. I'm also intrigued by this aside in the PowerPC ISA documentation: | Moreover, Load with Update instructions may take longer to execute | in some implementations than the corresponding pair of a non-update | Load instruction and an Add instruction. What does clang generate? I think we should consider dropping this "optimized" memmove.S on both powerpc and powerpc64. -- Christian "naddy" Weisgerber na...@mips.inka.de
powerpc64: 64-bit-ize memmove.S
Well, I suggested it, so here's my attempt to switch powerpc64's libc memmove.S over to 64 bits: * Treat length parameter as 64 bits (size_t) instead of 32 (u_int). * Set up the main loop to copy 64-bit doublewords instead of 32-bit words. This definitely needs double and triple checking. Things I wonder about, but didn't touch: * Why is the memcpy entry point commented out? * END... STRONG? WEAK? BUILTIN? Index: memmove.S === RCS file: /cvs/src/lib/libc/arch/powerpc64/string/memmove.S,v retrieving revision 1.1 diff -u -p -r1.1 memmove.S --- memmove.S 25 Jun 2020 02:34:22 - 1.1 +++ memmove.S 26 Jun 2020 22:22:51 - @@ -39,7 +39,7 @@ * == */ -#include "SYS.h" +#include "DEFS.h" .text @@ -64,45 +64,45 @@ ENTRY(memmove) /* start of dest*/ fwd: - addi%r4, %r4, -4/* Back up src and dst pointers */ - addi%r8, %r8, -4/* due to auto-update of 'load' */ + addi%r4, %r4, -8/* Back up src and dst pointers */ + addi%r8, %r8, -8/* due to auto-update of 'load' */ - srwi. %r9,%r5,2 /* How many words in total cnt */ - beq-last1 /* Handle byte by byte if < 4 */ + srdi. %r9,%r5,3 /* Doublewords in total count */ + beq-last1 /* Handle byte by byte if < 8 */ /* bytes total */ - mtctr %r9 /* Count of words for loop */ - lwzu%r7, 4(%r4) /* Preload first word */ + mtctr %r9 /* Count of dwords for loop */ + ldu %r7, 8(%r4) /* Preload first doubleword */ b g1 g0:/* Main loop*/ - lwzu%r7, 4(%r4) /* Load a new word */ - stwu%r6, 4(%r8) /* Store previous word */ + ldu %r7, 8(%r4) /* Load a new doubleword*/ + stdu%r6, 8(%r8) /* Store previous doubleword*/ g1: bdz-last/* Dec cnt, and branch if just */ - /* one word to store*/ - lwzu%r6, 4(%r4) /* Load another word*/ - stwu%r7, 4(%r8) /* Store previous word */ + /* one doubleword to store */ + ldu %r6, 8(%r4) /* Load another doubleword */ + stdu%r7, 8(%r8) /* Store previous doubleword*/ bdnz+ g0 /* Dec cnt, and loop again if */ - /* more words */ - mr %r7, %r6/* If word count -> 0, then... */ + /* more doublewords */ + mr %r7, %r6/* If dword count -> 0, then... */ last: - stwu%r7, 4(%r8) /* ... store last word */ + stdu%r7, 8(%r8) /* ... store last doubleword*/ last1: /* Byte-by-byte copy*/ - clrlwi. %r5,%r5,30 /* If count -> 0, then ... */ + clrldi. %r5,%r5,61 /* If count -> 0, then ... */ beqlr /* we're done */ mtctr %r5 /* else load count for loop */ - lbzu%r6, 4(%r4) /* 1st byte: update addr by 4 */ - stbu%r6, 4(%r8) /* since we pre-adjusted by 4 */ + lbzu%r6, 8(%r4) /* 1st byte: update addr by 8 */ + stbu%r6, 8(%r8) /* since we pre-adjusted by 8 */ bdzlr- /* in anticipation of main loop */ last2: @@ -120,40 +120,40 @@ reverse: add %r4, %r4, %r5 /* Work from end to beginning */ add %r8, %r8, %r5 /* so add count to string ptrs */ - srwi. %r9,%r5,2 /* Words in total count */ - beq-rlast1 /* Handle byte by byte if < 4 */ + srdi. %r9,%r5,3 /* Doublewords in total count */ + beq-rlast1 /* Handle byte by byte if < 8 */ /* bytes total */ - mtctr %r9 /* Count of words for loop */ + mtctr %r9 /* Count of dwords for loop */ - lwzu%r7, -4(%r4)/* Preload first word */ + ldu