Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

2008-10-16 Thread Paul Mackerras
Joel Schopp writes:

 As for the technical comments, I agree with all of them and will 
 incorporate them into the next version.

Mark Nelson is working on new memcpy and __copy_tofrom_user routines
that look like they will be simpler than the old ones as well as being
faster, particularly on Cell.  It turns out that doing unaligned
8-byte loads is faster than doing aligned loads + shifts + ors on
POWER5 and later machines.  So I suggest that you try a loop that does
say 4 ld's and 4 std's rather than worrying with all the complexity of
the shifts and ors.  On POWER3, ld and std that are not 4-byte aligned
will cause an alignment interrupt, so there I suggest we fall back to
just using lwz and stw as at present (though maybe with the loop
unrolled a bit more).  We'll be adding a feature bit to tell whether
the cpu can do unaligned 8-bytes loads and stores without trapping.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

2008-10-10 Thread Paul Mackerras
[EMAIL PROTECTED] writes:

 The current 64 bit csum_partial_copy_generic function is based on the 32 
 bit version and never was optimized for 64 bit.  This patch takes the 64 bit 
 memcpy and adapts it to also do the sum.  It has been tested on a variety 
 of input sizes and alignments on Power5 and Power6 processors.  It gives 
 correct output for all cases tested.  It also runs 20-55% faster 
 than the implemention it replaces depending on size, alignment, and processor.

Thanks for doing this.  A few comments below, but first, can you
clarify what your and George Fulk's roles were in producing this?  I had
the impression George had written the code, and if that's the case,
you need to put a From: George Fulk ... line as the first line of
your mail when you re-send the patch.

Also, the patch was whitespace-damaged.  All lines with an initial
space character had that space turned into two spaces.

 @@ -22,8 +22,7 @@
* len is in words and is always = 5.
*
* In practice len == 5, but this is not guaranteed.  So this code does not
 - * attempt to use doubleword instructions.
 - */
 + * attempt to use doubleword instructions. */

This looked better the way it was IMHO, and in any case it's unrelated
to the main point of the patch.

 + * This returns a 32 bit 1s complement sum that can be folded to 16 bits and
 + * notted to produce a 16bit tcp/ip checksum.
  ^^
complemented

   _GLOBAL(csum_partial_copy_generic)
 + std r7,48(r1)   /* we need to save the error pointers ...*/
 + std r8,56(r1)   /* we need to save the error pointers ...*/
 + PPC_MTOCRF  0x01,r5
 + cmpldi  cr1,r5,16
 + neg r11,r4  # LS 3 bits = # bytes to 8-byte dest bdry
 + andi.   r11,r11,7
 + dcbt0,r3
 + blt cr1,.Lshort_copy
 + bne .Ldst_unaligned
 +.Ldst_aligned:
 + andi.   r0,r3,7
 + addir4,r4,-16
 + bne .Lsrc_unaligned
 + srdir10,r5,4/* src and dst aligned */
 +80:  ld  r9,0(r3)
 + addir3,r3,-8
 + mtctr   r10
 + andi.   r5,r5,7
 + bf  cr7*4+0,2f
 + addir4,r4,8
 + addir3,r3,8
 + mr  r12,r9
 + blt cr1,3f
 +1:
 +81:  ld  r9,8(r3)
 +82:  std r12,8(r4)
 + adder6,r6,r12   /* add to checksum */

(I took out the - lines to make the new code clearer.)

At this point you're doing an adde, which will add in the current
setting of the carry bit.  However, you haven't done anything to make
sure the carry will be 0 coming into the first iteration of the loop.
That was why the old version started with an addic r0,r6,0
instruction - addic affects the carry bit, and adding 0 to something
is guaranteed to have a carry out of 0, so that clears the carry bit.
The caller isn't obligated to make sure the carry bit is clear when
calling this.

 +2:
 +83:  ldu r12,16(r3)
 +84:  stdur9,16(r4)
 + adder6,r6,r9/* add to checksum */
 + bdnz1b
 +3:
 +85:  std r12,8(r4)
 + adder6,r6,r12   /* add to checksum */
   beq 3f
 + addir4,r4,16
 + ld  r9,8(r3)

Why doesn't this ld have a label?  It needs a label so it can be in
the exception table, since it is potentially a load from user memory.

 +.Ldo_tail:
 + bf  cr7*4+1,1f
 + rotldi  r9,r9,32
 +86:  stw r9,0(r4)
 + adder6,r6,r9/* add to checksum */

This is wrong; we only want to add in the bottom 32 bits of r9.

 + addir4,r4,4
 +1:   bf  cr7*4+2,2f
 + rotldi  r9,r9,16
 +87:  sth r9,0(r4)
 + adder6,r6,r9/* add to checksum */

And here we only want to add in the bottom 16 bits of r9.

 + addir4,r4,2
 +2:   bf  cr7*4+3,3f
 + rotldi  r9,r9,8
 +88:  stb r9,0(r4)
 + adder6,r6,r9/* add to checksum */

And here the bottom 8 bits, but shifted left 8 bits.

 +3:   addze   r6,r6   /* add in final carry (unlikely with 64-bit 
 regs) */

The unlikely comment is no longer true, since we're now doing 64-bit
loads and stores.

 +rldicl  r9,r6,32,0/* fold 64 bit value */

I realize the rldicl was in the old code, but this would be clearer as
rotldi  r9,r6,32 (which is the same instruction).

 +add r3,r9,r6
 +srdir3,r3,32
 + blr /* return sum */
 +
 +.Lsrc_unaligned:
 + srdir11,r5,3
 + addir5,r5,-16
 + subfr3,r0,r3
 + srdir7,r5,4
 + sldir10,r0,3
 + cmpdi   cr6,r11,3
 + andi.   r5,r5,7
 + mtctr   r7
 + subfic  r12,r10,64

This will set the carry bit, since r10 is less than 64 here.  Probably
not what we want since we're about to do some addes.

[snip]

 + # d0=(s0|s1) in r12, s1 in r11, s2 in r7, s2 in r8, s3 in r9
 +1:   or  r7,r7,r11
 +89:  ld  r0,8(r3)
 +90:  std r12,8(r4)
 + adder6,r6,r12   /* add to checksum */
 +2:   subfic  r12,r10,64  /* recreate value from r10 */

Oops, we just trashed 

Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

2008-10-10 Thread Joel Schopp



Thanks for doing this.  A few comments below, but first, can you
clarify what your and George Fulk's roles were in producing this?  I had
the impression George had written the code, and if that's the case,
you need to put a From: George Fulk ... line as the first line of
your mail when you re-send the patch.
  
I wrote the patch, George wrote some very useful testcases.  So a 
Tested-by: George Fulk [EMAIL PROTECTED] line would be appropriate, 
not sure if that line ever really caught on.


As for the technical comments, I agree with all of them and will 
incorporate them into the next version.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

2008-09-11 Thread Joel Schopp



Did you consider the other alternative?  If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one).  Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.

Or maybe the bottleneck here is purely the memory bandwidth?

I think the main bottleneck is the bandwidth/latency of memory.

When I sent the patch out I hadn't thought about eliminating the e from 
the add with 32 bit chunks.  So I went off and tried it today and 
converting the existing function to use just add instead of adde (since 
it was only doing 32 bits already) and got 1.5% - 15.7% faster on 
Power5, which is nice, but was still way behind the new function in 
every testcase.  I then added 1 level of unrolling to that (using 2 
accumulators) and got 59% slower to 10% faster on Power5 depending on 
input. It seems quite a bit slower than I would have expected (I would 
have expected basically even), but thats what got measured. The comment 
in the existing function indicates unrolling the loop doesn't help 
because the bdnz has zero overhead, so I guess the unrolling hurt more 
than I expected.


In any case I have now thought about it and don't think it will work out.




Signed-off-by: Joel Schopp[EMAIL PROTECTED]


You missed a space there.

If at first you don't succeed...

Signed-off-by: Joel Schopp [EMAIL PROTECTED]
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH/RFC] 64 bit csum_partial_copy_generic

2008-09-10 Thread jschopp
The current 64 bit csum_partial_copy_generic function is based on the 32 
bit version and never was optimized for 64 bit.  This patch takes the 64 bit 
memcpy and adapts it to also do the sum.  It has been tested on a variety 
of input sizes and alignments on Power5 and Power6 processors.  It gives 
correct output for all cases tested.  It also runs 20-55% faster 
than the implemention it replaces depending on size, alignment, and processor.


I think there is still some room for improvement in the unaligned case, 
but given that it is much faster than what we have now I figured I'd send 
it out.


Signed-off-by: Joel Schopp[EMAIL PROTECTED]

Index: 2.6.26/arch/powerpc/lib/checksum_64.S
===
--- 2.6.26.orig/arch/powerpc/lib/checksum_64.S
+++ 2.6.26/arch/powerpc/lib/checksum_64.S
@@ -22,8 +22,7 @@
  * len is in words and is always = 5.
  *
  * In practice len == 5, but this is not guaranteed.  So this code does not
- * attempt to use doubleword instructions.
- */
+ * attempt to use doubleword instructions. */
 _GLOBAL(ip_fast_csum)
lwz r0,0(r3)
lwzur5,4(r3)
@@ -122,108 +121,286 @@ _GLOBAL(csum_partial)
  * to *src_err or *dst_err respectively, and (for an error on
  * src) zeroes the rest of dst.
  *
- * This code needs to be reworked to take advantage of 64 bit sum+copy.
- * However, due to tokenring halfword alignment problems this will be very
- * tricky.  For now we'll leave it until we instrument it somehow.
+ * This returns a 32 bit 1s complement sum that can be folded to 16 bits and
+ * notted to produce a 16bit tcp/ip checksum.
  *
  * csum_partial_copy_generic(r3=src, r4=dst, r5=len, r6=sum, r7=src_err, 
r8=dst_err)
  */
 _GLOBAL(csum_partial_copy_generic)
-   addic   r0,r6,0
-   subir3,r3,4
-   subir4,r4,4
-   srwi.   r6,r5,2
-   beq 3f  /* if we're doing  4 bytes */
-   andi.   r9,r4,2 /* Align dst to longword boundary */
-   beq+1f
-81:lhz r6,4(r3)/* do 2 bytes to get aligned */
-   addir3,r3,2
-   subir5,r5,2
-91:sth r6,4(r4)
-   addir4,r4,2
-   addcr0,r0,r6
-   srwi.   r6,r5,2 /* # words to do */
+   std r7,48(r1)   /* we need to save the error pointers ...*/
+   std r8,56(r1)   /* we need to save the error pointers ...*/
+   PPC_MTOCRF  0x01,r5
+   cmpldi  cr1,r5,16
+   neg r11,r4  # LS 3 bits = # bytes to 8-byte dest bdry
+   andi.   r11,r11,7
+   dcbt0,r3
+   blt cr1,.Lshort_copy
+   bne .Ldst_unaligned
+.Ldst_aligned:
+   andi.   r0,r3,7
+   addir4,r4,-16
+   bne .Lsrc_unaligned
+   srdir10,r5,4/* src and dst aligned */
+80:ld  r9,0(r3)
+   addir3,r3,-8
+   mtctr   r10
+   andi.   r5,r5,7
+   bf  cr7*4+0,2f
+   addir4,r4,8
+   addir3,r3,8
+   mr  r12,r9
+   blt cr1,3f
+1:
+81:ld  r9,8(r3)
+82:std r12,8(r4)
+   adder6,r6,r12   /* add to checksum */
+2:
+83:ldu r12,16(r3)
+84:stdur9,16(r4)
+   adder6,r6,r9/* add to checksum */
+   bdnz1b
+3:
+85:std r12,8(r4)
+   adder6,r6,r12   /* add to checksum */
beq 3f
-1: mtctr   r6
-82:lwzur6,4(r3)/* the bdnz has zero overhead, so it should */
-92:stwur6,4(r4)/* be unnecessary to unroll this loop */
-   adder0,r0,r6
-   bdnz82b
-   andi.   r5,r5,3
-3: cmpwi   0,r5,2
-   blt+4f
-83:lhz r6,4(r3)
+   addir4,r4,16
+   ld  r9,8(r3)
+.Ldo_tail:
+   bf  cr7*4+1,1f
+   rotldi  r9,r9,32
+86:stw r9,0(r4)
+   adder6,r6,r9/* add to checksum */
+   addir4,r4,4
+1: bf  cr7*4+2,2f
+   rotldi  r9,r9,16
+87:sth r9,0(r4)
+   adder6,r6,r9/* add to checksum */
+   addir4,r4,2
+2: bf  cr7*4+3,3f
+   rotldi  r9,r9,8
+88:stb r9,0(r4)
+   adder6,r6,r9/* add to checksum */
+3: addze   r6,r6   /* add in final carry (unlikely with 64-bit 
regs) */
+rldicl  r9,r6,32,0/* fold 64 bit value */
+add r3,r9,r6
+srdir3,r3,32
+   blr /* return sum */
+
+.Lsrc_unaligned:
+   srdir11,r5,3
+   addir5,r5,-16
+   subfr3,r0,r3
+   srdir7,r5,4
+   sldir10,r0,3
+   cmpdi   cr6,r11,3
+   andi.   r5,r5,7
+   mtctr   r7
+   subfic  r12,r10,64
+   add r5,r5,r0
+
+   bt  cr7*4+0,0f
+
+115:   ld  r9,0(r3)# 3+2n loads, 2+2n stores
+116:   ld  r0,8(r3)
+   sld r11,r9,r10
+117:   ldu r9,16(r3)
+   srd r7,r0,r12
+   sld r8,r0,r10
+   or  r7,r7,r11
+   blt cr6,4f
+118:   ld  r0,8(r3)
+