Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Segher Boessenkool
On Thu, May 24, 2018 at 10:18:44AM +, Christophe Leroy wrote:
> On 05/24/2018 06:20 AM, Christophe LEROY wrote:
> >Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> >>On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> >>>The generic csum_ipv6_magic() generates a pretty bad result
> >>
> >>
> >>
> >>Please try with a more recent compiler, what you used is pretty ancient.
> >>It's not like recent compilers do great on this either, but it's not
> >>*that* bad anymore ;-)
> 
> Here is what I get with GCC 8.1
> It doesn't look much better, does it ?

There are no more mfocrf, which is a big speedup.  Other than that it is
pretty lousy still, I totally agree.  This improvement happened quite a
while ago, it's fixed in GCC 6.


Segher


Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Segher Boessenkool
On Thu, May 24, 2018 at 08:20:16AM +0200, Christophe LEROY wrote:
> Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> >On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> >>+_GLOBAL(csum_ipv6_magic)
> >>+   lwz r8, 0(r3)
> >>+   lwz r9, 4(r3)
> >>+   lwz r10, 8(r3)
> >>+   lwz r11, 12(r3)
> >>+   addcr0, r5, r6
> >>+   adder0, r0, r7
> >>+   adder0, r0, r8
> >>+   adder0, r0, r9
> >>+   adder0, r0, r10
> >>+   adder0, r0, r11
> >>+   lwz r8, 0(r4)
> >>+   lwz r9, 4(r4)
> >>+   lwz r10, 8(r4)
> >>+   lwz r11, 12(r4)
> >>+   adder0, r0, r8
> >>+   adder0, r0, r9
> >>+   adder0, r0, r10
> >>+   adder0, r0, r11
> >>+   addze   r0, r0
> >>+   rotlwi  r3, r0, 16
> >>+   add r3, r0, r3
> >>+   not r3, r3
> >>+   rlwinm  r3, r3, 16, 16, 31
> >>+   blr
> >>+EXPORT_SYMBOL(csum_ipv6_magic)
> >
> >Clustering the loads and carry insns together is pretty much the worst you
> >can do on most 32-bit CPUs.
> 
> Oh, really ? __csum_partial is written that way too.

I thought I told you about this before?  Maybe not.

> Right, now I tried interleaving the lwz and adde. I get no improvment at 
> all on a 885, but I get a 15% improvment on a 8321.

It won't likely help on single-issue cores (like the one 885 has), yes.


Segher


Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Christophe Leroy



On 05/24/2018 06:20 AM, Christophe LEROY wrote:



Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :

On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:

The generic csum_ipv6_magic() generates a pretty bad result




Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)



Here is what I get with GCC 8.1
It doesn't look much better, does it ?


net/ipv6/ip6_checksum.o: file format elf32-powerpc


Disassembly of section .text:

 :
   0:   94 21 ff f0 stwur1,-16(r1)
   4:   80 04 00 00 lwz r0,0(r4)
   8:   81 64 00 04 lwz r11,4(r4)
   c:   81 04 00 08 lwz r8,8(r4)
  10:   93 e1 00 0c stw r31,12(r1)
  14:   81 43 00 00 lwz r10,0(r3)
  18:   83 e3 00 04 lwz r31,4(r3)
  1c:   81 23 00 08 lwz r9,8(r3)
  20:   81 83 00 0c lwz r12,12(r3)
  24:   7c ea 3a 14 add r7,r10,r7
  28:   7d 4a 38 10 subfc   r10,r10,r7
  2c:   7c ff 3a 14 add r7,r31,r7
  30:   81 44 00 0c lwz r10,12(r4)
  34:   7c 63 19 10 subfe   r3,r3,r3
  38:   7c 63 38 50 subfr3,r3,r7
  3c:   7f ff 18 10 subfc   r31,r31,r3
  40:   7c e9 1a 14 add r7,r9,r3
  44:   83 e1 00 0c lwz r31,12(r1)
  48:   7c 63 19 10 subfe   r3,r3,r3
  4c:   38 21 00 10 addir1,r1,16
  50:   7c 63 38 50 subfr3,r3,r7
  54:   7d 29 18 10 subfc   r9,r9,r3
  58:   7d 2c 1a 14 add r9,r12,r3
  5c:   7c 63 19 10 subfe   r3,r3,r3
  60:   7c 63 48 50 subfr3,r3,r9
  64:   7d 8c 18 10 subfc   r12,r12,r3
  68:   7d 20 1a 14 add r9,r0,r3
  6c:   7c 63 19 10 subfe   r3,r3,r3
  70:   7c 63 48 50 subfr3,r3,r9
  74:   7c 00 18 10 subfc   r0,r0,r3
  78:   7d 2b 1a 14 add r9,r11,r3
  7c:   7c 63 19 10 subfe   r3,r3,r3
  80:   7c 63 48 50 subfr3,r3,r9
  84:   7d 6b 18 10 subfc   r11,r11,r3
  88:   7d 28 1a 14 add r9,r8,r3
  8c:   7c 63 19 10 subfe   r3,r3,r3
  90:   7c 63 48 50 subfr3,r3,r9
  94:   7d 08 18 10 subfc   r8,r8,r3
  98:   7d 2a 1a 14 add r9,r10,r3
  9c:   7c 63 19 10 subfe   r3,r3,r3
  a0:   7c 63 48 50 subfr3,r3,r9
  a4:   7d 4a 18 10 subfc   r10,r10,r3
  a8:   7d 23 2a 14 add r9,r3,r5
  ac:   7c 63 19 10 subfe   r3,r3,r3
  b0:   7c 63 48 50 subfr3,r3,r9
  b4:   7c a5 18 10 subfc   r5,r5,r3
  b8:   7c 63 32 14 add r3,r3,r6
  bc:   7d 29 49 10 subfe   r9,r9,r9
  c0:   7d 29 18 50 subfr9,r9,r3
  c4:   7c c6 48 10 subfc   r6,r6,r9
  c8:   7c 63 19 10 subfe   r3,r3,r3
  cc:   7c 63 48 50 subfr3,r3,r9
  d0:   54 69 80 3e rotlwi  r9,r3,16
  d4:   7c 63 4a 14 add r3,r3,r9
  d8:   7c 63 18 f8 not r3,r3
  dc:   54 63 84 3e rlwinm  r3,r3,16,16,31
  e0:   4e 80 00 20 blr

net/ipv6/ip6_checksum.o: file format elf64-powerpc


Disassembly of section .text:

 <.csum_ipv6_magic>:
   0:   fb e1 ff f8 std r31,-8(r1)
   4:   81 43 00 00 lwz r10,0(r3)
   8:   81 83 00 04 lwz r12,4(r3)
   c:   81 23 00 08 lwz r9,8(r3)
  10:   80 03 00 0c lwz r0,12(r3)
  14:   7c e7 52 14 add r7,r7,r10
  18:   80 64 00 08 lwz r3,8(r4)
  1c:   81 04 00 00 lwz r8,0(r4)
  20:   78 ff 00 20 clrldi  r31,r7,32
  24:   7c ec 3a 14 add r7,r12,r7
  28:   81 64 00 04 lwz r11,4(r4)
  2c:   7f ea f8 50 subfr31,r10,r31
  30:   81 44 00 0c lwz r10,12(r4)
  34:   7b ff 0f e0 rldicl  r31,r31,1,63
  38:   7c ff 3a 14 add r7,r31,r7
  3c:   eb e1 ff f8 ld  r31,-8(r1)
  40:   78 e4 00 20 clrldi  r4,r7,32
  44:   7c e9 3a 14 add r7,r9,r7
  48:   7d 8c 20 50 subfr12,r12,r4
  4c:   79 8c 0f e0 rldicl  r12,r12,1,63
  50:   7d 8c 3a 14 add r12,r12,r7
  54:   79 87 00 20 clrldi  r7,r12,32
  58:   7d 80 62 14 add r12,r0,r12
  5c:   7d 29 38 50 subfr9,r9,r7
  60:   79 29 0f e0 rldicl  r9,r9,1,63
  64:   7d 29 62 14 add r9,r9,r12
  68:   79 27 00 20 clrldi  r7,r9,32
  6c:   7d 28 4a 14 add r9,r8,r9
  70:   7c 00 38 50 subfr0,r0,r7
  74:   78 00 0f e0 rldicl  r0,r0,1,63
  78:   7c 00 4a 14 add r0,r0,r9
  7c:   78 09 00 20 clrldi  r9,r0,32
  80:   7c 0b 02 14 add r0,r11,r0
  84:   7d 08 48 50 subfr8,r8,r9
  88:   79 08 0f e0 rldicl  r8,r8,1,63
  8c:   7d 08 02 14 add r8,r8,r0
  90:   79 09 00 20 clrldi  r9,r8,32
  94:   7d 03 42 14 add r8,r3,r8
  98:   7d 2b 48 50 subfr9,r11,r9
  9c:   79 29 0f e0 rldicl  r9,r9,1,63
  a0:   7d 29 42 14 add r9,r9,r8
  a4:   79 28 00 20 clrldi  r8,r9,32
  a8:   7d 2a 4a 14 add r9,r10,r9
  ac:   7d 03 40 50 subfr8,r3,r8
  b0:   79 08 0f e0 rldicl  r8,r8,1,63
  b4:   7d 08 4a 14 add r8,r8,r9
  b8:   

Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Christophe LEROY



Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :

On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:

The generic csum_ipv6_magic() generates a pretty bad result




Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)


--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -293,3 +293,36 @@ dst_error:
EX_TABLE(51b, dst_error);
  
  EXPORT_SYMBOL(csum_partial_copy_generic)

+
+/*
+ * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ *   const struct in6_addr *daddr,
+ *   __u32 len, __u8 proto, __wsum sum)
+ */
+
+_GLOBAL(csum_ipv6_magic)
+   lwz r8, 0(r3)
+   lwz r9, 4(r3)
+   lwz r10, 8(r3)
+   lwz r11, 12(r3)
+   addcr0, r5, r6
+   adder0, r0, r7
+   adder0, r0, r8
+   adder0, r0, r9
+   adder0, r0, r10
+   adder0, r0, r11
+   lwz r8, 0(r4)
+   lwz r9, 4(r4)
+   lwz r10, 8(r4)
+   lwz r11, 12(r4)
+   adder0, r0, r8
+   adder0, r0, r9
+   adder0, r0, r10
+   adder0, r0, r11
+   addze   r0, r0
+   rotlwi  r3, r0, 16
+   add r3, r0, r3
+   not r3, r3
+   rlwinm  r3, r3, 16, 16, 31
+   blr
+EXPORT_SYMBOL(csum_ipv6_magic)


Clustering the loads and carry insns together is pretty much the worst you
can do on most 32-bit CPUs.


Oh, really ? __csum_partial is written that way too.

Right, now I tried interleaving the lwz and adde. I get no improvment at 
all on a 885, but I get a 15% improvment on a 8321.


Christophe




Segher



Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-23 Thread Segher Boessenkool
On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> The generic csum_ipv6_magic() generates a pretty bad result



Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)

> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -293,3 +293,36 @@ dst_error:
>   EX_TABLE(51b, dst_error);
>  
>  EXPORT_SYMBOL(csum_partial_copy_generic)
> +
> +/*
> + * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> + * const struct in6_addr *daddr,
> + * __u32 len, __u8 proto, __wsum sum)
> + */
> +
> +_GLOBAL(csum_ipv6_magic)
> + lwz r8, 0(r3)
> + lwz r9, 4(r3)
> + lwz r10, 8(r3)
> + lwz r11, 12(r3)
> + addcr0, r5, r6
> + adder0, r0, r7
> + adder0, r0, r8
> + adder0, r0, r9
> + adder0, r0, r10
> + adder0, r0, r11
> + lwz r8, 0(r4)
> + lwz r9, 4(r4)
> + lwz r10, 8(r4)
> + lwz r11, 12(r4)
> + adder0, r0, r8
> + adder0, r0, r9
> + adder0, r0, r10
> + adder0, r0, r11
> + addze   r0, r0
> + rotlwi  r3, r0, 16
> + add r3, r0, r3
> + not r3, r3
> + rlwinm  r3, r3, 16, 16, 31
> + blr
> +EXPORT_SYMBOL(csum_ipv6_magic)

Clustering the loads and carry insns together is pretty much the worst you
can do on most 32-bit CPUs.


Segher


[PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-22 Thread Christophe Leroy
The generic csum_ipv6_magic() generates a pretty bad result

 : (PPC32)
   0:   81 23 00 00 lwz r9,0(r3)
   4:   81 03 00 04 lwz r8,4(r3)
   8:   7c e7 4a 14 add r7,r7,r9
   c:   7d 29 38 10 subfc   r9,r9,r7
  10:   7d 4a 51 10 subfe   r10,r10,r10
  14:   7d 27 42 14 add r9,r7,r8
  18:   7d 2a 48 50 subfr9,r10,r9
  1c:   80 e3 00 08 lwz r7,8(r3)
  20:   7d 08 48 10 subfc   r8,r8,r9
  24:   7d 4a 51 10 subfe   r10,r10,r10
  28:   7d 29 3a 14 add r9,r9,r7
  2c:   81 03 00 0c lwz r8,12(r3)
  30:   7d 2a 48 50 subfr9,r10,r9
  34:   7c e7 48 10 subfc   r7,r7,r9
  38:   7d 4a 51 10 subfe   r10,r10,r10
  3c:   7d 29 42 14 add r9,r9,r8
  40:   7d 2a 48 50 subfr9,r10,r9
  44:   80 e4 00 00 lwz r7,0(r4)
  48:   7d 08 48 10 subfc   r8,r8,r9
  4c:   7d 4a 51 10 subfe   r10,r10,r10
  50:   7d 29 3a 14 add r9,r9,r7
  54:   7d 2a 48 50 subfr9,r10,r9
  58:   81 04 00 04 lwz r8,4(r4)
  5c:   7c e7 48 10 subfc   r7,r7,r9
  60:   7d 4a 51 10 subfe   r10,r10,r10
  64:   7d 29 42 14 add r9,r9,r8
  68:   7d 2a 48 50 subfr9,r10,r9
  6c:   80 e4 00 08 lwz r7,8(r4)
  70:   7d 08 48 10 subfc   r8,r8,r9
  74:   7d 4a 51 10 subfe   r10,r10,r10
  78:   7d 29 3a 14 add r9,r9,r7
  7c:   7d 2a 48 50 subfr9,r10,r9
  80:   81 04 00 0c lwz r8,12(r4)
  84:   7c e7 48 10 subfc   r7,r7,r9
  88:   7d 4a 51 10 subfe   r10,r10,r10
  8c:   7d 29 42 14 add r9,r9,r8
  90:   7d 2a 48 50 subfr9,r10,r9
  94:   7d 08 48 10 subfc   r8,r8,r9
  98:   7d 4a 51 10 subfe   r10,r10,r10
  9c:   7d 29 2a 14 add r9,r9,r5
  a0:   7d 2a 48 50 subfr9,r10,r9
  a4:   7c a5 48 10 subfc   r5,r5,r9
  a8:   7c 63 19 10 subfe   r3,r3,r3
  ac:   7d 29 32 14 add r9,r9,r6
  b0:   7d 23 48 50 subfr9,r3,r9
  b4:   7c c6 48 10 subfc   r6,r6,r9
  b8:   7c 63 19 10 subfe   r3,r3,r3
  bc:   7c 63 48 50 subfr3,r3,r9
  c0:   54 6a 80 3e rotlwi  r10,r3,16
  c4:   7c 63 52 14 add r3,r3,r10
  c8:   7c 63 18 f8 not r3,r3
  cc:   54 63 84 3e rlwinm  r3,r3,16,16,31
  d0:   4e 80 00 20 blr

 <.csum_ipv6_magic>: (PPC64)
   0:   81 23 00 00 lwz r9,0(r3)
   4:   80 03 00 04 lwz r0,4(r3)
   8:   81 63 00 08 lwz r11,8(r3)
   c:   7c e7 4a 14 add r7,r7,r9
  10:   7f 89 38 40 cmplw   cr7,r9,r7
  14:   7d 47 02 14 add r10,r7,r0
  18:   7d 30 10 26 mfocrf  r9,1
  1c:   55 29 f7 fe rlwinm  r9,r9,30,31,31
  20:   7d 4a 4a 14 add r10,r10,r9
  24:   7f 80 50 40 cmplw   cr7,r0,r10
  28:   7d 2a 5a 14 add r9,r10,r11
  2c:   80 03 00 0c lwz r0,12(r3)
  30:   81 44 00 00 lwz r10,0(r4)
  34:   7d 10 10 26 mfocrf  r8,1
  38:   55 08 f7 fe rlwinm  r8,r8,30,31,31
  3c:   7d 29 42 14 add r9,r9,r8
  40:   81 04 00 04 lwz r8,4(r4)
  44:   7f 8b 48 40 cmplw   cr7,r11,r9
  48:   7d 29 02 14 add r9,r9,r0
  4c:   7d 70 10 26 mfocrf  r11,1
  50:   55 6b f7 fe rlwinm  r11,r11,30,31,31
  54:   7d 29 5a 14 add r9,r9,r11
  58:   7f 80 48 40 cmplw   cr7,r0,r9
  5c:   7d 29 52 14 add r9,r9,r10
  60:   7c 10 10 26 mfocrf  r0,1
  64:   54 00 f7 fe rlwinm  r0,r0,30,31,31
  68:   7d 69 02 14 add r11,r9,r0
  6c:   7f 8a 58 40 cmplw   cr7,r10,r11
  70:   7c 0b 42 14 add r0,r11,r8
  74:   81 44 00 08 lwz r10,8(r4)
  78:   7c f0 10 26 mfocrf  r7,1
  7c:   54 e7 f7 fe rlwinm  r7,r7,30,31,31
  80:   7c 00 3a 14 add r0,r0,r7
  84:   7f 88 00 40 cmplw   cr7,r8,r0
  88:   7d 20 52 14 add r9,r0,r10
  8c:   80 04 00 0c lwz r0,12(r4)
  90:   7d 70 10 26 mfocrf  r11,1
  94:   55 6b f7 fe rlwinm  r11,r11,30,31,31
  98:   7d 29 5a 14 add r9,r9,r11
  9c:   7f 8a 48 40 cmplw   cr7,r10,r9
  a0:   7d 29 02 14 add r9,r9,r0
  a4:   7d 70 10 26 mfocrf  r11,1
  a8:   55 6b f7 fe rlwinm  r11,r11,30,31,31
  ac:   7d 29 5a 14 add r9,r9,r11
  b0:   7f 80 48 40 cmplw   cr7,r0,r9
  b4:   7d 29 2a 14 add r9,r9,r5
  b8:   7c 10 10 26 mfocrf  r0,1
  bc:   54 00 f7 fe rlwinm  r0,r0,30,31,31
  c0:   7d 29 02 14 add r9,r9,r0
  c4:   7f 85 48 40 cmplw   cr7,r5,r9
  c8:   7c 09 32 14 add r0,r9,r6
  cc:   7d 50 10 26 mfocrf  r10,1
  d0:   55 4a f7 fe rlwinm  r10,r10,30,31,31
  d4:   7c 00 52 14 add r0,r0,r10
  d8:   7f 80 30 40 cmplw   cr7,r0,r6
  dc:   7d 30 10 26 mfocrf  r9,1
  e0:   55 29 ef fe rlwinm  r9,r9,29,31,31
  e4:   7c 09 02 14 add r0,r9,r0
  e8:   54 03 80 3e rotlwi  r3,r0,16
  ec:   7c 03 02 14 add r0,r3,r0
  f0:   7c 03 00 f8 not r3,r0
  f4:   78 63 84 22 rldicl  r3,r3,48,48
  f8:   4e 80 00 20 blr

This patch implements it in assembly