RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread David Laight
From: Eric Dumazet
> Sent: 05 January 2016 22:19
> To: Tom Herbert
> You might add a comment telling the '4' comes from length of 'adcq
> 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
> 
> We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> instruction and remove the nop.
> 
> 
> +   lea 20f(, %rcx, 4), %r11
> +   clc
> +   jmp *%r11
> +
> +.align 8
> +   adcq6*8(%rdi),%rax
> +   adcq5*8(%rdi),%rax
> +   adcq4*8(%rdi),%rax
> +   adcq3*8(%rdi),%rax
> +   adcq2*8(%rdi),%rax
> +   adcq1*8(%rdi),%rax
> +   adcq0*8(%rdi),%rax // could force a 4 byte instruction (.byte 
> 0x48, 0x13, 0x47, 0x00)
> +   nop
> +20:/* #quads % 8 jump table base */

Or move label at the top (after the .align) and adjust the maths.
You could add a second label after the first adcq and use the
difference between them for the '4'.

David



Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Eric Dumazet
On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> From: Eric Dumazet
> > Sent: 05 January 2016 22:19
> > To: Tom Herbert
> > You might add a comment telling the '4' comes from length of 'adcq
> > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
> > 
> > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > instruction and remove the nop.
> > 
> > 
> > +   lea 20f(, %rcx, 4), %r11
> > +   clc
> > +   jmp *%r11
> > +
> > +.align 8
> > +   adcq6*8(%rdi),%rax
> > +   adcq5*8(%rdi),%rax
> > +   adcq4*8(%rdi),%rax
> > +   adcq3*8(%rdi),%rax
> > +   adcq2*8(%rdi),%rax
> > +   adcq1*8(%rdi),%rax
> > +   adcq0*8(%rdi),%rax // could force a 4 byte instruction (.byte 
> > 0x48, 0x13, 0x47, 0x00)
> > +   nop
> > +20:/* #quads % 8 jump table base */
> 
> Or move label at the top (after the .align) and adjust the maths.
> You could add a second label after the first adcq and use the
> difference between them for the '4'.

Not really.

We could not use the trick it the length was 5.

Only 1, 2, 4 and 8 are supported.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread David Laight
From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > > 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
> > >
> > > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > > instruction and remove the nop.
> > >
> > >
> > > +   lea 20f(, %rcx, 4), %r11
> > > +   clc
> > > +   jmp *%r11
> > > +
> > > +.align 8
> > > +   adcq6*8(%rdi),%rax
> > > +   adcq5*8(%rdi),%rax
> > > +   adcq4*8(%rdi),%rax
> > > +   adcq3*8(%rdi),%rax
> > > +   adcq2*8(%rdi),%rax
> > > +   adcq1*8(%rdi),%rax
> > > +   adcq0*8(%rdi),%rax // could force a 4 byte instruction (.byte 
> > > 0x48, 0x13, 0x47, 0x00)
> > > +   nop
> > > +20:/* #quads % 8 jump table base */
> >
> > Or move label at the top (after the .align) and adjust the maths.
> > You could add a second label after the first adcq and use the
> > difference between them for the '4'.
> 
> Not really.
> 
> We could not use the trick it the length was 5.
> 
> Only 1, 2, 4 and 8 are supported.

Indeed, and 'lea  20f(, %rcx, 5), %r11' will generate an error from the
assembler.
Seems appropriate to get the assembler to verify this for you.

Assuming this code block is completely skipped for aligned lengths
the nop isn't needed provided the '20:' label is at the right place.

Someone also pointed out that the code is memory limited (dual add
chains making no difference), so why is it unrolled at all?

OTOH I'm sure I remember something about false dependencies on the
x86 flags register because of instructions only changing some bits.
So it might be that you can't (or couldn't) get concurrency between
instructions that update different parts of the flags register.

David




Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Eric Dumazet
On Wed, 2016-01-06 at 14:49 +, David Laight wrote:

> Someone also pointed out that the code is memory limited (dual add
> chains making no difference), so why is it unrolled at all?

Because it matters if the data is already present in CPU caches.

So why not unrolling if it helps in some situations ?






--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Hannes Frederic Sowa

Hi Tom,

On 05.01.2016 19:41, Tom Herbert wrote:

--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,147 @@
+/* Copyright 2016 Tom Herbert 
+ *
+ * Checksum partial calculation
+ *
+ * __wsum csum_partial(const void *buff, int len, __wsum sum)
+ *
+ * Computes the checksum of a memory block at buff, length len,
+ * and adds in "sum" (32-bit)
+ *
+ * Returns a 32-bit number suitable for feeding into itself
+ * or csum_tcpudp_magic
+ *
+ * Register usage:
+ *   %rdi: argument 1, buff
+ *   %rsi: argument 2, length
+ *   %rdx: argument 3, add in value


I think you forgot to carry-add-in the %rdx register.

The assembly code replaces do_csum but not csum_partial.

Bye,
Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Tom Herbert
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa
 wrote:
> Hi Tom,
>
> On 05.01.2016 19:41, Tom Herbert wrote:
>>
>> --- /dev/null
>> +++ b/arch/x86/lib/csum-partial_64.S
>> @@ -0,0 +1,147 @@
>> +/* Copyright 2016 Tom Herbert 
>> + *
>> + * Checksum partial calculation
>> + *
>> + * __wsum csum_partial(const void *buff, int len, __wsum sum)
>> + *
>> + * Computes the checksum of a memory block at buff, length len,
>> + * and adds in "sum" (32-bit)
>> + *
>> + * Returns a 32-bit number suitable for feeding into itself
>> + * or csum_tcpudp_magic
>> + *
>> + * Register usage:
>> + *   %rdi: argument 1, buff
>> + *   %rsi: argument 2, length
>> + *   %rdx: argument 3, add in value
>
>
> I think you forgot to carry-add-in the %rdx register.
>
> The assembly code replaces do_csum but not csum_partial.
>
First instruction is: movl%edx, %eax  /* Initialize with
initial sum argument */

Tom

> Bye,
> Hannes
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Hannes Frederic Sowa

On 07.01.2016 03:36, Tom Herbert wrote:

On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa
 wrote:

Hi Tom,

On 05.01.2016 19:41, Tom Herbert wrote:


--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,147 @@
+/* Copyright 2016 Tom Herbert 
+ *
+ * Checksum partial calculation
+ *
+ * __wsum csum_partial(const void *buff, int len, __wsum sum)
+ *
+ * Computes the checksum of a memory block at buff, length len,
+ * and adds in "sum" (32-bit)
+ *
+ * Returns a 32-bit number suitable for feeding into itself
+ * or csum_tcpudp_magic
+ *
+ * Register usage:
+ *   %rdi: argument 1, buff
+ *   %rsi: argument 2, length
+ *   %rdx: argument 3, add in value



I think you forgot to carry-add-in the %rdx register.

The assembly code replaces do_csum but not csum_partial.


First instruction is: movl%edx, %eax  /* Initialize with
initial sum argument */


Ups, sorry. I only grepped through the source. Meanwhile my test also 
uses non-zero sums and shows that it is fine.


Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Andi Kleen
Tom Herbert  writes:

> Also, we don't do anything special for alignment, unaligned
> accesses on x86 do not appear to be a performance issue.

This is not true on Atom CPUs.

Also on most CPUs there is still a larger penalty when crossing
cache lines.

> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Checksum performance:
>
> Isolating old and new implementation for some common cases:

You forgot to state the CPU. The results likely depend heavily
on the micro architecture.

The original C code was optimized for K8 FWIW.

Overall your assembler looks similar to the C code, except for the jump
table. Jump table has the disadvantage that it is much harder to branch
predict, with a large penalty if it's mispredicted.

I would expect it to be slower for cases where the length
changes frequently. Did you benchmark that case?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Eric Dumazet
On Wed, 2016-01-06 at 00:35 +0100, Hannes Frederic Sowa wrote:

> 
> Tom, did you have a look if it makes sense to add a second carry 
> addition train with the adcx instruction, which does not signal carry 
> via the carry flag but with the overflow flag? This instruction should 
> not have any dependencies with  the adc instructions and could help the 
> CPU to parallelize the code even more (increased instructions per cycle).

I guess adcx would (possibly) bring improvements for large areas, but
for this case the bottleneck is to bring data from memory.

Note this topic was discussed 2 years ago and no conclusive action was
taken.

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg529610.html





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Eric Dumazet
On Tue, 2016-01-05 at 17:10 -0800, H. Peter Anvin wrote:

> Apparently "adcq.d8" will do The Right Thing for this.

Nice trick ;)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Eric Dumazet
On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.

Very nice !


You might add a comment telling the '4' comes from length of 'adcq
6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
'adcq0*8(%rdi),%rax' is using 3 bytes instead.

We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
instruction and remove the nop.


+   lea 20f(, %rcx, 4), %r11
+   clc
+   jmp *%r11
+
+.align 8
+   adcq6*8(%rdi),%rax
+   adcq5*8(%rdi),%rax
+   adcq4*8(%rdi),%rax
+   adcq3*8(%rdi),%rax
+   adcq2*8(%rdi),%rax
+   adcq1*8(%rdi),%rax
+   adcq0*8(%rdi),%rax // could force a 4 byte instruction (.byte 0x48, 
0x13, 0x47, 0x00)
+   nop
+20:/* #quads % 8 jump table base */


Acked-by: Eric Dumazet 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread H. Peter Anvin
On 01/05/2016 02:18 PM, Eric Dumazet wrote:
> On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>> conversion.
> 
> Very nice !
> 
> 
> You might add a comment telling the '4' comes from length of 'adcq
> 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
> 
> We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> instruction and remove the nop.
> 

Apparently "adcq.d8" will do The Right Thing for this.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Hannes Frederic Sowa

Hi,

On 05.01.2016 19:41, Tom Herbert wrote:

Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
conversion.

This implementation is similar to csum_partial implemented in
checksum_32.S, however since we are dealing with 8 bytes at a time
there are more cases for small lengths-- for that we employ a jump
table. Also, we don't do anything special for alignment, unaligned
accesses on x86 do not appear to be a performance issue.

Testing:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Checksum performance:

Isolating old and new implementation for some common cases:

 Old  New
Casensecsnsecs Improvement
-+++-
1400 bytes (0 align)194.5174.3 10%(Big packet)
40 bytes (0 align)  13.8 5.8   57%(Ipv6 hdr common case)
8 bytes (4 align)   8.4  2.9   65%(UDP, VXLAN in IPv4)
14 bytes (0 align)  10.6 5.8   45%(Eth hdr)
14 bytes (4 align)  10.8 5.8   46%(Eth hdr in IPv4)

Signed-off-by: Tom Herbert 


Also,

Acked-by: Hannes Frederic Sowa 

Tested with the same test cases as the old patch and showed no problems 
and same improvements.


Tom, did you have a look if it makes sense to add a second carry 
addition train with the adcx instruction, which does not signal carry 
via the carry flag but with the overflow flag? This instruction should 
not have any dependencies with the adc instructions and could help the 
CPU to parallelize the code even more (increased instructions per cycle).


Thanks,
Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html