RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread David Laight
From: Eric Dumazet > Sent: 05 January 2016 22:19 > To: Tom Herbert > You might add a comment telling the '4' comes from length of 'adcq > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that > 'adcq0*8(%rdi),%rax' is using 3 bytes instead. > > We also could use .byte 0x48,

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Eric Dumazet
On Wed, 2016-01-06 at 10:16 +, David Laight wrote: > From: Eric Dumazet > > Sent: 05 January 2016 22:19 > > To: Tom Herbert > > You might add a comment telling the '4' comes from length of 'adcq > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that > > 'adcq

RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread David Laight
From: Eric Dumazet > Sent: 06 January 2016 14:25 > On Wed, 2016-01-06 at 10:16 +, David Laight wrote: > > From: Eric Dumazet > > > Sent: 05 January 2016 22:19 > > > To: Tom Herbert > > > You might add a comment telling the '4' comes from length of 'adcq > > > 6*8(%rdi),%rax' instruction, and

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Eric Dumazet
On Wed, 2016-01-06 at 14:49 +, David Laight wrote: > Someone also pointed out that the code is memory limited (dual add > chains making no difference), so why is it unrolled at all? Because it matters if the data is already present in CPU caches. So why not unrolling if it helps in some

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Hannes Frederic Sowa
Hi Tom, On 05.01.2016 19:41, Tom Herbert wrote: --- /dev/null +++ b/arch/x86/lib/csum-partial_64.S @@ -0,0 +1,147 @@ +/* Copyright 2016 Tom Herbert + * + * Checksum partial calculation + * + * __wsum csum_partial(const void *buff, int len, __wsum sum) + * + * Computes the

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Tom Herbert
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa wrote: > Hi Tom, > > On 05.01.2016 19:41, Tom Herbert wrote: >> >> --- /dev/null >> +++ b/arch/x86/lib/csum-partial_64.S >> @@ -0,0 +1,147 @@ >> +/* Copyright 2016 Tom Herbert >> + * >> + *

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Hannes Frederic Sowa
On 07.01.2016 03:36, Tom Herbert wrote: On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa wrote: Hi Tom, On 05.01.2016 19:41, Tom Herbert wrote: --- /dev/null +++ b/arch/x86/lib/csum-partial_64.S @@ -0,0 +1,147 @@ +/* Copyright 2016 Tom Herbert

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread Andi Kleen
Tom Herbert writes: > Also, we don't do anything special for alignment, unaligned > accesses on x86 do not appear to be a performance issue. This is not true on Atom CPUs. Also on most CPUs there is still a larger penalty when crossing cache lines. > Verified correctness

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Eric Dumazet
On Wed, 2016-01-06 at 00:35 +0100, Hannes Frederic Sowa wrote: > > Tom, did you have a look if it makes sense to add a second carry > addition train with the adcx instruction, which does not signal carry > via the carry flag but with the overflow flag? This instruction should > not have any

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Eric Dumazet
On Tue, 2016-01-05 at 17:10 -0800, H. Peter Anvin wrote: > Apparently "adcq.d8" will do The Right Thing for this. Nice trick ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Eric Dumazet
On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread H. Peter Anvin
On 01/05/2016 02:18 PM, Eric Dumazet wrote: > On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote: >> Implement assembly routine for csum_partial for 64 bit x86. This >> primarily speeds up checksum calculation for smaller lengths such as >> those that are present when doing skb_postpull_rcsum

Re: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-05 Thread Hannes Frederic Sowa
Hi, On 05.01.2016 19:41, Tom Herbert wrote: Implement assembly routine for csum_partial for 64 bit x86. This primarily speeds up checksum calculation for smaller lengths such as those that are present when doing skb_postpull_rcsum when getting CHECKSUM_COMPLETE from device or after