Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-08 Thread Tom Herbert
On Mon, Mar 7, 2016 at 5:39 PM, Linus Torvalds
 wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a 1 nsec hit to add this
>> check.
>
> It shouldn't be a factor, and you shouldn't check for it. My code was
> self-aligning, and had at most one unaligned access at the beginnig
> (the data of which was then used to align the rest).
>
Yes, but the logic to do the alignment does not come for free. The
intent of these patches is really to speed up checksums over small
buffers (like the checksum over the IP header or pulling up checksums
over protocol headers for dealing with checksum-complete). For
checksum over larger buffers, e.g. TCP/UDP checksums we are depending
on checksum offload (there are still some case where the host will
need to a packet checksum, but as vendors move to providing up
protocol agnostic checksum those should go away). In the VXLAN GRO
path for instance, we do a checksum pull over both the UDP header and
VLXAN headers each of which are 8 bytes. csum_partial can be trivially
implemented for a buffer of length 8 with three adcq instructions (as
in my patch). When we're using VXLAN in IPv4 both the VLXAN headers
and UDP will likely not be eight byte aligned, but alignment seems to
only be an issue when crossing a page boundary. The probability that
an 8 byte header crosses a page boundary is already very low, and
probably with a little bit code drivers could pretty much guarantee
that packet headers don't straddle page boundaries. So it seems like
the effort to align small buffers, assuming they don't straddle page
boundaries, provides little or no value.

Tom

> Tom had a version that used that. Although now that I look back at it,
> it seems to be broken by some confusion about the one-byte alignment
> vs 8-byte alignment.
>
>  Linus


RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-08 Thread David Laight
From: Alexander Duyck 
...
> >> So the loop:
> >> 10: addc %rax,(%rdx,%rcx,8)
> >> inc %rcx
> >> jnz 10b
> >> could easily be as fast as anything that doesn't use the 'new'
> >> instructions that use the overflow flag.
> >> That loop might be measurable faster for aligned buffers.
> >
> > Tested by replacing the unrolled loop in my patch with just:
> >
> > if (len >= 8) {
> > asm("clc\n\t"
> > "0: adcq (%[src],%%rcx,8),%[res]\n\t"
> > "decl %%ecx\n\t"
> > "jge 0b\n\t"
> > "adcq $0, %[res]\n\t"
> > : [res] "=r" (result)
> > : [src] "r" (buff), "[res]" (result), "c"
> > ((len >> 3) - 1));
> > }
> >
> > This seems to be significantly slower:
> >
> > 1400 bytes: 797 nsecs vs. 202 nsecs
> > 40 bytes: 6.5 nsecs vs. 26.8 nsecs
> 
> You still need the loop unrolling as the decl and jge have some
> overhead.  You can't just get rid of it with a single call in a tight
> loop but it should improve things.  The gain from what I have seen
> ends up being minimal though.  I haven't really noticed all that much
> in my tests anyway.

The overhead from the jge and decl is probably similar to that of the adc.
The problem is that they can't be executed at the same time because they
both have dependencies on the carry flag.

Tom did some extra tests last night, using a loop construct of 4 instructions
that didn't modify the flags register was twice the speed of the above.
I think there is a 3 instruction loop, add a second adc and it may well
be as fast as your 8-way unrolled version - but is much simpler to implement.

That loop (using swab and jecxz to loop until the high 32bits of rcx are 
non-zero)
could also be used with the 'add carry using overflow bit' instructions.

David



RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-08 Thread David Laight
From: Alexander Duyck 
...
> One thought I had is that we may want to look into making an inline
> function that we can call for compile-time defined lengths less than
> 64.  Maybe call it something like __csum_partial and we could then use
> that in place of csum_partial for all those headers that are a fixed
> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
> we might be able to look at taking care of alignment for csum_partial
> which will improve the skb_checksum() case without impacting the
> header pulling cases as much since that code would be inlined
> elsewhere.

I think there are some patches going through the ppc tree to do
exactly that.

David



Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Linus Torvalds
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.

It shouldn't be a factor, and you shouldn't check for it. My code was
self-aligning, and had at most one unaligned access at the beginnig
(the data of which was then used to align the rest).

Tom had a version that used that. Although now that I look back at it,
it seems to be broken by some confusion about the one-byte alignment
vs 8-byte alignment.

 Linus


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck
 wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
>> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>>  wrote:
>>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
 On Mon, Mar 7, 2016 at 5:56 AM, David Laight  
 wrote:
> From: Alexander Duyck
>  ...
>> Actually probably the easiest way to go on x86 is to just replace the
>> use of len with (len >> 6) and use decl or incl instead of addl or
>> subl, and lea instead of addq for the buff address.  None of those
>> instructions effect the carry flag as this is how such loops were
>> intended to be implemented.
>>
>> I've been doing a bit of testing and that seems to work without
>> needing the adcq until after you exit the loop, but doesn't give that
>> much of a gain in speed for dropping the instruction from the
>> hot-path.  I suspect we are probably memory bottle-necked already in
>> the loop so dropping an instruction or two doesn't gain you much.
>
> Right, any superscalar architecture gives you some instructions
> 'for free' if they can execute at the same time as those on the
> critical path (in this case the memory reads and the adc).
> This is why loop unrolling can be pointless.
>
> So the loop:
> 10: addc %rax,(%rdx,%rcx,8)
> inc %rcx
> jnz 10b
> could easily be as fast as anything that doesn't use the 'new'
> instructions that use the overflow flag.
> That loop might be measurable faster for aligned buffers.

 Tested by replacing the unrolled loop in my patch with just:

 if (len >= 8) {
 asm("clc\n\t"
 "0: adcq (%[src],%%rcx,8),%[res]\n\t"
 "decl %%ecx\n\t"
 "jge 0b\n\t"
 "adcq $0, %[res]\n\t"
 : [res] "=r" (result)
 : [src] "r" (buff), "[res]" (result), "c"
 ((len >> 3) - 1));
 }

 This seems to be significantly slower:

 1400 bytes: 797 nsecs vs. 202 nsecs
 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>>
>>> You still need the loop unrolling as the decl and jge have some
>>> overhead.  You can't just get rid of it with a single call in a tight
>>> loop but it should improve things.  The gain from what I have seen
>>> ends up being minimal though.  I haven't really noticed all that much
>>> in my tests anyway.
>>>
>>> I have been doing some testing and the penalty for an unaligned
>>> checksum can get pretty big if the data-set is big enough.  I was
>>> messing around and tried doing a checksum over 32K minus some offset
>>> and was seeing a penalty of about 200 cycles per 64K frame.
>>>
>> Out of how many cycles to checksum 64K though?
>
> So the clock cycles I am seeing is ~16660 for unaligned vs 16416
> aligned.  So yeah the effect is only a 1.5% penalty for the total
> time.
>
>>> One thought I had is that we may want to look into making an inline
>>> function that we can call for compile-time defined lengths less than
>>> 64.  Maybe call it something like __csum_partial and we could then use
>>> that in place of csum_partial for all those headers that are a fixed
>>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>>> we might be able to look at taking care of alignment for csum_partial
>>> which will improve the skb_checksum() case without impacting the
>>> header pulling cases as much since that code would be inlined
>>> elsewhere.
>>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a 1 nsec hit to add this
>> check.
>
> Well I was just noticing there are a number of places we can get an
> even bigger benefit if we just bypass the need for csum_partial
> entirely.  For example the DSA code is calling csum_partial to extract
> 2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
> could catch cases like these with a __builtin_constant_p check then we
> might be able to save some significant CPU time by avoiding the
> function call entirely and just doing some inline addition on the
> input values directly.
>
Sure, we could inline a switch function for common values (0, 2, 4, 8,
14, 16, 20, 40) maybe.

> - Alex


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Alexander Duyck
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>  wrote:
>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight  
>>> wrote:
 From: Alexander Duyck
  ...
> Actually probably the easiest way to go on x86 is to just replace the
> use of len with (len >> 6) and use decl or incl instead of addl or
> subl, and lea instead of addq for the buff address.  None of those
> instructions effect the carry flag as this is how such loops were
> intended to be implemented.
>
> I've been doing a bit of testing and that seems to work without
> needing the adcq until after you exit the loop, but doesn't give that
> much of a gain in speed for dropping the instruction from the
> hot-path.  I suspect we are probably memory bottle-necked already in
> the loop so dropping an instruction or two doesn't gain you much.

 Right, any superscalar architecture gives you some instructions
 'for free' if they can execute at the same time as those on the
 critical path (in this case the memory reads and the adc).
 This is why loop unrolling can be pointless.

 So the loop:
 10: addc %rax,(%rdx,%rcx,8)
 inc %rcx
 jnz 10b
 could easily be as fast as anything that doesn't use the 'new'
 instructions that use the overflow flag.
 That loop might be measurable faster for aligned buffers.
>>>
>>> Tested by replacing the unrolled loop in my patch with just:
>>>
>>> if (len >= 8) {
>>> asm("clc\n\t"
>>> "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>>> "decl %%ecx\n\t"
>>> "jge 0b\n\t"
>>> "adcq $0, %[res]\n\t"
>>> : [res] "=r" (result)
>>> : [src] "r" (buff), "[res]" (result), "c"
>>> ((len >> 3) - 1));
>>> }
>>>
>>> This seems to be significantly slower:
>>>
>>> 1400 bytes: 797 nsecs vs. 202 nsecs
>>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>
>> You still need the loop unrolling as the decl and jge have some
>> overhead.  You can't just get rid of it with a single call in a tight
>> loop but it should improve things.  The gain from what I have seen
>> ends up being minimal though.  I haven't really noticed all that much
>> in my tests anyway.
>>
>> I have been doing some testing and the penalty for an unaligned
>> checksum can get pretty big if the data-set is big enough.  I was
>> messing around and tried doing a checksum over 32K minus some offset
>> and was seeing a penalty of about 200 cycles per 64K frame.
>>
> Out of how many cycles to checksum 64K though?

So the clock cycles I am seeing is ~16660 for unaligned vs 16416
aligned.  So yeah the effect is only a 1.5% penalty for the total
time.

>> One thought I had is that we may want to look into making an inline
>> function that we can call for compile-time defined lengths less than
>> 64.  Maybe call it something like __csum_partial and we could then use
>> that in place of csum_partial for all those headers that are a fixed
>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>> we might be able to look at taking care of alignment for csum_partial
>> which will improve the skb_checksum() case without impacting the
>> header pulling cases as much since that code would be inlined
>> elsewhere.
>>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.

Well I was just noticing there are a number of places we can get an
even bigger benefit if we just bypass the need for csum_partial
entirely.  For example the DSA code is calling csum_partial to extract
2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
could catch cases like these with a __builtin_constant_p check then we
might be able to save some significant CPU time by avoiding the
function call entirely and just doing some inline addition on the
input values directly.

- Alex


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
 wrote:
> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight  wrote:
>>> From: Alexander Duyck
>>>  ...
 Actually probably the easiest way to go on x86 is to just replace the
 use of len with (len >> 6) and use decl or incl instead of addl or
 subl, and lea instead of addq for the buff address.  None of those
 instructions effect the carry flag as this is how such loops were
 intended to be implemented.

 I've been doing a bit of testing and that seems to work without
 needing the adcq until after you exit the loop, but doesn't give that
 much of a gain in speed for dropping the instruction from the
 hot-path.  I suspect we are probably memory bottle-necked already in
 the loop so dropping an instruction or two doesn't gain you much.
>>>
>>> Right, any superscalar architecture gives you some instructions
>>> 'for free' if they can execute at the same time as those on the
>>> critical path (in this case the memory reads and the adc).
>>> This is why loop unrolling can be pointless.
>>>
>>> So the loop:
>>> 10: addc %rax,(%rdx,%rcx,8)
>>> inc %rcx
>>> jnz 10b
>>> could easily be as fast as anything that doesn't use the 'new'
>>> instructions that use the overflow flag.
>>> That loop might be measurable faster for aligned buffers.
>>
>> Tested by replacing the unrolled loop in my patch with just:
>>
>> if (len >= 8) {
>> asm("clc\n\t"
>> "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>> "decl %%ecx\n\t"
>> "jge 0b\n\t"
>> "adcq $0, %[res]\n\t"
>> : [res] "=r" (result)
>> : [src] "r" (buff), "[res]" (result), "c"
>> ((len >> 3) - 1));
>> }
>>
>> This seems to be significantly slower:
>>
>> 1400 bytes: 797 nsecs vs. 202 nsecs
>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>
> You still need the loop unrolling as the decl and jge have some
> overhead.  You can't just get rid of it with a single call in a tight
> loop but it should improve things.  The gain from what I have seen
> ends up being minimal though.  I haven't really noticed all that much
> in my tests anyway.
>
> I have been doing some testing and the penalty for an unaligned
> checksum can get pretty big if the data-set is big enough.  I was
> messing around and tried doing a checksum over 32K minus some offset
> and was seeing a penalty of about 200 cycles per 64K frame.
>
Out of how many cycles to checksum 64K though?

> One thought I had is that we may want to look into making an inline
> function that we can call for compile-time defined lengths less than
> 64.  Maybe call it something like __csum_partial and we could then use
> that in place of csum_partial for all those headers that are a fixed
> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
> we might be able to look at taking care of alignment for csum_partial
> which will improve the skb_checksum() case without impacting the
> header pulling cases as much since that code would be inlined
> elsewhere.
>
As I said previously, if alignment really is a factor then we can
check up front if a buffer crosses a page boundary and call the slow
path function (original code). I'm seeing a 1 nsec hit to add this
check.

Tom

> - Alex


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Alexander Duyck
On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
> On Mon, Mar 7, 2016 at 5:56 AM, David Laight  wrote:
>> From: Alexander Duyck
>>  ...
>>> Actually probably the easiest way to go on x86 is to just replace the
>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>> subl, and lea instead of addq for the buff address.  None of those
>>> instructions effect the carry flag as this is how such loops were
>>> intended to be implemented.
>>>
>>> I've been doing a bit of testing and that seems to work without
>>> needing the adcq until after you exit the loop, but doesn't give that
>>> much of a gain in speed for dropping the instruction from the
>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>> the loop so dropping an instruction or two doesn't gain you much.
>>
>> Right, any superscalar architecture gives you some instructions
>> 'for free' if they can execute at the same time as those on the
>> critical path (in this case the memory reads and the adc).
>> This is why loop unrolling can be pointless.
>>
>> So the loop:
>> 10: addc %rax,(%rdx,%rcx,8)
>> inc %rcx
>> jnz 10b
>> could easily be as fast as anything that doesn't use the 'new'
>> instructions that use the overflow flag.
>> That loop might be measurable faster for aligned buffers.
>
> Tested by replacing the unrolled loop in my patch with just:
>
> if (len >= 8) {
> asm("clc\n\t"
> "0: adcq (%[src],%%rcx,8),%[res]\n\t"
> "decl %%ecx\n\t"
> "jge 0b\n\t"
> "adcq $0, %[res]\n\t"
> : [res] "=r" (result)
> : [src] "r" (buff), "[res]" (result), "c"
> ((len >> 3) - 1));
> }
>
> This seems to be significantly slower:
>
> 1400 bytes: 797 nsecs vs. 202 nsecs
> 40 bytes: 6.5 nsecs vs. 26.8 nsecs

You still need the loop unrolling as the decl and jge have some
overhead.  You can't just get rid of it with a single call in a tight
loop but it should improve things.  The gain from what I have seen
ends up being minimal though.  I haven't really noticed all that much
in my tests anyway.

I have been doing some testing and the penalty for an unaligned
checksum can get pretty big if the data-set is big enough.  I was
messing around and tried doing a checksum over 32K minus some offset
and was seeing a penalty of about 200 cycles per 64K frame.

One thought I had is that we may want to look into making an inline
function that we can call for compile-time defined lengths less than
64.  Maybe call it something like __csum_partial and we could then use
that in place of csum_partial for all those headers that are a fixed
length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
we might be able to look at taking care of alignment for csum_partial
which will improve the skb_checksum() case without impacting the
header pulling cases as much since that code would be inlined
elsewhere.

- Alex


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 5:56 AM, David Laight  wrote:
> From: Alexander Duyck
>  ...
>> Actually probably the easiest way to go on x86 is to just replace the
>> use of len with (len >> 6) and use decl or incl instead of addl or
>> subl, and lea instead of addq for the buff address.  None of those
>> instructions effect the carry flag as this is how such loops were
>> intended to be implemented.
>>
>> I've been doing a bit of testing and that seems to work without
>> needing the adcq until after you exit the loop, but doesn't give that
>> much of a gain in speed for dropping the instruction from the
>> hot-path.  I suspect we are probably memory bottle-necked already in
>> the loop so dropping an instruction or two doesn't gain you much.
>
> Right, any superscalar architecture gives you some instructions
> 'for free' if they can execute at the same time as those on the
> critical path (in this case the memory reads and the adc).
> This is why loop unrolling can be pointless.
>
> So the loop:
> 10: addc %rax,(%rdx,%rcx,8)
> inc %rcx
> jnz 10b
> could easily be as fast as anything that doesn't use the 'new'
> instructions that use the overflow flag.
> That loop might be measurable faster for aligned buffers.

Tested by replacing the unrolled loop in my patch with just:

if (len >= 8) {
asm("clc\n\t"
"0: adcq (%[src],%%rcx,8),%[res]\n\t"
"decl %%ecx\n\t"
"jge 0b\n\t"
"adcq $0, %[res]\n\t"
: [res] "=r" (result)
: [src] "r" (buff), "[res]" (result), "c"
((len >> 3) - 1));
}

This seems to be significantly slower:

1400 bytes: 797 nsecs vs. 202 nsecs
40 bytes: 6.5 nsecs vs. 26.8 nsecs

Tom

>
> David
>


RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread David Laight
From: Alexander Duyck
 ...
> Actually probably the easiest way to go on x86 is to just replace the
> use of len with (len >> 6) and use decl or incl instead of addl or
> subl, and lea instead of addq for the buff address.  None of those
> instructions effect the carry flag as this is how such loops were
> intended to be implemented.
> 
> I've been doing a bit of testing and that seems to work without
> needing the adcq until after you exit the loop, but doesn't give that
> much of a gain in speed for dropping the instruction from the
> hot-path.  I suspect we are probably memory bottle-necked already in
> the loop so dropping an instruction or two doesn't gain you much.

Right, any superscalar architecture gives you some instructions
'for free' if they can execute at the same time as those on the
critical path (in this case the memory reads and the adc).
This is why loop unrolling can be pointless.

So the loop:
10: addc %rax,(%rdx,%rcx,8)
inc %rcx
jnz 10b
could easily be as fast as anything that doesn't use the 'new'
instructions that use the overflow flag.
That loop might be measurable faster for aligned buffers.

David



Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-04 Thread Alexander Duyck
On Fri, Mar 4, 2016 at 2:38 AM, David Laight  wrote:
> From: Linus Torvalds
>> Sent: 03 March 2016 18:44
>>
>> On Thu, Mar 3, 2016 at 8:12 AM, David Laight  wrote:
>> >
>> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
>> > without any unrolling?
>>
>> Is that actually supposed to work ok these days? jcxz used to be quite
>> slow, and is historically *never* used.
>>
>> Now, in theory, loop constructs can actually do better on branch
>> prediction etc, but Intel seems to have never really tried to push
>> them, and has instead pretty much discouraged them in favor of making
>> the normal jumps go faster (including all the instruction fusion etc)
>
> Yes, they've even added the 'adc using the overflow flag' but not made
> it possible to put that into a loop.
>
>> From what I have seen, the whole "don't use LOOP or JRCXZ" has not changed.
>
> LOOP is still slow on intel cpus (but is single clock on recentish amd ones).
>
> JCXZ is reasonable on most cpus, certainly all the ones we care about these 
> days.
> On intel cpus JCXZ is still 2 clocks, but it removes the dependency on any
> flags (which all other conditional instructions have).
> The difficulty is using it for a loop (you need JCXNZ or a fast LOOP).
> An alternative to the 'JCXZ, JMPS' pair would be to move the high bits
> of the counter into the low bits of cx so that cx would become non-zero
> on the last iteration.

Actually probably the easiest way to go on x86 is to just replace the
use of len with (len >> 6) and use decl or incl instead of addl or
subl, and lea instead of addq for the buff address.  None of those
instructions effect the carry flag as this is how such loops were
intended to be implemented.

I've been doing a bit of testing and that seems to work without
needing the adcq until after you exit the loop, but doesn't give that
much of a gain in speed for dropping the instruction from the
hot-path.  I suspect we are probably memory bottle-necked already in
the loop so dropping an instruction or two doesn't gain you much.

- Alex


RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-04 Thread David Laight
From: Linus Torvalds
> Sent: 03 March 2016 18:44
> 
> On Thu, Mar 3, 2016 at 8:12 AM, David Laight  wrote:
> >
> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
> > without any unrolling?
> 
> Is that actually supposed to work ok these days? jcxz used to be quite
> slow, and is historically *never* used.
> 
> Now, in theory, loop constructs can actually do better on branch
> prediction etc, but Intel seems to have never really tried to push
> them, and has instead pretty much discouraged them in favor of making
> the normal jumps go faster (including all the instruction fusion etc)

Yes, they've even added the 'adc using the overflow flag' but not made
it possible to put that into a loop.

> From what I have seen, the whole "don't use LOOP or JRCXZ" has not changed.

LOOP is still slow on intel cpus (but is single clock on recentish amd ones).

JCXZ is reasonable on most cpus, certainly all the ones we care about these 
days.
On intel cpus JCXZ is still 2 clocks, but it removes the dependency on any
flags (which all other conditional instructions have).
The difficulty is using it for a loop (you need JCXNZ or a fast LOOP).
An alternative to the 'JCXZ, JMPS' pair would be to move the high bits
of the counter into the low bits of cx so that cx would become non-zero
on the last iteration.

David



Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-03 Thread Linus Torvalds
On Thu, Mar 3, 2016 at 8:12 AM, David Laight  wrote:
>
> Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
> without any unrolling?

Is that actually supposed to work ok these days? jcxz used to be quite
slow, and is historically *never* used.

Now, in theory, loop constructs can actually do better on branch
prediction etc, but Intel seems to have never really tried to push
them, and has instead pretty much discouraged them in favor of making
the normal jumps go faster (including all the instruction fusion etc)

>From what I have seen, the whole "don't use LOOP or JRCXZ" has not changed.

   Linus


RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-03 Thread David Laight
From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> + /* Main loop using 64byte blocks */
> + for (; len > 64; len -= 64, buff += 64) {
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq 6*8(%[src]),%[res]\n\t"
> + "adcq 7*8(%[src]),%[res]\n\t"
> + "adcq $0,%[res]"
> + : [res] "=r" (result)
> + : [src] "r" (buff),
> + "[res]" (result));

Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
without any unrolling?

...
> + /* Sum over any remaining bytes (< 8 of them) */
> + if (len & 0x7) {
> + unsigned long val;
> + /*
> +  * Since "len" is > 8 here we backtrack in the buffer to load
> +  * the outstanding bytes into the low order bytes of a quad and
> +  * then shift to extract the relevant bytes. By doing this we
> +  * avoid additional calls to load_unaligned_zeropad.

That comment is wrong. Maybe:
 * Read the last 8 bytes of the buffer then shift to extract
 * the required bytes.
 * This is safe because the original length was > 8 and avoids
 * any problems reading beyond the end of the valid data.

David



Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Alexander Duyck
On Wed, Mar 2, 2016 at 4:40 PM, Tom Herbert  wrote:
> On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck
>  wrote:
>> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert  wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for smaller lengths such
>>> as those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>>
>>> - v4
>>>- went back to C code with inline assembly for critical routines
>>>- implemented suggestion from Linus to deal with lengths < 8
>>> - v5
>>>- fixed register attribute add32_with_carry3
>>>- do_csum returns unsigned long
>>>- don't consider alignment at all. Rationalization is that x86
>>>  handles unaligned accesses very well except in the case that
>>>  the access crosses a page boundary which has a performance
>>>  penalty (I see about 10nsecs on my system). Drivers and the
>>>  stack go to considerable lengths to not have packets cross page
>>>  boundaries, so the case that csum_partial is called with
>>>  buffer that crosses a page boundary should be a very rare
>>>  occurrence. Not dealing with alignment is a significant
>>>  simplification.
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>>  Old  New %
>>> Len/Aln  nsecsnsecs   Improv
>>> +---++---
>>> 1400/0195.6181.7   3% (Big packet)
>>
>> The interesting bit here for me would be how we handle the 1480 byte
>> values with a 2 byte alignment offset.  That would typically be our
>> case where we have 14 (eth) + 20 (IP) and no IP_ALIGN since we set it
>> to 0 on x86.  This is also the case where we would have around a 30%
>> chance of there being a page boundary in there somewhere since if I
>> recall 1536 + 64 (NET_SKB_PAD) + 384 (skb_shared_info) doesn't add up
>> to an even power of 2 so there is a good likelihood of us actually
>> regressing in the receive path for large packet checksumming since it
>> is about a 10ns penalty for spanning a page boundary for a single
>> read.
>>
> Yes, but the case your considering would be that in which we need to
> perform a full packet checksum in the host-- normally we'd expect to
> have HW offload for that. But even in that case, the regression would
> not be the full 10ns since the logic to check alignment would be
> needed and that has some cost. Also, for longer checksums the cost is
> amortized so any relative regression would be smaller. Interestingly,
> on the Atom CPU it was still better performance to ignore the
> alignment than got through the code to handle it. For Xeon there was
> some regression. The potentially bad case would be if headers are
> split over a page boundary and we are doing checksum complete (this in
> theory could also be a problem for other accesses like if the IP
> addresses end up straddling the page boundary). I think the
> probability of that is well less than 30%, but if we really are
> worried about checksum over a page boundary, then I would put in one
> conditional to switch between the old do_csum and the new do_csum
> (call it do_fast_csum) based on buffer crossing page boundary-- so the
> cost of needing to deal with alignment in the common case is at most a
> single conditional (1 ns).

The part I would be interested in seeing is how much we gain/lose for
dealing with the alignment.  Based on the 10ns value I would expect
that we would see a 2% regression per 4K for dealing with an unaligned
page.  So the question is what do we gain by taking the 2% hit.  I
agree that the 2% hit won't occur often as long as we are only doing
headers, however we can't be myopic about how this code is to be used.
The fact is there are plenty of other paths that use it as well and it
doesn't do us much good if we make the slow path that much slower to
accelerate the fast path by 1 or 2 ns.

By any chance could you email me the code you are using to test this?
I'm assuming you are probably testing this in user-space with some
rdtsc type calls thrown in.  I want to try applying a few patches and
see what the gain/loss is for aligning the code versus not bothering
to align it.

>>> 40/0  11.8 6.5 45%(Ipv6 hdr cmn case)
>>
>> Common case for transmit maybe.  For receive on x86 this isn't IP
>> aligned so the offset would be 6.
>
> With offset 6 gives about the same results.

Yeah, I kind of figured that.  Just stating that the common alignment is 6.

>>
>>> 8/4   8.1  3.2 60%(UDP, VXLAN in IPv4)
>>
>> How likely is the 8/4 case in reality?  I ask because you h

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Tom Herbert
On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck
 wrote:
> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert  wrote:
>> This patch implements performant csum_partial for x86_64. The intent is
>> to speed up checksum calculation, particularly for smaller lengths such
>> as those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>
>> - v4
>>- went back to C code with inline assembly for critical routines
>>- implemented suggestion from Linus to deal with lengths < 8
>> - v5
>>- fixed register attribute add32_with_carry3
>>- do_csum returns unsigned long
>>- don't consider alignment at all. Rationalization is that x86
>>  handles unaligned accesses very well except in the case that
>>  the access crosses a page boundary which has a performance
>>  penalty (I see about 10nsecs on my system). Drivers and the
>>  stack go to considerable lengths to not have packets cross page
>>  boundaries, so the case that csum_partial is called with
>>  buffer that crosses a page boundary should be a very rare
>>  occurrence. Not dealing with alignment is a significant
>>  simplification.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>  Old  New %
>> Len/Aln  nsecsnsecs   Improv
>> +---++---
>> 1400/0195.6181.7   3% (Big packet)
>
> The interesting bit here for me would be how we handle the 1480 byte
> values with a 2 byte alignment offset.  That would typically be our
> case where we have 14 (eth) + 20 (IP) and no IP_ALIGN since we set it
> to 0 on x86.  This is also the case where we would have around a 30%
> chance of there being a page boundary in there somewhere since if I
> recall 1536 + 64 (NET_SKB_PAD) + 384 (skb_shared_info) doesn't add up
> to an even power of 2 so there is a good likelihood of us actually
> regressing in the receive path for large packet checksumming since it
> is about a 10ns penalty for spanning a page boundary for a single
> read.
>
Yes, but the case your considering would be that in which we need to
perform a full packet checksum in the host-- normally we'd expect to
have HW offload for that. But even in that case, the regression would
not be the full 10ns since the logic to check alignment would be
needed and that has some cost. Also, for longer checksums the cost is
amortized so any relative regression would be smaller. Interestingly,
on the Atom CPU it was still better performance to ignore the
alignment than got through the code to handle it. For Xeon there was
some regression. The potentially bad case would be if headers are
split over a page boundary and we are doing checksum complete (this in
theory could also be a problem for other accesses like if the IP
addresses end up straddling the page boundary). I think the
probability of that is well less than 30%, but if we really are
worried about checksum over a page boundary, then I would put in one
conditional to switch between the old do_csum and the new do_csum
(call it do_fast_csum) based on buffer crossing page boundary-- so the
cost of needing to deal with alignment in the common case is at most a
single conditional (1 ns).

>> 40/0  11.8 6.5 45%(Ipv6 hdr cmn case)
>
> Common case for transmit maybe.  For receive on x86 this isn't IP
> aligned so the offset would be 6.

With offset 6 gives about the same results.

>
>> 8/4   8.1  3.2 60%(UDP, VXLAN in IPv4)
>
> How likely is the 8/4 case in reality?  I ask because you have a
> special handler for the case and as such that extra bit of code is
> going to cost you one cycle or more in all other cases.  As far as the
> checksum for VXLAN I would think you are probably looking at something
> more like 20/4 because you would likely be pulling the UDP, VXLAN, and
> inner Ethernet header to get to the inner IP header.  The only case I
> can think of where we might be working on 8 bytes would be something
> like L3 encapsulated inside of GRE.
>
>> 14/0  8.9  6.3 29%(Eth hdr)
>> 14/4  9.5  6.3 33%(Eth hdr in IPv4)
>> 14/3  9.6  6.3 34%(Eth with odd align)
>> 20/0  9.1  6.8 25%(IP hdr without options)
>> 7/1   9.1  3.9 57%(buffer in one quad)
>> 100/017.4 13.6 21%(medium-sized pkt)
>> 100/217.7 13.5 24%(medium-sized pkt w/ alignment)
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
>
>> Also tested on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>> Int

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Alexander Duyck
On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert  wrote:
> This patch implements performant csum_partial for x86_64. The intent is
> to speed up checksum calculation, particularly for smaller lengths such
> as those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>
> - v4
>- went back to C code with inline assembly for critical routines
>- implemented suggestion from Linus to deal with lengths < 8
> - v5
>- fixed register attribute add32_with_carry3
>- do_csum returns unsigned long
>- don't consider alignment at all. Rationalization is that x86
>  handles unaligned accesses very well except in the case that
>  the access crosses a page boundary which has a performance
>  penalty (I see about 10nsecs on my system). Drivers and the
>  stack go to considerable lengths to not have packets cross page
>  boundaries, so the case that csum_partial is called with
>  buffer that crosses a page boundary should be a very rare
>  occurrence. Not dealing with alignment is a significant
>  simplification.
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Performance:
>
> Isolating old and new implementation for some common cases:
>
>  Old  New %
> Len/Aln  nsecsnsecs   Improv
> +---++---
> 1400/0195.6181.7   3% (Big packet)

The interesting bit here for me would be how we handle the 1480 byte
values with a 2 byte alignment offset.  That would typically be our
case where we have 14 (eth) + 20 (IP) and no IP_ALIGN since we set it
to 0 on x86.  This is also the case where we would have around a 30%
chance of there being a page boundary in there somewhere since if I
recall 1536 + 64 (NET_SKB_PAD) + 384 (skb_shared_info) doesn't add up
to an even power of 2 so there is a good likelihood of us actually
regressing in the receive path for large packet checksumming since it
is about a 10ns penalty for spanning a page boundary for a single
read.

> 40/0  11.8 6.5 45%(Ipv6 hdr cmn case)

Common case for transmit maybe.  For receive on x86 this isn't IP
aligned so the offset would be 6.

> 8/4   8.1  3.2 60%(UDP, VXLAN in IPv4)

How likely is the 8/4 case in reality?  I ask because you have a
special handler for the case and as such that extra bit of code is
going to cost you one cycle or more in all other cases.  As far as the
checksum for VXLAN I would think you are probably looking at something
more like 20/4 because you would likely be pulling the UDP, VXLAN, and
inner Ethernet header to get to the inner IP header.  The only case I
can think of where we might be working on 8 bytes would be something
like L3 encapsulated inside of GRE.

> 14/0  8.9  6.3 29%(Eth hdr)
> 14/4  9.5  6.3 33%(Eth hdr in IPv4)
> 14/3  9.6  6.3 34%(Eth with odd align)
> 20/0  9.1  6.8 25%(IP hdr without options)
> 7/1   9.1  3.9 57%(buffer in one quad)
> 100/017.4 13.6 21%(medium-sized pkt)
> 100/217.7 13.5 24%(medium-sized pkt w/ alignment)
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz


> Also tested on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> Intel(R) Atom(TM) CPU N450   @ 1.66GHz
>
> Branch  prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between the
> two scenarios (perf stat output is below), neither does either case
> show a significant number of branch misses.
>
> Interleave lengths case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
> ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 1
>
>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 
> 1' (10 runs):
>
>  9,556,693,202  instructions   ( +-  0.00% )
>  1,176,208,640   branches 
> ( +-  0.00% )
> 19,487   branch-misses#0.00% of all branches  
> ( +-  6.07% )
>
>2.049732539 seconds time elapsed
>
> Non-interleave case:
>
> perf stat --repeat

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Eric Dumazet
On mer., 2016-03-02 at 14:18 -0800, Tom Herbert wrote:
\
> + asm("lea 0f(, %[slen], 4), %%r11\n\t"
> + "clc\n\t"
> + "jmpq *%%r11\n\t"
> + "adcq 7*8(%[src]),%[res]\n\t"
> + "adcq 6*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq 4*8(%[src]),%[res]\n\t"
> + "adcq 3*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 0*8(%[src]),%[res]\n\t"
> + "nop\n\t"
> + "0: adcq $0,%[res]"
> + : [res] "=r" (result)
> + : [src] "r" (buff),
> +   [slen] "r" (-(unsigned long)(len >> 3)), "[res]" (result)
> + : "r11");
>  


hpa mentioned we could use adcq.d8 0*8(%[src]),%[res]

to avoid the mandatory 'nop'





[PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Tom Herbert
This patch implements performant csum_partial for x86_64. The intent is
to speed up checksum calculation, particularly for smaller lengths such
as those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.

- v4
   - went back to C code with inline assembly for critical routines
   - implemented suggestion from Linus to deal with lengths < 8
- v5
   - fixed register attribute add32_with_carry3
   - do_csum returns unsigned long
   - don't consider alignment at all. Rationalization is that x86
 handles unaligned accesses very well except in the case that
 the access crosses a page boundary which has a performance
 penalty (I see about 10nsecs on my system). Drivers and the
 stack go to considerable lengths to not have packets cross page
 boundaries, so the case that csum_partial is called with
 buffer that crosses a page boundary should be a very rare
 occurrence. Not dealing with alignment is a significant
 simplification.

Testing:

Correctness:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Performance:

Isolating old and new implementation for some common cases:

 Old  New %
Len/Aln  nsecsnsecs   Improv
+---++---
1400/0195.6181.7   3% (Big packet)
40/0  11.8 6.5 45%(Ipv6 hdr cmn case)
8/4   8.1  3.2 60%(UDP, VXLAN in IPv4)
14/0  8.9  6.3 29%(Eth hdr)
14/4  9.5  6.3 33%(Eth hdr in IPv4)
14/3  9.6  6.3 34%(Eth with odd align)
20/0  9.1  6.8 25%(IP hdr without options)
7/1   9.1  3.9 57%(buffer in one quad)
100/017.4 13.6 21%(medium-sized pkt)
100/217.7 13.5 24%(medium-sized pkt w/ alignment)

Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Also tested on these with similar results:

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Intel(R) Atom(TM) CPU N450   @ 1.66GHz

Branch  prediction:

To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between the
two scenarios (perf stat output is below), neither does either case
show a significant number of branch misses.

Interleave lengths case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 1

 Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 
1' (10 runs):

 9,556,693,202  instructions   ( +-  0.00% )
 1,176,208,640   branches   
  ( +-  0.00% )
19,487   branch-misses#0.00% of all branches
  ( +-  6.07% )

   2.049732539 seconds time elapsed

Non-interleave case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
 ./csum -M new-thrash -l 100 -S 24 -a 1 -c 1

Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 
1' (10 runs):

 9,782,188,310  instructions   ( +-  0.00% )
 1,251,286,958   branches   
  ( +-  0.01% )
18,950   branch-misses#0.00% of all branches
  ( +- 12.74% )

   2.271789046 seconds time elapsed

Signed-off-by: Tom Herbert 
---
 arch/x86/include/asm/checksum_64.h |  21 +
 arch/x86/lib/csum-partial_64.c | 171 ++---
 2 files changed, 102 insertions(+), 90 deletions(-)

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index cd00e17..1224f7d 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, 
unsigned b)
return a;
 }
 
+static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
+{
+   asm("addq %2,%0\n\t"
+   "adcq $0,%0"
+   : "=r" (a)
+   : "0" (a), "rm" (b));
+   return a;
+}
+
+static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
+unsigned int c)
+{
+   asm("addl %1,%0\n\t"
+   "adcl %2,%0\n\t"
+