RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight
From: George Spelvin > Sent: 10 February 2016 14:44 ... > > I think the fastest loop is: > > 10: adcq0(%rdi,%rcx,8),%rax > > inc %rcx > > jnz 10b > > That loop looks like it will have no overhead on recent cpu. > > Well, it should execute at 1 instruction/cycle. I presume you

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread George Spelvin
David Laight wrote: > Separate renaming allows: > 1) The value to tested without waiting for pending updates to complete. >Useful for IE and DIR. I don't quite follow. It allows the value to be tested without waiting for pending updates *of other bits* to complete. Obviusly, the update of

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight
From: George Spelvin > Sent: 10 February 2016 00:54 > To: David Laight; linux-ker...@vger.kernel.org; li...@horizon.com; > netdev@vger.kernel.org; > David Laight wrote: > > Since adcx and adox must execute in parallel I clearly need to re-remember > > how dependencies against the flags register

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread David Laight
From: George Spelvin [mailto:li...@horizon.com] > Sent: 08 February 2016 20:13 > David Laight wrote: > > I'd need convincing that unrolling the loop like that gives any significant > > gain. > > You have a dependency chain on the carry flag so have delays between the > > 'adcq' > > instructions

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread George Spelvin
David Laight wrote: > Since adcx and adox must execute in parallel I clearly need to re-remember > how dependencies against the flags register work. I'm sure I remember > issues with 'false dependencies' against the flags. The issue is with flags register bits that are *not* modified by an

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-08 Thread George Spelvin
David Laight wrote: > I'd need convincing that unrolling the loop like that gives any significant > gain. > You have a dependency chain on the carry flag so have delays between the > 'adcq' > instructions (these may be more significant than the memory reads from l1 > cache). If the carry chain

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar
* Tom Herbert wrote: > [] gcc turns these switch statements into jump tables (not function > tables > which is what Ingo's example code was using). [...] So to the extent this still matters, on most x86 microarchitectures that count, jump tables and function call

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar
* Tom Herbert wrote: > Thanks for the explanation and sample code. Expanding on your example, I > added a > switch statement to perform to function (code below). So I think your new switch() based testcase is broken in a subtle way. The problem is that in your added

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread David Laight
From: Ingo Molnar ... > As Linus noticed, data lookup tables are the intelligent solution: if you > manage > to offload the logic into arithmetics and not affect the control flow then > that's > a big win. The inherent branching will be hidden by executing on massively > parallel arithmetics

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar
* Ingo Molnar wrote: > s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > > + > > + /* Check length */ > > +10:cmpl$8, %esi > > + jg 30f > > + jl 20f > > + > > + /* Exactly 8 bytes length */ > > + addl(%rdi), %eax > > + adcl4(%rdi), %eax >

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread David Laight
From: Tom Herbert > Sent: 03 February 2016 19:19 ... > + /* Main loop */ > +50: adcq0*8(%rdi),%rax > + adcq1*8(%rdi),%rax > + adcq2*8(%rdi),%rax > + adcq3*8(%rdi),%rax > + adcq4*8(%rdi),%rax > + adcq5*8(%rdi),%rax > + adcq6*8(%rdi),%rax > +

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
I missed the original email (I don't have net-devel in my mailbox), but based on Ingo's quoting have a more fundamental question: Why wasn't that done with C code instead of asm with odd numerical targets? It seems likely that the real issue is avoiding the short loops (that will cause branch

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert wrote: > > The reason I did this in assembly is precisely about the your point of > having to close the carry chains with adcq $0. I do have a first > implementation in C which using switch() to handle alignment, excess > length

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds wrote: > > static const unsigned long mask[9] = { > 0x, > 0xff00, > 0x, > 0xff00, >

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 9:09 AM, David Laight wrote: > From: Tom Herbert > ... >> > If nothing else reducing the size of this main loop may be desirable. >> > I know the newer x86 is supposed to have a loop buffer so that it can >> > basically loop on already decoded

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds wrote: > I missed the original email (I don't have net-devel in my mailbox), > but based on Ingo's quoting have a more fundamental question: > > Why wasn't that done with C code instead of asm with odd numerical

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Alexander Duyck
On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert wrote: > On Thu, Feb 4, 2016 at 9:09 AM, David Laight wrote: >> From: Tom Herbert >> ... >>> > If nothing else reducing the size of this main loop may be desirable. >>> > I know the newer x86 is supposed

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds wrote: > sum = csum_partial_lt8(*(unsigned long *)buff, len, sum); > return rotate_by8_if_odd(sum, align); Actually, that last word-sized access to "buff" might be past the end of the buffer. The code

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds wrote: > > The "+" should be "-", of course - the point is to shift up the value > by 8 bits for odd cases, and we need to load starting one byte early > for that. The idea is that we use the byte shifter in the load

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar
* Tom Herbert wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or after

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Alexander Duyck
On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting >

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar wrote: > > * Ingo Molnar wrote: > >> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> >> > + >> > + /* Check length */ >> > +10:cmpl$8, %esi >> > + jg 30f >> > + jl 20f >> > + >> > + /*

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck wrote: > On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote: >> Implement assembly routine for csum_partial for 64 bit x86. This >> primarily speeds up checksum calculation for smaller lengths such

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Alexander Duyck
On Thu, Feb 4, 2016 at 3:08 AM, David Laight wrote: > From: Tom Herbert >> Sent: 03 February 2016 19:19 > ... >> + /* Main loop */ >> +50: adcq0*8(%rdi),%rax >> + adcq1*8(%rdi),%rax >> + adcq2*8(%rdi),%rax >> + adcq3*8(%rdi),%rax >> +

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck wrote: > On Thu, Feb 4, 2016 at 3:08 AM, David Laight wrote: >> From: Tom Herbert >>> Sent: 03 February 2016 19:19 >> ... >>> + /* Main loop */ >>> +50: adcq0*8(%rdi),%rax >>> + adcq

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread David Laight
From: Tom Herbert ... > > If nothing else reducing the size of this main loop may be desirable. > > I know the newer x86 is supposed to have a loop buffer so that it can > > basically loop on already decoded instructions. Normally it is only > > something like 64 or 128 bytes in size though. You

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck wrote: > On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote: >> Implement assembly routine for csum_partial for 64 bit x86. This >> primarily speeds up checksum calculation for smaller lengths such

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Alexander Duyck
On Thu, Feb 4, 2016 at 11:44 AM, Tom Herbert wrote: > On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck > wrote: >> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote: >>> Implement assembly routine for csum_partial for 64

[PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-03 Thread Tom Herbert
Implement assembly routine for csum_partial for 64 bit x86. This primarily speeds up checksum calculation for smaller lengths such as those that are present when doing skb_postpull_rcsum when getting CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.