From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
>
> Well, it should execute at 1 instruction/cycle.
I presume you
David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>Useful for IE and DIR.
I don't quite follow. It allows the value to be tested without waiting
for pending updates *of other bits* to complete.
Obviusly, the update of
From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-ker...@vger.kernel.org; li...@horizon.com;
> netdev@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register
From: George Spelvin [mailto:li...@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant
> > gain.
> > You have a dependency chain on the carry flag so have delays between the
> > 'adcq'
> > instructions
David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.
The issue is with flags register bits that are *not* modified by
an
David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant
> gain.
> You have a dependency chain on the carry flag so have delays between the
> 'adcq'
> instructions (these may be more significant than the memory reads from l1
> cache).
If the carry chain
* Tom Herbert wrote:
> [] gcc turns these switch statements into jump tables (not function
> tables
> which is what Ingo's example code was using). [...]
So to the extent this still matters, on most x86 microarchitectures that count,
jump tables and function call
* Tom Herbert wrote:
> Thanks for the explanation and sample code. Expanding on your example, I
> added a
> switch statement to perform to function (code below).
So I think your new switch() based testcase is broken in a subtle way.
The problem is that in your added
From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you
> manage
> to offload the logic into arithmetics and not affect the control flow then
> that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics
* Ingo Molnar wrote:
> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> > +
> > + /* Check length */
> > +10:cmpl$8, %esi
> > + jg 30f
> > + jl 20f
> > +
> > + /* Exactly 8 bytes length */
> > + addl(%rdi), %eax
> > + adcl4(%rdi), %eax
>
From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> + /* Main loop */
> +50: adcq0*8(%rdi),%rax
> + adcq1*8(%rdi),%rax
> + adcq2*8(%rdi),%rax
> + adcq3*8(%rdi),%rax
> + adcq4*8(%rdi),%rax
> + adcq5*8(%rdi),%rax
> + adcq6*8(%rdi),%rax
> +
I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:
Why wasn't that done with C code instead of asm with odd numerical targets?
It seems likely that the real issue is avoiding the short loops (that
will cause branch
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
wrote:
>
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
>
On Thu, Feb 4, 2016 at 9:09 AM, David Laight wrote:
> From: Tom Herbert
> ...
>> > If nothing else reducing the size of this main loop may be desirable.
>> > I know the newer x86 is supposed to have a loop buffer so that it can
>> > basically loop on already decoded
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical
On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert wrote:
> On Thu, Feb 4, 2016 at 9:09 AM, David Laight wrote:
>> From: Tom Herbert
>> ...
>>> > If nothing else reducing the size of this main loop may be desirable.
>>> > I know the newer x86 is supposed
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
wrote:
> sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
> return rotate_by8_if_odd(sum, align);
Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load
* Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after
On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
>
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar wrote:
>
> * Ingo Molnar wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > + /* Check length */
>> > +10:cmpl$8, %esi
>> > + jg 30f
>> > + jl 20f
>> > +
>> > + /*
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such
On Thu, Feb 4, 2016 at 3:08 AM, David Laight wrote:
> From: Tom Herbert
>> Sent: 03 February 2016 19:19
> ...
>> + /* Main loop */
>> +50: adcq0*8(%rdi),%rax
>> + adcq1*8(%rdi),%rax
>> + adcq2*8(%rdi),%rax
>> + adcq3*8(%rdi),%rax
>> +
On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck
wrote:
> On Thu, Feb 4, 2016 at 3:08 AM, David Laight wrote:
>> From: Tom Herbert
>>> Sent: 03 February 2016 19:19
>> ...
>>> + /* Main loop */
>>> +50: adcq0*8(%rdi),%rax
>>> + adcq
From: Tom Herbert
...
> > If nothing else reducing the size of this main loop may be desirable.
> > I know the newer x86 is supposed to have a loop buffer so that it can
> > basically loop on already decoded instructions. Normally it is only
> > something like 64 or 128 bytes in size though. You
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such
On Thu, Feb 4, 2016 at 11:44 AM, Tom Herbert wrote:
> On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
> wrote:
>> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote:
>>> Implement assembly routine for csum_partial for 64
Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
conversion.
29 matches
Mail list logo