From: Eric Dumazet
> Sent: 05 January 2016 22:19
> To: Tom Herbert
> You might add a comment telling the '4' comes from length of 'adcq
> 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
>
> We also could use .byte 0x48,
On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> From: Eric Dumazet
> > Sent: 05 January 2016 22:19
> > To: Tom Herbert
> > You might add a comment telling the '4' comes from length of 'adcq
> > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > 'adcq
From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and
On Wed, 2016-01-06 at 14:49 +, David Laight wrote:
> Someone also pointed out that the code is memory limited (dual add
> chains making no difference), so why is it unrolled at all?
Because it matters if the data is already present in CPU caches.
So why not unrolling if it helps in some
Hi Tom,
On 05.01.2016 19:41, Tom Herbert wrote:
--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,147 @@
+/* Copyright 2016 Tom Herbert
+ *
+ * Checksum partial calculation
+ *
+ * __wsum csum_partial(const void *buff, int len, __wsum sum)
+ *
+ * Computes the
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa
wrote:
> Hi Tom,
>
> On 05.01.2016 19:41, Tom Herbert wrote:
>>
>> --- /dev/null
>> +++ b/arch/x86/lib/csum-partial_64.S
>> @@ -0,0 +1,147 @@
>> +/* Copyright 2016 Tom Herbert
>> + *
>> + *
On 07.01.2016 03:36, Tom Herbert wrote:
On Wed, Jan 6, 2016 at 5:52 PM, Hannes Frederic Sowa
wrote:
Hi Tom,
On 05.01.2016 19:41, Tom Herbert wrote:
--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,147 @@
+/* Copyright 2016 Tom Herbert
Tom Herbert writes:
> Also, we don't do anything special for alignment, unaligned
> accesses on x86 do not appear to be a performance issue.
This is not true on Atom CPUs.
Also on most CPUs there is still a larger penalty when crossing
cache lines.
> Verified correctness
On Wed, 2016-01-06 at 00:35 +0100, Hannes Frederic Sowa wrote:
>
> Tom, did you have a look if it makes sense to add a second carry
> addition train with the adcx instruction, which does not signal carry
> via the carry flag but with the overflow flag? This instruction should
> not have any
On Tue, 2016-01-05 at 17:10 -0800, H. Peter Anvin wrote:
> Apparently "adcq.d8" will do The Right Thing for this.
Nice trick ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at
On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or
On 01/05/2016 02:18 PM, Eric Dumazet wrote:
> On Tue, 2016-01-05 at 10:41 -0800, Tom Herbert wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum
Hi,
On 05.01.2016 19:41, Tom Herbert wrote:
Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after
13 matches
Mail list logo