Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-25 Thread Neil Horman
On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote: > If implemented properly adcx/adox should give additional speedup... that is > the whole reason for their existence. > Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a stepping of this CPU to test any

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-25 Thread Neil Horman
On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote: If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence. Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a stepping of this CPU to test any code

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > > > > Ok, so I ran the above code on a single cpu using taskset, and set irq > > affinity > > such that no interrupts (save for local ones), would occur on that cpu. > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Eric Dumazet
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > Ok, so I ran the above code on a single cpu using taskset, and set irq > affinity > such that no interrupts (save for local ones), would occur on that cpu. Note > that I had to convert csum_partial_opt to csum_partial, as the _opt

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > #define BUFSIZ_ORDER 4 > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > > static int __init csum_init_module(void) > > { > > int i; > > __wsum sum = 0; > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Doug Ledford
On 10/19/2013 04:23 AM, Ingo Molnar wrote: > > * Doug Ledford wrote: >> All prefetch operations get sent to an access queue in the memory >> controller where they compete with both other reads and writes for the >> available memory bandwidth. The optimal prefetch window is not a factor >> of

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote: > On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote: > > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > > > > > #define BUFSIZ_ORDER 4 > > > > #define

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Eric Dumazet
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote: > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > > > #define BUFSIZ_ORDER 4 > > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > > > static int __init

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Eric Dumazet
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote: On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: #define BUFSIZ_ORDER 4 #define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2)) static int __init

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote: On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote: On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: #define BUFSIZ_ORDER 4 #define BUFSIZ ((2

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Doug Ledford
On 10/19/2013 04:23 AM, Ingo Molnar wrote: * Doug Ledford dledf...@redhat.com wrote: All prefetch operations get sent to an access queue in the memory controller where they compete with both other reads and writes for the available memory bandwidth. The optimal prefetch window is not a

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: #define BUFSIZ_ORDER 4 #define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2)) static int __init csum_init_module(void) { int i; __wsum sum = 0; struct

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Eric Dumazet
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: Ok, so I ran the above code on a single cpu using taskset, and set irq affinity such that no interrupts (save for local ones), would occur on that cpu. Note that I had to convert csum_partial_opt to csum_partial, as the _opt variant

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-21 Thread Neil Horman
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: Ok, so I ran the above code on a single cpu using taskset, and set irq affinity such that no interrupts (save for local ones), would occur on that cpu. Note that I

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-20 Thread Neil Horman
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > > > #define BUFSIZ_ORDER 4 > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > > static int __init csum_init_module(void) > > { > > int i; > > __wsum sum = 0; > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-20 Thread Neil Horman
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote: On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: #define BUFSIZ_ORDER 4 #define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2)) static int __init csum_init_module(void) { int i; __wsum sum = 0; struct

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-19 Thread Ingo Molnar
* Doug Ledford wrote: > >> Based on these, prefetching is obviously a a good improvement, but > >> not as good as parallel execution, and the winner by far is doing > >> both. > > OK, this is where I have to chime in that these tests can *not* be used > to say anything about prefetch, and

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-19 Thread Ingo Molnar
* Doug Ledford dledf...@redhat.com wrote: Based on these, prefetching is obviously a a good improvement, but not as good as parallel execution, and the winner by far is doing both. OK, this is where I have to chime in that these tests can *not* be used to say anything about

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Doug Ledford
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote: > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: >> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: >> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: >> > > attached patch brings much better results >> > > >> > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Doug Ledford
On 2013-10-17, Ingo wrote: > * Neil Horman wrote: > >> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: >> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: >> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: >> > > >> > > > So, early testing results today. I wrote

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Eric Dumazet
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: > #define BUFSIZ_ORDER 4 > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2)) > static int __init csum_init_module(void) > { > int i; > __wsum sum = 0; > struct timespec start, end; > u64 time; > struct page

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote: > On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote: > > > > > > for(i=0;i<10;i++) { > > sum = csum_partial(buf+offset, PAGE_SIZE, sum); > > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE :

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Eric Dumazet
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote: > > > for(i=0;i<10;i++) { > sum = csum_partial(buf+offset, PAGE_SIZE, sum); > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0; > } Please replace this by random accesses, and use the

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread H. Peter Anvin
If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence. Neil Horman wrote: >On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote: >> On 10/11/2013 09:51 AM, Neil Horman wrote: >> > Sébastien Dugué reported to me that devices

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
> > Your benchmark uses a single 4K page, so data is _super_ hot in cpu > caches. > ( prefetch should give no speedups, I am surprised it makes any > difference) > > Try now with 32 huges pages, to get 64 MBytes of working set. > > Because in reality we never csum_partial() data in cpu cache. >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote: > On 10/11/2013 09:51 AM, Neil Horman wrote: > > Sébastien Dugué reported to me that devices implementing ipoib (which don't > > have > > checksum offload hardware were spending a significant amount of time > > computing > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Ingo Molnar
* H. Peter Anvin wrote: > On 10/17/2013 01:41 AM, Ingo Molnar wrote: > > > > To correctly simulate the workload you'd have to: > > > > - allocate a buffer larger than your L2 cache. > > > > - to measure the effects of the prefetches you'd also have to randomize > >the individual buffer

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Ingo Molnar
* H. Peter Anvin h...@zytor.com wrote: On 10/17/2013 01:41 AM, Ingo Molnar wrote: To correctly simulate the workload you'd have to: - allocate a buffer larger than your L2 cache. - to measure the effects of the prefetches you'd also have to randomize the individual buffer

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote: On 10/11/2013 09:51 AM, Neil Horman wrote: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a significant amount of time computing checksums. We

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
Your benchmark uses a single 4K page, so data is _super_ hot in cpu caches. ( prefetch should give no speedups, I am surprised it makes any difference) Try now with 32 huges pages, to get 64 MBytes of working set. Because in reality we never csum_partial() data in cpu cache. (Unless

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread H. Peter Anvin
If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence. Neil Horman nhor...@tuxdriver.com wrote: On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote: On 10/11/2013 09:51 AM, Neil Horman wrote: Sébastien Dugué reported to me

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Eric Dumazet
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote: for(i=0;i10;i++) { sum = csum_partial(buf+offset, PAGE_SIZE, sum); offset = (offset BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0; } Please replace this by random accesses, and use the more

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Neil Horman
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote: On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote: for(i=0;i10;i++) { sum = csum_partial(buf+offset, PAGE_SIZE, sum); offset = (offset BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0; }

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Eric Dumazet
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote: #define BUFSIZ_ORDER 4 #define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2)) static int __init csum_init_module(void) { int i; __wsum sum = 0; struct timespec start, end; u64 time; struct page *page;

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Doug Ledford
On 2013-10-17, Ingo wrote: * Neil Horman nhor...@tuxdriver.com wrote: On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: So, early testing results today. I

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-18 Thread Doug Ledford
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote: On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: attached patch brings much better results lpq83:~# ./netperf -H

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Eric Dumazet
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote: > Seriously, though, how much does it matter? All the above seems likely > to do is to drown the signal by adding noise. I don't think so. > > If the parallel (threaded) checksumming is faster, which theory says it > should and

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread H. Peter Anvin
On 10/17/2013 01:41 AM, Ingo Molnar wrote: > > To correctly simulate the workload you'd have to: > > - allocate a buffer larger than your L2 cache. > > - to measure the effects of the prefetches you'd also have to randomize >the individual buffer positions. See how 'perf bench numa'

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Ingo Molnar
* Neil Horman wrote: > On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: > > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: > > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > > > > > > > So, early testing results today. I wrote a test module that, allocated > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: So, early testing results today. I wrote a test module that,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread H. Peter Anvin
On 10/17/2013 01:41 AM, Ingo Molnar wrote: To correctly simulate the workload you'd have to: - allocate a buffer larger than your L2 cache. - to measure the effects of the prefetches you'd also have to randomize the individual buffer positions. See how 'perf bench numa' implements a

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-17 Thread Eric Dumazet
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote: Seriously, though, how much does it matter? All the above seems likely to do is to drown the signal by adding noise. I don't think so. If the parallel (threaded) checksumming is faster, which theory says it should and

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Eric Dumazet
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote: > > > > So I went to reproduce these results, but was unable to (due to the fact that > I > only have a pretty jittery network to do testing accross at the moment with > these devices). So instead I figured that I would go back to just

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Neil Horman
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > > > > > So, early testing results today. I wrote a test module that, allocated a > > > 4k > > > buffer, initalized

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Joe Perches
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote: > Prefetch takes memory from L2->L1 memory > just as much as it moves it cachelines from memory to the L2 cache. Yup, mea culpa. I thought the prefetch was still to L1 like the Pentium. -- To unsubscribe from this list: send the line

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Ingo Molnar
* Joe Perches wrote: > On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote: > > * Joe Perches wrote: > > > > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Ingo Molnar
* Joe Perches j...@perches.com wrote: On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote: * Joe Perches j...@perches.com wrote: On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: On Mon, 2013-10-14 at 15:18 -0700, Eric

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Joe Perches
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote: Prefetch takes memory from L2-L1 memory just as much as it moves it cachelines from memory to the L2 cache. Yup, mea culpa. I thought the prefetch was still to L1 like the Pentium. -- To unsubscribe from this list: send the line

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Neil Horman
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: So, early testing results today. I wrote a test module that, allocated a 4k buffer, initalized it with random

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-16 Thread Eric Dumazet
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote: So I went to reproduce these results, but was unable to (due to the fact that I only have a pretty jittery network to do testing accross at the moment with these devices). So instead I figured that I would go back to just doing

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote: > Ingo, Eric _showed_ that the prefetch is good here. > How about looking at a little optimization to the minimal > prefetch that gives that level of performance. Wait a minute, my point was to remind that main cost is the memory fetching.

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote: > > I get the csum_partial() if disabling prequeue. > > At least in the ipoib case i would consider that a misconfiguration. There is nothing you can do, if application is not blocked on recv(), but using poll()/epoll()/select(), prequeue is

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Joe Perches
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote: > * Joe Perches wrote: > > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > > > > attached patch brings

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Andi Kleen
> I get the csum_partial() if disabling prequeue. At least in the ipoib case i would consider that a misconfiguration. "don't do this if it hurts" There may be more such problems. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote: > And the receiver should also do the same : (ethtool -K eth0 rx off) > > 10.55%netserver [kernel.kallsyms] [k] > csum_partial_copy_generic I get the csum_partial() if disabling prequeue. echo 1

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote: > Hi Eric, > > On Tue, 15 Oct 2013 07:06:25 -0700 > Eric Dumazet wrote: > > But the csum cost is both for sender and receiver ? > > No, it was only on the receiver side that I noticed it. > Yes, as Andi said, we do the csum while

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
Hi Eric, On Tue, 15 Oct 2013 07:06:25 -0700 Eric Dumazet wrote: > On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: > > On Tue, 15 Oct 2013 15:33:36 +0200 > > Andi Kleen wrote: > > > > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR > > > > hardware > > > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: > On Tue, 15 Oct 2013 15:33:36 +0200 > Andi Kleen wrote: > > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR > > > hardware > > > where one cannot benefit from hardware offloads. > > > > Is this with sendfile? >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen wrote: > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware > > where one cannot benefit from hardware offloads. > > Is this with sendfile? Tests were done with iperf at the time without any extra funky options, and

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Andi Kleen
> indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware > where one cannot benefit from hardware offloads. Is this with sendfile? For normal send() the checksum is done in the user copy and for receiving it can be also done during the copy in most cases -Andi -- To

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Neil Horman
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote: > > * Andi Kleen wrote: > > > > > Neil Horman writes: > > > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > > don't have checksum offload

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Neil Horman
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: > > > > > > * Neil Horman wrote: > > > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > > don't have

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Borislav Petkov wrote: > On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote: > > Most processors have hundreds of cachelines even in their L1 cache. > > Thousands in the L2 cache, up to hundreds of thousands. > > Also, I have this hazy memory of prefetch hints being harmful in some

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Borislav Petkov
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote: > Most processors have hundreds of cachelines even in their L1 cache. > Thousands in the L2 cache, up to hundreds of thousands. Also, I have this hazy memory of prefetch hints being harmful in some situations:

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
Hi Neil, Andi, On Mon, 14 Oct 2013 16:25:28 -0400 Neil Horman wrote: > On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: > > Neil Horman writes: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have > > > checksum offload hardware were

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Joe Perches wrote: > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > > > attached patch brings much better results > > > > > > > > lpq83:~# ./netperf -H 7.7.8.84

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Neil Horman wrote: > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have checksum offload hardware were spending a significant amount > > > of time

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote: I get the csum_partial() if disabling prequeue. At least in the ipoib case i would consider that a misconfiguration. There is nothing you can do, if application is not blocked on recv(), but using poll()/epoll()/select(), prequeue is not

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote: Ingo, Eric _showed_ that the prefetch is good here. How about looking at a little optimization to the minimal prefetch that gives that level of performance. Wait a minute, my point was to remind that main cost is the memory fetching. Its

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Joe Perches j...@perches.com wrote: On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: attached patch brings much better results lpq83:~# ./netperf -H 7.7.8.84 -l

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
Hi Neil, Andi, On Mon, 14 Oct 2013 16:25:28 -0400 Neil Horman nhor...@tuxdriver.com wrote: On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: Neil Horman nhor...@tuxdriver.com writes: Sébastien Dugué reported to me that devices implementing ipoib (which don't have

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Borislav Petkov
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote: Most processors have hundreds of cachelines even in their L1 cache. Thousands in the L2 cache, up to hundreds of thousands. Also, I have this hazy memory of prefetch hints being harmful in some situations:

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Ingo Molnar
* Borislav Petkov b...@alien8.de wrote: On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote: Most processors have hundreds of cachelines even in their L1 cache. Thousands in the L2 cache, up to hundreds of thousands. Also, I have this hazy memory of prefetch hints being harmful

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Neil Horman
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Sébastien Dugué reported to me that devices implementing ipoib

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Neil Horman
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote: * Andi Kleen a...@firstfloor.org wrote: Neil Horman nhor...@tuxdriver.com writes: Sébastien Dugué reported to me that devices implementing ipoib (which don't

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Andi Kleen
indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. Is this with sendfile? For normal send() the checksum is done in the user copy and for receiving it can be also done during the copy in most cases -Andi -- To

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen a...@firstfloor.org wrote: indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. Is this with sendfile? Tests were done with iperf at the time without any extra funky

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen a...@firstfloor.org wrote: indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. Is this with sendfile?

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Sébastien Dugué
Hi Eric, On Tue, 15 Oct 2013 07:06:25 -0700 Eric Dumazet eric.duma...@gmail.com wrote: On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen a...@firstfloor.org wrote: indeed, our typical workload is connected mode IPoIB on mlx4 QDR

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote: Hi Eric, On Tue, 15 Oct 2013 07:06:25 -0700 Eric Dumazet eric.duma...@gmail.com wrote: But the csum cost is both for sender and receiver ? No, it was only on the receiver side that I noticed it. Yes, as Andi said, we do the

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Eric Dumazet
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote: And the receiver should also do the same : (ethtool -K eth0 rx off) 10.55%netserver [kernel.kallsyms] [k] csum_partial_copy_generic I get the csum_partial() if disabling prequeue. echo 1

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Andi Kleen
I get the csum_partial() if disabling prequeue. At least in the ipoib case i would consider that a misconfiguration. don't do this if it hurts There may be more such problems. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-15 Thread Joe Perches
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote: * Joe Perches j...@perches.com wrote: On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: attached patch brings

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Joe Perches
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > > attached patch brings much better results > > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc > > > MIGRATED TCP STREAM

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote: > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > > attached patch brings much better results > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 > > () port

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Joe Perches
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: > attached patch brings much better results > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () > port 0 AF_INET > Recv SendSend

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > > > So, early testing results today. I wrote a test module that, allocated a 4k > > buffer, initalized it with random data, and called csum_partial on it 10 > > times, recording

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: > So, early testing results today. I wrote a test module that, allocated a 4k > buffer, initalized it with random data, and called csum_partial on it 10 > times, recording the time at the start and end of that loop. Results on a 2.4 >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote: > * Andi Kleen wrote: > > > Neil Horman writes: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have checksum offload hardware were spending a significant > > > amount of time computing > > > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Neil Horman
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > don't have checksum offload hardware were spending a significant amount > > of time computing checksums. We found that by

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Neil Horman
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: > Neil Horman writes: > > > Sébastien Dugué reported to me that devices implementing ipoib (which don't > > have > > checksum offload hardware were spending a significant amount of time > > computing > > Must be an odd workload, most

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Ingo Molnar
* Andi Kleen wrote: > Neil Horman writes: > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > don't have checksum offload hardware were spending a significant > > amount of time computing > > Must be an odd workload, most TCP/UDP workloads do copy-checksum >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Ingo Molnar
* Andi Kleen a...@firstfloor.org wrote: Neil Horman nhor...@tuxdriver.com writes: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a significant amount of time computing Must be an odd workload, most TCP/UDP

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Neil Horman
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: Neil Horman nhor...@tuxdriver.com writes: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a significant amount of time computing Must be an odd

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Neil Horman
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a significant amount of time computing checksums. We found

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote: * Andi Kleen a...@firstfloor.org wrote: Neil Horman nhor...@tuxdriver.com writes: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a significant amount

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: So, early testing results today. I wrote a test module that, allocated a 4k buffer, initalized it with random data, and called csum_partial on it 10 times, recording the time at the start and end of that loop. Results on a 2.4 GHz

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Eric Dumazet
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote: On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote: So, early testing results today. I wrote a test module that, allocated a 4k buffer, initalized it with random data, and called csum_partial on it 10 times, recording the time

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-14 Thread Joe Perches
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote: attached patch brings much better results lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv SendSend Utilization

<    1   2   3   >