RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-04 Thread David Laight
> > I think you need 3 instructions, move a 0, conditionally move a 1 > > then add. I suspect it won't be a win! Or, with an appropriately unrolled loop, for each word: zero %eax, cmove a 1 to %al cmove a 1 to %ah shift %eax left, cmove a 1 to %al cmove a 1 to %ah,

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-04 Thread David Laight
I think you need 3 instructions, move a 0, conditionally move a 1 then add. I suspect it won't be a win! Or, with an appropriately unrolled loop, for each word: zero %eax, cmove a 1 to %al cmove a 1 to %ah shift %eax left, cmove a 1 to %al cmove a 1 to %ah, add

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote: > On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: > > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > > > > > I think it would be better if we just did

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Joe Perches
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > > > I think it would be better if we just did the prefetch here > > > and re-addressed this area when AVX (or

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > I think it would be better if we just did the prefetch here > > and re-addressed this area when AVX (or addcx/addox) instructions were > > available > > for testing on

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Joe Perches
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > I think it would be better if we just did the prefetch here > and re-addressed this area when AVX (or addcx/addox) instructions were > available > for testing on hardware. Could there be a difference if only a single software prefetch was

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 04:18:50PM -, David Laight wrote: > > How would you suggest replacing the jumps in this case? I agree it would be > > faster here, but I'm not sure how I would implement an increment using a > > single > > conditional move. > > I think you need 3 instructions, move a

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread David Laight
> How would you suggest replacing the jumps in this case? I agree it would be > faster here, but I'm not sure how I would implement an increment using a > single > conditional move. I think you need 3 instructions, move a 0, conditionally move a 1 then add. I suspect it won't be a win! If you

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ben Hutchings
On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote: > On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote: > > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: > > [...] > > > It > > > functions, but unfortunately the performance lost to the completely broken > > > branch

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote: > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: > [...] > > It > > functions, but unfortunately the performance lost to the completely broken > > branch prediction that this inflicts makes it a non starter: > [...] > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ben Hutchings
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: [...] > It > functions, but unfortunately the performance lost to the completely broken > branch prediction that this inflicts makes it a non starter: [...] Conditional branches are no good but conditional moves might be worth a shot. Ben.

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: > > > > > > * Neil Horman wrote: > > > > > > > > etc. For such short runtimes make sure the last column displays > > > > > close to 100%,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ingo Molnar
* Neil Horman wrote: > Prefetch and simluated adcx/adox from above: > Performance counter stats for './test.sh' (20 runs): > > 35,704,331 L1-dcache-load-misses >( +- 0.07% ) [75.00%] > 0 L1-dcache-prefetches

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ingo Molnar
* Neil Horman wrote: > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > > etc. For such short runtimes make sure the last column displays > > > > close to 100%, so that the PMU results become trustable. > > > > > > > > A nehalem+ PMU will

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: etc. For such short runtimes make sure the last column displays close to 100%, so that the PMU results become trustable. A

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: Prefetch and simluated adcx/adox from above: Performance counter stats for './test.sh' (20 runs): 35,704,331 L1-dcache-load-misses ( +- 0.07% ) [75.00%] 0 L1-dcache-prefetches

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: etc. For such short runtimes make sure the last column displays

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ben Hutchings
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: [...] It functions, but unfortunately the performance lost to the completely broken branch prediction that this inflicts makes it a non starter: [...] Conditional branches are no good but conditional moves might be worth a shot. Ben. --

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote: On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: [...] It functions, but unfortunately the performance lost to the completely broken branch prediction that this inflicts makes it a non starter: [...] Conditional

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Ben Hutchings
On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote: On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote: On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote: [...] It functions, but unfortunately the performance lost to the completely broken branch prediction that this

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread David Laight
How would you suggest replacing the jumps in this case? I agree it would be faster here, but I'm not sure how I would implement an increment using a single conditional move. I think you need 3 instructions, move a 0, conditionally move a 1 then add. I suspect it won't be a win! If you do

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 04:18:50PM -, David Laight wrote: How would you suggest replacing the jumps in this case? I agree it would be faster here, but I'm not sure how I would implement an increment using a single conditional move. I think you need 3 instructions, move a 0,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Joe Perches
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: I think it would be better if we just did the prefetch here and re-addressed this area when AVX (or addcx/addox) instructions were available for testing on hardware. Could there be a difference if only a single software prefetch was done

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: I think it would be better if we just did the prefetch here and re-addressed this area when AVX (or addcx/addox) instructions were available for testing on hardware.

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Joe Perches
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: I think it would be better if we just did the prefetch here and re-addressed this area when AVX (or addcx/addox)

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-11-01 Thread Neil Horman
On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote: On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: I think it would be better if we just did the prefetch

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Neil Horman
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote: > On 10/30/2013 07:02 AM, Neil Horman wrote: > > >That does makes sense, but it then begs the question, whats the advantage of > >having multiple alu's at all? > > There's lots of ALU operations that don't operate on the flags or >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Neil Horman
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > > etc. For such short runtimes make sure the last column displays > > > close to 100%, so that the PMU results become trustable. > > > > > > A nehalem+ PMU will allow 2-4 events to be measured in

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Ingo Molnar
* Neil Horman wrote: > > etc. For such short runtimes make sure the last column displays > > close to 100%, so that the PMU results become trustable. > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, > > plus generics like 'cycles', 'instructions' can be added 'for

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: etc. For such short runtimes make sure the last column displays close to 100%, so that the PMU results become trustable. A nehalem+ PMU will allow 2-4 events to be measured in parallel, plus generics like 'cycles', 'instructions' can be

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Neil Horman
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: etc. For such short runtimes make sure the last column displays close to 100%, so that the PMU results become trustable. A nehalem+ PMU will allow 2-4 events to be measured in

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-31 Thread Neil Horman
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote: On 10/30/2013 07:02 AM, Neil Horman wrote: That does makes sense, but it then begs the question, whats the advantage of having multiple alu's at all? There's lots of ALU operations that don't operate on the flags or other

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Neil Horman
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote: > On 10/30/2013 07:02 AM, Neil Horman wrote: > > >That does makes sense, but it then begs the question, whats the advantage of > >having multiple alu's at all? > > There's lots of ALU operations that don't operate on the flags or >

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
... > and then I also wanted to try using both xmm and ymm registers and doing > 64bit adds with 32bit numbers across multiple xmm/ymm registers as that > should parallel nicely. David, you mentioned you've tried this, how did > your experiment turn out and what was your method? I was planning

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Doug Ledford
On 10/30/2013 07:02 AM, Neil Horman wrote: That does makes sense, but it then begs the question, whats the advantage of having multiple alu's at all? There's lots of ALU operations that don't operate on the flags or other entities that can be run in parallel. If they're just going to

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Doug Ledford
On 10/30/2013 08:18 AM, David Laight wrote: /me wonders if rearranging the instructions into this order: adcq 0*8(src), res1 adcq 1*8(src), res2 adcq 2*8(src), res1 Those have to be sequenced. Using a 64bit lea to add 32bit quantities should avoid the dependencies on the flags register.

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
> /me wonders if rearranging the instructions into this order: > adcq 0*8(src), res1 > adcq 1*8(src), res2 > adcq 2*8(src), res1 Those have to be sequenced. Using a 64bit lea to add 32bit quantities should avoid the dependencies on the flags register. However you'd need to get 3 of those active

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Neil Horman
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote: > * Neil Horman wrote: > > 3) The run times are proportionally larger, but still indicate that > > Parallel ALU > > execution is hurting rather than helping, which is counter-intuitive. I'm > > looking into it, but thought you might

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
> The parallel ALU design of this patch seems OK at first glance, but it means > that two parallel operations are both trying to set/clear both the overflow > and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either > some CPU in the past had a set of overflow/carry flags per

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
The parallel ALU design of this patch seems OK at first glance, but it means that two parallel operations are both trying to set/clear both the overflow and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either some CPU in the past had a set of overflow/carry flags per ALU

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Neil Horman
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote: * Neil Horman nhor...@tuxdriver.com wrote: 3) The run times are proportionally larger, but still indicate that Parallel ALU execution is hurting rather than helping, which is counter-intuitive. I'm looking into it, but

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
/me wonders if rearranging the instructions into this order: adcq 0*8(src), res1 adcq 1*8(src), res2 adcq 2*8(src), res1 Those have to be sequenced. Using a 64bit lea to add 32bit quantities should avoid the dependencies on the flags register. However you'd need to get 3 of those active to

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Doug Ledford
On 10/30/2013 08:18 AM, David Laight wrote: /me wonders if rearranging the instructions into this order: adcq 0*8(src), res1 adcq 1*8(src), res2 adcq 2*8(src), res1 Those have to be sequenced. Using a 64bit lea to add 32bit quantities should avoid the dependencies on the flags register.

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Doug Ledford
On 10/30/2013 07:02 AM, Neil Horman wrote: That does makes sense, but it then begs the question, whats the advantage of having multiple alu's at all? There's lots of ALU operations that don't operate on the flags or other entities that can be run in parallel. If they're just going to

RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread David Laight
... and then I also wanted to try using both xmm and ymm registers and doing 64bit adds with 32bit numbers across multiple xmm/ymm registers as that should parallel nicely. David, you mentioned you've tried this, how did your experiment turn out and what was your method? I was planning on

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-30 Thread Neil Horman
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote: On 10/30/2013 07:02 AM, Neil Horman wrote: That does makes sense, but it then begs the question, whats the advantage of having multiple alu's at all? There's lots of ALU operations that don't operate on the flags or other

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Doug Ledford
* Neil Horman wrote: > 3) The run times are proportionally larger, but still indicate that Parallel > ALU > execution is hurting rather than helping, which is counter-intuitive. I'm > looking into it, but thought you might want to see these results in case > something jumped out at you So

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > So, I apologize, you were right. I was running the test.sh script > > but perf was measuring itself. [...] > > Ok, cool - one mystery less! > > > Which overall looks alot more like I expect, save for

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > So, I apologize, you were right. I was running the test.sh script > but perf was measuring itself. [...] Ok, cool - one mystery less! > Which overall looks alot more like I expect, save for the parallel > ALU cases. It seems here that the parallel ALU changes

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > I'm sure it worked properly on my system here, I specificially > > checked it, but I'll gladly run it again. You have to give me an > > hour as I have a meeting to run to, but I'll have results

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread David Ahern
On 10/29/13 6:52 AM, Ingo Molnar wrote: According to the perf man page, I'm supposed to be able to use -- to separate perf command line parameters from the command I want to run. And it definately executed test.sh, I added an echo to stdout in there as a test run and observed them get captured

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > I'm sure it worked properly on my system here, I specificially > > checked it, but I'll gladly run it again. You have to give me an > > hour as I have a meeting to run to, but I'll have results

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > I'm sure it worked properly on my system here, I specificially > checked it, but I'll gladly run it again. You have to give me an > hour as I have a meeting to run to, but I'll have results shortly. So what I tried to react to was this observation of yours: > > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: > > > > > > * Neil Horman wrote: > > > > > > > Sure it was this: > > > > for i in `seq 0 1 3` > > > > do > > > > echo $i >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > Sure it was this: > > > for i in `seq 0 1 3` > > > do > > > echo $i > /sys/module/csum_test/parameters/module_test_mode > > > taskset -c 0 perf stat --repeat 20 -C 0

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Sure it was this: > > for i in `seq 0 1 3` > > do > > echo $i > /sys/module/csum_test/parameters/module_test_mode > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > Sure it was this: > for i in `seq 0 1 3` > do > echo $i > /sys/module/csum_test/parameters/module_test_mode > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- > /root/test.sh > done >> counters.txt 2>&1 > > where test.sh is: > #!/bin/sh > echo

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Heres my data for running the same test with taskset restricting > > execution to only cpu0. I'm not quite sure whats going on here, > > but doing so resulted in a 10x slowdown of the runtime of each

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Doug Ledford wrote: > [ Snipped a couple of really nice real-life bandwidth tests. ] > Some of my preliminary results: > > 1) Regarding the initial claim that changing the code to have two > addition chains, allowing the use of two ALUs, doubling > performance: I'm just not seeing it. I

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman wrote: > Heres my data for running the same test with taskset restricting > execution to only cpu0. I'm not quite sure whats going on here, > but doing so resulted in a 10x slowdown of the runtime of each > iteration which I can't explain. As before however, both the >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: Heres my data for running the same test with taskset restricting execution to only cpu0. I'm not quite sure whats going on here, but doing so resulted in a 10x slowdown of the runtime of each iteration which I can't explain. As before however,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Doug Ledford dledf...@redhat.com wrote: [ Snipped a couple of really nice real-life bandwidth tests. ] Some of my preliminary results: 1) Regarding the initial claim that changing the code to have two addition chains, allowing the use of two ALUs, doubling performance: I'm just not

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Heres my data for running the same test with taskset restricting execution to only cpu0. I'm not quite sure whats going on here, but doing so resulted in a 10x slowdown of the

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: Sure it was this: for i in `seq 0 1 3` do echo $i /sys/module/csum_test/parameters/module_test_mode taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh done counters.txt 21 where test.sh is: #!/bin/sh

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Sure it was this: for i in `seq 0 1 3` do echo $i /sys/module/csum_test/parameters/module_test_mode taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Sure it was this: for i in `seq 0 1 3` do echo $i /sys/module/csum_test/parameters/module_test_mode taskset -c 0 perf stat

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Sure it was this: for i in `seq 0 1 3` do echo $i

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: I'm sure it worked properly on my system here, I specificially checked it, but I'll gladly run it again. You have to give me an hour as I have a meeting to run to, but I'll have results shortly. So what I tried to react to was this observation of

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: I'm sure it worked properly on my system here, I specificially checked it, but I'll gladly run it again. You have to give me an hour as I have a meeting to run to, but I'll have

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread David Ahern
On 10/29/13 6:52 AM, Ingo Molnar wrote: According to the perf man page, I'm supposed to be able to use -- to separate perf command line parameters from the command I want to run. And it definately executed test.sh, I added an echo to stdout in there as a test run and observed them get captured

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: I'm sure it worked properly on my system here, I specificially checked it, but I'll gladly run it again. You have to give me an hour as I have a meeting to run to, but I'll have

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: So, I apologize, you were right. I was running the test.sh script but perf was measuring itself. [...] Ok, cool - one mystery less! Which overall looks alot more like I expect, save for the parallel ALU cases. It seems here that the parallel

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Neil Horman
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: So, I apologize, you were right. I was running the test.sh script but perf was measuring itself. [...] Ok, cool - one mystery less! Which overall looks alot more like I expect,

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-29 Thread Doug Ledford
* Neil Horman nhor...@tuxdriver.com wrote: 3) The run times are proportionally larger, but still indicate that Parallel ALU execution is hurting rather than helping, which is counter-intuitive. I'm looking into it, but thought you might want to see these results in case something jumped out

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote: > On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote: > > > > * Neil Horman wrote: > > > > > Looking at the specific cpu counters we get this: > > > > > > Base: > > > Total time: 0.179 [sec] > > > > > > Performance

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Base: > >0.093269042 seconds time elapsed > > ( +- 2.24% ) > > Prefetch (5x64): > >0.079440009 seconds time elapsed

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > Looking at the specific cpu counters we get this: > > > > Base: > > Total time: 0.179 [sec] > > > > Performance counter stats for 'perf bench sched messaging -- bash -c echo > > 1 >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Doug Ledford
On 10/26/2013 07:55 AM, Ingo Molnar wrote: > > * Doug Ledford wrote: > >>> What I was objecting to strongly here was to measure the _wrong_ >>> thing, i.e. the cache-hot case. The cache-cold case should be >>> measured in a low noise fashion, so that results are >>> representative. It's closer to

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread David Ahern
On 10/28/13 10:24 AM, Ingo Molnar wrote: The most accurate method of measurement for such single-threaded workloads is something like: taskset 0x1 perf stat -a -C 1 --repeat 20 ... this will bind your workload to CPU#0, and will do PMU measurements only there - without mixing in other

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Ingo Molnar
* Neil Horman wrote: > Looking at the specific cpu counters we get this: > > Base: > Total time: 0.179 [sec] > > Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > > /sys/module/csum_test/parameters/test_fire' (20 runs): > >1571.304618 task-clock

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Ingo Molnar
* Neil Horman wrote: > Base: >0.093269042 seconds time elapsed >( +- 2.24% ) > Prefetch (5x64): >0.079440009 seconds time elapsed >( +- 2.29% ) > Parallel ALU: >0.08777 seconds

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
Ingo, et al.- Ok, sorry for the delay, here are the test results you've been asking for. First, some information about what I did. I attached the module that I ran this test with at the bottom of this email. You'll note that I started using a module parameter write patch to trigger

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
Ingo, et al.- Ok, sorry for the delay, here are the test results you've been asking for. First, some information about what I did. I attached the module that I ran this test with at the bottom of this email. You'll note that I started using a module parameter write patch to trigger

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: Base: 0.093269042 seconds time elapsed ( +- 2.24% ) Prefetch (5x64): 0.079440009 seconds time elapsed ( +- 2.29% ) Parallel ALU:

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: Looking at the specific cpu counters we get this: Base: Total time: 0.179 [sec] Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 /sys/module/csum_test/parameters/test_fire' (20 runs): 1571.304618

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread David Ahern
On 10/28/13 10:24 AM, Ingo Molnar wrote: The most accurate method of measurement for such single-threaded workloads is something like: taskset 0x1 perf stat -a -C 1 --repeat 20 ... this will bind your workload to CPU#0, and will do PMU measurements only there - without mixing in other

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Doug Ledford
On 10/26/2013 07:55 AM, Ingo Molnar wrote: * Doug Ledford dledf...@redhat.com wrote: What I was objecting to strongly here was to measure the _wrong_ thing, i.e. the cache-hot case. The cache-cold case should be measured in a low noise fashion, so that results are representative. It's

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Looking at the specific cpu counters we get this: Base: Total time: 0.179 [sec] Performance counter stats for 'perf bench sched messaging -- bash -c echo 1

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Base: 0.093269042 seconds time elapsed ( +- 2.24% ) Prefetch (5x64): 0.079440009 seconds time elapsed

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-28 Thread Neil Horman
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote: On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: Looking at the specific cpu counters we get this: Base: Total time: 0.179 [sec] Performance

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-27 Thread Neil Horman
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote: > > * Neil Horman wrote: > > > > You keep ignoring my request to calculate and account for noise of the > > > measurement. > > > > Don't confuse "ignoring" with "haven't gotten there yet". [...] > > So, instead of replying to my

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-27 Thread Ingo Molnar
* Neil Horman wrote: > > You keep ignoring my request to calculate and account for noise of the > > measurement. > > Don't confuse "ignoring" with "haven't gotten there yet". [...] So, instead of replying to my repeated feedback with a single line mail that you plan to address it, you

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-27 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: You keep ignoring my request to calculate and account for noise of the measurement. Don't confuse ignoring with haven't gotten there yet. [...] So, instead of replying to my repeated feedback with a single line mail that you plan to address

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-27 Thread Neil Horman
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: You keep ignoring my request to calculate and account for noise of the measurement. Don't confuse ignoring with haven't gotten there yet. [...] So, instead of replying to my

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Neil Horman
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote: > > * Neil Horman wrote: > > > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: > > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > > > > > > > > > > Ok, so I ran the above code on a single cpu using

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Ingo Molnar
* Neil Horman wrote: > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: > > > > > > > > Ok, so I ran the above code on a single cpu using taskset, and set irq > > > affinity > > > such that no interrupts (save for local

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Ingo Molnar
* Doug Ledford wrote: > > What I was objecting to strongly here was to measure the _wrong_ > > thing, i.e. the cache-hot case. The cache-cold case should be > > measured in a low noise fashion, so that results are > > representative. It's closer to the real usecase than any other > >

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Ingo Molnar
* Doug Ledford dledf...@redhat.com wrote: What I was objecting to strongly here was to measure the _wrong_ thing, i.e. the cache-hot case. The cache-cold case should be measured in a low noise fashion, so that results are representative. It's closer to the real usecase than any other

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Ingo Molnar
* Neil Horman nhor...@tuxdriver.com wrote: On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: Ok, so I ran the above code on a single cpu using taskset, and set irq affinity such that no interrupts (save for

Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

2013-10-26 Thread Neil Horman
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote: * Neil Horman nhor...@tuxdriver.com wrote: On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote: On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote: Ok, so I ran the above code on a single cpu using

  1   2   3   >