> > I think you need 3 instructions, move a 0, conditionally move a 1
> > then add. I suspect it won't be a win!
Or, with an appropriately unrolled loop, for each word:
zero %eax, cmove a 1 to %al
cmove a 1 to %ah
shift %eax left, cmove a 1 to %al
cmove a 1 to %ah,
I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!
Or, with an appropriately unrolled loop, for each word:
zero %eax, cmove a 1 to %al
cmove a 1 to %ah
shift %eax left, cmove a 1 to %al
cmove a 1 to %ah, add
On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > >
> > > > I think it would be better if we just did
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> >
> > > I think it would be better if we just did the prefetch here
> > > and re-addressed this area when AVX (or
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
>
> > I think it would be better if we just did the prefetch here
> > and re-addressed this area when AVX (or addcx/addox) instructions were
> > available
> > for testing on
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were
> available
> for testing on hardware.
Could there be a difference if only a single software
prefetch was
On Fri, Nov 01, 2013 at 04:18:50PM -, David Laight wrote:
> > How would you suggest replacing the jumps in this case? I agree it would be
> > faster here, but I'm not sure how I would implement an increment using a
> > single
> > conditional move.
>
> I think you need 3 instructions, move a
> How would you suggest replacing the jumps in this case? I agree it would be
> faster here, but I'm not sure how I would implement an increment using a
> single
> conditional move.
I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!
If you
On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote:
> > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> > [...]
> > > It
> > > functions, but unfortunately the performance lost to the completely broken
> > > branch
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote:
> On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> [...]
> > It
> > functions, but unfortunately the performance lost to the completely broken
> > branch prediction that this inflicts makes it a non starter:
> [...]
>
>
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
> It
> functions, but unfortunately the performance lost to the completely broken
> branch prediction that this inflicts makes it a non starter:
[...]
Conditional branches are no good but conditional moves might be worth a shot.
Ben.
On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > >
> > > * Neil Horman wrote:
> > >
> > > > > etc. For such short runtimes make sure the last column displays
> > > > > close to 100%,
* Neil Horman wrote:
> Prefetch and simluated adcx/adox from above:
> Performance counter stats for './test.sh' (20 runs):
>
> 35,704,331 L1-dcache-load-misses
>( +- 0.07% ) [75.00%]
> 0 L1-dcache-prefetches
* Neil Horman wrote:
> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > > etc. For such short runtimes make sure the last column displays
> > > > close to 100%, so that the PMU results become trustable.
> > > >
> > > > A nehalem+ PMU will
* Neil Horman nhor...@tuxdriver.com wrote:
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
etc. For such short runtimes make sure the last column displays
close to 100%, so that the PMU results become trustable.
A
* Neil Horman nhor...@tuxdriver.com wrote:
Prefetch and simluated adcx/adox from above:
Performance counter stats for './test.sh' (20 runs):
35,704,331 L1-dcache-load-misses
( +- 0.07% ) [75.00%]
0 L1-dcache-prefetches
On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
etc. For such short runtimes make sure the last column displays
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this inflicts makes it a non starter:
[...]
Conditional branches are no good but conditional moves might be worth a shot.
Ben.
--
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote:
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this inflicts makes it a non starter:
[...]
Conditional
On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote:
On Fri, Nov 01, 2013 at 03:42:46PM +, Ben Hutchings wrote:
On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this
How would you suggest replacing the jumps in this case? I agree it would be
faster here, but I'm not sure how I would implement an increment using a
single
conditional move.
I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!
If you do
On Fri, Nov 01, 2013 at 04:18:50PM -, David Laight wrote:
How would you suggest replacing the jumps in this case? I agree it would be
faster here, but I'm not sure how I would implement an increment using a
single
conditional move.
I think you need 3 instructions, move a 0,
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were
available
for testing on hardware.
Could there be a difference if only a single software
prefetch was done
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were
available
for testing on hardware.
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox)
On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote:
On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
I think it would be better if we just did the prefetch
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
>
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
>
> There's lots of ALU operations that don't operate on the flags or
>
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > > etc. For such short runtimes make sure the last column displays
> > > close to 100%, so that the PMU results become trustable.
> > >
> > > A nehalem+ PMU will allow 2-4 events to be measured in
* Neil Horman wrote:
> > etc. For such short runtimes make sure the last column displays
> > close to 100%, so that the PMU results become trustable.
> >
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> > plus generics like 'cycles', 'instructions' can be added 'for
* Neil Horman nhor...@tuxdriver.com wrote:
etc. For such short runtimes make sure the last column displays
close to 100%, so that the PMU results become trustable.
A nehalem+ PMU will allow 2-4 events to be measured in parallel,
plus generics like 'cycles', 'instructions' can be
On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
etc. For such short runtimes make sure the last column displays
close to 100%, so that the PMU results become trustable.
A nehalem+ PMU will allow 2-4 events to be measured in
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
On 10/30/2013 07:02 AM, Neil Horman wrote:
That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or
other
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
>
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
>
> There's lots of ALU operations that don't operate on the flags or
>
...
> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely. David, you mentioned you've tried this, how did
> your experiment turn out and what was your method? I was planning
On 10/30/2013 07:02 AM, Neil Horman wrote:
That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or other
entities that can be run in parallel.
If they're just going to
On 10/30/2013 08:18 AM, David Laight wrote:
/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
> * Neil Horman wrote:
> > 3) The run times are proportionally larger, but still indicate that
> > Parallel ALU
> > execution is hurting rather than helping, which is counter-intuitive. I'm
> > looking into it, but thought you might
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either
> some CPU in the past had a set of overflow/carry flags per
The parallel ALU design of this patch seems OK at first glance, but it means
that two parallel operations are both trying to set/clear both the overflow
and carry flags of the EFLAGS register of the *CPU* (not the ALU). So, either
some CPU in the past had a set of overflow/carry flags per ALU
On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
3) The run times are proportionally larger, but still indicate that
Parallel ALU
execution is hurting rather than helping, which is counter-intuitive. I'm
looking into it, but
/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active to
On 10/30/2013 08:18 AM, David Laight wrote:
/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1
Those have to be sequenced.
Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
On 10/30/2013 07:02 AM, Neil Horman wrote:
That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or other
entities that can be run in parallel.
If they're just going to
...
and then I also wanted to try using both xmm and ymm registers and doing
64bit adds with 32bit numbers across multiple xmm/ymm registers as that
should parallel nicely. David, you mentioned you've tried this, how did
your experiment turn out and what was your method? I was planning on
On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
On 10/30/2013 07:02 AM, Neil Horman wrote:
That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?
There's lots of ALU operations that don't operate on the flags or
other
* Neil Horman wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel
> ALU
> execution is hurting rather than helping, which is counter-intuitive. I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you
So
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > So, I apologize, you were right. I was running the test.sh script
> > but perf was measuring itself. [...]
>
> Ok, cool - one mystery less!
>
> > Which overall looks alot more like I expect, save for
* Neil Horman wrote:
> So, I apologize, you were right. I was running the test.sh script
> but perf was measuring itself. [...]
Ok, cool - one mystery less!
> Which overall looks alot more like I expect, save for the parallel
> ALU cases. It seems here that the parallel ALU changes
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > I'm sure it worked properly on my system here, I specificially
> > checked it, but I'll gladly run it again. You have to give me an
> > hour as I have a meeting to run to, but I'll have results
On 10/29/13 6:52 AM, Ingo Molnar wrote:
According to the perf man page, I'm supposed to be able to use --
to separate perf command line parameters from the command I want
to run. And it definately executed test.sh, I added an echo to
stdout in there as a test run and observed them get captured
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > I'm sure it worked properly on my system here, I specificially
> > checked it, but I'll gladly run it again. You have to give me an
> > hour as I have a meeting to run to, but I'll have results
* Neil Horman wrote:
> I'm sure it worked properly on my system here, I specificially
> checked it, but I'll gladly run it again. You have to give me an
> hour as I have a meeting to run to, but I'll have results shortly.
So what I tried to react to was this observation of yours:
> > >
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > >
> > > * Neil Horman wrote:
> > >
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i >
* Neil Horman wrote:
> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --
> >
* Neil Horman wrote:
> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --
> /root/test.sh
> done >> counters.txt 2>&1
>
> where test.sh is:
> #!/bin/sh
> echo
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Heres my data for running the same test with taskset restricting
> > execution to only cpu0. I'm not quite sure whats going on here,
> > but doing so resulted in a 10x slowdown of the runtime of each
* Doug Ledford wrote:
> [ Snipped a couple of really nice real-life bandwidth tests. ]
> Some of my preliminary results:
>
> 1) Regarding the initial claim that changing the code to have two
> addition chains, allowing the use of two ALUs, doubling
> performance: I'm just not seeing it. I
* Neil Horman wrote:
> Heres my data for running the same test with taskset restricting
> execution to only cpu0. I'm not quite sure whats going on here,
> but doing so resulted in a 10x slowdown of the runtime of each
> iteration which I can't explain. As before however, both the
>
* Neil Horman nhor...@tuxdriver.com wrote:
Heres my data for running the same test with taskset restricting
execution to only cpu0. I'm not quite sure whats going on here,
but doing so resulted in a 10x slowdown of the runtime of each
iteration which I can't explain. As before however,
* Doug Ledford dledf...@redhat.com wrote:
[ Snipped a couple of really nice real-life bandwidth tests. ]
Some of my preliminary results:
1) Regarding the initial claim that changing the code to have two
addition chains, allowing the use of two ALUs, doubling
performance: I'm just not
On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Heres my data for running the same test with taskset restricting
execution to only cpu0. I'm not quite sure whats going on here,
but doing so resulted in a 10x slowdown of the
* Neil Horman nhor...@tuxdriver.com wrote:
Sure it was this:
for i in `seq 0 1 3`
do
echo $i /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --
/root/test.sh
done counters.txt 21
where test.sh is:
#!/bin/sh
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Sure it was this:
for i in `seq 0 1 3`
do
echo $i /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging --
* Neil Horman nhor...@tuxdriver.com wrote:
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Sure it was this:
for i in `seq 0 1 3`
do
echo $i /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat
On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Sure it was this:
for i in `seq 0 1 3`
do
echo $i
* Neil Horman nhor...@tuxdriver.com wrote:
I'm sure it worked properly on my system here, I specificially
checked it, but I'll gladly run it again. You have to give me an
hour as I have a meeting to run to, but I'll have results shortly.
So what I tried to react to was this observation of
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
I'm sure it worked properly on my system here, I specificially
checked it, but I'll gladly run it again. You have to give me an
hour as I have a meeting to run to, but I'll have
On 10/29/13 6:52 AM, Ingo Molnar wrote:
According to the perf man page, I'm supposed to be able to use --
to separate perf command line parameters from the command I want
to run. And it definately executed test.sh, I added an echo to
stdout in there as a test run and observed them get captured
On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
I'm sure it worked properly on my system here, I specificially
checked it, but I'll gladly run it again. You have to give me an
hour as I have a meeting to run to, but I'll have
* Neil Horman nhor...@tuxdriver.com wrote:
So, I apologize, you were right. I was running the test.sh script
but perf was measuring itself. [...]
Ok, cool - one mystery less!
Which overall looks alot more like I expect, save for the parallel
ALU cases. It seems here that the parallel
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
So, I apologize, you were right. I was running the test.sh script
but perf was measuring itself. [...]
Ok, cool - one mystery less!
Which overall looks alot more like I expect,
* Neil Horman nhor...@tuxdriver.com wrote:
3) The run times are proportionally larger, but still indicate that Parallel
ALU
execution is hurting rather than helping, which is counter-intuitive. I'm
looking into it, but thought you might want to see these results in case
something jumped out
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > Looking at the specific cpu counters we get this:
> > >
> > > Base:
> > > Total time: 0.179 [sec]
> > >
> > > Performance
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Base:
> >0.093269042 seconds time elapsed
> > ( +- 2.24% )
> > Prefetch (5x64):
> >0.079440009 seconds time elapsed
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Looking at the specific cpu counters we get this:
> >
> > Base:
> > Total time: 0.179 [sec]
> >
> > Performance counter stats for 'perf bench sched messaging -- bash -c echo
> > 1 >
On 10/26/2013 07:55 AM, Ingo Molnar wrote:
>
> * Doug Ledford wrote:
>
>>> What I was objecting to strongly here was to measure the _wrong_
>>> thing, i.e. the cache-hot case. The cache-cold case should be
>>> measured in a low noise fashion, so that results are
>>> representative. It's closer to
On 10/28/13 10:24 AM, Ingo Molnar wrote:
The most accurate method of measurement for such single-threaded
workloads is something like:
taskset 0x1 perf stat -a -C 1 --repeat 20 ...
this will bind your workload to CPU#0, and will do PMU measurements
only there - without mixing in other
* Neil Horman wrote:
> Looking at the specific cpu counters we get this:
>
> Base:
> Total time: 0.179 [sec]
>
> Performance counter stats for 'perf bench sched messaging -- bash -c echo 1
> > /sys/module/csum_test/parameters/test_fire' (20 runs):
>
>1571.304618 task-clock
* Neil Horman wrote:
> Base:
>0.093269042 seconds time elapsed
>( +- 2.24% )
> Prefetch (5x64):
>0.079440009 seconds time elapsed
>( +- 2.29% )
> Parallel ALU:
>0.08777 seconds
Ingo, et al.-
Ok, sorry for the delay, here are the test results you've been asking
for.
First, some information about what I did. I attached the module that I ran this
test with at the bottom of this email. You'll note that I started using a
module parameter write patch to trigger
Ingo, et al.-
Ok, sorry for the delay, here are the test results you've been asking
for.
First, some information about what I did. I attached the module that I ran this
test with at the bottom of this email. You'll note that I started using a
module parameter write patch to trigger
* Neil Horman nhor...@tuxdriver.com wrote:
Base:
0.093269042 seconds time elapsed
( +- 2.24% )
Prefetch (5x64):
0.079440009 seconds time elapsed
( +- 2.29% )
Parallel ALU:
* Neil Horman nhor...@tuxdriver.com wrote:
Looking at the specific cpu counters we get this:
Base:
Total time: 0.179 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo 1
/sys/module/csum_test/parameters/test_fire' (20 runs):
1571.304618
On 10/28/13 10:24 AM, Ingo Molnar wrote:
The most accurate method of measurement for such single-threaded
workloads is something like:
taskset 0x1 perf stat -a -C 1 --repeat 20 ...
this will bind your workload to CPU#0, and will do PMU measurements
only there - without mixing in other
On 10/26/2013 07:55 AM, Ingo Molnar wrote:
* Doug Ledford dledf...@redhat.com wrote:
What I was objecting to strongly here was to measure the _wrong_
thing, i.e. the cache-hot case. The cache-cold case should be
measured in a low noise fashion, so that results are
representative. It's
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Looking at the specific cpu counters we get this:
Base:
Total time: 0.179 [sec]
Performance counter stats for 'perf bench sched messaging -- bash -c echo
1
On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Base:
0.093269042 seconds time elapsed
( +- 2.24% )
Prefetch (5x64):
0.079440009 seconds time elapsed
On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Looking at the specific cpu counters we get this:
Base:
Total time: 0.179 [sec]
Performance
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > > You keep ignoring my request to calculate and account for noise of the
> > > measurement.
> >
> > Don't confuse "ignoring" with "haven't gotten there yet". [...]
>
> So, instead of replying to my
* Neil Horman wrote:
> > You keep ignoring my request to calculate and account for noise of the
> > measurement.
>
> Don't confuse "ignoring" with "haven't gotten there yet". [...]
So, instead of replying to my repeated feedback with a single line mail
that you plan to address it, you
* Neil Horman nhor...@tuxdriver.com wrote:
You keep ignoring my request to calculate and account for noise of the
measurement.
Don't confuse ignoring with haven't gotten there yet. [...]
So, instead of replying to my repeated feedback with a single line mail
that you plan to address
On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
You keep ignoring my request to calculate and account for noise of the
measurement.
Don't confuse ignoring with haven't gotten there yet. [...]
So, instead of replying to my
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > >
> > > >
> > > > Ok, so I ran the above code on a single cpu using
* Neil Horman wrote:
> On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> >
> > >
> > > Ok, so I ran the above code on a single cpu using taskset, and set irq
> > > affinity
> > > such that no interrupts (save for local
* Doug Ledford wrote:
> > What I was objecting to strongly here was to measure the _wrong_
> > thing, i.e. the cache-hot case. The cache-cold case should be
> > measured in a low noise fashion, so that results are
> > representative. It's closer to the real usecase than any other
> >
* Doug Ledford dledf...@redhat.com wrote:
What I was objecting to strongly here was to measure the _wrong_
thing, i.e. the cache-hot case. The cache-cold case should be
measured in a low noise fashion, so that results are
representative. It's closer to the real usecase than any other
* Neil Horman nhor...@tuxdriver.com wrote:
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
Ok, so I ran the above code on a single cpu using taskset, and set irq
affinity
such that no interrupts (save for
On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
Ok, so I ran the above code on a single cpu using
1 - 100 of 214 matches
Mail list logo