Just for everyone's information, here's the updated benchmark code on
the same Phenom. The REP MOVSQ code is indeed much faster.
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9850 Quad-Core Processor
stepping: 3
microcode :
2 6:58 PM
> To: Ma, Ling
> Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com;
> t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com;
> George Spelvin
> Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging
> instruction sequence and saving r
On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote:
> Right, so benchmark shows around 20% speedup on Bulldozer but this is
> a microbenchmark and before pursue this further, we need to verify
> whether this brings any palpable speedup with a real benchmark, I
> don't know, kernbench,
On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote:
Right, so benchmark shows around 20% speedup on Bulldozer but this is
a microbenchmark and before pursue this further, we need to verify
whether this brings any palpable speedup with a real benchmark, I
don't know, kernbench,
To: Ma, Ling
Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com;
t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com;
George Spelvin
Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging
instruction sequence and saving register
On Fri, Oct 12, 2012 at 08:04
Just for everyone's information, here's the updated benchmark code on
the same Phenom. The REP MOVSQ code is indeed much faster.
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9850 Quad-Core Processor
stepping: 3
microcode :
On Fri, Oct 12, 2012 at 05:02:57PM -0400, George Spelvin wrote:
> Here are some Phenom results for that benchmark. The average time
> increases from 700 to 760 cycles (+8.6%).
I was afraid something like that would show up.
Btw, in looking at this more and IINM, we use the REP MOVSQ version on
Here are some Phenom results for that benchmark. The average time
increases from 700 to 760 cycles (+8.6%).
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9850 Quad-Core Processor
stepping: 3
microcode : 0x183
cpu MHz
On Fri, Oct 12, 2012 at 09:07:43AM +, Ma, Ling wrote:
> > > > So is that also true for AMD CPUs?
> > > Although Bulldozer put 32byte instruction into decoupled 16byte entry
> > > buffers, it still decode 4 instructions per cycle, so 4 instructions
> > > will be fed into execution unit and
> >
On Fri, Oct 12, 2012 at 02:54:54PM +, Ma, Ling wrote:
> > If you can't test the CPUs who run this code I think it's safer if you
> > add a new variant for Atom, not change the existing well tested code.
> > Otherwise you risk performance regressions on these older CPUs.
>
> I found one older
> If you can't test the CPUs who run this code I think it's safer if you
> add a new variant for Atom, not change the existing well tested code.
> Otherwise you risk performance regressions on these older CPUs.
I found one older machine, and tested the code on it, the results between them
are
> I tested new and original version on core2, the patch improved performance
> about 9%,
That's not useful because core2 doesn't use this variant, it uses the
rep string variant. Primary user is P4.
> Although core2 is out-of-order pipeline and weaken instruction sequence
> requirement,
>
> > > So is that also true for AMD CPUs?
> > Although Bulldozer put 32byte instruction into decoupled 16byte entry
> > buffers, it still decode 4 instructions per cycle, so 4 instructions
> > will be fed into execution unit and
> > 2 loads ,1 write will be issued per cycle.
>
> I'd be very
On Fri, Oct 12, 2012 at 03:37:50AM +, Ma, Ling wrote:
> > > Load and write operation occupy about 35% and 10% respectively for
> > > most industry benchmarks. Fetched 16-aligned bytes code include about
> > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > > Modern CPU support 2
On Fri, Oct 12, 2012 at 03:37:50AM +, Ma, Ling wrote:
Load and write operation occupy about 35% and 10% respectively for
most industry benchmarks. Fetched 16-aligned bytes code include about
4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1
So is that also true for AMD CPUs?
Although Bulldozer put 32byte instruction into decoupled 16byte entry
buffers, it still decode 4 instructions per cycle, so 4 instructions
will be fed into execution unit and
2 loads ,1 write will be issued per cycle.
I'd be very interested with
I tested new and original version on core2, the patch improved performance
about 9%,
That's not useful because core2 doesn't use this variant, it uses the
rep string variant. Primary user is P4.
Although core2 is out-of-order pipeline and weaken instruction sequence
requirement,
because
If you can't test the CPUs who run this code I think it's safer if you
add a new variant for Atom, not change the existing well tested code.
Otherwise you risk performance regressions on these older CPUs.
I found one older machine, and tested the code on it, the results between them
are
On Fri, Oct 12, 2012 at 02:54:54PM +, Ma, Ling wrote:
If you can't test the CPUs who run this code I think it's safer if you
add a new variant for Atom, not change the existing well tested code.
Otherwise you risk performance regressions on these older CPUs.
I found one older machine,
On Fri, Oct 12, 2012 at 09:07:43AM +, Ma, Ling wrote:
So is that also true for AMD CPUs?
Although Bulldozer put 32byte instruction into decoupled 16byte entry
buffers, it still decode 4 instructions per cycle, so 4 instructions
will be fed into execution unit and
2 loads ,1
Here are some Phenom results for that benchmark. The average time
increases from 700 to 760 cycles (+8.6%).
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9850 Quad-Core Processor
stepping: 3
microcode : 0x183
cpu MHz
On Fri, Oct 12, 2012 at 05:02:57PM -0400, George Spelvin wrote:
Here are some Phenom results for that benchmark. The average time
increases from 700 to 760 cycles (+8.6%).
I was afraid something like that would show up.
Btw, in looking at this more and IINM, we use the REP MOVSQ version on
> > Load and write operation occupy about 35% and 10% respectively for
> > most industry benchmarks. Fetched 16-aligned bytes code include about
> > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > Modern CPU support 2 load and 1 write per cycle, so throughput from
> > write is
> > Load and write operation occupy about 35% and 10% respectively for
> > most industry benchmarks. Fetched 16-aligned bytes code include about
> > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> > Modern CPU support 2 load and 1 write per cycle, so throughput from
> > write is
On Thu, Oct 11, 2012 at 08:29:08PM +0800, ling...@intel.com wrote:
> From: Ma Ling
>
> Load and write operation occupy about 35% and 10% respectively
> for most industry benchmarks. Fetched 16-aligned bytes code include
> about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> Modern
ling...@intel.com writes:
> From: Ma Ling
>
> Load and write operation occupy about 35% and 10% respectively
> for most industry benchmarks. Fetched 16-aligned bytes code include
> about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
> Modern CPU support 2 load and 1 write per
ling...@intel.com writes:
From: Ma Ling ling...@intel.com
Load and write operation occupy about 35% and 10% respectively
for most industry benchmarks. Fetched 16-aligned bytes code include
about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1
On Thu, Oct 11, 2012 at 08:29:08PM +0800, ling...@intel.com wrote:
From: Ma Ling ling...@intel.com
Load and write operation occupy about 35% and 10% respectively
for most industry benchmarks. Fetched 16-aligned bytes code include
about 4 instructions, implying 1.34(0.35 * 4) load, 0.4
Load and write operation occupy about 35% and 10% respectively for
most industry benchmarks. Fetched 16-aligned bytes code include about
4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1 write per cycle, so throughput from
write is bottleneck for
Load and write operation occupy about 35% and 10% respectively for
most industry benchmarks. Fetched 16-aligned bytes code include about
4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1 write per cycle, so throughput from
write is bottleneck for
30 matches
Mail list logo