Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread George Spelvin
Just for everyone's information, here's the updated benchmark code on the same Phenom. The REP MOVSQ code is indeed much faster. vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9850 Quad-Core Processor stepping: 3 microcode :

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Ma, Ling
2 6:58 PM > To: Ma, Ling > Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com; > t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com; > George Spelvin > Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging > instruction sequence and saving r

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote: > Right, so benchmark shows around 20% speedup on Bulldozer but this is > a microbenchmark and before pursue this further, we need to verify > whether this brings any palpable speedup with a real benchmark, I > don't know, kernbench,

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 08:04:11PM +0200, Borislav Petkov wrote: Right, so benchmark shows around 20% speedup on Bulldozer but this is a microbenchmark and before pursue this further, we need to verify whether this brings any palpable speedup with a real benchmark, I don't know, kernbench,

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread Ma, Ling
To: Ma, Ling Cc: Konrad Rzeszutek Wilk; mi...@elte.hu; h...@zytor.com; t...@linutronix.de; linux-kernel@vger.kernel.org; i...@google.com; George Spelvin Subject: Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register On Fri, Oct 12, 2012 at 08:04

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-14 Thread George Spelvin
Just for everyone's information, here's the updated benchmark code on the same Phenom. The REP MOVSQ code is indeed much faster. vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9850 Quad-Core Processor stepping: 3 microcode :

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 05:02:57PM -0400, George Spelvin wrote: > Here are some Phenom results for that benchmark. The average time > increases from 700 to 760 cycles (+8.6%). I was afraid something like that would show up. Btw, in looking at this more and IINM, we use the REP MOVSQ version on

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread George Spelvin
Here are some Phenom results for that benchmark. The average time increases from 700 to 760 cycles (+8.6%). vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9850 Quad-Core Processor stepping: 3 microcode : 0x183 cpu MHz

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 09:07:43AM +, Ma, Ling wrote: > > > > So is that also true for AMD CPUs? > > > Although Bulldozer put 32byte instruction into decoupled 16byte entry > > > buffers, it still decode 4 instructions per cycle, so 4 instructions > > > will be fed into execution unit and > >

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Andi Kleen
On Fri, Oct 12, 2012 at 02:54:54PM +, Ma, Ling wrote: > > If you can't test the CPUs who run this code I think it's safer if you > > add a new variant for Atom, not change the existing well tested code. > > Otherwise you risk performance regressions on these older CPUs. > > I found one older

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
> If you can't test the CPUs who run this code I think it's safer if you > add a new variant for Atom, not change the existing well tested code. > Otherwise you risk performance regressions on these older CPUs. I found one older machine, and tested the code on it, the results between them are

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Andi Kleen
> I tested new and original version on core2, the patch improved performance > about 9%, That's not useful because core2 doesn't use this variant, it uses the rep string variant. Primary user is P4. > Although core2 is out-of-order pipeline and weaken instruction sequence > requirement, >

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
> > > So is that also true for AMD CPUs? > > Although Bulldozer put 32byte instruction into decoupled 16byte entry > > buffers, it still decode 4 instructions per cycle, so 4 instructions > > will be fed into execution unit and > > 2 loads ,1 write will be issued per cycle. > > I'd be very

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 03:37:50AM +, Ma, Ling wrote: > > > Load and write operation occupy about 35% and 10% respectively for > > > most industry benchmarks. Fetched 16-aligned bytes code include about > > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > > Modern CPU support 2

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 03:37:50AM +, Ma, Ling wrote: Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
So is that also true for AMD CPUs? Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers, it still decode 4 instructions per cycle, so 4 instructions will be fed into execution unit and 2 loads ,1 write will be issued per cycle. I'd be very interested with

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Andi Kleen
I tested new and original version on core2, the patch improved performance about 9%, That's not useful because core2 doesn't use this variant, it uses the rep string variant. Primary user is P4. Although core2 is out-of-order pipeline and weaken instruction sequence requirement, because

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Ma, Ling
If you can't test the CPUs who run this code I think it's safer if you add a new variant for Atom, not change the existing well tested code. Otherwise you risk performance regressions on these older CPUs. I found one older machine, and tested the code on it, the results between them are

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Andi Kleen
On Fri, Oct 12, 2012 at 02:54:54PM +, Ma, Ling wrote: If you can't test the CPUs who run this code I think it's safer if you add a new variant for Atom, not change the existing well tested code. Otherwise you risk performance regressions on these older CPUs. I found one older machine,

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 09:07:43AM +, Ma, Ling wrote: So is that also true for AMD CPUs? Although Bulldozer put 32byte instruction into decoupled 16byte entry buffers, it still decode 4 instructions per cycle, so 4 instructions will be fed into execution unit and 2 loads ,1

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread George Spelvin
Here are some Phenom results for that benchmark. The average time increases from 700 to 760 cycles (+8.6%). vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9850 Quad-Core Processor stepping: 3 microcode : 0x183 cpu MHz

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-12 Thread Borislav Petkov
On Fri, Oct 12, 2012 at 05:02:57PM -0400, George Spelvin wrote: Here are some Phenom results for that benchmark. The average time increases from 700 to 760 cycles (+8.6%). I was afraid something like that would show up. Btw, in looking at this more and IINM, we use the REP MOVSQ version on

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
> > Load and write operation occupy about 35% and 10% respectively for > > most industry benchmarks. Fetched 16-aligned bytes code include about > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > Modern CPU support 2 load and 1 write per cycle, so throughput from > > write is

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
> > Load and write operation occupy about 35% and 10% respectively for > > most industry benchmarks. Fetched 16-aligned bytes code include about > > 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > > Modern CPU support 2 load and 1 write per cycle, so throughput from > > write is

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Konrad Rzeszutek Wilk
On Thu, Oct 11, 2012 at 08:29:08PM +0800, ling...@intel.com wrote: > From: Ma Ling > > Load and write operation occupy about 35% and 10% respectively > for most industry benchmarks. Fetched 16-aligned bytes code include > about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > Modern

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Andi Kleen
ling...@intel.com writes: > From: Ma Ling > > Load and write operation occupy about 35% and 10% respectively > for most industry benchmarks. Fetched 16-aligned bytes code include > about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. > Modern CPU support 2 load and 1 write per

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Andi Kleen
ling...@intel.com writes: From: Ma Ling ling...@intel.com Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1

Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Konrad Rzeszutek Wilk
On Thu, Oct 11, 2012 at 08:29:08PM +0800, ling...@intel.com wrote: From: Ma Ling ling...@intel.com Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is bottleneck for

RE: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-11 Thread Ma, Ling
Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is bottleneck for