[RFC PATCH] aliworkqueue: Adaptive lock integration on multi-core platform

2016-04-14 Thread ling . ma . program
From: Ma Ling Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are

[RFC PATCH] aliworkqueue: Adaptive lock integration on multi-core platform

2016-04-14 Thread ling . ma . program
From: Ma Ling Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are sent to one core and

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-11 Thread Ling Ma
Is it acceptable for performance improvement or more comments on this patch? Thanks Ling 2016-04-05 11:44 GMT+08:00 Ling Ma <ling.ma.prog...@gmail.com>: > Hi Longman, > >> with some modest increase in performance. That can be hard to justify. Maybe >> you should find othe

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-11 Thread Ling Ma
Is it acceptable for performance improvement or more comments on this patch? Thanks Ling 2016-04-05 11:44 GMT+08:00 Ling Ma : > Hi Longman, > >> with some modest increase in performance. That can be hard to justify. Maybe >> you should find other use cases that involve less

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-04 Thread Ling Ma
Hi Longman, > with some modest increase in performance. That can be hard to justify. Maybe > you should find other use cases that involve less changes, but still have > noticeable performance improvement. That will make it easier to be accepted. The attachment is for other use case with the new

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-04 Thread Ling Ma
Hi Longman, > with some modest increase in performance. That can be hard to justify. Maybe > you should find other use cases that involve less changes, but still have > noticeable performance improvement. That will make it easier to be accepted. The attachment is for other use case with the new

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Ling Ma
413 43584658 38097842 43235392 ORGNEW TOTAL 874675292 1005486126 So the data tell us the new mechanism can improve performance 14% ( 1005486126/874675292) , and the operation can be justified fairly. Thanks Ling 2016-02-04 5:42 GMT+08:00 Waiman Long : > On 02/02/2016

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Ling Ma
;: > On 02/02/2016 11:40 PM, Ling Ma wrote: >> >> Longman, >> >> The attachment include user space code(thread.c), and kernel >> patch(ali_work_queue.patch) based on 4.3.0-rc4, >> we replaced all original spinlock (list_lock) in slab.h/c with the >> n

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
44229887 38108158 43142689 37771900 43228168 37652536 43901042 37649114 43172690 37591314 43380004 38539678 43435592 Total 1026910602 1174810406 Thanks Ling 2016-02-03 12:40 GMT+08:00 Ling Ma : > Longman, > > The attachment include user space code(thread.c), and kernel > patch(ali_work

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
version according to your comments. Thanks Ling 2016-01-19 23:36 GMT+08:00 Waiman Long : > On 01/19/2016 03:52 AM, Ling Ma wrote: >> >> Is it acceptable for performance improvement or more comments on this >> patch? >> >> Thanks >> Ling >> >> >

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
44229887 38108158 43142689 37771900 43228168 37652536 43901042 37649114 43172690 37591314 43380004 38539678 43435592 Total 1026910602 1174810406 Thanks Ling 2016-02-03 12:40 GMT+08:00 Ling Ma <ling.ma.prog...@gmail.com>: > Longman, > > The attachment include user space code(thread

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma
version according to your comments. Thanks Ling 2016-01-19 23:36 GMT+08:00 Waiman Long <waiman.l...@hpe.com>: > On 01/19/2016 03:52 AM, Ling Ma wrote: >> >> Is it acceptable for performance improvement or more comments on this >> patch? >> >> Thanks >

[RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2015-12-31 Thread ling . ma . program
From: Ma Ling Hi ALL, Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are sent to one core

[RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2015-12-31 Thread ling . ma . program
From: Ma Ling Hi ALL, Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized

Re: Improve spinlock performance by moving work to one core

2015-12-06 Thread Ling Ma
Long : > On 11/30/2015 01:17 AM, Ling Ma wrote: >> >> Any comments, the patch is acceptable ? >> >> Thanks >> Ling >> >> > Ling, > > The core idea of your current patch hasn't changed from your previous > patch. > > My comment is

Re: Improve spinlock performance by moving work to one core

2015-12-06 Thread Ling Ma
man Long <waiman.l...@hpe.com>: > On 11/30/2015 01:17 AM, Ling Ma wrote: >> >> Any comments, the patch is acceptable ? >> >> Thanks >> Ling >> >> > Ling, > > The core idea of your current patch hasn't changed from your previous >

Re: Improve spinlock performance by moving work to one core

2015-11-29 Thread Ling Ma
Any comments, the patch is acceptable ? Thanks Ling 2015-11-26 17:00 GMT+08:00 Ling Ma : > Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates > cache_flusharray and cache_alloc_refill functions spend 25.6% time > on queued_spin_lock_slowpath totally. it means the comp

Re: Improve spinlock performance by moving work to one core

2015-11-29 Thread Ling Ma
Any comments, the patch is acceptable ? Thanks Ling 2015-11-26 17:00 GMT+08:00 Ling Ma <ling.ma.prog...@gmail.com>: > Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates > cache_flusharray and cache_alloc_refill functions spend 25.6% time > on queued_spin_lock_s

Re: Improve spinlock performance by moving work to one core

2015-11-26 Thread Ling Ma
Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates cache_flusharray and cache_alloc_refill functions spend 25.6% time on queued_spin_lock_slowpath totally. it means the compared data from our spinlock-test.patch is reliable. Thanks Ling 2015-11-26 11:49 GMT+08:00 Ling Ma

Re: Improve spinlock performance by moving work to one core

2015-11-26 Thread Ling Ma
Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates cache_flusharray and cache_alloc_refill functions spend 25.6% time on queued_spin_lock_slowpath totally. it means the compared data from our spinlock-test.patch is reliable. Thanks Ling 2015-11-26 11:49 GMT+08:00 Ling Ma

Re: Improve spinlock performance by moving work to one core

2015-11-25 Thread Ling Ma
e only modify them, other lock scenarios will continue to use the lock in qspinlock.h, we must modify the code, otherwise the operation will be hooked in the queued and never be waken up. Thanks Ling 2015-11-26 3:05 GMT+08:00 Waiman Long : > On 11/23/2015 04:41 AM, Ling Ma wrote: >> Hi

Re: Improve spinlock performance by moving work to one core

2015-11-25 Thread Ling Ma
e only modify them, other lock scenarios will continue to use the lock in qspinlock.h, we must modify the code, otherwise the operation will be hooked in the queued and never be waken up. Thanks Ling 2015-11-26 3:05 GMT+08:00 Waiman Long <waiman.l...@hpe.com>: > On 11/23/2015 04:41 AM, L

Re: Improve spinlock performance by moving work to one core

2015-11-24 Thread Ling Ma
Any comments about it ? Thanks Ling 2015-11-23 17:41 GMT+08:00 Ling Ma : > Hi Longman, > > Attachments include user space application thread.c and kernel patch > spinlock-test.patch based on kernel 4.3.0-rc4 > > we run thread.c with kernel patch, test original and new spinlo

Re: Improve spinlock performance by moving work to one core

2015-11-24 Thread Ling Ma
Any comments about it ? Thanks Ling 2015-11-23 17:41 GMT+08:00 Ling Ma <ling.ma.prog...@gmail.com>: > Hi Longman, > > Attachments include user space application thread.c and kernel patch > spinlock-test.patch based on kernel 4.3.0-rc4 > > we run thread.c with kernel patc

Re: Improve spinlock performance by moving work to one core

2015-11-23 Thread Ling Ma
+08:00 Waiman Long : > > On 11/05/2015 11:28 PM, Ling Ma wrote: >> >> Longman >> >> Thanks for your suggestion. >> We will look for real scenario to test, and could you please introduce >> some benchmarks on spinlock ? >> >> Regards >> Ling

Re: Improve spinlock performance by moving work to one core

2015-11-23 Thread Ling Ma
+08:00 Waiman Long <waiman.l...@hpe.com>: > > On 11/05/2015 11:28 PM, Ling Ma wrote: >> >> Longman >> >> Thanks for your suggestion. >> We will look for real scenario to test, and could you please introduce >> some benchmarks on spinlock ? >>

Re: Improve spinlock performance by moving work to one core

2015-11-05 Thread Ling Ma
Longman Thanks for your suggestion. We will look for real scenario to test, and could you please introduce some benchmarks on spinlock ? Regards Ling > > Your new spinlock code completely change the API and the semantics of the > existing spinlock calls. That requires changes to thousands of

Re: Improve spinlock performance by moving work to one core

2015-11-05 Thread Ling Ma
Longman Thanks for your suggestion. We will look for real scenario to test, and could you please introduce some benchmarks on spinlock ? Regards Ling > > Your new spinlock code completely change the API and the semantics of the > existing spinlock calls. That requires changes to thousands of

Re: Improve spinlock performance by moving work to one core

2015-11-04 Thread Ling Ma
Hi All, (send again for linux-kernel@vger.kernel.org) Spinlock caused cache line ping-pong between cores, we have to spend lots of time to get serialized execution. However if we present the serialized work to one core, it will help us save much time. In the attachment we changed code based on

Re: Improve spinlock performance by moving work to one core

2015-11-04 Thread Ling Ma
Hi All, (send again for linux-kernel@vger.kernel.org) Spinlock caused cache line ping-pong between cores, we have to spend lots of time to get serialized execution. However if we present the serialized work to one core, it will help us save much time. In the attachment we changed code based on

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma
> > I did see some performance improvement when I used your test program on a > Haswell-EX system. It seems like the use of cmpxchg has forced the changed > memory values to be visible to other processors earlier. I also ran your > test on an older machine with Westmere-EX processors. This time, I

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma
2015-10-20 17:16 GMT+08:00 Peter Zijlstra : > On Tue, Oct 20, 2015 at 11:24:02AM +0800, Ling Ma wrote: >> 2015-10-19 17:46 GMT+08:00 Peter Zijlstra : >> > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> >> From: Ma Ling >> &g

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma
Ok, we will put the spinlock test into the perf bench. Thanks Ling 2015-10-20 16:48 GMT+08:00 Ingo Molnar : > > * Ling Ma wrote: > >> > So it would be nice to create a new user-space spinlock testing facility, >> > via >> > a new 'perf bench spinlock' fea

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma
Ok, we will put the spinlock test into the perf bench. Thanks Ling 2015-10-20 16:48 GMT+08:00 Ingo Molnar <mi...@kernel.org>: > > * Ling Ma <ling.ma.prog...@gmail.com> wrote: > >> > So it would be nice to create a new user-space spinlock testing facility, &

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma
2015-10-20 17:16 GMT+08:00 Peter Zijlstra <pet...@infradead.org>: > On Tue, Oct 20, 2015 at 11:24:02AM +0800, Ling Ma wrote: >> 2015-10-19 17:46 GMT+08:00 Peter Zijlstra <pet...@infradead.org>: >> > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma
> > I did see some performance improvement when I used your test program on a > Haswell-EX system. It seems like the use of cmpxchg has forced the changed > memory values to be visible to other processors earlier. I also ran your > test on an older machine with Westmere-EX processors. This time, I

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-19 17:46 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-20 1:18 GMT+08:00 Waiman Long : > On 10/18/2015 10:27 PM, ling.ma.prog...@gmail.com wrote: >> >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-19 17:46 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-19 17:33 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
> > So it would be nice to create a new user-space spinlock testing facility, via > a > new 'perf bench spinlock' feature or so. That way others can test and validate > your results on different hardware as well. > Attached the spinlock test module . Queued spinlock will run very slowly in user

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-19 17:46 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
> > So it would be nice to create a new user-space spinlock testing facility, via > a > new 'perf bench spinlock' feature or so. That way others can test and validate > your results on different hardware as well. > Attached the spinlock test module . Queued spinlock will run very slowly in user

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-20 1:18 GMT+08:00 Waiman Long : > On 10/18/2015 10:27 PM, ling.ma.prog...@gmail.com wrote: >> >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-19 17:33 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma
2015-10-19 17:46 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores

[RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-18 Thread ling . ma . program
From: Ma Ling All load instructions can run speculatively but they have to follow memory order rule in multiple cores as below: _x = _y = 0 Processor 0 Processor 1 mov r1, [ _y] //M1 mov [ _x], 1 //M3 mov r2, [ _x] //M2 mov

[RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-18 Thread ling . ma . program
From: Ma Ling All load instructions can run speculatively but they have to follow memory order rule in multiple cores as below: _x = _y = 0 Processor 0 Processor 1 mov r1, [ _y] //M1 mov [ _x], 1 //M3 mov r2, [ _x]

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-14 Thread Ling Ma
The kernel version 3.14 shows memcpy, memset occur 19622 and 14189 times respectively. so memset is still important for us, correct? Thanks Ling 2014-04-14 6:03 GMT+08:00, Andi Kleen : > On Sun, Apr 13, 2014 at 11:11:59PM +0800, Ling Ma wrote: >> Any further comments ? > > I

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-14 Thread Ling Ma
The kernel version 3.14 shows memcpy, memset occur 19622 and 14189 times respectively. so memset is still important for us, correct? Thanks Ling 2014-04-14 6:03 GMT+08:00, Andi Kleen a...@firstfloor.org: On Sun, Apr 13, 2014 at 11:11:59PM +0800, Ling Ma wrote: Any further comments

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-13 Thread Ling Ma
Any further comments ? Thanks Ling 2014-04-08 22:00 GMT+08:00, Ling Ma : > Andi, > > The below is compared result on older machine(cpu info is attached): > That shows new code get better performance up to 1.6x. > > Bytes: ORG_TIME: NEW_TIME: ORG vs NEW: > 7 0.87

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-13 Thread Ling Ma
Any further comments ? Thanks Ling 2014-04-08 22:00 GMT+08:00, Ling Ma ling.ma.prog...@gmail.com: Andi, The below is compared result on older machine(cpu info is attached): That shows new code get better performance up to 1.6x. Bytes: ORG_TIME: NEW_TIME: ORG vs NEW: 7 0.870.76

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-08 Thread Ling Ma
1.61 356 1.611.021.57 601 1.781.221.45 958 2.041.471.38 10242.071.481.39 20482.802.211.26 Thanks Ling 2014-04-08 0:42 GMT+08:00, Andi Kleen : > ling.ma.prog...@gmail.com writes: > >> From: Ling Ma >> >>

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-08 Thread Ling Ma
1.61 356 1.611.021.57 601 1.781.221.45 958 2.041.471.38 10242.071.481.39 20482.802.211.26 Thanks Ling 2014-04-08 0:42 GMT+08:00, Andi Kleen a...@firstfloor.org: ling.ma.prog...@gmail.com writes: From: Ling Ma ling...@alibaba-inc.com

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread Ling Ma
Append test suit after tar, run ./test command please. thanks 2014-04-07 22:50 GMT+08:00, ling.ma.prog...@gmail.com : > From: Ling Ma > > In this patch we manage to reduce miss branch prediction by > avoiding using branch instructions and force destination to be aligned > wit

[PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread ling . ma . program
From: Ling Ma In this patch we manage to reduce miss branch prediction by avoiding using branch instructions and force destination to be aligned with general 64bit instruction. Below compared results shows we improve performance up to 1.8x (We modified test suit from Ondra, send after

[PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread ling . ma . program
From: Ling Ma ling...@alibaba-inc.com In this patch we manage to reduce miss branch prediction by avoiding using branch instructions and force destination to be aligned with general 64bit instruction. Below compared results shows we improve performance up to 1.8x (We modified test suit from

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread Ling Ma
Append test suit after tar, run ./test command please. thanks 2014-04-07 22:50 GMT+08:00, ling.ma.prog...@gmail.com ling.ma.prog...@gmail.com: From: Ling Ma ling...@alibaba-inc.com In this patch we manage to reduce miss branch prediction by avoiding using branch instructions and force

Re: [PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-27 Thread Ling Ma
Hi Ingo Thanks for your correcting. Because thinking of most of 32bit CPU belong to low-end CPUs(smaller cache), they should more emphasize i-cache miss, I chose Os for them. I will test it and send out result ASAP. Regards Ling 2013/1/27, Ingo Molnar : > > * ling.ma.prog...@gmail.com

Re: [PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-27 Thread Ling Ma
Hi Ingo Thanks for your correcting. Because thinking of most of 32bit CPU belong to low-end CPUs(smaller cache), they should more emphasize i-cache miss, I chose Os for them. I will test it and send out result ASAP. Regards Ling 2013/1/27, Ingo Molnar mi...@kernel.org: *

[PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-26 Thread ling . ma . program
From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps,

[PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-26 Thread ling . ma . program
From: Ma Ling ling...@alipay.com Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions,

[PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-25 Thread ling . ma . program
From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps,

[PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-25 Thread ling . ma . program
From: Ma Ling ling...@alipay.com Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions,

Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-30 Thread Ling Ma
Hi Ingo, By netperf we did double check on older Nehalem platform too as below: O2 NHM Performance counter stats for 'netperf' (3 runs): 3779.262214 task-clock#0.378 CPUs utilized ( +- 0.37% ) 47,580 context-switches #0.013 M/sec

Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-30 Thread Ling Ma
Hi Ingo, By netperf we did double check on older Nehalem platform too as below: O2 NHM Performance counter stats for 'netperf' (3 runs): 3779.262214 task-clock#0.378 CPUs utilized ( +- 0.37% ) 47,580 context-switches #0.013 M/sec

[Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-25 Thread ling . ma . program
From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps,

[Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-25 Thread ling . ma . program
From: Ma Ling ling...@alipay.com Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions,

Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-12-02 Thread Ling Ma
Hi Eric, Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result shows when last level cache(LLC) miss and CPU fetches data from memory, critical word as first 64bit member in cache line has better performance(costs 158290336 cycles ) than other positions(offset 0x10, costs 164100732

Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-12-02 Thread Ling Ma
Hi Eric, Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result shows when last level cache(LLC) miss and CPU fetches data from memory, critical word as first 64bit member in cache line has better performance(costs 158290336 cycles ) than other positions(offset 0x10, costs 164100732

Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-27 Thread Ling Ma
> networking patches should be sent to netdev. > > (I understand this patch is more a generic one, but at least CC netdev) Ling: OK, this is my first inet patch, I will send to netdev later. > You give no performance numbers for this change... Ling: after I get machine, I will send out test

Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-27 Thread Ling Ma
networking patches should be sent to netdev. (I understand this patch is more a generic one, but at least CC netdev) Ling: OK, this is my first inet patch, I will send to netdev later. You give no performance numbers for this change... Ling: after I get machine, I will send out test result.

[PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-25 Thread ling . ma . program
From: Ma Ling In order to reduce memory latency when last level cache miss occurs, modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or Early Restart(ER) to get data ASAP. For CWF if critical word is first member in cache line, memory feed CPU with critical word, then fill others

[PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-25 Thread ling . ma . program
From: Ma Ling ling.ma.prog...@gmail.com In order to reduce memory latency when last level cache miss occurs, modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or Early Restart(ER) to get data ASAP. For CWF if critical word is first member in cache line, memory feed CPU with

[PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread ling . ma . program
From: Ma Ling CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be handled by MS-ROM, the

Re: [PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread Ling Ma
Attached memcpy micro benchmark, cpu info ,comparison results between rep movsq/b and memcpy on atom, ivb. Thanks Ling 2012/10/23, ling.ma.prog...@gmail.com : > From: Ma Ling > > CISC code has higher instruction density, saving memory and > improving i-cache hit rate. However decode become

Re: [PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread Ling Ma
Attached memcpy micro benchmark, cpu info ,comparison results between rep movsq/b and memcpy on atom, ivb. Thanks Ling 2012/10/23, ling.ma.prog...@gmail.com ling.ma.prog...@gmail.com: From: Ma Ling ling.ma.prog...@gmail.com CISC code has higher instruction density, saving memory and

[PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread ling . ma . program
From: Ma Ling ling.ma.prog...@gmail.com CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be

[PATCH RFC] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-18 Thread ling . ma . program
From: Ma Ling CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be handled by MS-ROM, the

[PATCH RFC] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-18 Thread ling . ma . program
From: Ma Ling ling.ma.prog...@gmail.com CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be

[PATCH RFC V2 1/2] [x86] Modify comments and clean up code.

2012-10-17 Thread ling . ma
From: Ma Ling Modern CPU use fast-string instruction to accelerate copy performance, by combining data into 128bit, so we modify comments and code style. Signed-off-by: Ma Ling --- In this version, update comments from Borislav Petkov Thanks Ling arch/x86/lib/copy_page_64.S | 120

[PATCH RFC V2 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-17 Thread ling . ma
From: Ma Ling Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.40(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, throughput from write is bottleneck

[PATCH RFC V2 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-17 Thread ling . ma
From: Ma Ling ling...@intel.com Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.40(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, throughput from

[PATCH RFC V2 1/2] [x86] Modify comments and clean up code.

2012-10-17 Thread ling . ma
From: Ma Ling ling...@intel.com Modern CPU use fast-string instruction to accelerate copy performance, by combining data into 128bit, so we modify comments and code style. Signed-off-by: Ma Ling ling...@intel.com --- In this version, update comments from Borislav Petkov Thanks Ling

[PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-10 Thread ling . ma
From: Ma Ling Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is

[PATCH RFC 1/2] [x86] Modify comments and clean up code.

2012-10-10 Thread ling . ma
From: Ma Ling Modern CPU use fast-string instruction to accelerate copy performance, by combining data into 128bit, so we modify comments and code style. Signed-off-by: Ma Ling --- arch/x86/lib/copy_page_64.S | 119 +-- 1 files changed, 59

[PATCH RFC 1/2] [x86] Modify comments and clean up code.

2012-10-10 Thread ling . ma
From: Ma Ling ling...@intel.com Modern CPU use fast-string instruction to accelerate copy performance, by combining data into 128bit, so we modify comments and code style. Signed-off-by: Ma Ling ling...@intel.com --- arch/x86/lib/copy_page_64.S | 119

[PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-10 Thread ling . ma
From: Ma Ling ling...@intel.com Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from