[Xen-ia64-devel] RE: [Xen-devel] unnecessary VCPU migration happens again

2006-12-21 Thread Xu, Anthony
Hi Emmanuel,

Thanks for your quick response.
I'm not familiar with scheduler, I'll study it. :-)

I put comments below, maybe it's not right. :-)

Emmanuel Ackaouy write on 2006年12月22日 0:23:
 Hi Anthony.
 
 Based on the number of ticks on CPU0 that occurred between the
 two stat dumps, over 16 minutes elapsed during that time.
 
 During that time, 364 regular migrations occurred. These are
 migrations that happen when an idle CPU finds a runnable VCPU queued
 elsewhere 
 on the system.

From the point of user of credit scheduler, it may be not regular migrations.
Because there are 4 CPU, and there are only 3 VCPU,

It would be unlikely that an idle CPU finds a runnable VCPU queued elsewhere 
 on the system.

 
 Also during that time, 530 multi-core load balancing migrations
 happened.
 
 That's about one such migration every 1.86 seconds. I'm somewhat
 surprised
 that this costs 5% in performance of your benchmark. That said, the
 point
 of this code is to balance a partially idle system and not to shuffle
 things
 around too much so I'm happy to smooth the algorithm further to reduce
 the
 number of these migrations.

I'm interested about this.
I'll investigate this.

 
 I'll send another patch shortly.
Thanks again.

 
 On Dec 20, 2006, at 4:26, Xu, Anthony wrote:
 Before running KB
 
 (XEN)   migrate_queued = 169
 (XEN)   migrate_running= 213
 
 (XEN) CPU[00]  tick=117181, sort=12233, sibling=0x1, core=0x5
 
 
 After running KB
 
 (XEN)   migrate_queued = 533
 (XEN)   migrate_running= 743
 
 (XEN) CPU[00]  tick=215790, sort=42999, sibling=0x1, core=0x5

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel


[Xen-ia64-devel] RE: [Xen-devel] unnecessary VCPU migration happens again

2006-12-19 Thread Xu, Anthony
Emmanuel Ackaouy write on 2006年12月19日 17:00:
 On Dec 19, 2006, at 8:02, Xu, Anthony wrote:
 Can you dump the credit scheduler stat counter before and after you
 run the
 benchmark? (^A^A^A on the dom0/hypervisor console to switch to the
 hypervisor
 and then type the r key to dump scheduler info). That along with an
 idea of the
 elapsed time between the two stat samples would be handy.

Hi Emmanuel,

I got the dump scheduler info.

The evironment is,
Two sockects, two cores per sockect.
Dom0 is UP,
DomVTI is 2 VCPU SMP,
There are 4 physical CPUs,
There are 3 VCPUs totally.

We run Kernel Build on VTI domain.
make -j3

Before running KB

(XEN) *** Serial input - DOM0 (type 'CTRL-a' three times to switch input to 
Xen).
(XEN) *** Serial input - Xen (type 'CTRL-a' three times to switch input to 
DOM0).
(XEN) Scheduler: SMP Credit Scheduler (credit)
(XEN) info:
(XEN)   ncpus  = 4
(XEN)   master = 0
(XEN)   credit = 1200
(XEN)   credit balance = 0
(XEN)   weight = 0
(XEN)   runq_sort  = 12233
(XEN)   default-weight = 256
(XEN)   msecs per tick = 10ms
(XEN)   credits per tick   = 100
(XEN)   ticks per tslice   = 3
(XEN)   ticks per acct = 3
(XEN) idlers: 0xf
(XEN) stats:
(XEN)   schedule   = 4521191
(XEN)   acct_run   = 12233
(XEN)   acct_no_work   = 26827
(XEN)   acct_balance   = 7
(XEN)   acct_reorder   = 0
(XEN)   acct_min_credit= 0
(XEN)   acct_vcpu_active   = 4197
(XEN)   acct_vcpu_idle = 4197
(XEN)   vcpu_sleep = 16
(XEN)   vcpu_wake_running  = 140923
(XEN)   vcpu_wake_onrunq   = 0
(XEN)   vcpu_wake_runnable = 2016395
(XEN)   vcpu_wake_not_runnable = 0
(XEN)   vcpu_park  = 0
(XEN)   vcpu_unpark= 0
(XEN)   tickle_local_idler = 2016206
(XEN)   tickle_local_over  = 2
(XEN)   tickle_local_under = 39
(XEN)   tickle_local_other = 0
(XEN)   tickle_idlers_none = 0
(XEN)   tickle_idlers_some = 204
(XEN)   load_balance_idle  = 2432213
(XEN)   load_balance_over  = 132
(XEN)   load_balance_other = 0
(XEN)   steal_trylock_failed   = 12730
(XEN)   steal_peer_idle= 1197122
(XEN)   migrate_queued = 169
(XEN)   migrate_running= 213
(XEN)   dom_init   = 4
(XEN)   dom_destroy= 1
(XEN)   vcpu_init  = 9
(XEN)   vcpu_destroy   = 2
(XEN) active vcpus:
(XEN) NOW=0x01114A1FE784
(XEN) CPU[00]  tick=117181, sort=12233, sibling=0x1, core=0x5
(XEN)   run: [32767.0] pri=-64 flags=0 cpu=0
(XEN) CPU[01]  tick=117189, sort=12233, sibling=0x2, core=0xa
(XEN)   run: [32767.1] pri=-64 flags=0 cpu=1
(XEN) CPU[02]  tick=117196, sort=12233, sibling=0x4, core=0x5
(XEN)   run: [32767.2] pri=-64 flags=0 cpu=2
(XEN) CPU[03]  tick=117176, sort=12233, sibling=0x8, core=0xa
(XEN)   run: [32767.3] pri=-64 flags=0 cpu=3


After running KB

(XEN) Scheduler: SMP Credit Scheduler (credit)
(XEN) info:
(XEN)   ncpus  = 4
(XEN)   master = 0
(XEN)   credit = 1200
(XEN)   credit balance = 0
(XEN)   weight = 0
(XEN)   runq_sort  = 42999
(XEN)   default-weight = 256
(XEN)   msecs per tick = 10ms
(XEN)   credits per tick   = 100
(XEN)   ticks per tslice   = 3
(XEN)   ticks per acct = 3
(XEN) idlers: 0xf
(XEN) stats:
(XEN)   schedule   = 8233816
(XEN)   acct_run   = 42999
(XEN)   acct_no_work   = 28931
(XEN)   acct_balance   = 47
(XEN)   acct_reorder   = 0
(XEN)   acct_min_credit= 0
(XEN)   acct_vcpu_active   = 9963
(XEN)   acct_vcpu_idle = 9963
(XEN)   vcpu_sleep = 16
(XEN)   vcpu_wake_running  = 240904
(XEN)   vcpu_wake_onrunq   = 0
(XEN)   vcpu_wake_runnable = 3697995
(XEN)   vcpu_wake_not_runnable = 0
(XEN)   vcpu_park  = 0
(XEN)   vcpu_unpark= 0
(XEN)   tickle_local_idler = 3697336
(XEN)   tickle_local_over  = 18
(XEN)   tickle_local_under = 148
(XEN)   tickle_local_other = 0
(XEN)   tickle_idlers_none = 0
(XEN)   tickle_idlers_some = 669
(XEN)   load_balance_idle  = 4394421
(XEN)   load_balance_over  = 503
(XEN)   load_balance_other = 0
(XEN)   steal_trylock_failed   = 23268
(XEN)   steal_peer_idle= 3014315
(XEN)   migrate_queued = 533
(XEN)   migrate_running= 743
(XEN)   dom_init   = 4
(XEN)   dom_destroy= 1

[Xen-ia64-devel] RE: [Xen-devel] unnecessary VCPU migration happens again

2006-12-18 Thread Xu, Anthony
Emmanuel Ackaouy write on 2006年12月14日 6:05:
 Anthony,
 
 I checked in a change to the scheduler multi-core/thread
 mechanisms in xen-unstable which should address the over
 aggressive migrations you were seeing.
 
 Can you pull that change, try your experiments again, and
 let me know how it works for you?


Hi Emmanuel,

Sorry for late response,

I did some performances tests based on your patch, SMP VTI Kernel build
and SMP VTI LTP.

Your patch is good, and reduce the majority of unnecessary migrations.
But the unnecessary migration still exist. I can still see about 5% performance
degradation on above benchmark( KB and LTP).
In fact this patch had helped a lot (from 27% to 5%)

I can understand it is impossible to implement spreading VCPU over all 
sockets/cores
and eliminate all unnecessary migration in the same time.

Is it possible for us to add a argument to function scheduler_init to 
enable/disable 
spreading VCPU feature?

It's caller's responsibilty to enable/disable this feature.

BTW, I used attatched patch to disable spreading VCPU feature.



Thanks,
Anthony



 
 Cheers,
 Emmanuel.
 
 ___
 Xen-devel mailing list
 [EMAIL PROTECTED]
 http://lists.xensource.com/xen-devel



sched.patch
Description: sched.patch
___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel

[Xen-ia64-devel] Re: [Xen-devel] unnecessary VCPU migration happens again

2006-12-13 Thread Emmanuel Ackaouy
Anthony,

I checked in a change to the scheduler multi-core/thread
mechanisms in xen-unstable which should address the over
aggressive migrations you were seeing.

Can you pull that change, try your experiments again, and
let me know how it works for you?

Cheers,
Emmanuel.

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel


[Xen-ia64-devel] RE: [Xen-devel] unnecessary VCPU migration happens again

2006-12-08 Thread Xu, Anthony
Petersson, Mats write on 2006年12月7日 18:52:
 -Original Message-
 From: Emmanuel Ackaouy [mailto:[EMAIL PROTECTED]
 Sent: 07 December 2006 10:38
 To: Xu, Anthony
 Cc: Petersson, Mats; [EMAIL PROTECTED]; xen-ia64-devel
 Subject: Re: [Xen-devel] unnecessary VCPU migration happens again
 Argueably, if 2 unrelated VCPUs are runnable on a dual socket
 host, it is useful to spread them across both sockets. This
 will give each VCPU more achievable bandwidth to memory.
 
 What I think you may be argueing here is that the scheduler
 is too aggressive in this action because the VCPU that blocked
 on socket 2 will wake up very shortly, negating the host-wide
 benefits of the migration when it does while still maintaining the
 costs. 

Yes, you are right, the VCPU that blocked will wake up very shortly,

as Mats mentioned , the migration is expensive in IPF flatform,

1. TLB penalty,
assume a VCPU is migrated from CPU0 to CPU1,
  (1) TLB purge penalty
  HV must purge all  CPU0 TLB, in case this VCPU is migrated back, and
the CPU0 may contain stale TLB entries.

IA32 doesn't have such penalty, because every time VCPU switch happens, it 
will purge all TLB.
TLB purge is not caused by this migration.

(2) TLB warm up penalty
When VCPU is migrated to CPU1, it will warm up TLB in CPU1,
 Both IPF and IA32 have this penalty.

2. cache penatly,
When VCPU is migrated to CPU1, it will warm up cache in CPU1
Both IPF and IA32 have this penalty. 



 
 There is a tradeoff here. We could try being less aggressive
 in spreading stuff over idle sockets. It would be nice to do
 this with a greater understanding of the tradeoff though. Can
 you share more information, such as benchmark perf results,
 migration statistics, or scheduler traces?
 

I got following basical data on LTP benchmark.

Environment, 
IPF platform
two sockets, two core per socket, two thread per core.
There are 8 logical CPU,

Dom0 is UP
VTIdomaim is 4VCPU,

It takes 66 minutes to run LTP .

Then I comments following code, there is no unnecessary migration.

It takes 48 minites to run LTP,

The degradation is,

(66-48)/66 = 27%

That's a big degradation!


/*
while ( !cpus_empty(cpus) )
{
nxt = first_cpu(cpus);

if ( csched_idler_compare(cpu, nxt)  0 )
{
cpu = nxt;
cpu_clear(nxt, cpus);
}
else if ( cpu_isset(cpu, cpu_core_map[nxt]) )
{
cpus_andnot(cpus, cpus, cpu_sibling_map[nxt]);
}
else
{
cpus_andnot(cpus, cpus, cpu_core_map[nxt]);
}

ASSERT( !cpu_isset(nxt, cpus) );
}
*/




Thanks,
Anthony


 I don't know if I've understood this right or not, but I believe the
 penalty for switching from one core (or socket) to another is higher
 on IA64 than on x86. I'm not an expert on IA64, but I remember
 someone at the Xen Summit saying something to that effect - I think
 it was something like executing a bunch of code to flush the TLB's or
 some such...
 
 
 Emmanuel.

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel


[Xen-ia64-devel] Re: [Xen-devel] unnecessary VCPU migration happens again

2006-12-08 Thread Emmanuel Ackaouy
On Fri, Dec 08, 2006 at 04:43:12PM +0800, Xu, Anthony wrote:
 The degradation is,
 
 (66-48)/66 = 27%
 
 That's a big degradation!

Yeah. I'll throw some code together to fix this.

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel


[Xen-ia64-devel] Re: [Xen-devel] unnecessary VCPU migration happens again

2006-12-07 Thread Emmanuel Ackaouy
On Thu, Dec 07, 2006 at 11:37:54AM +0800, Xu, Anthony wrote:
 From this logic, the migration happens frequently if the numbers VCPU
 is less than the number of logic CPU.

This logic is designed to make better use of a partially idle
system by spreading work across sockets and cores before
co-scheduling them. It won't come into play if there are no
idle execution units.

Note that __csched_running_vcpu_is_stealable() will trigger a
migration only when the end result would be strictly better
than the current situation. Once the system is balanced, it
will not bounce VCPUs around.

 That I want to highlight is,
 
 When HVM VCPU is executing IO operation,
 This HVM VCPU is blocked by HV, until this IO operation
 is emulated by Qemu. Then HV wakes up this HVM VCPU.
 
 While PV VCPU will not be blocked by PV driver.
 
 
 I can give below senario.
 
 There are two sockets, two core per socket.
 
 Assume, dom0 is running on socket1 core1,
  vti1 is runing on socket1 core2,
 Vti 2 is runing on socket2 core1,
 Socket2 core2 is idle.
 
 If vti2 is blocked by IO operation, then socket2 core1 is idle,
 That means two cores in socket2 are idle,
 While dom0 and vti1 are running on two cores of socket1,
 
 Then scheduler will try to spread dom0 and vti1 on these two sockets.
 Then migration happens. This is no necessary.

Argueably, if 2 unrelated VCPUs are runnable on a dual socket
host, it is useful to spread them across both sockets. This
will give each VCPU more achievable bandwidth to memory.

What I think you may be argueing here is that the scheduler
is too aggressive in this action because the VCPU that blocked
on socket 2 will wake up very shortly, negating the host-wide
benefits of the migration when it does while still maintaining
the costs.

There is a tradeoff here. We could try being less aggressive
in spreading stuff over idle sockets. It would be nice to do
this with a greater understanding of the tradeoff though. Can
you share more information, such as benchmark perf results,
migration statistics, or scheduler traces?

Emmanuel.

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel


[Xen-ia64-devel] Re: [Xen-devel] unnecessary VCPU migration happens again

2006-12-06 Thread Emmanuel Ackaouy
Hi Anthony.

Could you send xentrace output for scheduling operations
in your setup?

Perhaps we're being a little too aggressive spreading
work across sockets. We do this on vcpu_wake right now.

I'm not sure I understand why HVM VCPUs would block
and wake more often than PV VCPUs though. Can you
explain?

If you could gather some scheduler traces and send
results, it will give us a good idea of what's going
on and why. The multi-core support is new and not
widely tested so it's possible that it is being
overly aggressive or perhaps even buggy.

Emmanuel.


On Fri, Dec 01, 2006 at 06:11:32PM +0800, Xu, Anthony wrote:
 Emmanue,
 
 I found that unnecessary VCPU migration happens again.
 
 
 My environment is,
 
 IPF two sockes, two cores per socket, 1 thread per core.
 
 There are 4 core totally.
 
 There are 3 domain, they are all UP,
 So there are 3 VCPU totally.
 
 One is domain0,
 The other two are VTI-domain.
 
 I found there are lots of migrations.
 
 
 This is caused by below code segment in function csched_cpu_pick.
 When I comments this code segment, there is no migration in above 
 enviroment. 
 
 
 
 I have a little analysis about this code.
 
 This code handls multi-core and multi-thread, that's very good,
 If two VCPUS running on LPs which belong to the same core, then the
 performance
 is bad, so if there are free LPS, we should let this two VCPUS run on
 different cores.
 
 This code may work well with para-domain.
 Because para-domain is seldom blocked,
 It may be block due to guest call halt instruction.
 This means if a idle VCPU is running on a LP,
 there is no non-idle VCPU running on this LP.
 In this evironment, I think below code should work well.
 
 
 But in HVM environment, HVM is blocked by IO operation,
 That is to say, if a idle VCPU is running on a LP, maybe a
 HVM VCPU is blocked, and HVM VCPU will run on this LP, when
 it is woken up.
 In this evironment, below code cause unnecessary migrations.
 I think this doesn't reach the goal ot this code segment.
 
 In IPF side, migration is time-consuming, so it caused some performance
 degradation.
 
 
 I have a proposal and it may be not good.
 
 We can change the meaning of idle-LP,
 
 Idle-LP means a idle-VCPU is running on this LP, and there is no VCPU
 blocked on this
 LP.( if this VCPU is woken up, this VCPU will run on this LP).
 
 
 
 --Anthony
 
 
 /*
  * In multi-core and multi-threaded CPUs, not all idle execution
  * vehicles are equal!
  *
  * We give preference to the idle execution vehicle with the
 most
  * idling neighbours in its grouping. This distributes work
 across
  * distinct cores first and guarantees we don't do something
 stupid
  * like run two VCPUs on co-hyperthreads while there are idle
 cores
  * or sockets.
  */
 while ( !cpus_empty(cpus) )
 {
 nxt = first_cpu(cpus);
 
 if ( csched_idler_compare(cpu, nxt)  0 )
 {
 cpu = nxt;
 cpu_clear(nxt, cpus);
 }
 else if ( cpu_isset(cpu, cpu_core_map[nxt]) )
 {
 cpus_andnot(cpus, cpus, cpu_sibling_map[nxt]);
 }
 else
 {
 cpus_andnot(cpus, cpus, cpu_core_map[nxt]);
 }
 
 ASSERT( !cpu_isset(nxt, cpus) );
 }

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel


[Xen-ia64-devel] RE: [Xen-devel] unnecessary VCPU migration happens again

2006-12-06 Thread Xu, Anthony
Hi,

Thanks for your reply. Please see embedded comments.


Petersson, Mats write on 2006年12月6日 22:14:
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Emmanuel
 Ackaouy Sent: 06 December 2006 14:02
 To: Xu, Anthony
 Cc: [EMAIL PROTECTED]; xen-ia64-devel
 Subject: Re: [Xen-devel] unnecessary VCPU migration happens again
 
 Hi Anthony.
 
 Could you send xentrace output for scheduling operations
 in your setup?
I'm not sure xentrace works on IPF side. I'm trying.

 
 Perhaps we're being a little too aggressive spreading
 work across sockets. We do this on vcpu_wake right now.

I think below logic also does spreading work.

1. in csched_load_balance, below code segment sets _VCPUF_migrating flag
 in peer_vcpu, as the comment said,
/*
 * If we failed to find any remotely queued VCPUs to move here,
 * see if it would be more efficient to move any of the running
 * remote VCPUs over here.
 */


/* Signal the first candidate only. */
if ( !is_idle_vcpu(peer_vcpu) 
 is_idle_vcpu(__runq_elem(spc-runq.next)-vcpu) 
 __csched_running_vcpu_is_stealable(cpu, peer_vcpu) )
{
set_bit(_VCPUF_migrating, peer_vcpu-vcpu_flags);
spin_unlock(per_cpu(schedule_data, peer_cpu).schedule_lock);

CSCHED_STAT_CRANK(steal_loner_signal);
cpu_raise_softirq(peer_cpu, SCHEDULE_SOFTIRQ);
break;
}


2. When this peer_vcpu is scheduled out, migration happens,

void context_saved(struct vcpu *prev)
{
clear_bit(_VCPUF_running, prev-vcpu_flags);

if ( unlikely(test_bit(_VCPUF_migrating, prev-vcpu_flags)) )
vcpu_migrate(prev);
}

From this logic, the migration happens frequently if the numbers VCPU
is less than the number of logic CPU.


Anthony.



 
 I'm not sure I understand why HVM VCPUs would block
 and wake more often than PV VCPUs though. Can you
 explain?
 
 Whilst I don't know any of the facts of the original poster, I can
 tell you why HVM and PV guests have differing number of scheduling
 operations...
 
 Every time you get a IOIO/MMIO vmexit that leads to a qemu-dm
 interaction, you'll get a context switch. So for an average IDE block
 read/write (for example) on x86, you get 4-5 IOIO intercepts to send
 the command to qemu, then an interrupt is sent to the guest to
 indicate that the operation is finished, followed by a 256 x 16-bit
 IO read/write of the sector content (which is normally just one IOIO
 intercept unless the driver is stupid). This means around a dozen
 or so schedule operations to do one disk IO operation.
 
 The same operation in PV (or using PV driver in HVM guest of course)
 would require a single transaction from DomU to Dom0 and back, so only
 two schedule operations.
 
 The same problem occurs of course for other hardware devices such as
 network, keyboard, mouse, where a transaction consists of more than a
 single read or write to a single register.



That I want to highlight is,

When HVM VCPU is executing IO operation,
This HVM VCPU is blocked by HV, until this IO operation
is emulated by Qemu. Then HV wakes up this HVM VCPU.

While PV VCPU will not be blocked by PV driver.


I can give below senario.

There are two sockets, two core per socket.

Assume, dom0 is running on socket1 core1,
 vti1 is runing on socket1 core2,
Vti 2 is runing on socket2 core1,
Socket2 core2 is idle.

If vti2 is blocked by IO operation, then socket2 core1 is idle,
That means two cores in socket2 are idle,
While dom0 and vti1 are running on two cores of socket1,

Then scheduler will try to spread dom0 and vti1 on these two sockets.
Then migration happens. This is no necessary.



 
 
 If you could gather some scheduler traces and send
 results, it will give us a good idea of what's going
 on and why. The multi-core support is new and not
 widely tested so it's possible that it is being
 overly aggressive or perhaps even buggy.
 

 ___
 Xen-devel mailing list
 [EMAIL PROTECTED]
 http://lists.xensource.com/xen-devel

___
Xen-ia64-devel mailing list
Xen-ia64-devel@lists.xensource.com
http://lists.xensource.com/xen-ia64-devel