Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-25 Thread Sergey Dyasli
Hi Juergen,

I've found another regression that happens only with sched-gran=core.
CentOS 5.11 (PV, CPUs: 32; RAM: 6GB) kernel hangs during suspend attempt.
The last kernel messages are:

CPU 1 offline: Remove Rx thread
CPU 2 offline: Remove Rx thread

Kernel: Linux localhost 2.6.18-398.el5xen #1 SMP Tue Sep 16 21:31:50 EDT 2014 
x86_64 x86_64 x86_64 GNU/Linux

xl top shows 100% CPU utilization for the hanged VM. And here's its state:

(XEN) [ 1907.976356] *** Dumping CPU14 guest state (d1v0): ***

(XEN) [ 1907.982558] [ Xen-4.13.0-8.0.6-d  x86_64  debug=y   Not tainted 
]

(XEN) [ 1907.990704] CPU:14

(XEN) [ 1907.993901] RIP:e033:[]

(XEN) [ 1907.999333] RFLAGS: 0286   EM: 1   CONTEXT: pv guest (d1v0)

(XEN) [ 1908.007282] rax: 0001   rbx: 80522b80   rcx: 


(XEN) [ 1908.016203] rdx: 80522b90   rsi: 0079   rdi: 
8052a980

(XEN) [ 1908.025121] rbp: 80522980   rsp: 88017106dcf8   r8:  
88017106c000

(XEN) [ 1908.034040] r9:     r10:    r11: 
880176fad8c0

(XEN) [ 1908.042962] r12: 0001   r13:    r14: 
80522980

(XEN) [ 1908.051881] r15: 0003   cr0: 8005003b   cr4: 
00142660

(XEN) [ 1908.060800] cr3: 00801d8c   cr2: 2b540097

(XEN) [ 1908.067393] fsb:    gsb: 80639000   gss: 


(XEN) [ 1908.076311] ds:    es:    fs:    gs:    ss: e02b   cs: 
e033

(XEN) [ 1908.084650] Guest stack trace from rsp=88017106dcf8:

(XEN) [ 1908.091147]802c3dd4 01168460 80522b90 
0079

(XEN) [ 1908.100164]80522b80 80522980 0001 


(XEN) [ 1908.109179]0003 0003 802c4041 
88017d68b040

(XEN) [ 1908.118197]0003 0003 0007 
0003

(XEN) [ 1908.127213]ffea 8029f0ad 802c4092 
8050ff90

(XEN) [ 1908.136229]80268111 0003  
88017d68b040

(XEN) [ 1908.145245]802a408b  8800011d3860 
fff7

(XEN) [ 1908.154263]  7fff 
0001

(XEN) [ 1908.163278]   


(XEN) [ 1908.172296]0003 0003  
88017f5ebce0

(XEN) [ 1908.181312] 88017f5ebcd0 803be1a6 
0001

(XEN) [ 1908.190328] 803be9dd 803bea43 
88017106deb0

(XEN) [ 1908.199345]80289495 0003 88017f5ebce0 
88017f5ebce8

(XEN) [ 1908.208362] 88017f5ebcd0 8029f0ad 
88017106dee0

(XEN) [ 1908.217379] 88017f5ebce0  
88017f5ebcd0

(XEN) [ 1908.226396]8029f0ad  80233ee4 


(XEN) [ 1908.235413]88017d8687f0   


(XEN) [ 1908.244427]7fff  8800011d3860 
88017f5ebcd0

(XEN) [ 1908.253444]88017f5ebc78 88017f5ce6c0 80260b2c 


(XEN) [ 1908.262461]8029f0ad 88017f5ebcd0  
88017f5ce6c0



Thanks,
Sergey

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-24 Thread Juergen Gross

On 24.07.19 16:54, Sergey Dyasli wrote:

On 24/07/2019 10:13, Juergen Gross wrote:

The fix is a one-liner. :-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index f0bc5b3161..da9efb147f 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -2207,6 +2207,7 @@ static struct sched_unit *sched_wait_rendezvous_in(struct 
sched_unit *prev,
  if ( unlikely(!scheduler_active) )
  {
  ASSERT(is_idle_unit(prev));
+    atomic_set(>next_task->rendezvous_out_cnt, 0);
  prev->rendezvous_in_cnt = 0;
  }
  }


Even with that applied, I'm still seeing it :(


Interesting, for me it was gone.

Time for more tests and some debug code...


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-24 Thread Sergey Dyasli
On 24/07/2019 10:13, Juergen Gross wrote:
> The fix is a one-liner. :-)
> 
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index f0bc5b3161..da9efb147f 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -2207,6 +2207,7 @@ static struct sched_unit 
> *sched_wait_rendezvous_in(struct sched_unit *prev,
>  if ( unlikely(!scheduler_active) )
>  {
>  ASSERT(is_idle_unit(prev));
> +    atomic_set(>next_task->rendezvous_out_cnt, 0);
>  prev->rendezvous_in_cnt = 0;
>  }
>  }

Even with that applied, I'm still seeing it :(

(XEN) [  311.223780] Watchdog timer detects that CPU1 is stuck!

(XEN) [  311.229413] [ Xen-4.13.0  x86_64  debug=y   Not tainted ]

(XEN) [  311.236002] CPU:1

(XEN) [  311.238774] RIP:e008:[] 
sched_context_switched+0x92/0x101

(XEN) [  311.246575] RFLAGS: 0202   CONTEXT: hypervisor

(XEN) [  311.252556] rax: 0002   rbx: 83081cc635b0   rcx: 
0001

(XEN) [  311.260530] rdx: 83081cc63634   rsi: 83081cc8f000   rdi: 
83081cc8f000

(XEN) [  311.268502] rbp: 83081cc87df0   rsp: 83081cc87dd0   r8:  


(XEN) [  311.276474] r9:  83081cc62000   r10: 83081cc62000   r11: 
83081cc6b000

(XEN) [  311.284448] r12: 83081cc8f000   r13: 83081cc8f000   r14: 
83081cc61e80

(XEN) [  311.292422] r15: 82d0805e2260   cr0: 8005003b   cr4: 
001526e0

(XEN) [  311.300395] cr3: dd4ac000   cr2: 559b05a94048

(XEN) [  311.306288] fsb:    gsb: 8880a394   gss: 


(XEN) [  311.314262] ds: 002b   es: 002b   fs:    gs:    ss: e010   cs: 
e008

(XEN) [  311.321716] Xen code around  
(sched_context_switched+0x92/0x101):

(XEN) [  311.329862]  85 c0 74 08 f3 90 8b 02 <85> c0 75 f8 49 8b 44 24 10 66 
81 38 ff 7f 75 05

(XEN) [  311.338269] Xen stack trace from rsp=83081cc87dd0:

(XEN) [  311.343904]83081cc8f000 83081cc8f000  
83081cc635b0

(XEN) [  311.351963]83081cc87e28 82d080240996 83081cc61e98 
82d08060a4a8

(XEN) [  311.360022]83081cc61e98 82d08060a4a8 83081cc635b0 
83081cc87e80

(XEN) [  311.368083]82d080240f7a 0001 83081cc8f000 
0047588837ec

(XEN) [  311.376142]00011cc87ec0 82d0805c3a00 82d0805c3980 


(XEN) [  311.384205]82d0805d3980 82d0805e2260 83081cc87eb0 
82d08024274a

(XEN) [  311.392263]0001 82d0805c3a00 0001 
0001

(XEN) [  311.400324]83081cc87ec0 82d0802427bf 83081cc87ef0 
82d080279a1d

(XEN) [  311.408385]83081cc8f000 83081cc8f000 0001 
83081cc635b0

(XEN) [  311.416443]83081cc87df0 88809ee1ba00 88809ee1ba00 


(XEN) [  311.424504] 0005 88809ee1ba00 
0246

(XEN) [  311.432563]  0001ca00 


(XEN) [  311.440625]810013aa 8203c190 deadbeefdeadf00d 
deadbeefdeadf00d

(XEN) [  311.448685]0100 810013aa e033 
0246

(XEN) [  311.456747]c900400bfeb0 e02b beef 
beef

(XEN) [  311.464807]beef beef e011 
83081cc8f000

(XEN) [  311.472864]00379c665d00 001526e0  


(XEN) [  311.480926]0600 

(XEN) [  311.486041] Xen call trace:

(XEN) [  311.489332][] sched_context_switched+0x92/0x101

(XEN) [  311.496266][] 
schedule.c#sched_context_switch+0x7f/0x160

(XEN) [  311.503980][] schedule.c#sched_slave+0x28f/0x2b5

(XEN) [  311.510999][] softirq.c#__do_softirq+0x61/0x8c

(XEN) [  311.517846][] do_softirq+0x13/0x15

(XEN) [  311.523653][] domain.c#idle_loop+0x52/0xa7

(XEN) [  311.530152]

(XEN) [  311.532144] CPU0 @ e008:82d08024334d 
(stop_machine.c#stopmachine_wait_state+0x19/0x24)

(XEN) [  311.540899] CPU5 @ e008:82d080243398 
(stop_machine.c#stopmachine_action+0x40/0x93)

(XEN) [  311.549307] CPU3 @ e008:82d08024339e 
(stop_machine.c#stopmachine_action+0x46/0x93)

(XEN) [  311.557712] CPU4 @ e008:82d08024339e 
(stop_machine.c#stopmachine_action+0x46/0x93)

(XEN) [  311.566119] CPU7 @ e008:82d08024339e 
(stop_machine.c#stopmachine_action+0x46/0x93)

(XEN) [  311.574526] CPU2 @ e008:82d080243398 
(stop_machine.c#stopmachine_action+0x40/0x93)

(XEN) [  311.582931] CPU6 @ e008:82d080243398 
(stop_machine.c#stopmachine_action+0x40/0x93)

(XEN) [  311.591919]

(XEN) [  311.593914] 

(XEN) [  311.599374] Panic on CPU 1:

(XEN) [  311.602669] FATAL TRAP: vector = 2 (nmi)

(XEN) [  311.607088] [error_code=]

(XEN) [  311.610641] 

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-24 Thread Juergen Gross

On 22.07.19 16:22, Sergey Dyasli wrote:

On 19/07/2019 14:57, Juergen Gross wrote:


I have now a git branch with the two problems corrected and rebased to
current staging available:

github.com/jgross1/xen.git sched-v1b


Many thanks for the branch! As for the crashes, vcpu_sleep_sync() one
seems to be fixed now. But I can still reproduce the shutdown one.
Interestingly, it now happens only if a host has running VMs (which
are automatically powered off via PV tools):

(XEN) [  332.981355] Preparing system for ACPI S5 state.
(XEN) [  332.981419] Disabling non-boot CPUs ...
(XEN) [  337.703896] Watchdog timer detects that CPU1 is stuck!
(XEN) [  337.709532] [ Xen-4.13.0-8.0.6-d  x86_64  debug=y   Not tainted 
]
(XEN) [  337.716808] CPU:1
(XEN) [  337.719582] RIP:e008:[] 
sched_context_switched+0xaf/0x101
(XEN) [  337.727384] RFLAGS: 0202   CONTEXT: hypervisor
(XEN) [  337.733364] rax: 0002   rbx: 83081cc615b0   rcx: 
0001
(XEN) [  337.741338] rdx: 83081cc61634   rsi: 83081cc72000   rdi: 
83081cc72000
(XEN) [  337.749312] rbp: 83081cc8fdc0   rsp: 83081cc8fda0   r8:  

(XEN) [  337.757284] r9:     r10: 004d88fc535e   r11: 
004df8675ce7
(XEN) [  337.765256] r12: 83081cc72000   r13: 83081cc72000   r14: 
83081ccb0e80
(XEN) [  337.773232] r15: 83081cc615b0   cr0: 8005003b   cr4: 
001526e0
(XEN) [  337.781206] cr3: dd2a1000   cr2: 88809ed1fb80
(XEN) [  337.787100] fsb:    gsb: 8880a38c   gss: 

(XEN) [  337.795072] ds: 002b   es: 002b   fs:    gs:    ss: e010   cs: 
e008
(XEN) [  337.802525] Xen code around  
(sched_context_switched+0xaf/0x101):
(XEN) [  337.810672]  00 00 eb 18 f3 90 8b 02 <85> c0 75 f8 eb 0e 49 8b 7e 30 
48 85 ff 74 05 e8
(XEN) [  337.819080] Xen stack trace from rsp=83081cc8fda0:
(XEN) [  337.824713]83081cc72000 83081cc72000  
83081cc615b0
(XEN) [  337.832772]83081cc8fe00 82d0802404e0 0082 
83081ccb0e98
(XEN) [  337.840832]0001 83081ccb0e98 0001 
82d080602628
(XEN) [  337.848895]83081cc8fe60 82d080240aca 004d873bd669 
0001
(XEN) [  337.856952]83081cc72000 004d873bdc1c 830800ff 
82d0805bba00
(XEN) [  337.865012]82d0805bb980  83081cc8 
0001
(XEN) [  337.873072]83081cc8fe90 82d080242315 0080 
82d0805bb980
(XEN) [  337.881132]0001 82d0806026f0 83081cc8fea0 
82d08024236a
(XEN) [  337.889196]83081cc8fef0 82d08027a151 82d080242315 
00010665f000
(XEN) [  337.897256]83081cc72000 83081cc72000 83080665f000 
83081cc63000
(XEN) [  337.905313]0001 830806684000 83081cc8fd78 
88809ee08000
(XEN) [  337.913373]88809ee08000   
0003
(XEN) [  337.921434]88809ee08000 0246  

(XEN) [  337.929497]96968abe  810013aa 
8203c190
(XEN) [  337.937554]deadbeefdeadf00d deadbeefdeadf00d 0100 
810013aa
(XEN) [  337.945615]e033 0246 c900400afeb0 
e02b
(XEN) [  337.953674]beef beef beef 
beef
(XEN) [  337.961736]e011 83081cc72000 00379c66db80 
001526e0
(XEN) [  337.969797]  0600 

(XEN) [  337.977856] Xen call trace:
(XEN) [  337.981152][] sched_context_switched+0xaf/0x101
(XEN) [  337.988083][] 
schedule.c#sched_context_switch+0x72/0x151
(XEN) [  337.995796][] schedule.c#sched_slave+0x2a3/0x2b2
(XEN) [  338.002817][] softirq.c#__do_softirq+0x85/0x90
(XEN) [  338.009664][] do_softirq+0x13/0x15
(XEN) [  338.015471][] domain.c#idle_loop+0xb2/0xc9
(XEN) [  338.021970]
(XEN) [  338.023965] CPU7 @ e008:82d080242f94 
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [  338.032372] CPU5 @ e008:82d080242f94 
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [  338.040776] CPU4 @ e008:82d080242f94 
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [  338.049182] CPU2 @ e008:82d080242f9a 
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [  338.057591] CPU6 @ e008:82d080242f9a 
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [  338.065999] CPU3 @ e008:82d080242f9a 
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [  338.074406] CPU0 @ e008:82d0802532d1 
(ns16550.c#ns_read_reg+0x21/0x42)
(XEN) [  338.081773]
(XEN) [  338.083764] 
(XEN) [  338.089226] Panic on CPU 1:
(XEN) [  338.092521] FATAL TRAP: vector = 2 (nmi)
(XEN) [  338.096940] [error_code=]
(XEN) 

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-22 Thread Sergey Dyasli
On 19/07/2019 14:57, Juergen Gross wrote:

> I have now a git branch with the two problems corrected and rebased to
> current staging available:
> 
> github.com/jgross1/xen.git sched-v1b

Many thanks for the branch! As for the crashes, vcpu_sleep_sync() one
seems to be fixed now. But I can still reproduce the shutdown one.
Interestingly, it now happens only if a host has running VMs (which
are automatically powered off via PV tools):

(XEN) [  332.981355] Preparing system for ACPI S5 state.
(XEN) [  332.981419] Disabling non-boot CPUs ...
(XEN) [  337.703896] Watchdog timer detects that CPU1 is stuck!
(XEN) [  337.709532] [ Xen-4.13.0-8.0.6-d  x86_64  debug=y   Not tainted 
]
(XEN) [  337.716808] CPU:1
(XEN) [  337.719582] RIP:e008:[] 
sched_context_switched+0xaf/0x101
(XEN) [  337.727384] RFLAGS: 0202   CONTEXT: hypervisor
(XEN) [  337.733364] rax: 0002   rbx: 83081cc615b0   rcx: 
0001
(XEN) [  337.741338] rdx: 83081cc61634   rsi: 83081cc72000   rdi: 
83081cc72000
(XEN) [  337.749312] rbp: 83081cc8fdc0   rsp: 83081cc8fda0   r8:  

(XEN) [  337.757284] r9:     r10: 004d88fc535e   r11: 
004df8675ce7
(XEN) [  337.765256] r12: 83081cc72000   r13: 83081cc72000   r14: 
83081ccb0e80
(XEN) [  337.773232] r15: 83081cc615b0   cr0: 8005003b   cr4: 
001526e0
(XEN) [  337.781206] cr3: dd2a1000   cr2: 88809ed1fb80
(XEN) [  337.787100] fsb:    gsb: 8880a38c   gss: 

(XEN) [  337.795072] ds: 002b   es: 002b   fs:    gs:    ss: e010   cs: 
e008
(XEN) [  337.802525] Xen code around  
(sched_context_switched+0xaf/0x101):
(XEN) [  337.810672]  00 00 eb 18 f3 90 8b 02 <85> c0 75 f8 eb 0e 49 8b 7e 30 
48 85 ff 74 05 e8
(XEN) [  337.819080] Xen stack trace from rsp=83081cc8fda0:
(XEN) [  337.824713]83081cc72000 83081cc72000  
83081cc615b0
(XEN) [  337.832772]83081cc8fe00 82d0802404e0 0082 
83081ccb0e98
(XEN) [  337.840832]0001 83081ccb0e98 0001 
82d080602628
(XEN) [  337.848895]83081cc8fe60 82d080240aca 004d873bd669 
0001
(XEN) [  337.856952]83081cc72000 004d873bdc1c 830800ff 
82d0805bba00
(XEN) [  337.865012]82d0805bb980  83081cc8 
0001
(XEN) [  337.873072]83081cc8fe90 82d080242315 0080 
82d0805bb980
(XEN) [  337.881132]0001 82d0806026f0 83081cc8fea0 
82d08024236a
(XEN) [  337.889196]83081cc8fef0 82d08027a151 82d080242315 
00010665f000
(XEN) [  337.897256]83081cc72000 83081cc72000 83080665f000 
83081cc63000
(XEN) [  337.905313]0001 830806684000 83081cc8fd78 
88809ee08000
(XEN) [  337.913373]88809ee08000   
0003
(XEN) [  337.921434]88809ee08000 0246  

(XEN) [  337.929497]96968abe  810013aa 
8203c190
(XEN) [  337.937554]deadbeefdeadf00d deadbeefdeadf00d 0100 
810013aa
(XEN) [  337.945615]e033 0246 c900400afeb0 
e02b
(XEN) [  337.953674]beef beef beef 
beef
(XEN) [  337.961736]e011 83081cc72000 00379c66db80 
001526e0
(XEN) [  337.969797]  0600 

(XEN) [  337.977856] Xen call trace:
(XEN) [  337.981152][] sched_context_switched+0xaf/0x101
(XEN) [  337.988083][] 
schedule.c#sched_context_switch+0x72/0x151
(XEN) [  337.995796][] schedule.c#sched_slave+0x2a3/0x2b2
(XEN) [  338.002817][] softirq.c#__do_softirq+0x85/0x90
(XEN) [  338.009664][] do_softirq+0x13/0x15
(XEN) [  338.015471][] domain.c#idle_loop+0xb2/0xc9
(XEN) [  338.021970]
(XEN) [  338.023965] CPU7 @ e008:82d080242f94 
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [  338.032372] CPU5 @ e008:82d080242f94 
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [  338.040776] CPU4 @ e008:82d080242f94 
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [  338.049182] CPU2 @ e008:82d080242f9a 
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [  338.057591] CPU6 @ e008:82d080242f9a 
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [  338.065999] CPU3 @ e008:82d080242f9a 
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [  338.074406] CPU0 @ e008:82d0802532d1 
(ns16550.c#ns_read_reg+0x21/0x42)
(XEN) [  338.081773]
(XEN) [  338.083764] 
(XEN) [  338.089226] Panic on CPU 1:
(XEN) [  338.092521] FATAL TRAP: vector = 2 (nmi)
(XEN) [  338.096940] [error_code=]
(XEN) [  338.100491] 

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-19 Thread Juergen Gross

On 18.07.19 17:14, Sergey Dyasli wrote:

On 18/07/2019 15:48, Juergen Gross wrote:

On 15.07.19 16:08, Sergey Dyasli wrote:

On 05/07/2019 14:56, Dario Faggioli wrote:

On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:

1) This crash is quite likely to happen:

[2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
that CPU2 is stuck!
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
8.0.6-d  x86_64  debug=y   Not tainted ]
[...]
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
[2019-07-04 18:22:47 UTC] (XEN) [
3425.505278]    [] vcpu_sleep_sync+0x50/0x71
[2019-07-04 18:22:47 UTC] (XEN) [
3425.511518]    [] vcpu_pause+0x21/0x23
[2019-07-04 18:22:47 UTC] (XEN) [
3425.517326]    []
vcpu_set_periodic_timer+0x27/0x73
[2019-07-04 18:22:47 UTC] (XEN) [
3425.524258]    [] do_vcpu_op+0x2c9/0x668
[2019-07-04 18:22:47 UTC] (XEN) [
3425.530238]    [] compat_vcpu_op+0x250/0x390
[2019-07-04 18:22:47 UTC] (XEN) [
3425.536566]    [] pv_hypercall+0x364/0x564
[2019-07-04 18:22:47 UTC] (XEN) [
3425.542719]    [] do_entry_int82+0x26/0x2d
[2019-07-04 18:22:47 UTC] (XEN) [
3425.548876]    [] entry_int82+0xbb/0xc0


Mmm... vcpu_set_periodic_timer?

What kernel is this and when does this crash happen?


Hi Dario,

I can easily reproduce this crash using a Debian 7 PV VM (2 vCPUs, 2GB RAM)
which has the following kernel:

# uname -a

Linux localhost 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux

All I need to do is suspend and resume the VM.


Happens with a more recent kernel, too.

I can easily reproduce the issue with any PV guest with more than 1 vcpu
by doing "xl save" and then "xl restore" again.

With the reproducer being available I'm now diving into the issue...


One further thing to add is that I was able to avoid the crash by reverting

xen/sched: rework and rename vcpu_force_reschedule()

which is a part of the series. This made all tests with PV guests pass.

Another question I have is do you have a git branch with core-scheduling
patches rebased on top of current staging available somewhere?


I have now a git branch with the two problems corrected and rebased to
current staging available:

github.com/jgross1/xen.git sched-v1b


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-19 Thread Juergen Gross

On 16.07.19 17:45, Sergey Dyasli wrote:

On 05/07/2019 14:17, Sergey Dyasli wrote:

[2019-07-05 00:37:16 UTC] (XEN) [24907.482686] Watchdog timer detects that 
CPU30 is stuck!
[2019-07-05 00:37:16 UTC] (XEN) [24907.514180] [ Xen-4.13.0-8.0.6-d  x86_64 
 debug=y   Not tainted ]
[2019-07-05 00:37:16 UTC] (XEN) [24907.552070] CPU:30
[2019-07-05 00:37:16 UTC] (XEN) [24907.565281] RIP:
e008:[] sched_context_switched+0xaf/0x101
[2019-07-05 00:37:16 UTC] (XEN) [24907.601232] RFLAGS: 0202   
CONTEXT: hypervisor
[2019-07-05 00:37:16 UTC] (XEN) [24907.629998] rax: 0002   rbx: 
83202782e880   rcx: 001e
[2019-07-05 00:37:16 UTC] (XEN) [24907.669651] rdx: 83202782e904   rsi: 
832027823000   rdi: 832027823000
[2019-07-05 00:37:16 UTC] (XEN) [24907.706560] rbp: 83403cab7d20   rsp: 
83403cab7d00   r8:  
[2019-07-05 00:37:16 UTC] (XEN) [24907.743258] r9:     r10: 
0200200200200200   r11: 0100100100100100
[2019-07-05 00:37:16 UTC] (XEN) [24907.779940] r12: 832027823000   r13: 
832027823000   r14: 83202782e7b0
[2019-07-05 00:37:16 UTC] (XEN) [24907.816849] r15: 83202782e880   cr0: 
8005003b   cr4: 000426e0
[2019-07-05 00:37:16 UTC] (XEN) [24907.854125] cr3: bd8a1000   cr2: 
1851b798
[2019-07-05 00:37:16 UTC] (XEN) [24907.881483] fsb:    gsb: 
   gss: 
[2019-07-05 00:37:16 UTC] (XEN) [24907.918309] ds:    es:    fs:    
gs:    ss:    cs: e008
[2019-07-05 00:37:16 UTC] (XEN) [24907.952619] Xen code around 
 (sched_context_switched+0xaf/0x101):
[2019-07-05 00:37:16 UTC] (XEN) [24907.990277]  00 00 eb 18 f3 90 8b 02 <85> c0 
75 f8 eb 0e 49 8b 7e 30 48 85 ff 74 05 e8
[2019-07-05 00:37:16 UTC] (XEN) [24908.032393] Xen stack trace from 
rsp=83403cab7d00:
[2019-07-05 00:37:16 UTC] (XEN) [24908.061298]832027823000 
832027823000  83202782e880
[2019-07-05 00:37:16 UTC] (XEN) [24908.098529]83403cab7d60 
82d0802407c0 0082 83202782e7c8
[2019-07-05 00:37:16 UTC] (XEN) [24908.135622]001e 
83202782e7c8 001e 82d080602628
[2019-07-05 00:37:16 UTC] (XEN) [24908.172671]83403cab7dc0 
82d080240d83 df99 001e
[2019-07-05 00:37:16 UTC] (XEN) [24908.210212]832027823000 
16a62dc8c6bc 00fc 001e
[2019-07-05 00:37:16 UTC] (XEN) [24908.247181]83202782e7c8 
82d080602628 82d0805da460 001e
[2019-07-05 00:37:16 UTC] (XEN) [24908.284279]83403cab7e60 
82d080240ea4 0002802aecc5 832027823000
[2019-07-05 00:37:16 UTC] (XEN) [24908.321128]83202782e7b0 
83202782e880 83403cab7e10 82d080273b4e
[2019-07-05 00:37:16 UTC] (XEN) [24908.358308]83403cab7e10 
82d080242f7f 83403cab7e60 82d08024663a
[2019-07-05 00:37:17 UTC] (XEN) [24908.395662]83403cab7ea0 
82d0802ec32a 834000ff 82d0805bc880
[2019-07-05 00:37:17 UTC] (XEN) [24908.432376]82d0805bb980 
 83403cab7fff 001e
[2019-07-05 00:37:17 UTC] (XEN) [24908.469812]83403cab7e90 
82d080242575 0f00 82d0805bb980
[2019-07-05 00:37:17 UTC] (XEN) [24908.508373]001e 
82d0806026f0 83403cab7ea0 82d0802425ca
[2019-07-05 00:37:17 UTC] (XEN) [24908.549856]83403cab7ef0 
82d08027a601 82d080242575 001e7ffde000
[2019-07-05 00:37:17 UTC] (XEN) [24908.588022]832027823000 
832027823000 83127ffde000 83203ffe5000
[2019-07-05 00:37:17 UTC] (XEN) [24908.625217]001e 
831204092000 83403cab7d78 ffed
[2019-07-05 00:37:17 UTC] (XEN) [24908.662932]8180 
 8180 
[2019-07-05 00:37:17 UTC] (XEN) [24908.703246]818f4580 
880039118848 0e6a3c4b2698 148900db
[2019-07-05 00:37:17 UTC] (XEN) [24908.743671] 
8101e650 8185c3e0 
[2019-07-05 00:37:17 UTC] (XEN) [24908.781927] 
 beefbeef 81054eb2
[2019-07-05 00:37:17 UTC] (XEN) [24908.820986] Xen call trace:
[2019-07-05 00:37:17 UTC] (XEN) [24908.836789][] 
sched_context_switched+0xaf/0x101
[2019-07-05 00:37:17 UTC] (XEN) [24908.869916][] 
schedule.c#sched_context_switch+0x72/0x151
[2019-07-05 00:37:17 UTC] (XEN) [24908.907384][] 
schedule.c#sched_slave+0x2a3/0x2b2
[2019-07-05 00:37:17 UTC] (XEN) [24908.941241][] 
schedule.c#schedule+0x112/0x2a1
[2019-07-05 00:37:17 UTC] (XEN) [24908.973939][] 
softirq.c#__do_softirq+0x85/0x90
[2019-07-05 00:37:17 UTC] (XEN) [24909.007101][] 
do_softirq+0x13/0x15
[2019-07-05 00:37:17 UTC] (XEN) [24909.035971][] 
domain.c#idle_loop+0xad/0xc0
[2019-07-05 00:37:17 UTC] (XEN) [24909.070546]
[2019-07-05 

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-19 Thread Juergen Gross

On 19.07.19 07:41, Juergen Gross wrote:

On 18.07.19 17:14, Sergey Dyasli wrote:

On 18/07/2019 15:48, Juergen Gross wrote:

On 15.07.19 16:08, Sergey Dyasli wrote:

On 05/07/2019 14:56, Dario Faggioli wrote:

On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:

1) This crash is quite likely to happen:

[2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
that CPU2 is stuck!
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
8.0.6-d  x86_64  debug=y   Not tainted ]
[...]
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
[2019-07-04 18:22:47 UTC] (XEN) [
3425.505278]    [] vcpu_sleep_sync+0x50/0x71
[2019-07-04 18:22:47 UTC] (XEN) [
3425.511518]    [] vcpu_pause+0x21/0x23
[2019-07-04 18:22:47 UTC] (XEN) [
3425.517326]    []
vcpu_set_periodic_timer+0x27/0x73
[2019-07-04 18:22:47 UTC] (XEN) [
3425.524258]    [] do_vcpu_op+0x2c9/0x668
[2019-07-04 18:22:47 UTC] (XEN) [
3425.530238]    [] compat_vcpu_op+0x250/0x390
[2019-07-04 18:22:47 UTC] (XEN) [
3425.536566]    [] pv_hypercall+0x364/0x564
[2019-07-04 18:22:47 UTC] (XEN) [
3425.542719]    [] do_entry_int82+0x26/0x2d
[2019-07-04 18:22:47 UTC] (XEN) [
3425.548876]    [] entry_int82+0xbb/0xc0


Mmm... vcpu_set_periodic_timer?

What kernel is this and when does this crash happen?


Hi Dario,

I can easily reproduce this crash using a Debian 7 PV VM (2 vCPUs, 
2GB RAM)

which has the following kernel:

# uname -a

Linux localhost 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux

All I need to do is suspend and resume the VM.


Happens with a more recent kernel, too.

I can easily reproduce the issue with any PV guest with more than 1 vcpu
by doing "xl save" and then "xl restore" again.

With the reproducer being available I'm now diving into the issue...


One further thing to add is that I was able to avoid the crash by 
reverting


xen/sched: rework and rename vcpu_force_reschedule()

which is a part of the series. This made all tests with PV guests pass.


Yeah, but removing this patch is just papering over a general issue.
The main problem seems to be a vcpu trying to pause another vcpu of the
same sched_unit. I already have an idea what is really happening, I just
need to verify it.


Was another problem as I thought initially, but I've found it.

xl restore is working now. :-)


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-18 Thread Juergen Gross

On 18.07.19 17:14, Sergey Dyasli wrote:

On 18/07/2019 15:48, Juergen Gross wrote:

On 15.07.19 16:08, Sergey Dyasli wrote:

On 05/07/2019 14:56, Dario Faggioli wrote:

On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:

1) This crash is quite likely to happen:

[2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
that CPU2 is stuck!
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
8.0.6-d  x86_64  debug=y   Not tainted ]
[...]
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
[2019-07-04 18:22:47 UTC] (XEN) [
3425.505278]    [] vcpu_sleep_sync+0x50/0x71
[2019-07-04 18:22:47 UTC] (XEN) [
3425.511518]    [] vcpu_pause+0x21/0x23
[2019-07-04 18:22:47 UTC] (XEN) [
3425.517326]    []
vcpu_set_periodic_timer+0x27/0x73
[2019-07-04 18:22:47 UTC] (XEN) [
3425.524258]    [] do_vcpu_op+0x2c9/0x668
[2019-07-04 18:22:47 UTC] (XEN) [
3425.530238]    [] compat_vcpu_op+0x250/0x390
[2019-07-04 18:22:47 UTC] (XEN) [
3425.536566]    [] pv_hypercall+0x364/0x564
[2019-07-04 18:22:47 UTC] (XEN) [
3425.542719]    [] do_entry_int82+0x26/0x2d
[2019-07-04 18:22:47 UTC] (XEN) [
3425.548876]    [] entry_int82+0xbb/0xc0


Mmm... vcpu_set_periodic_timer?

What kernel is this and when does this crash happen?


Hi Dario,

I can easily reproduce this crash using a Debian 7 PV VM (2 vCPUs, 2GB RAM)
which has the following kernel:

# uname -a

Linux localhost 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux

All I need to do is suspend and resume the VM.


Happens with a more recent kernel, too.

I can easily reproduce the issue with any PV guest with more than 1 vcpu
by doing "xl save" and then "xl restore" again.

With the reproducer being available I'm now diving into the issue...


One further thing to add is that I was able to avoid the crash by reverting

xen/sched: rework and rename vcpu_force_reschedule()

which is a part of the series. This made all tests with PV guests pass.


Yeah, but removing this patch is just papering over a general issue.
The main problem seems to be a vcpu trying to pause another vcpu of the
same sched_unit. I already have an idea what is really happening, I just
need to verify it.


Another question I have is do you have a git branch with core-scheduling
patches rebased on top of current staging available somewhere?


Only the one Dario already mentioned.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-18 Thread Dario Faggioli
On Thu, 2019-07-18 at 16:14 +0100, Sergey Dyasli wrote:
> On 18/07/2019 15:48, Juergen Gross wrote:
> > 
> > I can easily reproduce the issue with any PV guest with more than 1
> > vcpu
> > by doing "xl save" and then "xl restore" again.
> > 
> > With the reproducer being available I'm now diving into the
> > issue...
> 
> One further thing to add is that I was able to avoid the crash by
> reverting
> 
>   xen/sched: rework and rename vcpu_force_reschedule()
> 
Ah, interesting!

> which is a part of the series. This made all tests with PV guests
> pass.
> 
That's good news. :-)

> Another question I have is do you have a git branch with core-
> scheduling
> patches rebased on top of current staging available somewhere?
> 
For my benchmarks, I used this:

https://github.com/jgross1/xen/tree/sched-v1-rebase

I don't know if Juergen has another one which is even more updated.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)



signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-18 Thread Sergey Dyasli
On 18/07/2019 15:48, Juergen Gross wrote:
> On 15.07.19 16:08, Sergey Dyasli wrote:
>> On 05/07/2019 14:56, Dario Faggioli wrote:
>>> On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:
 1) This crash is quite likely to happen:

 [2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
 that CPU2 is stuck!
 [2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
 8.0.6-d  x86_64  debug=y   Not tainted ]
 [...]
 [2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.505278]    [] vcpu_sleep_sync+0x50/0x71
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.511518]    [] vcpu_pause+0x21/0x23
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.517326]    []
 vcpu_set_periodic_timer+0x27/0x73
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.524258]    [] do_vcpu_op+0x2c9/0x668
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.530238]    [] compat_vcpu_op+0x250/0x390
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.536566]    [] pv_hypercall+0x364/0x564
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.542719]    [] do_entry_int82+0x26/0x2d
 [2019-07-04 18:22:47 UTC] (XEN) [
 3425.548876]    [] entry_int82+0xbb/0xc0

>>> Mmm... vcpu_set_periodic_timer?
>>>
>>> What kernel is this and when does this crash happen?
>>
>> Hi Dario,
>>
>> I can easily reproduce this crash using a Debian 7 PV VM (2 vCPUs, 2GB RAM)
>> which has the following kernel:
>>
>> # uname -a
>>
>> Linux localhost 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux
>>
>> All I need to do is suspend and resume the VM.
> 
> Happens with a more recent kernel, too.
> 
> I can easily reproduce the issue with any PV guest with more than 1 vcpu
> by doing "xl save" and then "xl restore" again.
> 
> With the reproducer being available I'm now diving into the issue...

One further thing to add is that I was able to avoid the crash by reverting

xen/sched: rework and rename vcpu_force_reschedule()

which is a part of the series. This made all tests with PV guests pass.

Another question I have is do you have a git branch with core-scheduling
patches rebased on top of current staging available somewhere?

Thanks,
Sergey

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-18 Thread Juergen Gross

On 15.07.19 16:08, Sergey Dyasli wrote:

On 05/07/2019 14:56, Dario Faggioli wrote:

On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:

1) This crash is quite likely to happen:

[2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
that CPU2 is stuck!
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
8.0.6-d  x86_64  debug=y   Not tainted ]
[...]
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
[2019-07-04 18:22:47 UTC] (XEN) [
3425.505278][] vcpu_sleep_sync+0x50/0x71
[2019-07-04 18:22:47 UTC] (XEN) [
3425.511518][] vcpu_pause+0x21/0x23
[2019-07-04 18:22:47 UTC] (XEN) [
3425.517326][]
vcpu_set_periodic_timer+0x27/0x73
[2019-07-04 18:22:47 UTC] (XEN) [
3425.524258][] do_vcpu_op+0x2c9/0x668
[2019-07-04 18:22:47 UTC] (XEN) [
3425.530238][] compat_vcpu_op+0x250/0x390
[2019-07-04 18:22:47 UTC] (XEN) [
3425.536566][] pv_hypercall+0x364/0x564
[2019-07-04 18:22:47 UTC] (XEN) [
3425.542719][] do_entry_int82+0x26/0x2d
[2019-07-04 18:22:47 UTC] (XEN) [
3425.548876][] entry_int82+0xbb/0xc0


Mmm... vcpu_set_periodic_timer?

What kernel is this and when does this crash happen?


Hi Dario,

I can easily reproduce this crash using a Debian 7 PV VM (2 vCPUs, 2GB RAM)
which has the following kernel:

# uname -a

Linux localhost 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux

All I need to do is suspend and resume the VM.


Happens with a more recent kernel, too.

I can easily reproduce the issue with any PV guest with more than 1 vcpu
by doing "xl save" and then "xl restore" again.

With the reproducer being available I'm now diving into the issue...


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-16 Thread Sergey Dyasli
On 05/07/2019 14:17, Sergey Dyasli wrote:
> [2019-07-05 00:37:16 UTC] (XEN) [24907.482686] Watchdog timer detects that 
> CPU30 is stuck!
> [2019-07-05 00:37:16 UTC] (XEN) [24907.514180] [ Xen-4.13.0-8.0.6-d  
> x86_64  debug=y   Not tainted ]
> [2019-07-05 00:37:16 UTC] (XEN) [24907.552070] CPU:30
> [2019-07-05 00:37:16 UTC] (XEN) [24907.565281] RIP:
> e008:[] sched_context_switched+0xaf/0x101
> [2019-07-05 00:37:16 UTC] (XEN) [24907.601232] RFLAGS: 0202   
> CONTEXT: hypervisor
> [2019-07-05 00:37:16 UTC] (XEN) [24907.629998] rax: 0002   rbx: 
> 83202782e880   rcx: 001e
> [2019-07-05 00:37:16 UTC] (XEN) [24907.669651] rdx: 83202782e904   rsi: 
> 832027823000   rdi: 832027823000
> [2019-07-05 00:37:16 UTC] (XEN) [24907.706560] rbp: 83403cab7d20   rsp: 
> 83403cab7d00   r8:  
> [2019-07-05 00:37:16 UTC] (XEN) [24907.743258] r9:     r10: 
> 0200200200200200   r11: 0100100100100100
> [2019-07-05 00:37:16 UTC] (XEN) [24907.779940] r12: 832027823000   r13: 
> 832027823000   r14: 83202782e7b0
> [2019-07-05 00:37:16 UTC] (XEN) [24907.816849] r15: 83202782e880   cr0: 
> 8005003b   cr4: 000426e0
> [2019-07-05 00:37:16 UTC] (XEN) [24907.854125] cr3: bd8a1000   cr2: 
> 1851b798
> [2019-07-05 00:37:16 UTC] (XEN) [24907.881483] fsb:    gsb: 
>    gss: 
> [2019-07-05 00:37:16 UTC] (XEN) [24907.918309] ds:    es:    fs:  
>   gs:    ss:    cs: e008
> [2019-07-05 00:37:16 UTC] (XEN) [24907.952619] Xen code around 
>  (sched_context_switched+0xaf/0x101):
> [2019-07-05 00:37:16 UTC] (XEN) [24907.990277]  00 00 eb 18 f3 90 8b 02 <85> 
> c0 75 f8 eb 0e 49 8b 7e 30 48 85 ff 74 05 e8
> [2019-07-05 00:37:16 UTC] (XEN) [24908.032393] Xen stack trace from 
> rsp=83403cab7d00:
> [2019-07-05 00:37:16 UTC] (XEN) [24908.061298]832027823000 
> 832027823000  83202782e880
> [2019-07-05 00:37:16 UTC] (XEN) [24908.098529]83403cab7d60 
> 82d0802407c0 0082 83202782e7c8
> [2019-07-05 00:37:16 UTC] (XEN) [24908.135622]001e 
> 83202782e7c8 001e 82d080602628
> [2019-07-05 00:37:16 UTC] (XEN) [24908.172671]83403cab7dc0 
> 82d080240d83 df99 001e
> [2019-07-05 00:37:16 UTC] (XEN) [24908.210212]832027823000 
> 16a62dc8c6bc 00fc 001e
> [2019-07-05 00:37:16 UTC] (XEN) [24908.247181]83202782e7c8 
> 82d080602628 82d0805da460 001e
> [2019-07-05 00:37:16 UTC] (XEN) [24908.284279]83403cab7e60 
> 82d080240ea4 0002802aecc5 832027823000
> [2019-07-05 00:37:16 UTC] (XEN) [24908.321128]83202782e7b0 
> 83202782e880 83403cab7e10 82d080273b4e
> [2019-07-05 00:37:16 UTC] (XEN) [24908.358308]83403cab7e10 
> 82d080242f7f 83403cab7e60 82d08024663a
> [2019-07-05 00:37:17 UTC] (XEN) [24908.395662]83403cab7ea0 
> 82d0802ec32a 834000ff 82d0805bc880
> [2019-07-05 00:37:17 UTC] (XEN) [24908.432376]82d0805bb980 
>  83403cab7fff 001e
> [2019-07-05 00:37:17 UTC] (XEN) [24908.469812]83403cab7e90 
> 82d080242575 0f00 82d0805bb980
> [2019-07-05 00:37:17 UTC] (XEN) [24908.508373]001e 
> 82d0806026f0 83403cab7ea0 82d0802425ca
> [2019-07-05 00:37:17 UTC] (XEN) [24908.549856]83403cab7ef0 
> 82d08027a601 82d080242575 001e7ffde000
> [2019-07-05 00:37:17 UTC] (XEN) [24908.588022]832027823000 
> 832027823000 83127ffde000 83203ffe5000
> [2019-07-05 00:37:17 UTC] (XEN) [24908.625217]001e 
> 831204092000 83403cab7d78 ffed
> [2019-07-05 00:37:17 UTC] (XEN) [24908.662932]8180 
>  8180 
> [2019-07-05 00:37:17 UTC] (XEN) [24908.703246]818f4580 
> 880039118848 0e6a3c4b2698 148900db
> [2019-07-05 00:37:17 UTC] (XEN) [24908.743671] 
> 8101e650 8185c3e0 
> [2019-07-05 00:37:17 UTC] (XEN) [24908.781927] 
>  beefbeef 81054eb2
> [2019-07-05 00:37:17 UTC] (XEN) [24908.820986] Xen call trace:
> [2019-07-05 00:37:17 UTC] (XEN) [24908.836789][] 
> sched_context_switched+0xaf/0x101
> [2019-07-05 00:37:17 UTC] (XEN) [24908.869916][] 
> schedule.c#sched_context_switch+0x72/0x151
> [2019-07-05 00:37:17 UTC] (XEN) [24908.907384][] 
> schedule.c#sched_slave+0x2a3/0x2b2
> [2019-07-05 00:37:17 UTC] (XEN) [24908.941241][] 
> schedule.c#schedule+0x112/0x2a1
> [2019-07-05 00:37:17 UTC] (XEN) [24908.973939][] 
> softirq.c#__do_softirq+0x85/0x90
> [2019-07-05 00:37:17 UTC] (XEN) [24909.007101][] 
> do_softirq+0x13/0x15
> [2019-07-05 

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-15 Thread Sergey Dyasli
On 05/07/2019 14:56, Dario Faggioli wrote:
> On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:
>> 1) This crash is quite likely to happen:
>>
>> [2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
>> that CPU2 is stuck!
>> [2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
>> 8.0.6-d  x86_64  debug=y   Not tainted ]
>> [...]
>> [2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.505278][] vcpu_sleep_sync+0x50/0x71
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.511518][] vcpu_pause+0x21/0x23
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.517326][]
>> vcpu_set_periodic_timer+0x27/0x73
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.524258][] do_vcpu_op+0x2c9/0x668
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.530238][] compat_vcpu_op+0x250/0x390
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.536566][] pv_hypercall+0x364/0x564
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.542719][] do_entry_int82+0x26/0x2d
>> [2019-07-04 18:22:47 UTC] (XEN) [
>> 3425.548876][] entry_int82+0xbb/0xc0
>>
> Mmm... vcpu_set_periodic_timer?
> 
> What kernel is this and when does this crash happen?

Hi Dario,

I can easily reproduce this crash using a Debian 7 PV VM (2 vCPUs, 2GB RAM)
which has the following kernel:

# uname -a

Linux localhost 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 x86_64 GNU/Linux

All I need to do is suspend and resume the VM.

Thanks,
Sergey

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-11 Thread Dario Faggioli
On Tue, 2019-05-28 at 12:32 +0200, Juergen Gross wrote:
> Add support for core- and socket-scheduling in the Xen hypervisor.
> 
> [...]
>
> I have done some very basic performance testing: on a 4 cpu system
> (2 cores with 2 threads each) I did a "make -j 4" for building the
> Xen
> hypervisor. With This test has been run on dom0, once with no other
> guest active and once with another guest with 4 vcpus running the
> same
> test. The results are (always elapsed time, system time, user time):
> 
> sched-gran=cpu,no other guest: 116.10 177.65 207.84
> sched-gran=core,   no other guest: 114.04 175.47 207.45
> sched-gran=cpu,other guest:202.30 334.21 384.63
> sched-gran=core,   other guest:207.24 293.04 371.37
> 
> The performance tests have been performed with credit2, the other
> schedulers are tested only briefly to be able to create a domain in a
> cpupool.
> 
I have done some more, and I'd like to report the results here.

For those that are attending the Xen-Project Dev Summit, and has seen
Juergen's talk about core-scheduling, these are the numbers he had in
his slides.

They're quite a few number, and there are multiple way to show them.

We arranged them in two different ways, and I'm showing both. Since
it's quite likely that the result will be poor in mail clients, here
they are the links to view in browser or download text files:

http://xenbits.xen.org/people/dariof/benchmarks/results/2019/07-July/xen/core-sched/mmtests/boxes/wayrath/summary.txt

http://xenbits.xen.org/people/dariof/benchmarks/results/2019/07-July/xen/core-sched/mmtests/boxes/wayrath/summary-5columns.txt

They're the same numbers, in the two files, but in
'summary-5columns.txt' has them arranged in tables that show, for each
combination of benchmark and configuration, the differences between the
various options (i.e., no core-scheduling, core-scheduling not used,
core-scheduling in use, etc).

The 'summary.txt' file, contains some more data (such as the results of
runs done on baremetal), arranged in different tables. It also contains
some of my thoughts and analysis about what the numbers tells us.

It's quite hard to come up with a concise summary, as results vary a
lot on a case by case basis, and there are a few things that needs
being investigated mode.

I'll try anyway, but please, if you are interested in the subject, do
have a look at the numbers themselves, even if there's a lot of them:

- Overhead: the cost of having this patch series applied, and not 
  using core-scheduling, seems acceptable to me. In most cases, the 
  overhead is within the noise margin of the measurements. There are a 
  couple of benchmarks where this is not the case. But that means we 
  can go trying figuring out why this happens only there, and, 
  potentially, optimize and tune.

- PV vs. HVM: there seem to be some differences, in some of the 
  results, for different type of guest (well, for PV I used dom0). In 
  general, HVM seems to behave a little worse, i.e., suffers from more 
  overhead and perf degradation, but this is not the case for all 
  benchmarks, so it's hard to tell whether it's something specific or 
  an actual trend.
  I don't have the numbers for proper PV guests and for PVH. I expect 
  the former to be close to dom0 numbers and the latter to HVM 
  numbers, but I'll try to do those runs as well (as soon as the 
  testbox is free again).

- HT vs. noHT: even before considering core-scheduling at all, the 
  debate is still open about whether or not Hyperthreading help in the 
  first place. These numbers shows that this very much depend on the 
  workload and on the load, which is no big surprise.
  It is quite a solid trend, however, than when load is high (look, 
  for instance, at runs that saturate the CPU, or at oversubscribed 
  runs), Hyperthreading let us achieve better results.

- Regressions versus no core-scheduling: this happens, as it could 
  have been expected. It does not happen 100% of the times, and 
  mileage may vary, but in most benchmarks and in most configurations, 
  we do regress.

- Core-scheduling vs. no-Hyperthreading: this is again a mixed bag. 
  There are cases where things are faster in one setup, and cases 
  where it is the other one that wins. Especially in the non 
  overloaded case.

- Core-scheduling and overloading: when more vCPUs than pCPUs are used 
  (and there is actual overload, i.e., the vCPUs actually generate 
  more load than there are pCPUs to satisfy), core-scheduling shows 
  pretty decent performance. This is easy to see, comparing core-
  scheduling with no-Hyperthreading, in the overloaded cases. In most 
  benchmark, both the configuration perform worse than default, but 
  core-scheduling regresses a lot less than no-Hyperthreading. And 
  this, I think, is quite important!

- Core-scheduling and HT-aware scheduling: currently, the scheduler 
  tend to spread vCPUs among cores. That is, if we have 2 vCPUs and 2 
  cores with two threads each, the 

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-05 Thread Dario Faggioli
On Fri, 2019-07-05 at 14:17 +0100, Sergey Dyasli wrote:
> 1) This crash is quite likely to happen:
> 
> [2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects
> that CPU2 is stuck!
> [2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-
> 8.0.6-d  x86_64  debug=y   Not tainted ]
> [...]
> [2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.505278][] vcpu_sleep_sync+0x50/0x71
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.511518][] vcpu_pause+0x21/0x23
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.517326][]
> vcpu_set_periodic_timer+0x27/0x73
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.524258][] do_vcpu_op+0x2c9/0x668
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.530238][] compat_vcpu_op+0x250/0x390
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.536566][] pv_hypercall+0x364/0x564
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.542719][] do_entry_int82+0x26/0x2d
> [2019-07-04 18:22:47 UTC] (XEN) [
> 3425.548876][] entry_int82+0xbb/0xc0
>
Mmm... vcpu_set_periodic_timer?

What kernel is this and when does this crash happen?

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
---
<> (Raistlin Majere)



signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-05 Thread Juergen Gross

On 05.07.19 15:17, Sergey Dyasli wrote:

Hi Juergen,

I did some testing of this series (with sched-gran=core) and posting a couple of
crash backtraces here for your information.

Additionally, resuming a Debian 7 guest after suspend is broken.

I will be able to provide any additional information only after XenSummit :)


Thanks for the reports!

I will look at this after XenSummit. :-)


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-07-05 Thread Sergey Dyasli
Hi Juergen,

I did some testing of this series (with sched-gran=core) and posting a couple of
crash backtraces here for your information.

Additionally, resuming a Debian 7 guest after suspend is broken.

I will be able to provide any additional information only after XenSummit :)

1) This crash is quite likely to happen:

[2019-07-04 18:22:46 UTC] (XEN) [ 3425.220660] Watchdog timer detects that CPU2 
is stuck!
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.226293] [ Xen-4.13.0-8.0.6-d  x86_64 
 debug=y   Not tainted ]
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.233576] CPU:2
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.236348] RIP:
e008:[] vcpu_sleep_sync+0x50/0x71
[2019-07-04 18:22:46 UTC] (XEN) [ 3425.243458] RFLAGS: 0202   
CONTEXT: hypervisor (d34v0)
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.250129] rax: 0001   rbx: 
8305f29e6000   rcx: 8305f29e6128
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.258101] rdx:    rsi: 
0296   rdi: 8308066f9128
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.266076] rbp: 8308066f7cb8   rsp: 
8308066f7ca8   r8:  deadf00d
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.274052] r9:  deadf00d   r10: 
   r11: 
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.282026] r12:    r13: 
8305f29e6000   r14: 
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.289994] r15: 0003   cr0: 
8005003b   cr4: 001526e0
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.297970] cr3: 0005f2de3000   cr2: 
c012ae78
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.303864] fsb: 04724000   gsb: 
c52c4a20   gss: 
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.311836] ds: 007b   es: 007b   fs: 00d8   
gs: 00e0   ss:    cs: e008
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.319290] Xen code around 
 (vcpu_sleep_sync+0x50/0x71):
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.326744]  ec 01 00 00 09 d0 48 98 <48> 0b 
83 20 01 00 00 74 09 80 bb 07 01 00 00 00
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.335152] Xen stack trace from 
rsp=8308066f7ca8:
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.340783]82d0802aede4 
8305f29e6000 8308066f7cc8 82d080208370
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.348844]8308066f7ce8 
82d08023e25d 0001 8305f33f
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.356904]8308066f7d58 
82d080209682 031c63c966ad ed601000
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.364963]92920063 
0009 8305f33f 0001
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.373024]0292 
82d080242ee2 0001 8305f29e6000
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.381084] 
8305f33f 8308066f7e28 82d08024f970
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.389144]8305f33f00d4 
000c 8305f33f deadf00d
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.397207]8308066f7da8 
82d0802b3754 82d080209d46 82d08020b6e7
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.405262]8308066f7e28 
82d08020c658 0002ec86be74 0002
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.413325]8305c33d8300 
8305f33f00d4  000c0008
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.421383]0009 
83081cca1000 82d08038835a 8308066f7ef8
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.429445]8306a2b11000 
deadf00d 0180 0003
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.437503]8308066f7ec8 
82d080383964 82d08038835a 82d7
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.445565]82d1 
82d0 82d0deadf00d 82d0deadf00d
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.453624]82d08038835a 
82d08038834e 82d08038835a 82d08038834e
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.461683]82d08038835a 
82d08038834e 82d08038835a 8308066f7ef8
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.469744] 
  
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.477803]8308066f7ee8 
82d080385644 82d08038835a 8306a2b11000
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.485865]7cf7f99080e7 
82d08038839b  
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.493923] 
 0001 0007
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.501989] Xen call trace:
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.505278][] 
vcpu_sleep_sync+0x50/0x71
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.511518][] 
vcpu_pause+0x21/0x23
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.517326][] 
vcpu_set_periodic_timer+0x27/0x73
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.524258][] 
do_vcpu_op+0x2c9/0x668
[2019-07-04 18:22:47 UTC] (XEN) [ 3425.530238][] 
compat_vcpu_op+0x250/0x390
[2019-07-04 18:22:47 UTC] (XEN) [ 

[Xen-devel] [PATCH 00/60] xen: add core scheduling support

2019-05-28 Thread Juergen Gross
Add support for core- and socket-scheduling in the Xen hypervisor.

Via boot parameter sched-gran=core (or sched-gran=socket)
it is possible to change the scheduling granularity from cpu (the
default) to either whole cores or even sockets.

All logical cpus (threads) of the core or socket are always scheduled
together. This means that on a core always vcpus of the same domain
will be active, and those vcpus will always be scheduled at the same
time.

This is achieved by switching the scheduler to no longer see vcpus as
the primary object to schedule, but "schedule units". Each schedule
unit consists of as many vcpus as each core has threads on the current
system. The vcpu->unit relation is fixed.

I have done some very basic performance testing: on a 4 cpu system
(2 cores with 2 threads each) I did a "make -j 4" for building the Xen
hypervisor. With This test has been run on dom0, once with no other
guest active and once with another guest with 4 vcpus running the same
test. The results are (always elapsed time, system time, user time):

sched-gran=cpu,no other guest: 116.10 177.65 207.84
sched-gran=core,   no other guest: 114.04 175.47 207.45
sched-gran=cpu,other guest:202.30 334.21 384.63
sched-gran=core,   other guest:207.24 293.04 371.37

The performance tests have been performed with credit2, the other
schedulers are tested only briefly to be able to create a domain in a
cpupool.

Cpupools have been moderately tested (cpu add/remove, create, destroy,
move domain).

Cpu on-/offlining has been moderately tested, too.

The complete patch series is available under:

  git://github.com/jgross1/xen/ sched-v1

Changes in V1:
- cpupools are working now
- cpu on-/offlining working now
- all schedulers working now
- renamed "items" to "units"
- introduction of "idle scheduler"
- several new patches (see individual patches, mostly splits of
  former patches or cpupool and cpu on-/offlining support)
- all review comments addressed
- some minor changes (see individual patches)

Changes in RFC V2:
- ARM is building now
- HVM domains are working now
- idling will always be done with idle_vcpu active
- other small changes see individual patches

Juergen Gross (60):
  xen/sched: only allow schedulers with all mandatory functions
available
  xen/sched: add inline wrappers for calling per-scheduler functions
  xen/sched: let sched_switch_sched() return new lock address
  xen/sched: use new sched_unit instead of vcpu in scheduler interfaces
  xen/sched: alloc struct sched_unit for each vcpu
  xen/sched: move per-vcpu scheduler private data pointer to sched_unit
  xen/sched: build a linked list of struct sched_unit
  xen/sched: introduce struct sched_resource
  xen/sched: let pick_cpu return a scheduler resource
  xen/sched: switch schedule_data.curr to point at sched_unit
  xen/sched: move per cpu scheduler private data into struct
sched_resource
  xen/sched: switch vcpu_schedule_lock to unit_schedule_lock
  xen/sched: move some per-vcpu items to struct sched_unit
  xen/sched: add scheduler helpers hiding vcpu
  xen/sched: add domain pointer to struct sched_unit
  xen/sched: add id to struct sched_unit
  xen/sched: rename scheduler related perf counters
  xen/sched: switch struct task_slice from vcpu to sched_unit
  xen/sched: add is_running indicator to struct sched_unit
  xen/sched: make null scheduler vcpu agnostic.
  xen/sched: make rt scheduler vcpu agnostic.
  xen/sched: make credit scheduler vcpu agnostic.
  xen/sched: make credit2 scheduler vcpu agnostic.
  xen/sched: make arinc653 scheduler vcpu agnostic.
  xen: add sched_unit_pause_nosync() and sched_unit_unpause()
  xen: let vcpu_create() select processor
  xen/sched: use sched_resource cpu instead smp_processor_id in
schedulers
  xen/sched: switch schedule() from vcpus to sched_units
  xen/sched: switch sched_move_irqs() to take sched_unit as parameter
  xen: switch from for_each_vcpu() to for_each_sched_unit()
  xen/sched: add runstate counters to struct sched_unit
  xen/sched: rework and rename vcpu_force_reschedule()
  xen/sched: Change vcpu_migrate_*() to operate on schedule unit
  xen/sched: move struct task_slice into struct sched_unit
  xen/sched: add code to sync scheduling of all vcpus of a sched unit
  xen/sched: introduce unit_runnable_state()
  xen/sched: add support for multiple vcpus per sched unit where missing
  x86: make loading of GDT at context switch more modular
  x86: optimize loading of GDT at context switch
  xen/sched: modify cpupool_domain_cpumask() to be an unit mask
  xen/sched: support allocating multiple vcpus into one sched unit
  xen/sched: add a scheduler_percpu_init() function
  xen/sched: add a percpu resource index
  xen/sched: add fall back to idle vcpu when scheduling unit
  xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware
  xen/sched: carve out freeing sched_unit memory into dedicated function
  xen/sched: move per-cpu variable scheduler to struct sched_resource
  xen/sched: