Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure

2020-02-13 Thread Jürgen Groß

On 13.02.20 15:19, Sergey Dyasli wrote:

On 12/02/2020 12:24, Jürgen Groß wrote:

On 12.02.20 12:21, Sergey Dyasli wrote:

Hi Juergen,

Recently our testing has found a host crash which is reproducible.
Do you have any idea what might be going on here?


Oh, nice catch!

The problem is that get_cpu_idle_time() is calling vcpu_runstate_get()
for an idle vcpu. This is fragile as idle vcpus are sometimes assigned
temporarily to normal scheduling units, thus the ASSERT() in the unlock
function is failing when the assignment of the idle vcpu is modified
under the feet of vcpu_runstate_get() and the unit it has been assigned
to before is already scheduled on another cpu.

The patch is rather easy, though. Can you try it, please?


Thank you for the patch! I put it into testing yesterday and it looks
good so far. It also seems that the issue is well understood and the
patch should go into the main tree.


Just wanted to make sure it really fixes your problem. :-)


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure

2020-02-13 Thread Sergey Dyasli
On 12/02/2020 12:24, Jürgen Groß wrote:
> On 12.02.20 12:21, Sergey Dyasli wrote:
>> Hi Juergen,
>>
>> Recently our testing has found a host crash which is reproducible.
>> Do you have any idea what might be going on here?
>
> Oh, nice catch!
>
> The problem is that get_cpu_idle_time() is calling vcpu_runstate_get()
> for an idle vcpu. This is fragile as idle vcpus are sometimes assigned
> temporarily to normal scheduling units, thus the ASSERT() in the unlock
> function is failing when the assignment of the idle vcpu is modified
> under the feet of vcpu_runstate_get() and the unit it has been assigned
> to before is already scheduled on another cpu.
>
> The patch is rather easy, though. Can you try it, please?

Thank you for the patch! I put it into testing yesterday and it looks
good so far. It also seems that the issue is well understood and the
patch should go into the main tree.

--
Sergey

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure

2020-02-12 Thread Jürgen Groß

On 12.02.20 12:21, Sergey Dyasli wrote:

Hi Juergen,

Recently our testing has found a host crash which is reproducible.
Do you have any idea what might be going on here?


Oh, nice catch!

The problem is that get_cpu_idle_time() is calling vcpu_runstate_get()
for an idle vcpu. This is fragile as idle vcpus are sometimes assigned
temporarily to normal scheduling units, thus the ASSERT() in the unlock
function is failing when the assignment of the idle vcpu is modified
under the feet of vcpu_runstate_get() and the unit it has been assigned
to before is already scheduled on another cpu.

The patch is rather easy, though. Can you try it, please?


Juergen
>From 0236aee221409fa826a81395f2f3e8b15d5128de Mon Sep 17 00:00:00 2001
From: Juergen Gross 
To: xen-devel@lists.xenproject.org
Cc: George Dunlap 
Cc: Dario Faggioli 
Date: Wed, 12 Feb 2020 13:04:16 +0100
Subject: [PATCH] xen/sched: fix get_cpu_idle_time() with core scheduling

get_cpu_idle_time() is calling vcpu_runstate_get() for an idle vcpu.
With core scheduling active this is fragile, as idle vcpus are assigned
to other scheduling units temporarily, and that assignment is changed
in some cases without holding the scheduling lock, and
vcpu_runstate_get() is using v->sched_unit as parameter for
unit_schedule_[un]lock_irq(), resulting in an ASSERT() triggering in
unlock in case v->sched_unit has changed meanwhile.

Fix that by using a local unit variable holding the correct unit.

Signed-off-by: Juergen Gross 
---
 xen/common/sched/core.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 2e43f8029f..de5a6b1a57 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -308,17 +308,26 @@ void vcpu_runstate_get(const struct vcpu *v,
 {
 spinlock_t *lock;
 s_time_t delta;
+struct sched_unit *unit;
 
 rcu_read_lock(_res_rculock);
 
-lock = likely(v == current) ? NULL : unit_schedule_lock_irq(v->sched_unit);
+/*
+ * Be careful in case of an idle vcpu: the assignment to a unit might
+ * change even with the scheduling lock held, so be sure to use the
+ * correct unit for locking in order to avoid triggering an ASSERT() in
+ * the unlock function.
+ */
+unit = is_idle_vcpu(v) ? get_sched_res(v->processor)->sched_unit_idle
+   : v->sched_unit;
+lock = likely(v == current) ? NULL : unit_schedule_lock_irq(unit);
 memcpy(runstate, >runstate, sizeof(*runstate));
 delta = NOW() - runstate->state_entry_time;
 if ( delta > 0 )
 runstate->time[runstate->state] += delta;
 
 if ( unlikely(lock != NULL) )
-unit_schedule_unlock_irq(lock, v->sched_unit);
+unit_schedule_unlock_irq(lock, unit);
 
 rcu_read_unlock(_res_rculock);
 }
-- 
2.16.4

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure

2020-02-12 Thread Sergey Dyasli
Hi Juergen,

Recently our testing has found a host crash which is reproducible.
Do you have any idea what might be going on here?

(XEN) [175654.165126] Assertion 'lock == 
get_sched_res(i->res->master_cpu)->schedule_lock' failed at 
...ild/BUILD/xen-4.13.1/xen/include/xen/sched-if.h:269
(XEN) [175654.165133] [ Xen-4.13.1-9.0.3-d  x86_64  debug=y   Not tainted 
]
(XEN) [175654.165136] CPU:28
(XEN) [175654.165138] RIP:e008:[] 
vcpu_runstate_get+0x11e/0x14f
(XEN) [175654.165146] RFLAGS: 00010083   CONTEXT: hypervisor (d0v4)
(XEN) [175654.165151] rax: 83403ff0d340   rbx: 83807cc97ac8   rcx: 
0006
(XEN) [175654.165154] rdx: 006fbf942000   rsi: 83400f8e1cd8   rdi: 
107898e2
(XEN) [175654.165158] rbp: 83807cc97ab8   rsp: 83807cc97a88   r8:  
deadbeefdeadf00d
(XEN) [175654.165160] r9:  deadbeefdeadf00d   r10:    r11: 

(XEN) [175654.165164] r12: 83400fa6f000   r13: 83400f8c9778   r14: 
82d0805c8008
(XEN) [175654.165167] r15: 832e30854ae0   cr0: 80050033   cr4: 
00362660
(XEN) [175654.165170] cr3: 002130811000   cr2: 88817f50b728
(XEN) [175654.165172] fsb: 7f40a40da740   gsb: 88831d30   gss: 

(XEN) [175654.165175] ds:    es:    fs:    gs:    ss: e010   
cs: e008
(XEN) [175654.165179] Xen code around  
(vcpu_runstate_get+0x11e/0x14f):
(XEN) [175654.165181]  04 10 4c 3b 68 10 74 02 <0f> 0b 4c 89 ef e8 7e 5d 00 00 
48 8d 05 41 9d 38
(XEN) [175654.165192] Xen stack trace from rsp=83807cc97a88:
(XEN) [175654.165194]83807cc97aa8 83400fa75a60  
83807cc97da0
(XEN) [175654.165199]0230 83807cc97fff 83807cc97af8 
82d08023d41f
(XEN) [175654.165204]0001 9fc1ac1cb2f4 4840c423acdc 
5780e7f9735a
(XEN) [175654.165207]  83807cc97c98 
82d0802ea9f7
(XEN) [175654.165211] 9fc1ac1c6b99 00050007 
83807cc97c10
(XEN) [175654.165215]83807cc97bb0 0020  

(XEN) [175654.165251]   

(XEN) [175654.165254]   

(XEN) [175654.165258]82d0805c8038 82d0805c74a0  
aa00
(XEN) [175654.165263]   

(XEN) [175654.165266]   

(XEN) [175654.165269]   

(XEN) [175654.165273]   

(XEN) [175654.165276]   

(XEN) [175654.165279]   

(XEN) [175654.165283]83400f813000 83807cc97d98  
82d0805cda80
(XEN) [175654.165287]0230 83807cc97fff 83807cc97cc8 
82d08026d99b
(XEN) [175654.165291]83807cc97ef8 83400f813000 82d0805cda80 
0230
(XEN) [175654.165295]83807cc97e48 82d080244573 7f40a40e6000 
0206
(XEN) [175654.165300]82004006c000   
82e08a815e80
(XEN) [175654.165304] Xen call trace:
(XEN) [175654.165306][] R vcpu_runstate_get+0x11e/0x14f
(XEN) [175654.165310][] F get_cpu_idle_time+0x4d/0x53
(XEN) [175654.165315][] F pmstat_get_cx_stat+0x82/0x8e7
(XEN) [175654.165319][] F do_get_pm_info+0x27b/0x2d4
(XEN) [175654.165322][] F do_sysctl+0x633/0x14e0
(XEN) [175654.165327][] F pv_hypercall+0x1f5/0x567
(XEN) [175654.165330][] F lstar_enter+0x112/0x120
(XEN) [175654.165332]
(XEN) [175654.550916]
(XEN) [175654.553243] 
(XEN) [175654.559449] Panic on CPU 28:
(XEN) [175654.563328] Assertion 'lock == 
get_sched_res(i->res->master_cpu)->schedule_lock' failed at 
...ild/BUILD/xen-4.13.1/xen/include/xen/sched-if
(XEN) [175654.581847]
(XEN) [175654.584173] Reboot in five seconds...
(XEN) [175654.588925] Executing kexec image on cpu28
(XEN) [175654.594987] Shot down all CPUs


The state of the sibling was:


  PCPU 29 Host state:
RIP:e008:[] Ring 0
RFLAGS: 00040002  AC IOPL0

rax: 83400f8c91e4   rbx: 001d   rcx: 83400f8c91f4
rdx: 83400f8c9104   rsi: 83400f8c9094   rdi: 0004
rbp: 83807cc89f28   rsp: 83807cc89f28   r8:  
r9:     r10:    r11: 
r12:    r13:    r14: 83807cc8
r15: 

cr0: 80050033   PG