Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
On 13.02.20 15:19, Sergey Dyasli wrote: On 12/02/2020 12:24, Jürgen Groß wrote: On 12.02.20 12:21, Sergey Dyasli wrote: Hi Juergen, Recently our testing has found a host crash which is reproducible. Do you have any idea what might be going on here? Oh, nice catch! The problem is that get_cpu_idle_time() is calling vcpu_runstate_get() for an idle vcpu. This is fragile as idle vcpus are sometimes assigned temporarily to normal scheduling units, thus the ASSERT() in the unlock function is failing when the assignment of the idle vcpu is modified under the feet of vcpu_runstate_get() and the unit it has been assigned to before is already scheduled on another cpu. The patch is rather easy, though. Can you try it, please? Thank you for the patch! I put it into testing yesterday and it looks good so far. It also seems that the issue is well understood and the patch should go into the main tree. Just wanted to make sure it really fixes your problem. :-) Juergen ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
On 12/02/2020 12:24, Jürgen Groß wrote: > On 12.02.20 12:21, Sergey Dyasli wrote: >> Hi Juergen, >> >> Recently our testing has found a host crash which is reproducible. >> Do you have any idea what might be going on here? > > Oh, nice catch! > > The problem is that get_cpu_idle_time() is calling vcpu_runstate_get() > for an idle vcpu. This is fragile as idle vcpus are sometimes assigned > temporarily to normal scheduling units, thus the ASSERT() in the unlock > function is failing when the assignment of the idle vcpu is modified > under the feet of vcpu_runstate_get() and the unit it has been assigned > to before is already scheduled on another cpu. > > The patch is rather easy, though. Can you try it, please? Thank you for the patch! I put it into testing yesterday and it looks good so far. It also seems that the issue is well understood and the patch should go into the main tree. -- Sergey ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
On 12.02.20 12:21, Sergey Dyasli wrote: Hi Juergen, Recently our testing has found a host crash which is reproducible. Do you have any idea what might be going on here? Oh, nice catch! The problem is that get_cpu_idle_time() is calling vcpu_runstate_get() for an idle vcpu. This is fragile as idle vcpus are sometimes assigned temporarily to normal scheduling units, thus the ASSERT() in the unlock function is failing when the assignment of the idle vcpu is modified under the feet of vcpu_runstate_get() and the unit it has been assigned to before is already scheduled on another cpu. The patch is rather easy, though. Can you try it, please? Juergen >From 0236aee221409fa826a81395f2f3e8b15d5128de Mon Sep 17 00:00:00 2001 From: Juergen Gross To: xen-devel@lists.xenproject.org Cc: George Dunlap Cc: Dario Faggioli Date: Wed, 12 Feb 2020 13:04:16 +0100 Subject: [PATCH] xen/sched: fix get_cpu_idle_time() with core scheduling get_cpu_idle_time() is calling vcpu_runstate_get() for an idle vcpu. With core scheduling active this is fragile, as idle vcpus are assigned to other scheduling units temporarily, and that assignment is changed in some cases without holding the scheduling lock, and vcpu_runstate_get() is using v->sched_unit as parameter for unit_schedule_[un]lock_irq(), resulting in an ASSERT() triggering in unlock in case v->sched_unit has changed meanwhile. Fix that by using a local unit variable holding the correct unit. Signed-off-by: Juergen Gross --- xen/common/sched/core.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c index 2e43f8029f..de5a6b1a57 100644 --- a/xen/common/sched/core.c +++ b/xen/common/sched/core.c @@ -308,17 +308,26 @@ void vcpu_runstate_get(const struct vcpu *v, { spinlock_t *lock; s_time_t delta; +struct sched_unit *unit; rcu_read_lock(_res_rculock); -lock = likely(v == current) ? NULL : unit_schedule_lock_irq(v->sched_unit); +/* + * Be careful in case of an idle vcpu: the assignment to a unit might + * change even with the scheduling lock held, so be sure to use the + * correct unit for locking in order to avoid triggering an ASSERT() in + * the unlock function. + */ +unit = is_idle_vcpu(v) ? get_sched_res(v->processor)->sched_unit_idle + : v->sched_unit; +lock = likely(v == current) ? NULL : unit_schedule_lock_irq(unit); memcpy(runstate, >runstate, sizeof(*runstate)); delta = NOW() - runstate->state_entry_time; if ( delta > 0 ) runstate->time[runstate->state] += delta; if ( unlikely(lock != NULL) ) -unit_schedule_unlock_irq(lock, v->sched_unit); +unit_schedule_unlock_irq(lock, unit); rcu_read_unlock(_res_rculock); } -- 2.16.4 ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
[Xen-devel] Core Scheduling "lock == schedule_lock" assertion failure
Hi Juergen, Recently our testing has found a host crash which is reproducible. Do you have any idea what might be going on here? (XEN) [175654.165126] Assertion 'lock == get_sched_res(i->res->master_cpu)->schedule_lock' failed at ...ild/BUILD/xen-4.13.1/xen/include/xen/sched-if.h:269 (XEN) [175654.165133] [ Xen-4.13.1-9.0.3-d x86_64 debug=y Not tainted ] (XEN) [175654.165136] CPU:28 (XEN) [175654.165138] RIP:e008:[] vcpu_runstate_get+0x11e/0x14f (XEN) [175654.165146] RFLAGS: 00010083 CONTEXT: hypervisor (d0v4) (XEN) [175654.165151] rax: 83403ff0d340 rbx: 83807cc97ac8 rcx: 0006 (XEN) [175654.165154] rdx: 006fbf942000 rsi: 83400f8e1cd8 rdi: 107898e2 (XEN) [175654.165158] rbp: 83807cc97ab8 rsp: 83807cc97a88 r8: deadbeefdeadf00d (XEN) [175654.165160] r9: deadbeefdeadf00d r10: r11: (XEN) [175654.165164] r12: 83400fa6f000 r13: 83400f8c9778 r14: 82d0805c8008 (XEN) [175654.165167] r15: 832e30854ae0 cr0: 80050033 cr4: 00362660 (XEN) [175654.165170] cr3: 002130811000 cr2: 88817f50b728 (XEN) [175654.165172] fsb: 7f40a40da740 gsb: 88831d30 gss: (XEN) [175654.165175] ds: es: fs: gs: ss: e010 cs: e008 (XEN) [175654.165179] Xen code around (vcpu_runstate_get+0x11e/0x14f): (XEN) [175654.165181] 04 10 4c 3b 68 10 74 02 <0f> 0b 4c 89 ef e8 7e 5d 00 00 48 8d 05 41 9d 38 (XEN) [175654.165192] Xen stack trace from rsp=83807cc97a88: (XEN) [175654.165194]83807cc97aa8 83400fa75a60 83807cc97da0 (XEN) [175654.165199]0230 83807cc97fff 83807cc97af8 82d08023d41f (XEN) [175654.165204]0001 9fc1ac1cb2f4 4840c423acdc 5780e7f9735a (XEN) [175654.165207] 83807cc97c98 82d0802ea9f7 (XEN) [175654.165211] 9fc1ac1c6b99 00050007 83807cc97c10 (XEN) [175654.165215]83807cc97bb0 0020 (XEN) [175654.165251] (XEN) [175654.165254] (XEN) [175654.165258]82d0805c8038 82d0805c74a0 aa00 (XEN) [175654.165263] (XEN) [175654.165266] (XEN) [175654.165269] (XEN) [175654.165273] (XEN) [175654.165276] (XEN) [175654.165279] (XEN) [175654.165283]83400f813000 83807cc97d98 82d0805cda80 (XEN) [175654.165287]0230 83807cc97fff 83807cc97cc8 82d08026d99b (XEN) [175654.165291]83807cc97ef8 83400f813000 82d0805cda80 0230 (XEN) [175654.165295]83807cc97e48 82d080244573 7f40a40e6000 0206 (XEN) [175654.165300]82004006c000 82e08a815e80 (XEN) [175654.165304] Xen call trace: (XEN) [175654.165306][] R vcpu_runstate_get+0x11e/0x14f (XEN) [175654.165310][] F get_cpu_idle_time+0x4d/0x53 (XEN) [175654.165315][] F pmstat_get_cx_stat+0x82/0x8e7 (XEN) [175654.165319][] F do_get_pm_info+0x27b/0x2d4 (XEN) [175654.165322][] F do_sysctl+0x633/0x14e0 (XEN) [175654.165327][] F pv_hypercall+0x1f5/0x567 (XEN) [175654.165330][] F lstar_enter+0x112/0x120 (XEN) [175654.165332] (XEN) [175654.550916] (XEN) [175654.553243] (XEN) [175654.559449] Panic on CPU 28: (XEN) [175654.563328] Assertion 'lock == get_sched_res(i->res->master_cpu)->schedule_lock' failed at ...ild/BUILD/xen-4.13.1/xen/include/xen/sched-if (XEN) [175654.581847] (XEN) [175654.584173] Reboot in five seconds... (XEN) [175654.588925] Executing kexec image on cpu28 (XEN) [175654.594987] Shot down all CPUs The state of the sibling was: PCPU 29 Host state: RIP:e008:[] Ring 0 RFLAGS: 00040002 AC IOPL0 rax: 83400f8c91e4 rbx: 001d rcx: 83400f8c91f4 rdx: 83400f8c9104 rsi: 83400f8c9094 rdi: 0004 rbp: 83807cc89f28 rsp: 83807cc89f28 r8: r9: r10: r11: r12: r13: r14: 83807cc8 r15: cr0: 80050033 PG