On 16.11.23 17:48, George Dunlap wrote:
On Tue, Nov 14, 2023 at 1:30 PM Juergen Gross <[email protected]> wrote:

When moving a domain out of a cpupool running with the credit2
scheduler and having multiple run-queues, the following ASSERT() can
be observed:

(XEN) Xen call trace:
(XEN)    [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7
(XEN)    [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1
(XEN)    [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b
(XEN)    [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35
(XEN)    [<ffff82d040206513>] S domain_kill+0xa5/0x116
(XEN)    [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951
(XEN)    [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143
(XEN)    [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9
(XEN)    [<ffff82d0402012b7>] S lstar_enter+0x137/0x140
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at 
common/sched/credit2.c:1159
(XEN) ****************************************

This is happening as sched_move_domain() is setting a different cpu
for a scheduling unit without telling the scheduler. When this unit is
removed from the scheduler, the ASSERT() will trigger.

In non-debug builds the result is usually a clobbered pointer, leading
to another crash a short time later.

Fix that by swapping the two involved actions (setting another cpu and
removing the unit from the scheduler).

Cc: Henry Wang <[email protected]>
Link: https://github.com/Dasharo/dasharo-issues/issues/488
Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with 
different granularity")
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- added Link: (reporter didn't want to be added by name)
---
  xen/common/sched/core.c | 7 ++++---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 12deefa745..e9f7486197 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -738,12 +738,13 @@ int sched_move_domain(struct domain *d, struct cpupool *c)
      new_p = cpumask_first(d->cpupool->cpu_valid);

There's a comment just above here which should probably be updated;
something like "Remove all units from the old scheduler, and
temporarily move them to the same processor to make locking easier
when moving the new units to nwe processors."

With that change:

Reviewed-by: George Dunlap <[email protected]>

I could change that on check-if you'd like.

Yes, please (with s/nwe/new/).


I take it at this point this is just for the 4.19 branch, and this
will be a candidate for backport to 4.18.1?

Correct.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to