Hi Mykola,
On 05/03/2025 09:11, Mykola Kvach wrote:
If we call disable_nonboot_cpus on ARM64 with system_state set
to SYS_STATE_suspend, the following assertion will be triggered:
```
(XEN) [ 25.582712] Disabling non-boot CPUs ...
(XEN) [ 25.587032] Assertion '!in_irq() && (local_irq_is_enabled() ||
num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714
[...]
(XEN) [ 25.975069] Xen call trace:
(XEN) [ 25.978353] [<00000a000022e098>] xfree+0x130/0x1a4 (PC)
(XEN) [ 25.984314] [<00000a000022e08c>] xfree+0x124/0x1a4 (LR)
(XEN) [ 25.990276] [<00000a00002747d4>] release_irq+0xe4/0xe8
(XEN) [ 25.996152] [<00000a0000278588>] time.c#cpu_time_callback+0x44/0x60
(XEN) [ 26.003150] [<00000a000021d678>] notifier_call_chain+0x7c/0xa0
(XEN) [ 26.009717] [<00000a00002018e0>]
cpu.c#cpu_notifier_call_chain+0x24/0x48
(XEN) [ 26.017148] [<00000a000020192c>] cpu.c#_take_cpu_down+0x28/0x34
(XEN) [ 26.023801] [<00000a0000201944>] cpu.c#take_cpu_down+0xc/0x18
(XEN) [ 26.030281] [<00000a0000225c5c>]
stop_machine.c#stopmachine_action+0xbc/0xe4
(XEN) [ 26.038057] [<00000a00002264bc>]
tasklet.c#do_tasklet_work+0xb8/0x100
(XEN) [ 26.045229] [<00000a00002268a4>] do_tasklet+0x68/0xb0
(XEN) [ 26.051018] [<00000a000026e120>] domain.c#idle_loop+0x7c/0x194
(XEN) [ 26.057585] [<00000a0000277e30>] start_secondary+0x21c/0x220
(XEN) [ 26.063978] [<00000a0000361258>] 00000a0000361258
```
This happens because before invoking take_cpu_down via the stop_machine_run
function on the target CPU, stop_machine_run requests
the STOPMACHINE_DISABLE_IRQ state on that CPU. Releasing memory in
the release_irq function then triggers the assertion:
/*
* Heap allocations may need TLB flushes which may require IRQs to be
* enabled (except when only 1 PCPU is online).
*/
#define ASSERT_ALLOC_CONTEXT()
This patch introduces a new tasklet to perform the CPU_DYING call chain for
a particular CPU. However, we cannot call take_cpu_down from the tasklet
because the __cpu_disable function disables local IRQs, causing the system
to crash inside spin_lock_irq, which is called after the tasklet function
invocation inside do_tasklet_work:
void _spin_lock_irq(spinlock_t *lock)
{
ASSERT(local_irq_is_enabled());
To resolve this, take_cpu_down is split into two parts. The first part triggers
the CPU_DYING call chain, while the second part, __cpu_disable, is invoked from
stop_machine_run.
Rather than modifying common code, have you considered allocating from
the IRQ action from the percpu area? This would also reduce the number
of possible failure when bringup up a pCPU.
Cheers,
--
Julien Grall