Re: [PATCH v2] x86/hyperv: Fix kexec panic/hang issues
On Tue, Jan 05, 2021 at 04:39:38PM +, Michael Kelley wrote: > From: Wei Liu Sent: Tuesday, January 5, 2021 5:04 AM > > > > On Mon, Dec 21, 2020 at 10:55:41PM -0800, Dexuan Cui wrote: > > > Currently the kexec kernel can panic or hang due to 2 causes: > > > > > > 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the > > > old VP Assist Pages when the kexec kernel runs. The same issue is fixed > > > for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the > > > VP assist page for hibernation"). Now fix it for kexec. > > > > > > 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs > > > are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so > > > between hv_kexec_handler() and native_machine_shutdown(), the other CPUs > > > can still try to access the hypercall page and cause panic. The workaround > > > "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move > > > hyperv_cleanup() to a better place. > > > > > > Signed-off-by: Dexuan Cui > > > > The code looks a bit intrusive. On the other hand, this does sound like > > something needs backporting for older stable kernels. > > > > On a more practical note, I need to decide whether to take it via > > hyperv-fixes or hyperv-next. What do you think? > > > > I'd like to see this in hyperv-fixes and backported to older stable kernels. > In its current form, the kexec path in a Hyper-V guest has multiple problems > that make it unreliable, so the downside risk of taking these fixes is minimal > while the upside benefit is considerable. Applied to hyperv-fixes. Wei.
RE: [PATCH v2] x86/hyperv: Fix kexec panic/hang issues
From: Wei Liu Sent: Tuesday, January 5, 2021 5:04 AM > > On Mon, Dec 21, 2020 at 10:55:41PM -0800, Dexuan Cui wrote: > > Currently the kexec kernel can panic or hang due to 2 causes: > > > > 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the > > old VP Assist Pages when the kexec kernel runs. The same issue is fixed > > for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the > > VP assist page for hibernation"). Now fix it for kexec. > > > > 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs > > are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so > > between hv_kexec_handler() and native_machine_shutdown(), the other CPUs > > can still try to access the hypercall page and cause panic. The workaround > > "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move > > hyperv_cleanup() to a better place. > > > > Signed-off-by: Dexuan Cui > > The code looks a bit intrusive. On the other hand, this does sound like > something needs backporting for older stable kernels. > > On a more practical note, I need to decide whether to take it via > hyperv-fixes or hyperv-next. What do you think? > I'd like to see this in hyperv-fixes and backported to older stable kernels. In its current form, the kexec path in a Hyper-V guest has multiple problems that make it unreliable, so the downside risk of taking these fixes is minimal while the upside benefit is considerable. Michael
Re: [PATCH v2] x86/hyperv: Fix kexec panic/hang issues
On Mon, Dec 21, 2020 at 10:55:41PM -0800, Dexuan Cui wrote: > Currently the kexec kernel can panic or hang due to 2 causes: > > 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the > old VP Assist Pages when the kexec kernel runs. The same issue is fixed > for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the > VP assist page for hibernation"). Now fix it for kexec. > > 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs > are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so > between hv_kexec_handler() and native_machine_shutdown(), the other CPUs > can still try to access the hypercall page and cause panic. The workaround > "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move > hyperv_cleanup() to a better place. > > Signed-off-by: Dexuan Cui The code looks a bit intrusive. On the other hand, this does sound like something needs backporting for older stable kernels. On a more practical note, I need to decide whether to take it via hyperv-fixes or hyperv-next. What do you think? Wei.
RE: [PATCH v2] x86/hyperv: Fix kexec panic/hang issues
From: Dexuan Cui Sent: Monday, December 21, 2020 10:56 PM > > Currently the kexec kernel can panic or hang due to 2 causes: > > 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the > old VP Assist Pages when the kexec kernel runs. The same issue is fixed > for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the > VP assist page for hibernation"). Now fix it for kexec. > > 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs > are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so > between hv_kexec_handler() and native_machine_shutdown(), the other CPUs > can still try to access the hypercall page and cause panic. The workaround > "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move > hyperv_cleanup() to a better place. > > Signed-off-by: Dexuan Cui > --- > > Changes in v2: > Improved the commit log as Michael Kelley suggested. > No change to v1 otherwise. > > arch/x86/hyperv/hv_init.c | 4 > arch/x86/include/asm/mshyperv.h | 2 ++ > arch/x86/kernel/cpu/mshyperv.c | 18 ++ > drivers/hv/vmbus_drv.c | 2 -- > 4 files changed, 24 insertions(+), 2 deletions(-) > Reviewed-by: Michael Kelley
[PATCH v2] x86/hyperv: Fix kexec panic/hang issues
Currently the kexec kernel can panic or hang due to 2 causes: 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the old VP Assist Pages when the kexec kernel runs. The same issue is fixed for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for hibernation"). Now fix it for kexec. 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so between hv_kexec_handler() and native_machine_shutdown(), the other CPUs can still try to access the hypercall page and cause panic. The workaround "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move hyperv_cleanup() to a better place. Signed-off-by: Dexuan Cui --- Changes in v2: Improved the commit log as Michael Kelley suggested. No change to v1 otherwise. arch/x86/hyperv/hv_init.c | 4 arch/x86/include/asm/mshyperv.h | 2 ++ arch/x86/kernel/cpu/mshyperv.c | 18 ++ drivers/hv/vmbus_drv.c | 2 -- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index e04d90af4c27..4638a52d8eae 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,8 @@ #include #include +int hyperv_init_cpuhp; + void *hv_hypercall_pg; EXPORT_SYMBOL_GPL(hv_hypercall_pg); @@ -401,6 +404,7 @@ void __init hyperv_init(void) register_syscore_ops(_syscore_ops); + hyperv_init_cpuhp = cpuhp; return; remove_cpuhp_state: diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index ffc289992d1b..30f76b966857 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -74,6 +74,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {} #if IS_ENABLED(CONFIG_HYPERV) +extern int hyperv_init_cpuhp; + extern void *hv_hypercall_pg; extern void __percpu **hyperv_pcpu_input_arg; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index f628e3dc150f..43b54bef5448 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -135,14 +135,32 @@ static void hv_machine_shutdown(void) { if (kexec_in_progress && hv_kexec_handler) hv_kexec_handler(); + + /* +* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor +* corrupts the old VP Assist Pages and can crash the kexec kernel. +*/ + if (kexec_in_progress && hyperv_init_cpuhp > 0) + cpuhp_remove_state(hyperv_init_cpuhp); + + /* The function calls stop_other_cpus(). */ native_machine_shutdown(); + + /* Disable the hypercall page when there is only 1 active CPU. */ + if (kexec_in_progress) + hyperv_cleanup(); } static void hv_machine_crash_shutdown(struct pt_regs *regs) { if (hv_crash_handler) hv_crash_handler(regs); + + /* The function calls crash_smp_send_stop(). */ native_machine_crash_shutdown(regs); + + /* Disable the hypercall page when there is only 1 active CPU. */ + hyperv_cleanup(); } #endif /* CONFIG_KEXEC_CORE */ #endif /* CONFIG_HYPERV */ diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 502f8cd95f6d..d491fdcee61f 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2550,7 +2550,6 @@ static void hv_kexec_handler(void) /* Make sure conn_state is set as hv_synic_cleanup checks for it */ mb(); cpuhp_remove_state(hyperv_cpuhp_online); - hyperv_cleanup(); }; static void hv_crash_handler(struct pt_regs *regs) @@ -2566,7 +2565,6 @@ static void hv_crash_handler(struct pt_regs *regs) cpu = smp_processor_id(); hv_stimer_cleanup(cpu); hv_synic_disable_regs(cpu); - hyperv_cleanup(); }; static int hv_synic_suspend(void) -- 2.19.1