** Description changed:
- Create VM on Azure using Canonical UbuntuServer 14.04.5-LTS latest using
- size Standard DS2 v2
+ [Impact]
- Install kexec-tools kdump-tools makedumpfile, configure boot kernel
- parameter crashkernel=512M
+ * When kdumping on trusty/4.15 in an Azure instance, we observe quite
+ frequently a stall on the kdump kernel, it gets blocked and soon we see
+ a stack like the following:
- Run sed -i 's/USE_KDUMP=0/USE_KDUMP=1/g' /etc/default/kdump-tools and
- sed -i 's/LOAD_KEXEC=true/LOAD_KEXEC=false/g' /etc/default/kexec
+ [ 65.452007] INFO: rcu_sched detected stalls on CPUs/tasks:
+ [ 65.456004] 1-...!: (0 ticks this GP) idle=488/0/0 softirq=1/1 fqs=0
+ [ 65.456004] (detected by 0, t=15002 jiffies, g=707, c=706, q=8457)
+ [ 65.456004] rcu_sched kthread starved for 15002 jiffies! g707 c706 f0x0
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
- Reboot VM, make sure crash kernel memory reserved successfully
+ * By using the Azure serial console, we collected a sysrq-w when the issue
happens:
+ [ 529.515013] sysrq: Show Blocked State
+ [ 529.517730] task PC stack pid father
+ [ 529.519006] kworker/u4:2 D 0 94 2 0x80000000
+ [ 529.519006] Workqueue: events_unbound fsnotify_mark_destroy_workfn
+ [ 529.519006] Call Trace:
+ [ 529.519006] __schedule+0x292/0x880
+ [ 529.519006] schedule+0x36/0x80
+ [ 529.519006] schedule_timeout+0x1d5/0x2f0
+ [ 529.519006] ? check_preempt_wakeup+0x162/0x260
+ [ 529.519006] wait_for_completion+0xa5/0x110
+ [ 529.519006] ? wake_up_q+0x80/0x80
+ [ 529.519006] __synchronize_srcu.part.14+0x67/0x80
+ [ 529.519006] ? trace_raw_output_rcu_utilization+0x50/0x50
+ [ 529.519006] ? __switch_to_asm+0x41/0x70
+ [ 529.519006] synchronize_srcu+0xd1/0xd6
+ [ 529.519006] fsnotify_mark_destroy_workfn+0x6d/0xc0
+ [ 529.519006] process_one_work+0x14e/0x390
+ [ 529.519006] worker_thread+0x1cc/0x3d0
+ [ 529.519006] kthread+0x105/0x140
+ [ 529.519006] ? max_active_store+0x60/0x60
+ [ 529.519006] ? kthread_bind+0x20/0x20
+ [ 529.519006] ret_from_fork+0x35/0x40
+ [ 529.519006] udevadm D 0 544 1 0x00000000
+ [ 529.519006] Call Trace:
+ [ 529.519006] __schedule+0x292/0x880
+ [ 529.519006] schedule+0x36/0x80
+ [ 529.519006] schedule_timeout+0x1d5/0x2f0
+ [ 529.519006] ? try_to_wake_up+0x4a/0x460
+ [ 529.519006] ? try_to_wake_up+0x4a/0x460
+ [ 529.519006] wait_for_completion+0xa5/0x110
+ [ 529.519006] ? wake_up_q+0x80/0x80
+ [ 529.519006] __flush_work.isra.29+0x119/0x1b0
+ [ 529.519006] ? destroy_worker+0x90/0x90
+ [ 529.519006] flush_delayed_work+0x3f/0x50
+ [ 529.519006] fsnotify_wait_marks_destroyed+0x15/0x20
+ [ 529.519006] fsnotify_destroy_group+0x4e/0xc0
+ [ 529.519006] inotify_release+0x1e/0x50
+ [ 529.519006] __fput+0xea/0x220
+ [ 529.519006] ____fput+0xe/0x10
+ [ 529.519006] task_work_run+0x8c/0xb0
+ [ 529.519006] exit_to_usermode_loop+0x70/0xa9
+ [ 529.519006] do_syscall_64+0x1b5/0x1e0
+ [ 529.519006] entry_SYSCALL_64_after_hwframe+0x41/0xa6
+ [ 529.519006] dhclient D 0 573 572 0x00000000
+ [ 529.519006] Call Trace:
+ [ 529.519006] __schedule+0x292/0x880
+ [ 529.519006] schedule+0x36/0x80
+ [ 529.519006] schedule_timeout+0x1d5/0x2f0
+ [ 529.519006] ? aa_profile_af_perm+0xb4/0xf0
+ [ 529.519006] wait_for_completion+0xa5/0x110
+ [ 529.519006] ? wake_up_q+0x80/0x80
+ [ 529.519006] __wait_rcu_gp+0x123/0x150
+ [ 529.519006] synchronize_sched+0x4e/0x60
+ [ 529.519006] ? __call_rcu+0x2f0/0x2f0
+ [ 529.519006] ? trace_raw_output_rcu_utilization+0x50/0x50
+ [ 529.519006] synchronize_net+0x1c/0x30
+ [ 529.519006] __unregister_prot_hook+0xcd/0xf0
+ [ 529.519006] packet_do_bind+0x1bd/0x250
+ [ 529.519006] packet_bind+0x2f/0x50
+ [ 529.519006] SYSC_bind+0xd8/0x110
+ [ 529.519006] ? sock_alloc_file+0x91/0x130
+ [ 529.519006] SyS_bind+0xe/0x10
+ [ 529.519006] do_syscall_64+0x80/0x1e0
+ [ 529.519006] entry_SYSCALL_64_after_hwframe+0x41/0xa6
- Trigger a crash by run echo 1 > /proc/sys/kernel/sysrq, then echo c >
- /proc/sysrq-trigger, VM will reboot automatically and after reboot,
- check crash dump is generated under /var/crash
+ * Bisecting mainline kernels, we found that v4.17-rc1 didn't reproduce
+ the issue, whereas v4.16 reproduced. Then, a fine-grained git bisect led
+ us to the fix - the following patch, when backported to a problematic
+ version, fixes the issue: d8e462e19305 ("Drivers: hv: vmbus: Implement
+ Direct Mode for stimer0")
- Install linux-azure kernel
+ * In Azure/Hyper-V, before the aforementioned commit, timer interrupts
+ were passed to the hypervisor through a vmbus message, a mechanism of
+ communication of hyper-v guests/hypervisor. With the patch, a check is
+ made (through MSR-like mechanism) and if the hypervisor supports, a
+ direct timer IRQ mechanism is put in-place instead of the vmbus channel.
- Enable private-ppa canonical-kernel-esm
+ * Our theory is that on kdump kernel, specially due to the single cpu
+ nature, the vmbus-messaged timer IRQ could interfere with scheduling and
+ create a dead-lock condition, which is what we observe from the stack
+ traces. Hence, we hereby propose to backport such patch to azure 4.15
+ kernels.
- Install kernel linux-azure, reboot VM, kernel version 4.15.0-1084-azure
+ * Our hypothesis about why this wasn't observed in Bionic is that
+ changes in the kdump mechanism make it start earlier, and this might
+ hide the schedule issue. Lack of testing also could be the cause of the
+ non-observability.
- Trigger a crash by run echo 1 > /proc/sys/kernel/sysrq, then echo c >
- /proc/sysrq-trigger, VM entered into a hung status.
+ [Test Case]
- Attach whole serial console
- [ 65.452007] INFO: rcu_sched detected stalls on CPUs/tasks:
- [ 65.456004] 1-...!: (0 ticks this GP) idle=488/0/0 softirq=1/1 fqs=0
- [ 65.456004] (detected by 0, t=15002 jiffies, g=707, c=706, q=8457)
- [ 65.456004] rcu_sched kthread starved for 15002 jiffies! g707 c706 f0x0
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
- [ 360.020026] INFO: rcu_sched detected stalls on CPUs/tasks:
- [ 360.024015] 1-...!: (10 GPs behind) idle=b34/0/0 softirq=1/1 fqs=1
- [ 360.024015] (detected by 0, t=15002 jiffies, g=717, c=716, q=6429)
- [ 360.024015] rcu_sched kthread starved for 15000 jiffies! g717 c716 f0x0
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
- [ 420.048010] INFO: rcu_sched detected stalls on CPUs/tasks:
- [ 420.052006] 1-...!: (0 ticks this GP) idle=f94/0/0 softirq=1/1 fqs=0
- [ 420.052006] (detected by 0, t=15002 jiffies, g=718, c=717, q=6429)
- [ 420.052006] rcu_sched kthread starved for 15002 jiffies! g718 c717 f0x0
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
+ * Perform a kdump in Trusty/4.15 and after some rounds, you should see
+ the mentioned RCU stack trace. With the patch, no issue is observed, and
+ we could perform ~30 kdumps in a row.
+
+ [Where problems could occur]
+
+ * Since the patch touches IRQ code (although restricted to Hyper-v),
+ it's a somewhat invasive change. It could lead to issues in the IRQ
+ subsystem, although the patch is upstream since 2018 and no fixes for
+ that were found (we count on Fixes tag for that).
** Also affects: linux-azure (Ubuntu Bionic)
Importance: Undecided
Status: New
** Changed in: linux-azure (Ubuntu)
Assignee: Guilherme G. Piccoli (gpiccoli) => (unassigned)
** Changed in: linux-azure (Ubuntu)
Assignee: (unassigned) => Marcelo Cerri (mhcerri)
** Changed in: linux-azure (Ubuntu Bionic)
Assignee: (unassigned) => Marcelo Cerri (mhcerri)
** Changed in: linux-azure (Ubuntu Bionic)
Status: New => In Progress
** Changed in: linux-azure (Ubuntu)
Importance: Undecided => Medium
** Changed in: linux-azure (Ubuntu Bionic)
Importance: Undecided => Medium
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1882623
Title:
VM enter into hung status after triggering a crash
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1882623/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs