Logs: [ 594.291317] ACPI: Hardware changed while hibernated, success doubtful! [ 594.411609] BUG: kernel NULL pointer dereference, address: 00000000000001f4 [ 594.424658] #PF: supervisor write access in kernel mode [ 594.424660] #PF: error_code(0x0002) - not-present page [ 594.424661] PGD 0 P4D 0 [ 594.424665] Oops: 0002 [#1] SMP PTI [ 594.424668] CPU: 3 PID: 362 Comm: systemd-timesyn Not tainted 5.8.0-1036-aws #38~20.04.1-Ubuntu [ 594.424669] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 [ 594.424675] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40 [...] [ 594.424692] Call Trace: [ 594.424699] xennet_start_xmit+0x158/0x570
By looking the assembly, this is where it fails: ffffffff8182d8e0 <xennet_start_xmit>: [...] ffffffff8182da0a: 48 83 f8 0d cmp $0xd,%rax ffffffff8182da0e: 0f 86 46 03 00 00 jbe ffffffff8182dd5a <xennet_start_xmit+0x47a> ### pahole --hex -C netfront_queue usr/lib/debug/boot/vmlinux-5.8.0-1035-aws |grep lock ###### spinlock_t tx_lock; /* 0x1f4 0x4 */ ffffffff8182da14: 49 8d 86 f4 01 00 00 lea 0x1f4(%r14),%rax # <-- #rax = &queue->tx_lock ffffffff8182da1b: 45 8b 4c 24 70 mov 0x70(%r12),%r9d ffffffff8182da20: 45 2b 4c 24 74 sub 0x74(%r12),%r9d ffffffff8182da25: 48 89 c7 mov %rax,%rdi # <-- %rdi = %rax ffffffff8182da28: 44 89 4d 94 mov %r9d,-0x6c(%rbp) ffffffff8182da2c: 48 89 85 78 ff ff ff mov %rax,-0x88(%rbp) ffffffff8182da33: e8 38 52 36 00 callq ffffffff81b92c70 <_raw_spin_lock_irqsave> #<-- OOPS here [...] By correlating with the code, we have this in C: static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) [...] len = skb_headlen(skb); spin_lock_irqsave(&queue->tx_lock, flags); // <<< HERE if (unlikely(!netif_carrier_ok(dev) || [...] Happens that queue->tx_lock is NULL. What is interesting is the message: [ 594.291317] ACPI: Hardware changed while hibernated, success doubtful! So, it means the hibernation woke in a different compute node than it went sleeping. I'm still not 100% sure of why that would cause such OOPS...but I have 2 ideas to either prevent it and validate that hypothesis: (a) To modprobe unload the xen network driver right before hibernation and load it in the last stage of wake-up - I've done that myself in the hibernation scripts. This would likely prevent this issue. (b) For testing personnel : maybe you're able to "lock" the testing to sleep/wake-up *always* in the same compute node/instance. If that prevents the issue, we're sure that the difference between the nodes rom sleep/wake-up is triggering some memory corruption in the device queue, that is somewhat propagated to the kernel memory. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1934424 Title: kernel NULL pointer dereference during xen hibernation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1934424/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs