Logs:
[  594.291317] ACPI: Hardware changed while hibernated, success doubtful!
[  594.411609] BUG: kernel NULL pointer dereference, address: 00000000000001f4
[  594.424658] #PF: supervisor write access in kernel mode
[  594.424660] #PF: error_code(0x0002) - not-present page
[  594.424661] PGD 0 P4D 0 
[  594.424665] Oops: 0002 [#1] SMP PTI
[  594.424668] CPU: 3 PID: 362 Comm: systemd-timesyn Not tainted 5.8.0-1036-aws 
#38~20.04.1-Ubuntu
[  594.424669] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[  594.424675] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40
[...]
[  594.424692] Call Trace:
[  594.424699]  xennet_start_xmit+0x158/0x570

By looking the assembly, this is where it fails:

ffffffff8182d8e0 <xennet_start_xmit>:
[...]
ffffffff8182da0a:       48 83 f8 0d             cmp    $0xd,%rax
ffffffff8182da0e:       0f 86 46 03 00 00       jbe    ffffffff8182dd5a 
<xennet_start_xmit+0x47a>

### pahole --hex -C netfront_queue  usr/lib/debug/boot/vmlinux-5.8.0-1035-aws 
|grep lock
###### spinlock_t                 tx_lock;              /* 0x1f4   0x4 */

ffffffff8182da14:       49 8d 86 f4 01 00 00    lea    0x1f4(%r14),%rax # <-- 
#rax = &queue->tx_lock
ffffffff8182da1b:       45 8b 4c 24 70          mov    0x70(%r12),%r9d
ffffffff8182da20:       45 2b 4c 24 74          sub    0x74(%r12),%r9d
ffffffff8182da25:       48 89 c7                mov    %rax,%rdi # <-- %rdi = 
%rax
ffffffff8182da28:       44 89 4d 94             mov    %r9d,-0x6c(%rbp)
ffffffff8182da2c:       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
ffffffff8182da33:       e8 38 52 36 00          callq  ffffffff81b92c70 
<_raw_spin_lock_irqsave> #<-- OOPS here
[...]

By correlating with the code, we have this in C:

static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
[...]
len = skb_headlen(skb);

spin_lock_irqsave(&queue->tx_lock, flags); // <<< HERE

if (unlikely(!netif_carrier_ok(dev) ||
[...]

Happens that queue->tx_lock is NULL.
What is interesting is the message:

[  594.291317] ACPI: Hardware changed while hibernated, success
doubtful!


So, it means the hibernation woke in a different compute node than it went 
sleeping. I'm still not 100% sure of why that would cause such OOPS...but I 
have 2 ideas to either prevent it and validate that hypothesis:

(a) To modprobe unload the xen network driver right before hibernation
and load it in the last stage of wake-up - I've done that myself in the
hibernation scripts. This would likely prevent this issue.

(b) For testing personnel : maybe you're able to "lock" the testing to
sleep/wake-up *always* in the same compute node/instance. If that
prevents the issue, we're sure that the difference between the nodes rom
sleep/wake-up is triggering some memory corruption in the device queue,
that is somewhat propagated to the kernel memory.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1934424

Title:
  kernel NULL pointer dereference during xen hibernation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1934424/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to