Looking at both traces this looks to be consistently happen inside 
run_timer_softirq() and from the offset I would guess we are in the inlined 
__run_timers. Another noteworthy part is the value of RAX. This is the value of 
LIST_POISON2 which is used to mark an invalid pointer of a (hlist_node 
*)->pprev.
So I would guess something modified the list of pending timers (those exist 
per-cpu) while softirq processing was working on them. The problem is to say 
what. Not sure a dump will help as often in those races the clues go away just 
after causing problems.
I would maybe suspect the area of xen-netfront, given that, as far as I can 
tell, this has not happened on bare-metal servers and from the description 
rather seems to affect high traffic instances.
Would it be possible to volunteer one affected instance and try mainline 
kernels (https://wiki.ubuntu.com/Kernel/MainlineBuilds) between 3.19 and 4.2 
(4.0, 4.1) and/or after (4.3, maybe 4.4)? That would give a smaller delta to 
look at for what broke things (using the 4.0 and 4.1 kernels) or whether maybe 
it got fixed but not identified as a stable patch (when using 4.3 or 4.4).

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1534345

Title:
  Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced
  Networking

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534345/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to