Looking at both traces this looks to be consistently happen inside run_timer_softirq() and from the offset I would guess we are in the inlined __run_timers. Another noteworthy part is the value of RAX. This is the value of LIST_POISON2 which is used to mark an invalid pointer of a (hlist_node *)->pprev. So I would guess something modified the list of pending timers (those exist per-cpu) while softirq processing was working on them. The problem is to say what. Not sure a dump will help as often in those races the clues go away just after causing problems. I would maybe suspect the area of xen-netfront, given that, as far as I can tell, this has not happened on bare-metal servers and from the description rather seems to affect high traffic instances. Would it be possible to volunteer one affected instance and try mainline kernels (https://wiki.ubuntu.com/Kernel/MainlineBuilds) between 3.19 and 4.2 (4.0, 4.1) and/or after (4.3, maybe 4.4)? That would give a smaller delta to look at for what broke things (using the 4.0 and 4.1 kernels) or whether maybe it got fixed but not identified as a stable patch (when using 4.3 or 4.4).
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534345 Title: Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced Networking To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534345/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
