Public bug reported: == Comment: #0 - SRIKANTH AITHAL <bssrika...@in.ibm.com> - 2019-02-20 23:42:23 == ---Problem Description--- while running KVM guests, we are observing numad crashes on host. Contact Information = srikanth/bssrika...@in.ibm.com ---uname output--- Linux ltcgen6 4.15.0-1016-ibm-gt #18-Ubuntu SMP Thu Feb 7 16:58:31 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = witherspoon ---Debugger--- A debugger is not configured ---Steps to Reproduce--- 1. check status of numad, if stopped start it 2. start a kvm guest 3. Run some memory tests inside guest
On the host after few minutes we see numad crashing. I had enabled debug log for numad, seeing below messages in numad.log before it crashes: 8870669: PID 88781: (qemu-system-ppc), Threads 6, MBs_size 15871, MBs_used 11262, CPUs_used 400, Magnitude 4504800, Nodes: 0,8 Thu Feb 21 00:12:10 2019: PICK NODES FOR: PID: 88781, CPUs 470, MBs 18671 Thu Feb 21 00:12:10 2019: PROCESS_MBs[0]: 9201 Thu Feb 21 00:12:10 2019: Node[0]: mem: 0 cpu: 6 Thu Feb 21 00:12:10 2019: Node[1]: mem: 0 cpu: 6 Thu Feb 21 00:12:10 2019: Node[2]: mem: 1878026 cpu: 4666 Thu Feb 21 00:12:10 2019: Node[3]: mem: 0 cpu: 6 Thu Feb 21 00:12:10 2019: Node[4]: mem: 0 cpu: 6 Thu Feb 21 00:12:10 2019: Node[5]: mem: 2194058 cpu: 4728 Thu Feb 21 00:12:10 2019: Totmag[0]: 94112134 Thu Feb 21 00:12:10 2019: Totmag[1]: 109211855 Thu Feb 21 00:12:10 2019: Totmag[2]: 2990058 Thu Feb 21 00:12:10 2019: Totmag[3]: 2990058 Thu Feb 21 00:12:10 2019: Totmag[4]: 2990058 Thu Feb 21 00:12:10 2019: Totmag[5]: 2990058 Thu Feb 21 00:12:10 2019: best_node_ix: 1 Thu Feb 21 00:12:10 2019: Node: 8 Dist: 10 Magnitude: 10373506224 Thu Feb 21 00:12:10 2019: Node: 0 Dist: 40 Magnitude: 8762869316 Thu Feb 21 00:12:10 2019: Node: 253 Dist: 80 Magnitude: 0 Thu Feb 21 00:12:10 2019: Node: 254 Dist: 80 Magnitude: 0 Thu Feb 21 00:12:10 2019: Node: 252 Dist: 80 Magnitude: 0 Thu Feb 21 00:12:10 2019: Node: 255 Dist: 80 Magnitude: 0 Thu Feb 21 00:12:10 2019: MBs: 18671, CPUs: 470 Thu Feb 21 00:12:10 2019: Assigning resources from node 5 Thu Feb 21 00:12:10 2019: Node[0]: mem: 2007348 cpu: 1908 Thu Feb 21 00:12:10 2019: MBs: 0, CPUs: 0 Thu Feb 21 00:12:10 2019: Assigning resources from node 2 Thu Feb 21 00:12:10 2019: Process 88781 already 100 percent localized to target nodes. On syslog we see sig 11: [88726.086144] numad[88879]: unhandled signal 11 at 000000e38fe72688 nip 0000782ce4dcac20 lr 0000782ce4dcf85c code 1 Stack trace output: no Oops output: no System Dump Info: The system was configured to capture a dump, however a dump was not produced. *Additional Instructions for srikanth/bssrika...@in.ibm.com: -Attach sysctl -a output output to the bug. == Comment: #2 - SRIKANTH AITHAL <bssrika...@in.ibm.com> - 2019-02-20 23:44:38 == == Comment: #3 - SRIKANTH AITHAL <bssrika...@in.ibm.com> - 2019-02-20 23:48:20 == I was using stressapptest to run memory workload inside the guest `stressapptest -s 200` == Comment: #5 - Brian J. King <bjki...@us.ibm.com> - 2019-03-08 09:17:29 == Any update on this? == Comment: #6 - Leonardo Bras Soares Passos <leona...@ibm.com> - 2019-03-08 11:59:16 == Yes, I have been working on this for a while. After a suggestion of @lagarcia, I tested the bug on the same machine, booted on default kernel (4.15.0-45-generic) and also booted the vm with the same generic kernel. Results are that the bug also happens with 4.15.0-45-generic. So, it may not be a problem of the changes included on kernel 4.15.0-1016.18-fix1-ibm-gt. A few things I noticed, that may be interesting to solve this bug: - I had a very hard time to reproduce the bug on numad that started on boot. If I restart, or stop/start, the bug reproduces much easier. - I debugged numad using gdb and I found out it is getting segfault on _int_malloc(), from glibc. Attached is an occurrence of the bug, while numad was on gdb. (systemctl start numad ; gdb /usr/bin/numad $NUMAD_PID) == Comment: #7 - Leonardo Bras Soares Passos <leona...@ibm.com> - 2019-03-08 12:00:00 == == Comment: #8 - Leonardo Bras Soares Passos <leona...@ibm.com> - 2019-03-11 17:04:25 == I reverted the whole system to vanilla Ubuntu Bionic, and booted on 4.15.0-45-generic kernel. Linux ltcgen6 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:27:02 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Then I booted the guest, also on 4.15.0-45-generic. Linux ubuntu 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:27:02 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux I tried to reproduce the error, and I was able to. It probably means this bug was not introduced by the changes on qemu/kernel, and it is present in the current repository of Ubuntu. Next step should be doing a deeper debug on numad, in order to identify why it is getting segfault. ** Affects: numad (Ubuntu) Importance: Undecided Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) Status: New ** Tags: architecture-ppc64le bugnameltc-175673 severity-high targetmilestone-inin--- ** Tags added: architecture-ppc64le bugnameltc-175673 severity-high targetmilestone-inin--- -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1832915 Title: numad crashes while running kvm guest To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/numad/+bug/1832915/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs