Public bug reported:

== Comment: #0 - SRIKANTH AITHAL <bssrika...@in.ibm.com> - 2019-02-20 23:42:23 
==
---Problem Description---
while running KVM guests, we are observing numad crashes on host.
 
Contact Information = srikanth/bssrika...@in.ibm.com 
 
---uname output---
Linux ltcgen6 4.15.0-1016-ibm-gt #18-Ubuntu SMP Thu Feb 7 16:58:31 UTC 2019 
ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = witherspoon 
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
 1. check status of numad, if stopped start it
2. start a kvm guest
3. Run some memory tests inside guest

On the host after few minutes we see numad crashing. I had enabled debug
log for numad, seeing below messages in numad.log before it crashes:

8870669: PID 88781: (qemu-system-ppc), Threads  6, MBs_size  15871, MBs_used  
11262, CPUs_used  400, Magnitude 4504800, Nodes: 0,8
Thu Feb 21 00:12:10 2019: PICK NODES FOR:  PID: 88781,  CPUs 470,  MBs 18671
Thu Feb 21 00:12:10 2019: PROCESS_MBs[0]: 9201
Thu Feb 21 00:12:10 2019:     Node[0]: mem: 0  cpu: 6
Thu Feb 21 00:12:10 2019:     Node[1]: mem: 0  cpu: 6
Thu Feb 21 00:12:10 2019:     Node[2]: mem: 1878026  cpu: 4666
Thu Feb 21 00:12:10 2019:     Node[3]: mem: 0  cpu: 6
Thu Feb 21 00:12:10 2019:     Node[4]: mem: 0  cpu: 6
Thu Feb 21 00:12:10 2019:     Node[5]: mem: 2194058  cpu: 4728
Thu Feb 21 00:12:10 2019: Totmag[0]: 94112134
Thu Feb 21 00:12:10 2019: Totmag[1]: 109211855
Thu Feb 21 00:12:10 2019: Totmag[2]: 2990058
Thu Feb 21 00:12:10 2019: Totmag[3]: 2990058
Thu Feb 21 00:12:10 2019: Totmag[4]: 2990058
Thu Feb 21 00:12:10 2019: Totmag[5]: 2990058
Thu Feb 21 00:12:10 2019: best_node_ix: 1
Thu Feb 21 00:12:10 2019: Node: 8  Dist: 10  Magnitude: 10373506224
Thu Feb 21 00:12:10 2019: Node: 0  Dist: 40  Magnitude: 8762869316
Thu Feb 21 00:12:10 2019: Node: 253  Dist: 80  Magnitude: 0
Thu Feb 21 00:12:10 2019: Node: 254  Dist: 80  Magnitude: 0
Thu Feb 21 00:12:10 2019: Node: 252  Dist: 80  Magnitude: 0
Thu Feb 21 00:12:10 2019: Node: 255  Dist: 80  Magnitude: 0
Thu Feb 21 00:12:10 2019: MBs: 18671,  CPUs: 470
Thu Feb 21 00:12:10 2019: Assigning resources from node 5
Thu Feb 21 00:12:10 2019:     Node[0]: mem: 2007348  cpu: 1908
Thu Feb 21 00:12:10 2019: MBs: 0,  CPUs: 0
Thu Feb 21 00:12:10 2019: Assigning resources from node 2
Thu Feb 21 00:12:10 2019: Process 88781 already 100 percent localized to target 
nodes.


On syslog we see sig 11:
[88726.086144] numad[88879]: unhandled signal 11 at 000000e38fe72688 nip 
0000782ce4dcac20 lr 0000782ce4dcf85c code 1


 
Stack trace output:
 no
 
Oops output:
 no
 
System Dump Info:
  The system was configured to capture a dump, however a dump was not produced.
 
*Additional Instructions for srikanth/bssrika...@in.ibm.com: 
-Attach sysctl -a output output to the bug.

== Comment: #2 - SRIKANTH AITHAL <bssrika...@in.ibm.com> - 2019-02-20
23:44:38 ==


== Comment: #3 - SRIKANTH AITHAL <bssrika...@in.ibm.com> - 2019-02-20 23:48:20 
==
I was using stressapptest to run memory workload inside the guest
`stressapptest -s 200`

== Comment: #5 - Brian J. King <bjki...@us.ibm.com> - 2019-03-08 09:17:29 ==
Any update on this?

== Comment: #6 - Leonardo Bras Soares Passos <leona...@ibm.com> - 2019-03-08 
11:59:16 ==
Yes, I have been working on this for a while.

After a suggestion of @lagarcia, I tested the bug on the same machine, booted 
on default kernel (4.15.0-45-generic) and also booted the vm with the same 
generic kernel. 
Results are that the bug also happens with 4.15.0-45-generic. So, it may not be 
a problem of the changes included on kernel 4.15.0-1016.18-fix1-ibm-gt.

A few things I noticed, that may be interesting to solve this bug:
- I had a very hard time to reproduce the bug on numad that started on boot. If 
I restart, or stop/start, the bug reproduces much easier.
- I debugged numad using gdb and I found out it is getting segfault on 
_int_malloc(), from glibc.

Attached is an occurrence of the bug, while numad was on gdb.
(systemctl start numad ; gdb /usr/bin/numad $NUMAD_PID)

== Comment: #7 - Leonardo Bras Soares Passos <leona...@ibm.com> -
2019-03-08 12:00:00 ==


== Comment: #8 - Leonardo Bras Soares Passos <leona...@ibm.com> - 2019-03-11 
17:04:25 ==
I reverted the whole system to vanilla Ubuntu Bionic, and booted on 
4.15.0-45-generic kernel.
Linux ltcgen6 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:27:02 UTC 2019 
ppc64le ppc64le ppc64le GNU/Linux

Then I booted the guest, also on 4.15.0-45-generic.
Linux ubuntu 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:27:02 UTC 2019 
ppc64le ppc64le ppc64le GNU/Linux

I tried to reproduce the error, and I was able to.
It probably means this bug was not introduced by the changes on qemu/kernel, 
and it is present in the current repository of Ubuntu.

Next step should be doing a deeper debug on numad, in order to identify
why it is getting segfault.

** Affects: numad (Ubuntu)
     Importance: Undecided
     Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
         Status: New


** Tags: architecture-ppc64le bugnameltc-175673 severity-high 
targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-175673 severity-high
targetmilestone-inin---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1832915

Title:
  numad crashes while running kvm guest

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/numad/+bug/1832915/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to