I was trying to recreate this on x86 with a 128G guest and 64 CPUs.
I see numad action:

Thu Jul 18 10:51:22 2019: Advising pid 13197 (qemu-system-x86) move from nodes 
(0-1) to nodes (1)
Thu Jul 18 10:51:23 2019: PID 13197 moved to node(s) 1 in 0.19 seconds

Running stressapptest [1] in Host and guest for a while triggered more
of those, without crashes (expected).

Restarting numad did not break it on this system.
A shutdown seems to do a re-evaluation and then go on as usual:
Thu Jul 18 11:00:54 2019: Shutting down numad
Thu Jul 18 11:00:54 2019: Registering numad version 20150602 PID 15629
Thu Jul 18 11:01:01 2019: Advising pid 15500 (stressapptest) move from nodes 
(0-1) to nodes (0-1)
Thu Jul 18 11:01:01 2019: PID 15500 moved to node(s) 0-1 in 0.0 seconds
Thu Jul 18 11:01:06 2019: Advising pid 13197 (qemu-system-x86) move from nodes 
(0-1) to nodes (0-1)
Thu Jul 18 11:01:06 2019: PID 13197 moved to node(s) 0-1 in 0.0 seconds


So the assumption for now is that this is either ppc64el specific or
even specific to our particular P9 (dradis).

Lowering importance as it seems not to be a general issue.
I'll ping Frank if he wants to reverse mirror that to IBM.

[1]: https://github.com/stressapptest/stressapptest/releases

** Changed in: numad (Ubuntu)
   Importance: Undecided => Low

** Changed in: numad (Ubuntu)
       Status: New => Confirmed

** Description changed:

  while verifying bug 1832915 I found "by accident" that this crash (at
  least on our power 9 box seems to happen often.
  
  Case:
  - huge kvm guest running
  - restart numad
  => Numad crashes.
+ 
+ 
+ Steps to recreate:
+ 1. deploy P9 Bionic (or later) system
+ 2. install uvtool
+    $ apt install uvttool-libvirt
+ 3. log out & in to get permissions right
+ 4. sync images
+    $ uvt-simplestreams-libvirt --verbose sync --source http://cloud- 
images.ubuntu.com/daily arch=ppc64el label=daily release=eoan
+ 6. install and manually start numad
+    $ apt install numad
+    $ systemctl start numad
+ 5. spawn guest
+    $ uvt-kvm create --memory $((1024*64)) --cpu 64 --password ubuntu eoan 
arch=ppc64el release=eoan label=daily
+ 6. restart numad
+    $ systemctl restart numad
  
  The crash seems related to some re-init of a static structure:
  
  stack trace ---
  #0  tcache_get (tc_idx=<optimized out>) at malloc.c:2950
          e = 0x9a5ddc1950
          e = <optimized out>
          __PRETTY_FUNCTION__ = "tcache_get"
  #1  __GI___libc_malloc (bytes=16) at malloc.c:3058
          ar_ptr = <optimized out>
          victim = <optimized out>
          hook = <optimized out>
          tbytes = <optimized out>
          tc_idx = <optimized out>
          __PRETTY_FUNCTION__ = "__libc_malloc"
  #2  0x0000009a300279a0 in ?? ()
  No symbol table info available.
  #3  0x0000009a3002cad8 in ?? ()
  No symbol table info available.
  #4  0x0000009a30023794 in ?? ()
  No symbol table info available.
  #5  0x00007a6150998278 in generic_start_main (main=0x9a30022a00, 
argc=<optimized out>, argv=0x7fffe93a7828, auxvec=0x7fffe93a7880, 
init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, 
fini=<optimized out>) at ../csu/libc-start.c:308
          self = 0x7a6150dc38d0
          result = <optimized out>
          unwind_buf = {cancel_jmp_buf = {{jmp_buf = {8465053667230565969, 
134558384812288, 8465057470262718529, 0 <repeats 13 times>, 134558387008032, 0, 
134558387008040, 662230455376, 0, 2449962883098869759, 0 <repeats 42 times>}, 
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7fffe93a7700, 0x0}, data = 
{prev = 0x0, cleanup = 0x0, canceltype = -382044416}}}
          not_first_call = <optimized out>
  #6  0x00007a6150998484 in __libc_start_main (argc=<optimized out>, 
argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, 
rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized 
out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116
  No locals.
  #7  0x0000000000000000 in ?? ()
  No symbol table info available.
  --- source code stack trace ---
  #0  tcache_get (tc_idx=<optimized out>) at malloc.c:2950
    [Error: malloc.c was not found in source tree]
  #1  __GI___libc_malloc (bytes=16) at malloc.c:3058
    [Error: malloc.c was not found in source tree]
  #2  0x0000009a300279a0 in ?? ()
  #3  0x0000009a3002cad8 in ?? ()
  #4  0x0000009a30023794 in ?? ()
  #5  0x00007a6150998278 in generic_start_main (main=0x9a30022a00, 
argc=<optimized out>, argv=0x7fffe93a7828, auxvec=0x7fffe93a7880, 
init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, 
fini=<optimized out>) at ../csu/libc-start.c:308
    [Error: libc-start.c was not found in source tree]
  #6  0x00007a6150998484 in __libc_start_main (argc=<optimized out>, 
argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, 
rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized 
out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116
    [Error: libc-start.c was not found in source tree]
  #7  0x0000000000000000 in ?? ()
  
  I thought at first this would be related to my debug rebuilds, but it
- seems to appear as-is.
+ seems to appear as-is in the version as it is in the Ubuntu Archive.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1836913

Title:
  crash (on ppc64) when restarting numad while huge guest is active

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/numad/+bug/1836913/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to