I was trying to recreate this on x86 with a 128G guest and 64 CPUs. I see numad action:
Thu Jul 18 10:51:22 2019: Advising pid 13197 (qemu-system-x86) move from nodes (0-1) to nodes (1) Thu Jul 18 10:51:23 2019: PID 13197 moved to node(s) 1 in 0.19 seconds Running stressapptest [1] in Host and guest for a while triggered more of those, without crashes (expected). Restarting numad did not break it on this system. A shutdown seems to do a re-evaluation and then go on as usual: Thu Jul 18 11:00:54 2019: Shutting down numad Thu Jul 18 11:00:54 2019: Registering numad version 20150602 PID 15629 Thu Jul 18 11:01:01 2019: Advising pid 15500 (stressapptest) move from nodes (0-1) to nodes (0-1) Thu Jul 18 11:01:01 2019: PID 15500 moved to node(s) 0-1 in 0.0 seconds Thu Jul 18 11:01:06 2019: Advising pid 13197 (qemu-system-x86) move from nodes (0-1) to nodes (0-1) Thu Jul 18 11:01:06 2019: PID 13197 moved to node(s) 0-1 in 0.0 seconds So the assumption for now is that this is either ppc64el specific or even specific to our particular P9 (dradis). Lowering importance as it seems not to be a general issue. I'll ping Frank if he wants to reverse mirror that to IBM. [1]: https://github.com/stressapptest/stressapptest/releases ** Changed in: numad (Ubuntu) Importance: Undecided => Low ** Changed in: numad (Ubuntu) Status: New => Confirmed ** Description changed: while verifying bug 1832915 I found "by accident" that this crash (at least on our power 9 box seems to happen often. Case: - huge kvm guest running - restart numad => Numad crashes. + + + Steps to recreate: + 1. deploy P9 Bionic (or later) system + 2. install uvtool + $ apt install uvttool-libvirt + 3. log out & in to get permissions right + 4. sync images + $ uvt-simplestreams-libvirt --verbose sync --source http://cloud- images.ubuntu.com/daily arch=ppc64el label=daily release=eoan + 6. install and manually start numad + $ apt install numad + $ systemctl start numad + 5. spawn guest + $ uvt-kvm create --memory $((1024*64)) --cpu 64 --password ubuntu eoan arch=ppc64el release=eoan label=daily + 6. restart numad + $ systemctl restart numad The crash seems related to some re-init of a static structure: stack trace --- #0 tcache_get (tc_idx=<optimized out>) at malloc.c:2950 e = 0x9a5ddc1950 e = <optimized out> __PRETTY_FUNCTION__ = "tcache_get" #1 __GI___libc_malloc (bytes=16) at malloc.c:3058 ar_ptr = <optimized out> victim = <optimized out> hook = <optimized out> tbytes = <optimized out> tc_idx = <optimized out> __PRETTY_FUNCTION__ = "__libc_malloc" #2 0x0000009a300279a0 in ?? () No symbol table info available. #3 0x0000009a3002cad8 in ?? () No symbol table info available. #4 0x0000009a30023794 in ?? () No symbol table info available. #5 0x00007a6150998278 in generic_start_main (main=0x9a30022a00, argc=<optimized out>, argv=0x7fffe93a7828, auxvec=0x7fffe93a7880, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:308 self = 0x7a6150dc38d0 result = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {8465053667230565969, 134558384812288, 8465057470262718529, 0 <repeats 13 times>, 134558387008032, 0, 134558387008040, 662230455376, 0, 2449962883098869759, 0 <repeats 42 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7fffe93a7700, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = -382044416}}} not_first_call = <optimized out> #6 0x00007a6150998484 in __libc_start_main (argc=<optimized out>, argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116 No locals. #7 0x0000000000000000 in ?? () No symbol table info available. --- source code stack trace --- #0 tcache_get (tc_idx=<optimized out>) at malloc.c:2950 [Error: malloc.c was not found in source tree] #1 __GI___libc_malloc (bytes=16) at malloc.c:3058 [Error: malloc.c was not found in source tree] #2 0x0000009a300279a0 in ?? () #3 0x0000009a3002cad8 in ?? () #4 0x0000009a30023794 in ?? () #5 0x00007a6150998278 in generic_start_main (main=0x9a30022a00, argc=<optimized out>, argv=0x7fffe93a7828, auxvec=0x7fffe93a7880, init=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:308 [Error: libc-start.c was not found in source tree] #6 0x00007a6150998484 in __libc_start_main (argc=<optimized out>, argv=<optimized out>, ev=<optimized out>, auxvec=<optimized out>, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116 [Error: libc-start.c was not found in source tree] #7 0x0000000000000000 in ?? () I thought at first this would be related to my debug rebuilds, but it - seems to appear as-is. + seems to appear as-is in the version as it is in the Ubuntu Archive. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1836913 Title: crash (on ppc64) when restarting numad while huge guest is active To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/numad/+bug/1836913/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
