Hi,
the stack shows it core dumps inside initgroups() which is a libc
function. I suspect that although you run as root in the slurm.conf
SlurmUser is not root, try to set it as root, or better yet don't run as
root at all.
On 05/28/2014 05:51 AM, Vsevolod Nikonorov wrote:
Hello,
I ran into a problem: when started with -D, slurmctld segfaults without
visible reasons:
[root@head ~]# /usr/local/slurm/sbin/slurmctld -D
slurmctld: Resetting MaxJobCnt from 4294967294 to 4294942092 (MaxJobId -
FirstJobId + 1)
slurmctld: Resetting MaxJobCnt from 4294967294 to 4294942092 (MaxJobId -
FirstJobId + 1)
Segmentation fault (core dumped)
I tried to strace it:
open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 5
read(5, "65536\n", 31) = 6
close(5) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV (core dumped) +++
(this is the last operation after which the segfault occures, I can
provide the full trace if necessary).
Also, I tried to run it under gdb:
[root@head ~]# gdb --core=core.12468 --directory=/root/slurm-14.03.3-2
/usr/local/slurm/sbin/slurmctld
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/slurm/sbin/slurmctld...done.
[New Thread 12468]
Reading symbols from /lib64/libdl.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fff76fa0000
Core was generated by `/usr/local/slurm/sbin/slurmctld -D'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000003455678f40 in strlen () from /lib64/libc.so.6
(gdb) bt full
#0 0x0000003455678f40 in strlen () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000034557067fd in __nscd_getgrouplist () from /lib64/libc.so.6
No symbol table info available.
#2 0x00000034556974c4 in internal_getgrouplist () from /lib64/libc.so.6
No symbol table info available.
#3 0x000000345569760a in initgroups () from /lib64/libc.so.6
No symbol table info available.
#4 0x00000000004321f1 in _become_slurm_user (argc=2,
argv=0x7fff76e467f8) at controller.c:2314
No locals.
#5 main (argc=2, argv=0x7fff76e467f8) at controller.c:302
cnt = <value optimized out>
error_code = <value optimized out>
i = <value optimized out>
thread_attr = {__size =
"`f\344v\001\000\000\000\210@\271\177B+\000\000`\023\270\177B+\000\000pg\344v\377\177\000\000\000\020\270\177B+\000\000+oA\000\000\000\000\000U\000\000\000\064\000\000",
__align = 6289647200}
stat_buf = {st_dev = 0, st_ino = 47564610605920, st_nlink =
140735188068160, st_mode = 1994680048, st_uid = 32767, st_gid =
4131212846, pad0 = 0, st_rdev = 140735188068104, st_size = 0, st_blksize
= 224766497538, st_blocks = 0,
st_atim = {tv_sec = 47564610683016, tv_nsec = 223338299393},
st_mtim = {tv_sec = 0, tv_nsec = 223338299393}, st_ctim = {tv_sec =
224771820320, tv_nsec = 140735188067936}, __unused = {0, 85, 194}}
config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK,
node = WRITE_LOCK, partition = WRITE_LOCK}
callbacks = {acct_full = 0, dbd_fail = 0, dbd_resumed = 0,
db_fail = 0xf0b2ff, db_resumed = 0xc2}
dir_name = <value optimized out>
Remembering some previous problems I suspect that some uninitialised
variable in some structure (which represents some omitted option in
slurmd.conf) may cause such effect. Could someone please give me some
hints?
Thanks!
--
Thanks,
/David/Bigagli
www.schedmd.com