You might consider disabling optimization in the compile - we saw that help with similar symptoms.
On Wed, May 28, 2014 at 8:52 AM, Vsevolod Nikonorov <[email protected]> wrote: > > Hello, > > I ran into a problem: when started with -D, slurmctld segfaults without > visible reasons: > > [root@head ~]# /usr/local/slurm/sbin/slurmctld -D > slurmctld: Resetting MaxJobCnt from 4294967294 to 4294942092 (MaxJobId - > FirstJobId + 1) > slurmctld: Resetting MaxJobCnt from 4294967294 to 4294942092 (MaxJobId - > FirstJobId + 1) > Segmentation fault (core dumped) > > > > I tried to strace it: > > open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 5 > read(5, "65536\n", 31) = 6 > close(5) = 0 > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > +++ killed by SIGSEGV (core dumped) +++ > > > (this is the last operation after which the segfault occures, I can > provide the full trace if necessary). > > > > > Also, I tried to run it under gdb: > > [root@head ~]# gdb --core=core.12468 --directory=/root/slurm-14.03.3-2 > /usr/local/slurm/sbin/slurmctld > GNU gdb (GDB) CentOS (7.0.1-45.el5.centos) > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl. > html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /usr/local/slurm/sbin/slurmctld...done. > [New Thread 12468] > Reading symbols from /lib64/libdl.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libdl.so.2 > Reading symbols from /lib64/libpthread.so.0...(no debugging symbols > found)...done. > [Thread debugging using libthread_db enabled] > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from /lib64/libc.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libnss_files.so.2 > > warning: no loadable sections found in added symbol-file system-supplied > DSO at 0x7fff76fa0000 > Core was generated by `/usr/local/slurm/sbin/slurmctld -D'. > Program terminated with signal 11, Segmentation fault. > #0 0x0000003455678f40 in strlen () from /lib64/libc.so.6 > (gdb) bt full > #0 0x0000003455678f40 in strlen () from /lib64/libc.so.6 > No symbol table info available. > #1 0x00000034557067fd in __nscd_getgrouplist () from /lib64/libc.so.6 > No symbol table info available. > #2 0x00000034556974c4 in internal_getgrouplist () from /lib64/libc.so.6 > No symbol table info available. > #3 0x000000345569760a in initgroups () from /lib64/libc.so.6 > No symbol table info available. > #4 0x00000000004321f1 in _become_slurm_user (argc=2, argv=0x7fff76e467f8) > at controller.c:2314 > No locals. > #5 main (argc=2, argv=0x7fff76e467f8) at controller.c:302 > cnt = <value optimized out> > error_code = <value optimized out> > i = <value optimized out> > thread_attr = {__size = "`f\344v\001\000\000\000\210@\ > 271\177B+\000\000`\023\270\177B+\000\000pg\344v\377\177\ > 000\000\000\020\270\177B+\000\000+oA\000\000\000\000\000U\000\000\000\064\000\000", > __align = 6289647200} > stat_buf = {st_dev = 0, st_ino = 47564610605920, st_nlink = > 140735188068160, st_mode = 1994680048, st_uid = 32767, st_gid = 4131212846, > pad0 = 0, st_rdev = 140735188068104, st_size = 0, st_blksize = > 224766497538, st_blocks = 0, > st_atim = {tv_sec = 47564610683016, tv_nsec = 223338299393}, > st_mtim = {tv_sec = 0, tv_nsec = 223338299393}, st_ctim = {tv_sec = > 224771820320, tv_nsec = 140735188067936}, __unused = {0, 85, 194}} > config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = > WRITE_LOCK, partition = WRITE_LOCK} > callbacks = {acct_full = 0, dbd_fail = 0, dbd_resumed = 0, db_fail > = 0xf0b2ff, db_resumed = 0xc2} > dir_name = <value optimized out> > > > > > > Remembering some previous problems I suspect that some uninitialised > variable in some structure (which represents some omitted option in > slurmd.conf) may cause such effect. Could someone please give me some hints? > > Thanks! > > > > > -- > Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ > > Vsevolod D. Nikonorov, JSC NIKET >
