You might consider disabling optimization in the compile - we saw that help
with similar symptoms.


On Wed, May 28, 2014 at 8:52 AM, Vsevolod Nikonorov <[email protected]>
wrote:

>
> Hello,
>
> I ran into a problem: when started with -D, slurmctld segfaults without
> visible reasons:
>
> [root@head ~]# /usr/local/slurm/sbin/slurmctld -D
> slurmctld: Resetting MaxJobCnt from 4294967294 to 4294942092 (MaxJobId -
> FirstJobId + 1)
> slurmctld: Resetting MaxJobCnt from 4294967294 to 4294942092 (MaxJobId -
> FirstJobId + 1)
> Segmentation fault (core dumped)
>
>
>
> I tried to strace it:
>
> open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 5
> read(5, "65536\n", 31)                  = 6
> close(5)                                = 0
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV (core dumped) +++
>
>
> (this is the last operation after which the segfault occures, I can
> provide the full trace if necessary).
>
>
>
>
> Also, I tried to run it under gdb:
>
> [root@head ~]# gdb --core=core.12468 --directory=/root/slurm-14.03.3-2
> /usr/local/slurm/sbin/slurmctld
> GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /usr/local/slurm/sbin/slurmctld...done.
> [New Thread 12468]
> Reading symbols from /lib64/libdl.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libpthread.so.0...(no debugging symbols
> found)...done.
> [Thread debugging using libthread_db enabled]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/libc.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnss_files.so.2
>
> warning: no loadable sections found in added symbol-file system-supplied
> DSO at 0x7fff76fa0000
> Core was generated by `/usr/local/slurm/sbin/slurmctld -D'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000003455678f40 in strlen () from /lib64/libc.so.6
> (gdb) bt full
> #0  0x0000003455678f40 in strlen () from /lib64/libc.so.6
> No symbol table info available.
> #1  0x00000034557067fd in __nscd_getgrouplist () from /lib64/libc.so.6
> No symbol table info available.
> #2  0x00000034556974c4 in internal_getgrouplist () from /lib64/libc.so.6
> No symbol table info available.
> #3  0x000000345569760a in initgroups () from /lib64/libc.so.6
> No symbol table info available.
> #4  0x00000000004321f1 in _become_slurm_user (argc=2, argv=0x7fff76e467f8)
> at controller.c:2314
> No locals.
> #5  main (argc=2, argv=0x7fff76e467f8) at controller.c:302
>         cnt = <value optimized out>
>         error_code = <value optimized out>
>         i = <value optimized out>
>         thread_attr = {__size = "`f\344v\001\000\000\000\210@\
> 271\177B+\000\000`\023\270\177B+\000\000pg\344v\377\177\
> 000\000\000\020\270\177B+\000\000+oA\000\000\000\000\000U\000\000\000\064\000\000",
> __align = 6289647200}
>         stat_buf = {st_dev = 0, st_ino = 47564610605920, st_nlink =
> 140735188068160, st_mode = 1994680048, st_uid = 32767, st_gid = 4131212846,
> pad0 = 0, st_rdev = 140735188068104, st_size = 0, st_blksize =
> 224766497538, st_blocks = 0,
>           st_atim = {tv_sec = 47564610683016, tv_nsec = 223338299393},
> st_mtim = {tv_sec = 0, tv_nsec = 223338299393}, st_ctim = {tv_sec =
> 224771820320, tv_nsec = 140735188067936}, __unused = {0, 85, 194}}
>         config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node =
> WRITE_LOCK, partition = WRITE_LOCK}
>         callbacks = {acct_full = 0, dbd_fail = 0, dbd_resumed = 0, db_fail
> = 0xf0b2ff, db_resumed = 0xc2}
>         dir_name = <value optimized out>
>
>
>
>
>
> Remembering some previous problems I suspect that some uninitialised
> variable in some structure (which represents some omitted option in
> slurmd.conf) may cause such effect. Could someone please give me some hints?
>
> Thanks!
>
>
>
>
> --
> Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ
>
> Vsevolod D. Nikonorov, JSC NIKET
>

Reply via email to