Anyone? This is blocking one of our main nodes from submissions. Any ideas on what might cause this or how to debug further are welcome.
On 07.05.2014, at 11:42, Mario Kadastik <[email protected]> wrote: > > Hi, > > yesterday I upgraded Slurm from 2.5.3 to 14.11 pre-1 (i.e. the current git > clone yesterday). The installation went just fine after I updated the spec > file to the proper contents (rpmbuild doesn't like spaces in the spec file > Name: etc definitions so having "See META file" will break the rpmbuild). > However the one SL5.7 node we have and that worked just fine with slurm 2.5.3 > now segfaults for every slurm command. > > Here's the backtrace: > # gdb /usr/bin/squeue > GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1) > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /usr/bin/squeue...done. > (gdb) run > Starting program: /usr/bin/squeue > warning: no loadable sections found in added symbol-file system-supplied DSO > at 0x7ffff7ffb000 > [Thread debugging using libthread_db enabled] > > Program received signal SIGSEGV, Segmentation fault. > 0x0000000000459159 in _validate_and_set_defaults (file_name=<value optimized > out>) at read_config.c:2836 > 2836 if ((strcmp(conf->crypto_type, "crypto/openssl") == 0) && > (gdb) bt > #0 0x0000000000459159 in _validate_and_set_defaults (file_name=<value > optimized out>) at read_config.c:2836 > #1 _init_slurm_conf (file_name=<value optimized out>) at read_config.c:2475 > #2 0x000000000045b6fd in slurm_conf_init (file_name=0x0) at > read_config.c:2528 > #3 0x000000000042419d in main (argc=1, argv=0x0) at squeue.c:78 > (gdb) > > We use AuthType auth/munge so I'm not quite sure why it segfaults on the > conf->crypto_type comparison and even more I cannot fathom why it does it > only on the SL5.7 node while it works just fine on all the SL6 nodes. I used > the same tarball on SL6 and SL5 to create the RPMs using rpmbuild -ta > slurm-14-11.pre1.tgz. > > Ideas are welcome as the SL5.7 node is one of the main user nodes where they > create code and submit to cluster so it has to work even though the full rest > of the cluster works fine. The config btw is shared over NFS so it is > identical on all nodes. > > Mario Kadastik, PhD > Senior researcher > > --- > "Physics is like sex, sure it may have practical reasons, but that's not why > we do it" > -- Richard P. Feynman Mario Kadastik, PhD Senior researcher --- "Physics is like sex, sure it may have practical reasons, but that's not why we do it" -- Richard P. Feynman
