Did this ever get resolved? I am finding that slurmctld is dumping core on OSX (slurm version 2-2-7-1):
omega:slurm-2-2-7-1 rw$ sudo ./sbin/slurmctld -D slurmctld: error: Configured MailProg is invalid slurmctld: slurmctld version 2.2.7 started on cluster (null) slurmctld: error: read_slurm_conf: default partition not set. slurmctld: error: Could not open node state file /tmp/node_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Information may be lost! slurmctld: No node state file (/tmp/node_state.old) to recover slurmctld: error: Incomplete node data checkpoint file slurmctld: Recovered state of 0 nodes slurmctld: error: Could not open job state file /tmp/job_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! slurmctld: No job state file (/tmp/job_state.old) to recover slurmctld: error: Could not open reservation state file /tmp/ resv_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Reservations may be lost slurmctld: No reservation state file (/tmp/resv_state.old) to recover slurmctld: Recovered state of 0 reservations slurmctld: error: Could not open trigger state file /tmp/ trigger_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost! slurmctld: No trigger state file (/tmp/trigger_state.old) to recover slurmctld: error: Incomplete trigger data checkpoint file slurmctld: read_slurm_conf: backup_controller not specified. slurmctld: Reinitializing job accounting state Segmentation fault (core dumped) omega:slurm-2-2-7-1 rw$ cat etc/slurm.conf ControlMachine=omega NodeName=omega PartitionName=basic . On May 20, 5:12 pm, "Mark A. Grondona" <[email protected]> wrote: > On Fri, 20 May 2011 10:28:01 -0700, Danny Auble <[email protected]> wrote: > > Tyler, what you have found is the exact reason this code exists. > > > Those variables exist in the slurmctld but not in anything else. Since all > > programs load the plugins the same way we define them there and when the > > slurmctld loads things the symbols are overwritten with the real ones. > > Perhaps a better idea would be to export a function[1] or set of > functions to access global variables that might be required by plugins. > (though if you are doing that often you might rethink that strategy and > export more useful functions to plugins). I think plugins are loaded > with RTLD_LAZY, so the symbol is only resolved if you try to call it. > This weak-import/symbol interpositioning usage seems fragile. > > mark > > [1] One idea might be to have a single function > > void * get_slurm_global_symbol (const char *name) > > A quick implementation of this function might use dlsym() > to try getting the symbol from a global namespace. A better > implementation would contain a list of supported symbols > by name, and return NULL if an unsupported symbol is accessed > from a plugin. > > mark > > > > > > > > > There is even a nice comment that explains this... > > > /* These are defined here so when we link with something other than > > * the slurmctld we will have these symbols defined. They will get > > * overwritten when linking with the slurmctld. > > */ > > > Danny > > > On Friday, May 20, 2011 09:30:47 AM Tyler Strickland wrote: > > > Danny, > > > > I've traced the error in starting slurmd down to the dlopen line in > > > src/common/plugin.c in the plugin_load_from_file function (line 176). > > > What's strange is that both slurmd and slurmctld load plugins in the > > > same way - via the slurm_select_init function in slurmd/slurmd.c and > > > slurmctld/controller.c. Note that I re-added the part_list and job_list > > > variables to select_cons_res.c as extern Lists - making them extern > > > seems to have had the same effect as removing them altogether. > > > > Here's the error: > > > > May 20 12:27:14 head slurmd[78007]: error: plugin_load_from_file: > > > dlopen(/usr/local/lib/slurm/select_linear.so): dlopen(/usr/lo > > > cal/lib/slurm/select_linear.so, 1): Symbol not found: _part_list\n > > > Referenced from: /usr/local/lib/slurm/select_linear.so\n E > > > xpected in: dynamic lookup > > > May 20 12:27:14 head slurmd[78007]: error: Couldn't load specified > > > plugin name for select/linear: Dlopen of plugin file failed > > > May 20 12:27:14 head slurmd[78007]: fatal: Can't find plugin for > > > select/linear > > > > I'm not sure why one program can access the data without any issues and > > > another can't. Very strange. > > > > --Tyler > > > > On 05/19/2011 10:19 PM, Danny Auble wrote: > > > > Hey Tyler, > > > > > I don't think you can call this one solved yet since your patch > > > > probably is creating a host of other problems you aren't aware of yet. > > > > The slurmd being just one of them. I am guessing quite a few of the > > > > user tools won't work either. > > > > You may be on the right track though, perhaps there is something other > > > > than a weak import needed in the APPLE section. > > > > > Danny > > > > > On Thursday, May 19, 2011 07:12:25 PM Tyler Strickland wrote: > > > >> Jon, Danny, and Moe, > > > > >> After several hours of scouring through the code and trying to find out > > > >> why it wasn't working I finally hit upon something - if I comment out > > > >> the __APPLE__ section in select_cons_res.c AND the part_list and > > > >> job_list declarations, I can get slurmctld to start. Unfortunately, > > > >> that change kills slurmd - and in such a manner that it dies with exit > > > >> code 01, nothing in the log, and nothing printed anywhere - not a > > > >> single > > > >> clue to its death. Arggh. One step closer and one step further away. > > > > >> Tyler > > > > >> On 05/16/2011 05:27 PM, Jon Bringhurst wrote: > > > >>> This might have something to do with the __APPLE__ weak imports in > > > >>> src/plugins/select/cons_res/select_cons_res.c. > > > > >>> Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. > > > >>> Unfortunately I don't have anything running 10.5 available to debug > > > >>> this one. :\ > > > > >>> -Jon > > > > >>> On May 16, 2011, at 2:57 PM, Tyler Strickland wrote: > > > > >>>> Here's the result of recompiling with --enable-debug: > > > > >>>> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv > > > >>>> Assertion failed: (l != NULL), function list_count, file list.c, > > > >>>> line 351. > > > >>>> Abort trap > > > > >>>> And here's the gdb output: > > > >>>> (gdb) run -Dvv > > > >>>> Starting program: /usr/local/sbin/slurmctld -Dvv > > > >>>> Reading symbols for shared libraries ++. done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries .. done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Reading symbols for shared libraries . done > > > >>>> Assertion failed: (l != NULL), function list_count, file list.c, > > > >>>> line 351. > > > > >>>> Program received signal SIGABRT, Aborted. > > > >>>> 0x94630e42 in __kill () > > > >>>> (gdb) bt full > > > >>>> #0 0x94630e42 in __kill () > > > >>>> No symbol table info available. > > > >>>> #1 0x94630e34 in kill$UNIX2003 () > > > >>>> No symbol table info available. > > > >>>> #2 0x946a323a in raise () > > > >>>> No symbol table info available. > > > >>>> #3 0x946af679 in abort () > > > >>>> No symbol table info available. > > > >>>> #4 0x946a43db in __assert_rtn () > > > >>>> No symbol table info available. > > > >>>> #5 0x00087abd in list_count () > > > >>>> No symbol table info available. > > > >>>> #6 0x003b5ade in _create_part_data () > > > >>>> No symbol table info available. > > > >>>> #7 0x003b8dd9 in select_p_node_init () > > > >>>> No symbol table info available. > > > >>>> #8 0x000a9796 in select_g_node_init () > > > >>>> No symbol table info available. > > > >>>> #9 0x00059153 in read_slurm_conf () > > > >>>> No symbol table info available. > > > >>>> #10 0x0000a3ec in main () > > > >>>> No symbol table info available. > > > > >>>> Tyler > > > > >>>> On 05/16/2011 11:43 AM, Auble, Danny wrote: > > > >>>>> Could you configure with the --with-debug option and recompile? In > > > >>>>> any case. This appears to be a wild goose chase. Could you also > > > >>>>> try to compile against the lastest trunk in the git repo on github? > > > >>>>> It has other places fixed in headers to make sure we don't miss > > > >>>>> one in the future. > > > > >>>>> Danny > > > > >>>>>> -----Original Message----- > > > >>>>>> From: [email protected] > > > >>>>>> [mailto:[email protected]] On Behalf Of Tyler > > > >>>>>> Strickland > > > >>>>>> Sent: Friday, May 13, 2011 12:03 PM > > > >>>>>> To: [email protected] > > > >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > > > > >>>>>> Here's the full gdb output. What might cause slurm to not be able > > > >>>>>> to > > > >>>>>> access the memory? > > > > >>>>>> (gdb) run -Dvv > > > >>>>>> Starting program: /usr/local/sbin/slurmctld -Dvv > > > >>>>>> Reading symbols for shared libraries ++. done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries .. done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > >>>>>> Reading symbols for shared libraries . done > > > > >>>>>> Program received signal EXC_BAD_ACCESS, Could not access memory. > > > >>>>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014 > > > >>>>>> 0x945cab7e in pthread_mutex_lock () > > > >>>>>> (gdb) bt full > > > >>>>>> #0 0x945cab7e in pthread_mutex_lock () > > > >>>>>> No symbol table info available. > > > >>>>>> #1 0x00079eda in list_count () > > > >>>>>> No symbol table info available. > > > >>>>>> #2 0x00337e0e in _create_part_data () > > > >>>>>> No symbol table info available. > > > >>>>>> #3 0x0033b109 in select_p_node_init () > > > >>>>>> No symbol table info available. > > > >>>>>> #4 0x00096ee9 in select_g_node_init () > > > >>>>>> No symbol table info available. > > > >>>>>> #5 0x000504e3 in read_slurm_conf () > > > >>>>>> No symbol table info available. > > > >>>>>> #6 0x0000a768 in main () > > > >>>>>> No symbol table info available. > > > >>>>>> (gdb) > > > > >>>>>> On 05/13/2011 02:36 PM, Auble, Danny wrote: > > > >>>>>>> Could you run it is gdb and get the backtrace? > > > > >>>>>>> gdb slurmctld > > > >>>>>>> (gdb) run -Dvv > > > >>>>>>> ...crash... > > > >>>>>>> (gdb) bt full > > > > >>>>>>> That might give us something. > > > > >>>>>>> Danny > > > > >>>>>>>> -----Original Message----- > > > >>>>>>>> From: [email protected] > > > >>>>>>>> [mailto:[email protected]] On Behalf Of Tyler > > > >>>>>>>> Strickland > > > >>>>>>>> Sent: Friday, May 13, 2011 11:33 AM > > > >>>>>>>> To: [email protected] > > > >>>>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > > > > >>>>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I > > > >>>>>>>> go > > > >>>>>>>> about doing that? One of the past list posts said to run > > > >>>>>>>> 'ulimit -c > > > >>>>>>>> unlimited' followed by slurmctld -D, after which the core dump > > > >>>>>>>> would be > > > >>>>>>>> placed in the current directory (/tmp). Unfortunately, nothing > > > >>>>>>>> is to be > > > >>>>>>>> found in the folder after the crash. > > > > >>>>>>>> Thanks, > > > >>>>>>>> Tyler > > > > >>>>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote:... > > read more »
