Robert, when you run this type 'bt' afterwards. If you have a core you can run
gdb slurmctld core then type 'bt' That should give you some of what one would need to figure out where this is happening. Danny On Tuesday June 28 2011 8:53:16 PM you wrote: > I just used gdb to run it (output is below. I believe this is what Danny > intended me to do; if not, please let me know how I can provide information > useful for debugging. I have examined the __APPLE__-specific sections of > slurm, but I did not come up with any answers. > > Thank you very much! > --Robert > > > (20:49)rw@omega:~/slurm-2-2-7-1$ sudo gdb ./sbin/slurmctld > Password: > GNU gdb 6.3.50-20050815 (Apple version gdb-1515) (Sat Jan 15 08:33:48 UTC > 2011) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain > conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "x86_64-apple-darwin"...Reading symbols for > shared libraries .. done > > (gdb) run -Dvv > Starting program: /Users/rw/slurm-2-2-7-1/sbin/slurmctld -Dvv > Reading symbols for shared libraries +. done > slurmctld: error: Configured MailProg is invalid > Reading symbols for shared libraries . done > slurmctld: Accounting storage NOT INVOKED plugin loaded > slurmctld: slurmctld version 2.2.7 started on cluster (null) > Reading symbols for shared libraries .. done > slurmctld: Munge cryptographic signature plugin loaded > Reading symbols for shared libraries . done > Reading symbols for shared libraries . done > slurmctld: preempt/none loaded > Reading symbols for shared libraries . done > slurmctld: Checkpoint plugin loaded: checkpoint/none > Reading symbols for shared libraries . done > slurmctld: Job accounting gather NOT_INVOKED plugin loaded > slurmctld: debug: No backup controller to shutdown > Reading symbols for shared libraries . done > slurmctld: switch NONE plugin loaded > Reading symbols for shared libraries . done > slurmctld: topology NONE plugin loaded > slurmctld: debug: No DownNodes > Reading symbols for shared libraries . done > Reading symbols for shared libraries . done > slurmctld: sched: Built-in scheduler plugin loaded > slurmctld: error: read_slurm_conf: default partition not set. > slurmctld: error: Could not open node state file /tmp/node_state: No such > file or directory > slurmctld: error: NOTE: Trying backup state save file. Information may be > lost! > slurmctld: No node state file (/tmp/node_state.old) to recover > slurmctld: error: Incomplete node data checkpoint file > slurmctld: Recovered state of 0 nodes > slurmctld: error: Could not open job state file /tmp/job_state: No such file > or directory > slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! > slurmctld: No job state file (/tmp/job_state.old) to recover > slurmctld: debug: Updating partition uid access list > slurmctld: error: Could not open reservation state file /tmp/resv_state: No > such file or directory > slurmctld: error: NOTE: Trying backup state save file. Reservations may be > lost > slurmctld: No reservation state file (/tmp/resv_state.old) to recover > slurmctld: Recovered state of 0 reservations > slurmctld: error: Could not open trigger state file /tmp/trigger_state: No > such file or directory > slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost! > slurmctld: No trigger state file (/tmp/trigger_state.old) to recover > slurmctld: error: Incomplete trigger data checkpoint file > slurmctld: State of 0 triggers recovered > slurmctld: read_slurm_conf: backup_controller not specified. > slurmctld: Reinitializing job accounting state > > Program received signal EXC_BAD_ACCESS, Could not access memory. > Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000028 > 0x00007fff883cf777 in pthread_mutex_lock () > (gdb) > > > On Fri, Jun 24, 2011 at 10:10 AM, Danny Auble <[email protected]> wrote: > > > ** > > > > Do you happen to have a backtrace of the core? > > > > > > > Did this ever get resolved? I am finding that slurmctld is dumping > > > > > core on OSX (slurm version 2-2-7-1): > > > > > > > > > > omega:slurm-2-2-7-1 rw$ sudo ./sbin/slurmctld -D > > > > > slurmctld: error: Configured MailProg is invalid > > > > > slurmctld: slurmctld version 2.2.7 started on cluster (null) > > > > > slurmctld: error: read_slurm_conf: default partition not set. > > > > > slurmctld: error: Could not open node state file /tmp/node_state: No > > > > > such file or directory > > > > > slurmctld: error: NOTE: Trying backup state save file. Information may > > > > > be lost! > > > > > slurmctld: No node state file (/tmp/node_state.old) to recover > > > > > slurmctld: error: Incomplete node data checkpoint file > > > > > slurmctld: Recovered state of 0 nodes > > > > > slurmctld: error: Could not open job state file /tmp/job_state: No > > > > > such file or directory > > > > > slurmctld: error: NOTE: Trying backup state save file. Jobs may be > > > > > lost! > > > > > slurmctld: No job state file (/tmp/job_state.old) to recover > > > > > slurmctld: error: Could not open reservation state file /tmp/ > > > > > resv_state: No such file or directory > > > > > slurmctld: error: NOTE: Trying backup state save file. Reservations > > > > > may be lost > > > > > slurmctld: No reservation state file (/tmp/resv_state.old) to recover > > > > > slurmctld: Recovered state of 0 reservations > > > > > slurmctld: error: Could not open trigger state file /tmp/ > > > > > trigger_state: No such file or directory > > > > > slurmctld: error: NOTE: Trying backup state save file. Triggers may be > > > > > lost! > > > > > slurmctld: No trigger state file (/tmp/trigger_state.old) to recover > > > > > slurmctld: error: Incomplete trigger data checkpoint file > > > > > slurmctld: read_slurm_conf: backup_controller not specified. > > > > > slurmctld: Reinitializing job accounting state > > > > > Segmentation fault (core dumped) > > > > > > > > > > omega:slurm-2-2-7-1 rw$ cat etc/slurm.conf > > > > > ControlMachine=omega > > > > > NodeName=omega > > > > > PartitionName=basic > > > > > > > > > > > > > > > . > > > > > On May 20, 5:12 pm, "Mark A. Grondona" <[email protected]> wrote: > > > > > > On Fri, 20 May 2011 10:28:01 -0700, Danny Auble <[email protected]> > > wrote: > > > > > > > Tyler, what you have found is the exact reason this code exists. > > > > > > > > > > > > > Those variables exist in the slurmctld but not in anything else. > > Since all programs load the plugins the same way we define them there and > > when the slurmctld loads things the symbols are overwritten with the real > > ones. > > > > > > > > > > > > Perhaps a better idea would be to export a function[1] or set of > > > > > > functions to access global variables that might be required by plugins. > > > > > > (though if you are doing that often you might rethink that strategy and > > > > > > export more useful functions to plugins). I think plugins are loaded > > > > > > with RTLD_LAZY, so the symbol is only resolved if you try to call it. > > > > > > This weak-import/symbol interpositioning usage seems fragile. > > > > > > > > > > > > mark > > > > > > > > > > > > [1] One idea might be to have a single function > > > > > > > > > > > > void * get_slurm_global_symbol (const char *name) > > > > > > > > > > > > A quick implementation of this function might use dlsym() > > > > > > to try getting the symbol from a global namespace. A better > > > > > > implementation would contain a list of supported symbols > > > > > > by name, and return NULL if an unsupported symbol is accessed > > > > > > from a plugin. > > > > > > > > > > > > mark > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There is even a nice comment that explains this... > > > > > > > > > > > > > /* These are defined here so when we link with something other than > > > > > > > * the slurmctld we will have these symbols defined. They will get > > > > > > > * overwritten when linking with the slurmctld. > > > > > > > */ > > > > > > > > > > > > > Danny > > > > > > > > > > > > > On Friday, May 20, 2011 09:30:47 AM Tyler Strickland wrote: > > > > > > > > Danny, > > > > > > > > > > > > > > I've traced the error in starting slurmd down to the dlopen line in > > > > > > > > src/common/plugin.c in the plugin_load_from_file function (line > > 176). > > > > > > > > What's strange is that both slurmd and slurmctld load plugins in > > the > > > > > > > > same way - via the slurm_select_init function in slurmd/slurmd.c > > and > > > > > > > > slurmctld/controller.c. Note that I re-added the part_list and > > job_list > > > > > > > > variables to select_cons_res.c as extern Lists - making them extern > > > > > > > > seems to have had the same effect as removing them altogether. > > > > > > > > > > > > > > Here's the error: > > > > > > > > > > > > > > May 20 12:27:14 head slurmd[78007]: error: plugin_load_from_file: > > > > > > > > dlopen(/usr/local/lib/slurm/select_linear.so): dlopen(/usr/lo > > > > > > > > cal/lib/slurm/select_linear.so, 1): Symbol not found: _part_list\n > > > > > > > > Referenced from: /usr/local/lib/slurm/select_linear.so\n E > > > > > > > > xpected in: dynamic lookup > > > > > > > > May 20 12:27:14 head slurmd[78007]: error: Couldn't load specified > > > > > > > > plugin name for select/linear: Dlopen of plugin file failed > > > > > > > > May 20 12:27:14 head slurmd[78007]: fatal: Can't find plugin for > > > > > > > > select/linear > > > > > > > > > > > > > > I'm not sure why one program can access the data without any issues > > and > > > > > > > > another can't. Very strange. > > > > > > > > > > > > > > --Tyler > > > > > > > > > > > > > > On 05/19/2011 10:19 PM, Danny Auble wrote: > > > > > > > > > Hey Tyler, > > > > > > > > > > > > > > > I don't think you can call this one solved yet since your patch > > probably is creating a host of other problems you aren't aware of yet. The > > slurmd being just one of them. I am guessing quite a few of the user tools > > won't work either. > > > > > > > > > You may be on the right track though, perhaps there is something > > other than a weak import needed in the APPLE section. > > > > > > > > > > > > > > > Danny > > > > > > > > > > > > > > > On Thursday, May 19, 2011 07:12:25 PM Tyler Strickland wrote: > > > > > > > > >> Jon, Danny, and Moe, > > > > > > > > > > > > > > >> After several hours of scouring through the code and trying to > > find out > > > > > > > > >> why it wasn't working I finally hit upon something - if I > > comment out > > > > > > > > >> the __APPLE__ section in select_cons_res.c AND the part_list and > > > > > > > > >> job_list declarations, I can get slurmctld to start. > > Unfortunately, > > > > > > > > >> that change kills slurmd - and in such a manner that it dies > > with exit > > > > > > > > >> code 01, nothing in the log, and nothing printed anywhere - not > > a single > > > > > > > > >> clue to its death. Arggh. One step closer and one step further > > away. > > > > > > > > > > > > > > >> Tyler > > > > > > > > > > > > > > >> On 05/16/2011 05:27 PM, Jon Bringhurst wrote: > > > > > > > > >>> This might have something to do with the __APPLE__ weak imports > > in src/plugins/select/cons_res/select_cons_res.c. > > > > > > > > > > > > > > >>> Chaos master HEAD doesn't seem to get this on my OS X 10.6 > > install. Unfortunately I don't have anything running 10.5 available to debug > > this one. :\ > > > > > > > > > > > > > > >>> -Jon > > > > > > > > > > > > > > >>> On May 16, 2011, at 2:57 PM, Tyler Strickland wrote: > > > > > > > > > > > > > > >>>> Here's the result of recompiling with --enable-debug: > > > > > > > > > > > > > > >>>> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv > > > > > > > > >>>> Assertion failed: (l != NULL), function list_count, file > > list.c, line 351. > > > > > > > > >>>> Abort trap > > > > > > > > > > > > > > >>>> And here's the gdb output: > > > > > > > > >>>> (gdb) run -Dvv > > > > > > > > >>>> Starting program: /usr/local/sbin/slurmctld -Dvv > > > > > > > > >>>> Reading symbols for shared libraries ++. done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries .. done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Reading symbols for shared libraries . done > > > > > > > > >>>> Assertion failed: (l != NULL), function list_count, file > > list.c, line 351. > > > > > > > > > > > > > > >>>> Program received signal SIGABRT, Aborted. > > > > > > > > >>>> 0x94630e42 in __kill () > > > > > > > > >>>> (gdb) bt full > > > > > > > > >>>> #0 0x94630e42 in __kill () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #1 0x94630e34 in kill$UNIX2003 () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #2 0x946a323a in raise () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #3 0x946af679 in abort () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #4 0x946a43db in __assert_rtn () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #5 0x00087abd in list_count () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #6 0x003b5ade in _create_part_data () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #7 0x003b8dd9 in select_p_node_init () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #8 0x000a9796 in select_g_node_init () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #9 0x00059153 in read_slurm_conf () > > > > > > > > >>>> No symbol table info available. > > > > > > > > >>>> #10 0x0000a3ec in main () > > > > > > > > >>>> No symbol table info available. > > > > > > > > > > > > > > >>>> Tyler > > > > > > > > > > > > > > >>>> On 05/16/2011 11:43 AM, Auble, Danny wrote: > > > > > > > > >>>>> Could you configure with the --with-debug option and > > recompile? In any case. This appears to be a wild goose chase. Could you > > also try to compile against the lastest trunk in the git repo on github? It > > has other places fixed in headers to make sure we don't miss one in the > > future. > > > > > > > > > > > > > > >>>>> Danny > > > > > > > > > > > > > > >>>>>> -----Original Message----- > > > > > > > > >>>>>> From: [email protected] [mailto: > > [email protected]] On Behalf Of Tyler > > > > > > > > >>>>>> Strickland > > > > > > > > >>>>>> Sent: Friday, May 13, 2011 12:03 PM > > > > > > > > >>>>>> To: [email protected] > > > > > > > > >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > > > > > > > > > > > > > > >>>>>> Here's the full gdb output. What might cause slurm to not > > be able to > > > > > > > > >>>>>> access the memory? > > > > > > > > > > > > > > >>>>>> (gdb) run -Dvv > > > > > > > > >>>>>> Starting program: /usr/local/sbin/slurmctld -Dvv > > > > > > > > >>>>>> Reading symbols for shared libraries ++. done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries .. done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > >>>>>> Reading symbols for shared libraries . done > > > > > > > > > > > > > > >>>>>> Program received signal EXC_BAD_ACCESS, Could not access > > memory. > > > > > > > > >>>>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014 > > > > > > > > >>>>>> 0x945cab7e in pthread_mutex_lock () > > > > > > > > >>>>>> (gdb) bt full > > > > > > > > >>>>>> #0 0x945cab7e in pthread_mutex_lock () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> #1 0x00079eda in list_count () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> #2 0x00337e0e in _create_part_data () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> #3 0x0033b109 in select_p_node_init () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> #4 0x00096ee9 in select_g_node_init () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> #5 0x000504e3 in read_slurm_conf () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> #6 0x0000a768 in main () > > > > > > > > >>>>>> No symbol table info available. > > > > > > > > >>>>>> (gdb) > > > > > > > > > > > > > > >>>>>> On 05/13/2011 02:36 PM, Auble, Danny wrote: > > > > > > > > >>>>>>> Could you run it is gdb and get the backtrace? > > > > > > > > > > > > > > >>>>>>> gdb slurmctld > > > > > > > > >>>>>>> (gdb) run -Dvv > > > > > > > > >>>>>>> ...crash... > > > > > > > > >>>>>>> (gdb) bt full > > > > > > > > > > > > > > >>>>>>> That might give us something. > > > > > > > > > > > > > > >>>>>>> Danny > > > > > > > > > > > > > > >>>>>>>> -----Original Message----- > > > > > > > > >>>>>>>> From: [email protected] [mailto: > > [email protected]] On Behalf Of Tyler > > > > > > > > >>>>>>>> Strickland > > > > > > > > >>>>>>>> Sent: Friday, May 13, 2011 11:33 AM > > > > > > > > >>>>>>>> To: [email protected] > > > > > > > > >>>>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX > > 10.5 > > > > > > > > > > > > > > >>>>>>>> At the risk (OK, guarantee) of showing my ignorance, how > > might I go > > > > > > > > >>>>>>>> about doing that? One of the past list posts said to run > > 'ulimit -c > > > > > > > > >>>>>>>> unlimited' followed by slurmctld -D, after which the core > > dump would be > > > > > > > > >>>>>>>> placed in the current directory (/tmp). Unfortunately, > > nothing is to be > > > > > > > > >>>>>>>> found in the folder after the crash. > > > > > > > > > > > > > > >>>>>>>> Thanks, > > > > > > > > >>>>>>>> Tyler > > > > > > > > > > > > > > >>>>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote:... > > > > > > > > > > > > read more ยป > > > > > > > > > > > > > > > > > > > >
