Did this ever get resolved? I am finding that slurmctld is dumping
core on OSX (slurm version 2-2-7-1):

omega:slurm-2-2-7-1 rw$ sudo ./sbin/slurmctld  -D
slurmctld: error: Configured MailProg is invalid
slurmctld: slurmctld version 2.2.7 started on cluster (null)
slurmctld: error: read_slurm_conf: default partition not set.
slurmctld: error: Could not open node state file /tmp/node_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may
be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be
lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: error: Could not open reservation state file /tmp/
resv_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations
may be lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
slurmctld: error: Could not open trigger state file /tmp/
trigger_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be
lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
Segmentation fault (core dumped)

omega:slurm-2-2-7-1 rw$ cat etc/slurm.conf
ControlMachine=omega
NodeName=omega
PartitionName=basic


.
On May 20, 5:12 pm, "Mark A. Grondona" <[email protected]> wrote:
> On Fri, 20 May 2011 10:28:01 -0700, Danny Auble <[email protected]> wrote:
> > Tyler, what you have found is the exact reason this code exists.
>
> > Those variables exist in the slurmctld but not in anything else.  Since all 
> > programs load the plugins the same way we define them there and when the 
> > slurmctld loads things the symbols are overwritten with the real ones.
>
> Perhaps a better idea would be to export a function[1] or set of
> functions to access global variables that might be required by plugins.
> (though if you are doing that often you might rethink that strategy and
> export more useful functions to plugins). I think plugins are loaded
> with RTLD_LAZY, so the symbol is only resolved if you try to call it.
> This weak-import/symbol interpositioning usage seems fragile.
>
> mark
>
> [1] One idea might be to have a single function
>
>  void * get_slurm_global_symbol (const char *name)
>
> A quick implementation of this function might use dlsym()
> to try getting the symbol from a global namespace. A better
> implementation would contain a list of supported symbols
> by name, and return NULL if an unsupported symbol is accessed
> from a plugin.
>
> mark
>
>
>
>
>
>
>
> > There is even a nice comment that explains this...
>
> > /* These are defined here so when we link with something other than
> >  * the slurmctld we will have these symbols defined.  They will get
> >  * overwritten when linking with the slurmctld.
> >  */
>
> > Danny
>
> > On Friday, May 20, 2011 09:30:47 AM Tyler Strickland wrote:
> > > Danny,
>
> > > I've traced the error in starting slurmd down to the dlopen line in
> > > src/common/plugin.c in the plugin_load_from_file function (line 176).
> > > What's strange is that both slurmd and slurmctld load plugins in the
> > > same way - via the slurm_select_init function in slurmd/slurmd.c and
> > > slurmctld/controller.c.  Note that I re-added the part_list and job_list
> > > variables to select_cons_res.c as extern Lists - making them extern
> > > seems to have had the same effect as removing them altogether.
>
> > > Here's the error:
>
> > > May 20 12:27:14 head slurmd[78007]: error: plugin_load_from_file:
> > > dlopen(/usr/local/lib/slurm/select_linear.so): dlopen(/usr/lo
> > > cal/lib/slurm/select_linear.so, 1): Symbol not found: _part_list\n
> > > Referenced from: /usr/local/lib/slurm/select_linear.so\n  E
> > > xpected in: dynamic lookup
> > > May 20 12:27:14 head slurmd[78007]: error: Couldn't load specified
> > > plugin name for select/linear: Dlopen of plugin file failed
> > > May 20 12:27:14 head slurmd[78007]: fatal: Can't find plugin for
> > > select/linear
>
> > > I'm not sure why one program can access the data without any issues and
> > > another can't.  Very strange.
>
> > > --Tyler
>
> > > On 05/19/2011 10:19 PM, Danny Auble wrote:
> > > > Hey Tyler,
>
> > > > I don't think you can call this one solved yet since your patch 
> > > > probably is creating a host of other problems you aren't aware of yet.  
> > > > The slurmd being just one of them.  I am guessing quite a few of the 
> > > > user tools won't work either.
> > > > You may be on the right track though, perhaps there is something other 
> > > > than a weak import needed in the APPLE section.
>
> > > > Danny
>
> > > > On Thursday, May 19, 2011 07:12:25 PM Tyler Strickland wrote:
> > > >> Jon, Danny, and Moe,
>
> > > >> After several hours of scouring through the code and trying to find out
> > > >> why it wasn't working I finally hit upon something - if I comment out
> > > >> the __APPLE__ section in select_cons_res.c AND the part_list and
> > > >> job_list declarations, I can get slurmctld to start.  Unfortunately,
> > > >> that change kills slurmd - and in such a manner that it dies with exit
> > > >> code 01, nothing in the log, and nothing printed anywhere - not a 
> > > >> single
> > > >> clue to its death.  Arggh.  One step closer and one step further away.
>
> > > >> Tyler
>
> > > >> On 05/16/2011 05:27 PM, Jon Bringhurst wrote:
> > > >>> This might have something to do with the __APPLE__ weak imports in 
> > > >>> src/plugins/select/cons_res/select_cons_res.c.
>
> > > >>> Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. 
> > > >>> Unfortunately I don't have anything running 10.5 available to debug 
> > > >>> this one. :\
>
> > > >>> -Jon
>
> > > >>> On May 16, 2011, at 2:57 PM, Tyler Strickland wrote:
>
> > > >>>> Here's the result of recompiling with --enable-debug:
>
> > > >>>> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv
> > > >>>> Assertion failed: (l != NULL), function list_count, file list.c, 
> > > >>>> line 351.
> > > >>>> Abort trap
>
> > > >>>> And here's the gdb output:
> > > >>>> (gdb) run -Dvv
> > > >>>> Starting program: /usr/local/sbin/slurmctld -Dvv
> > > >>>> Reading symbols for shared libraries ++. done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries .. done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Reading symbols for shared libraries . done
> > > >>>> Assertion failed: (l != NULL), function list_count, file list.c, 
> > > >>>> line 351.
>
> > > >>>> Program received signal SIGABRT, Aborted.
> > > >>>> 0x94630e42 in __kill ()
> > > >>>> (gdb) bt full
> > > >>>> #0  0x94630e42 in __kill ()
> > > >>>> No symbol table info available.
> > > >>>> #1  0x94630e34 in kill$UNIX2003 ()
> > > >>>> No symbol table info available.
> > > >>>> #2  0x946a323a in raise ()
> > > >>>> No symbol table info available.
> > > >>>> #3  0x946af679 in abort ()
> > > >>>> No symbol table info available.
> > > >>>> #4  0x946a43db in __assert_rtn ()
> > > >>>> No symbol table info available.
> > > >>>> #5  0x00087abd in list_count ()
> > > >>>> No symbol table info available.
> > > >>>> #6  0x003b5ade in _create_part_data ()
> > > >>>> No symbol table info available.
> > > >>>> #7  0x003b8dd9 in select_p_node_init ()
> > > >>>> No symbol table info available.
> > > >>>> #8  0x000a9796 in select_g_node_init ()
> > > >>>> No symbol table info available.
> > > >>>> #9  0x00059153 in read_slurm_conf ()
> > > >>>> No symbol table info available.
> > > >>>> #10 0x0000a3ec in main ()
> > > >>>> No symbol table info available.
>
> > > >>>> Tyler
>
> > > >>>> On 05/16/2011 11:43 AM, Auble, Danny wrote:
> > > >>>>> Could you configure with the --with-debug option and recompile?  In 
> > > >>>>> any case.  This appears to be a wild goose chase.  Could you also 
> > > >>>>> try to compile against the lastest trunk in the git repo on github? 
> > > >>>>>  It has other places fixed in headers to make sure we don't miss 
> > > >>>>> one in the future.
>
> > > >>>>> Danny
>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: [email protected] 
> > > >>>>>> [mailto:[email protected]] On Behalf Of Tyler
> > > >>>>>> Strickland
> > > >>>>>> Sent: Friday, May 13, 2011 12:03 PM
> > > >>>>>> To: [email protected]
> > > >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
>
> > > >>>>>> Here's the full gdb output.  What might cause slurm to not be able 
> > > >>>>>> to
> > > >>>>>> access the memory?
>
> > > >>>>>> (gdb) run -Dvv
> > > >>>>>> Starting program: /usr/local/sbin/slurmctld -Dvv
> > > >>>>>> Reading symbols for shared libraries ++. done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries .. done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
> > > >>>>>> Reading symbols for shared libraries . done
>
> > > >>>>>> Program received signal EXC_BAD_ACCESS, Could not access memory.
> > > >>>>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014
> > > >>>>>> 0x945cab7e in pthread_mutex_lock ()
> > > >>>>>> (gdb) bt full
> > > >>>>>> #0  0x945cab7e in pthread_mutex_lock ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> #1  0x00079eda in list_count ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> #2  0x00337e0e in _create_part_data ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> #3  0x0033b109 in select_p_node_init ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> #4  0x00096ee9 in select_g_node_init ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> #5  0x000504e3 in read_slurm_conf ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> #6  0x0000a768 in main ()
> > > >>>>>> No symbol table info available.
> > > >>>>>> (gdb)
>
> > > >>>>>> On 05/13/2011 02:36 PM, Auble, Danny wrote:
> > > >>>>>>> Could you run it is gdb and get the backtrace?
>
> > > >>>>>>> gdb slurmctld
> > > >>>>>>> (gdb) run -Dvv
> > > >>>>>>> ...crash...
> > > >>>>>>> (gdb) bt full
>
> > > >>>>>>> That might give us something.
>
> > > >>>>>>> Danny
>
> > > >>>>>>>> -----Original Message-----
> > > >>>>>>>> From: [email protected] 
> > > >>>>>>>> [mailto:[email protected]] On Behalf Of Tyler
> > > >>>>>>>> Strickland
> > > >>>>>>>> Sent: Friday, May 13, 2011 11:33 AM
> > > >>>>>>>> To: [email protected]
> > > >>>>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
>
> > > >>>>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I 
> > > >>>>>>>> go
> > > >>>>>>>> about doing that?  One of the past list posts said to run 
> > > >>>>>>>> 'ulimit -c
> > > >>>>>>>> unlimited' followed by slurmctld -D, after which the core dump 
> > > >>>>>>>> would be
> > > >>>>>>>> placed in the current directory (/tmp).  Unfortunately, nothing 
> > > >>>>>>>> is to be
> > > >>>>>>>> found in the folder after the crash.
>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Tyler
>
> > > >>>>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote:...
>
> read more »

Reply via email to