Tyler, what you have found is the exact reason this code exists. Those variables exist in the slurmctld but not in anything else. Since all programs load the plugins the same way we define them there and when the slurmctld loads things the symbols are overwritten with the real ones.
There is even a nice comment that explains this... /* These are defined here so when we link with something other than * the slurmctld we will have these symbols defined. They will get * overwritten when linking with the slurmctld. */ Danny On Friday, May 20, 2011 09:30:47 AM Tyler Strickland wrote: > Danny, > > I've traced the error in starting slurmd down to the dlopen line in > src/common/plugin.c in the plugin_load_from_file function (line 176). > What's strange is that both slurmd and slurmctld load plugins in the > same way - via the slurm_select_init function in slurmd/slurmd.c and > slurmctld/controller.c. Note that I re-added the part_list and job_list > variables to select_cons_res.c as extern Lists - making them extern > seems to have had the same effect as removing them altogether. > > Here's the error: > > May 20 12:27:14 head slurmd[78007]: error: plugin_load_from_file: > dlopen(/usr/local/lib/slurm/select_linear.so): dlopen(/usr/lo > cal/lib/slurm/select_linear.so, 1): Symbol not found: _part_list\n > Referenced from: /usr/local/lib/slurm/select_linear.so\n E > xpected in: dynamic lookup > May 20 12:27:14 head slurmd[78007]: error: Couldn't load specified > plugin name for select/linear: Dlopen of plugin file failed > May 20 12:27:14 head slurmd[78007]: fatal: Can't find plugin for > select/linear > > I'm not sure why one program can access the data without any issues and > another can't. Very strange. > > --Tyler > > On 05/19/2011 10:19 PM, Danny Auble wrote: > > Hey Tyler, > > > > I don't think you can call this one solved yet since your patch probably is > > creating a host of other problems you aren't aware of yet. The slurmd > > being just one of them. I am guessing quite a few of the user tools won't > > work either. > > You may be on the right track though, perhaps there is something other than > > a weak import needed in the APPLE section. > > > > Danny > > > > On Thursday, May 19, 2011 07:12:25 PM Tyler Strickland wrote: > >> Jon, Danny, and Moe, > >> > >> After several hours of scouring through the code and trying to find out > >> why it wasn't working I finally hit upon something - if I comment out > >> the __APPLE__ section in select_cons_res.c AND the part_list and > >> job_list declarations, I can get slurmctld to start. Unfortunately, > >> that change kills slurmd - and in such a manner that it dies with exit > >> code 01, nothing in the log, and nothing printed anywhere - not a single > >> clue to its death. Arggh. One step closer and one step further away. > >> > >> Tyler > >> > >> On 05/16/2011 05:27 PM, Jon Bringhurst wrote: > >>> This might have something to do with the __APPLE__ weak imports in > >>> src/plugins/select/cons_res/select_cons_res.c. > >>> > >>> Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. > >>> Unfortunately I don't have anything running 10.5 available to debug this > >>> one. :\ > >>> > >>> -Jon > >>> > >>> On May 16, 2011, at 2:57 PM, Tyler Strickland wrote: > >>> > >>>> Here's the result of recompiling with --enable-debug: > >>>> > >>>> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv > >>>> Assertion failed: (l != NULL), function list_count, file list.c, line > >>>> 351. > >>>> Abort trap > >>>> > >>>> And here's the gdb output: > >>>> (gdb) run -Dvv > >>>> Starting program: /usr/local/sbin/slurmctld -Dvv > >>>> Reading symbols for shared libraries ++. done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries .. done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Assertion failed: (l != NULL), function list_count, file list.c, line > >>>> 351. > >>>> > >>>> Program received signal SIGABRT, Aborted. > >>>> 0x94630e42 in __kill () > >>>> (gdb) bt full > >>>> #0 0x94630e42 in __kill () > >>>> No symbol table info available. > >>>> #1 0x94630e34 in kill$UNIX2003 () > >>>> No symbol table info available. > >>>> #2 0x946a323a in raise () > >>>> No symbol table info available. > >>>> #3 0x946af679 in abort () > >>>> No symbol table info available. > >>>> #4 0x946a43db in __assert_rtn () > >>>> No symbol table info available. > >>>> #5 0x00087abd in list_count () > >>>> No symbol table info available. > >>>> #6 0x003b5ade in _create_part_data () > >>>> No symbol table info available. > >>>> #7 0x003b8dd9 in select_p_node_init () > >>>> No symbol table info available. > >>>> #8 0x000a9796 in select_g_node_init () > >>>> No symbol table info available. > >>>> #9 0x00059153 in read_slurm_conf () > >>>> No symbol table info available. > >>>> #10 0x0000a3ec in main () > >>>> No symbol table info available. > >>>> > >>>> Tyler > >>>> > >>>> On 05/16/2011 11:43 AM, Auble, Danny wrote: > >>>>> Could you configure with the --with-debug option and recompile? In any > >>>>> case. This appears to be a wild goose chase. Could you also try to > >>>>> compile against the lastest trunk in the git repo on github? It has > >>>>> other places fixed in headers to make sure we don't miss one in the > >>>>> future. > >>>>> > >>>>> Danny > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: [email protected] > >>>>>> [mailto:[email protected]] On Behalf Of Tyler > >>>>>> Strickland > >>>>>> Sent: Friday, May 13, 2011 12:03 PM > >>>>>> To: [email protected] > >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > >>>>>> > >>>>>> Here's the full gdb output. What might cause slurm to not be able to > >>>>>> access the memory? > >>>>>> > >>>>>> (gdb) run -Dvv > >>>>>> Starting program: /usr/local/sbin/slurmctld -Dvv > >>>>>> Reading symbols for shared libraries ++. done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries .. done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> Reading symbols for shared libraries . done > >>>>>> > >>>>>> Program received signal EXC_BAD_ACCESS, Could not access memory. > >>>>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014 > >>>>>> 0x945cab7e in pthread_mutex_lock () > >>>>>> (gdb) bt full > >>>>>> #0 0x945cab7e in pthread_mutex_lock () > >>>>>> No symbol table info available. > >>>>>> #1 0x00079eda in list_count () > >>>>>> No symbol table info available. > >>>>>> #2 0x00337e0e in _create_part_data () > >>>>>> No symbol table info available. > >>>>>> #3 0x0033b109 in select_p_node_init () > >>>>>> No symbol table info available. > >>>>>> #4 0x00096ee9 in select_g_node_init () > >>>>>> No symbol table info available. > >>>>>> #5 0x000504e3 in read_slurm_conf () > >>>>>> No symbol table info available. > >>>>>> #6 0x0000a768 in main () > >>>>>> No symbol table info available. > >>>>>> (gdb) > >>>>>> > >>>>>> > >>>>>> On 05/13/2011 02:36 PM, Auble, Danny wrote: > >>>>>>> Could you run it is gdb and get the backtrace? > >>>>>>> > >>>>>>> gdb slurmctld > >>>>>>> (gdb) run -Dvv > >>>>>>> ...crash... > >>>>>>> (gdb) bt full > >>>>>>> > >>>>>>> > >>>>>>> That might give us something. > >>>>>>> > >>>>>>> Danny > >>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: [email protected] > >>>>>>>> [mailto:[email protected]] On Behalf Of Tyler > >>>>>>>> Strickland > >>>>>>>> Sent: Friday, May 13, 2011 11:33 AM > >>>>>>>> To: [email protected] > >>>>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > >>>>>>>> > >>>>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I go > >>>>>>>> about doing that? One of the past list posts said to run 'ulimit -c > >>>>>>>> unlimited' followed by slurmctld -D, after which the core dump would > >>>>>>>> be > >>>>>>>> placed in the current directory (/tmp). Unfortunately, nothing is > >>>>>>>> to be > >>>>>>>> found in the folder after the crash. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Tyler > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote: > >>>>>>>>> If you can get a core file on SIGBUS and generate a backtrace, that > >>>>>>>>> may help. > >>>>>>>>> ________________________________________ > >>>>>>>>> From: [email protected] > >>>>>>>>> [[email protected]] On Behalf Of Tyler > >>>>>> Strickland > >>>>>>>> [[email protected]] > >>>>>>>>> Sent: Friday, May 13, 2011 10:42 AM > >>>>>>>>> To: [email protected] > >>>>>>>>> Subject: [slurm-dev] slurmctld not starting on OSX 10.5 > >>>>>>>>> > >>>>>>>>> All, > >>>>>>>>> > >>>>>>>>> After the fun with getting SLURM compiled light night, I've finally > >>>>>>>>> succeeded at getting it installed. slurmd starts up fine but > >>>>>>>>> slurmctld > >>>>>>>>> doesn't - and there are no errors indicating why. When I try to run > >>>>>>>>> it > >>>>>>>>> with -D the words "Bus Error" are printed and the log appearing much > >>>>>>>>> line the one below. > >>>>>>>>> > >>>>>>>>> The logfile for "slurmd -cvvvvvvvvv" > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Tyler > >>>>>>>>> > >>>>>>>>> [2011-05-13T13:39:29] pidfile not locked, assuming no running daemon > >>>>>>>>> [2011-05-13T13:39:29] debug: sched: slurmctld starting > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/accounting_storage_none.so > >>>>>>>>> [2011-05-13T13:39:29] Accounting storage NOT INVOKED plugin loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug3: not enforcing associations and no > >>>>>>>>> list was > >>>>>>>>> given so we are giving a blank list > >>>>>>>>> [2011-05-13T13:39:29] debug2: No Assoc usage file > >>>>>>>>> (/var/lib/slurm/slurmctld/assoc_usage) to recover > >>>>>>>>> [2011-05-13T13:39:29] slurmctld version 2.2.5 started on cluster > >>>>>>>>> cluster > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/crypto_munge.so > >>>>>>>>> [2011-05-13T13:39:29] Munge cryptographic signature plugin loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/select_cons_res.so > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/preempt_none.so > >>>>>>>>> [2011-05-13T13:39:29] preempt/none loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/checkpoint_none.so > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] Checkpoint plugin loaded: checkpoint/none > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/jobacct_gather_none.so > >>>>>>>>> [2011-05-13T13:39:29] Job accounting gather NOT_INVOKED plugin > >>>>>>>>> loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug: No backup controller to shutdown > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/switch_none.so > >>>>>>>>> [2011-05-13T13:39:29] switch NONE plugin loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/topology_none.so > >>>>>>>>> [2011-05-13T13:39:29] topology NONE plugin loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug: No DownNodes > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/jobcomp_none.so > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>>>> /usr/local/lib/slurm/sched_backfill.so > >>>>>>>>> [2011-05-13T13:39:29] sched: Backfill scheduler plugin loaded > >>>>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>>>> [2011-05-13T13:39:29] debug: No job state file > >>>>>>>>> (/var/lib/slurm/slurmctld/job_state) to recover > >>>>>>>>> [2011-05-13T13:39:29] cons_res: select_p_node_init > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >>> · · · · — · · — — — > >>> Jon O. Bringhurst > >>> High Performance Computing Systems - http://lanl.gov > >>> > >>> Email: [email protected] | Office: +1 505 667 9337 | Blog: > >>> http://bringhurst.org > >>> Schedule: B > >>> > >>> > >> > >> > > > >
