Hey Tyler, I don't think you can call this one solved yet since your patch probably is creating a host of other problems you aren't aware of yet. The slurmd being just one of them. I am guessing quite a few of the user tools won't work either. You may be on the right track though, perhaps there is something other than a weak import needed in the APPLE section.
Danny On Thursday, May 19, 2011 07:12:25 PM Tyler Strickland wrote: > Jon, Danny, and Moe, > > After several hours of scouring through the code and trying to find out > why it wasn't working I finally hit upon something - if I comment out > the __APPLE__ section in select_cons_res.c AND the part_list and > job_list declarations, I can get slurmctld to start. Unfortunately, > that change kills slurmd - and in such a manner that it dies with exit > code 01, nothing in the log, and nothing printed anywhere - not a single > clue to its death. Arggh. One step closer and one step further away. > > Tyler > > On 05/16/2011 05:27 PM, Jon Bringhurst wrote: > > This might have something to do with the __APPLE__ weak imports in > > src/plugins/select/cons_res/select_cons_res.c. > > > > Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. > > Unfortunately I don't have anything running 10.5 available to debug this > > one. :\ > > > > -Jon > > > > On May 16, 2011, at 2:57 PM, Tyler Strickland wrote: > > > >> Here's the result of recompiling with --enable-debug: > >> > >> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv > >> Assertion failed: (l != NULL), function list_count, file list.c, line 351. > >> Abort trap > >> > >> And here's the gdb output: > >> (gdb) run -Dvv > >> Starting program: /usr/local/sbin/slurmctld -Dvv > >> Reading symbols for shared libraries ++. done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries .. done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Reading symbols for shared libraries . done > >> Assertion failed: (l != NULL), function list_count, file list.c, line 351. > >> > >> Program received signal SIGABRT, Aborted. > >> 0x94630e42 in __kill () > >> (gdb) bt full > >> #0 0x94630e42 in __kill () > >> No symbol table info available. > >> #1 0x94630e34 in kill$UNIX2003 () > >> No symbol table info available. > >> #2 0x946a323a in raise () > >> No symbol table info available. > >> #3 0x946af679 in abort () > >> No symbol table info available. > >> #4 0x946a43db in __assert_rtn () > >> No symbol table info available. > >> #5 0x00087abd in list_count () > >> No symbol table info available. > >> #6 0x003b5ade in _create_part_data () > >> No symbol table info available. > >> #7 0x003b8dd9 in select_p_node_init () > >> No symbol table info available. > >> #8 0x000a9796 in select_g_node_init () > >> No symbol table info available. > >> #9 0x00059153 in read_slurm_conf () > >> No symbol table info available. > >> #10 0x0000a3ec in main () > >> No symbol table info available. > >> > >> Tyler > >> > >> On 05/16/2011 11:43 AM, Auble, Danny wrote: > >>> Could you configure with the --with-debug option and recompile? In any > >>> case. This appears to be a wild goose chase. Could you also try to > >>> compile against the lastest trunk in the git repo on github? It has > >>> other places fixed in headers to make sure we don't miss one in the > >>> future. > >>> > >>> Danny > >>> > >>>> -----Original Message----- > >>>> From: [email protected] > >>>> [mailto:[email protected]] On Behalf Of Tyler > >>>> Strickland > >>>> Sent: Friday, May 13, 2011 12:03 PM > >>>> To: [email protected] > >>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > >>>> > >>>> Here's the full gdb output. What might cause slurm to not be able to > >>>> access the memory? > >>>> > >>>> (gdb) run -Dvv > >>>> Starting program: /usr/local/sbin/slurmctld -Dvv > >>>> Reading symbols for shared libraries ++. done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries .. done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> Reading symbols for shared libraries . done > >>>> > >>>> Program received signal EXC_BAD_ACCESS, Could not access memory. > >>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014 > >>>> 0x945cab7e in pthread_mutex_lock () > >>>> (gdb) bt full > >>>> #0 0x945cab7e in pthread_mutex_lock () > >>>> No symbol table info available. > >>>> #1 0x00079eda in list_count () > >>>> No symbol table info available. > >>>> #2 0x00337e0e in _create_part_data () > >>>> No symbol table info available. > >>>> #3 0x0033b109 in select_p_node_init () > >>>> No symbol table info available. > >>>> #4 0x00096ee9 in select_g_node_init () > >>>> No symbol table info available. > >>>> #5 0x000504e3 in read_slurm_conf () > >>>> No symbol table info available. > >>>> #6 0x0000a768 in main () > >>>> No symbol table info available. > >>>> (gdb) > >>>> > >>>> > >>>> On 05/13/2011 02:36 PM, Auble, Danny wrote: > >>>>> Could you run it is gdb and get the backtrace? > >>>>> > >>>>> gdb slurmctld > >>>>> (gdb) run -Dvv > >>>>> ...crash... > >>>>> (gdb) bt full > >>>>> > >>>>> > >>>>> That might give us something. > >>>>> > >>>>> Danny > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: [email protected] > >>>>>> [mailto:[email protected]] On Behalf Of Tyler > >>>>>> Strickland > >>>>>> Sent: Friday, May 13, 2011 11:33 AM > >>>>>> To: [email protected] > >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 > >>>>>> > >>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I go > >>>>>> about doing that? One of the past list posts said to run 'ulimit -c > >>>>>> unlimited' followed by slurmctld -D, after which the core dump would be > >>>>>> placed in the current directory (/tmp). Unfortunately, nothing is to > >>>>>> be > >>>>>> found in the folder after the crash. > >>>>>> > >>>>>> Thanks, > >>>>>> Tyler > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote: > >>>>>>> If you can get a core file on SIGBUS and generate a backtrace, that > >>>>>>> may help. > >>>>>>> ________________________________________ > >>>>>>> From: [email protected] [[email protected]] > >>>>>>> On Behalf Of Tyler > >>>> Strickland > >>>>>> [[email protected]] > >>>>>>> Sent: Friday, May 13, 2011 10:42 AM > >>>>>>> To: [email protected] > >>>>>>> Subject: [slurm-dev] slurmctld not starting on OSX 10.5 > >>>>>>> > >>>>>>> All, > >>>>>>> > >>>>>>> After the fun with getting SLURM compiled light night, I've finally > >>>>>>> succeeded at getting it installed. slurmd starts up fine but > >>>>>>> slurmctld > >>>>>>> doesn't - and there are no errors indicating why. When I try to run it > >>>>>>> with -D the words "Bus Error" are printed and the log appearing much > >>>>>>> line the one below. > >>>>>>> > >>>>>>> The logfile for "slurmd -cvvvvvvvvv" > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Tyler > >>>>>>> > >>>>>>> [2011-05-13T13:39:29] pidfile not locked, assuming no running daemon > >>>>>>> [2011-05-13T13:39:29] debug: sched: slurmctld starting > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/accounting_storage_none.so > >>>>>>> [2011-05-13T13:39:29] Accounting storage NOT INVOKED plugin loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug3: not enforcing associations and no list > >>>>>>> was > >>>>>>> given so we are giving a blank list > >>>>>>> [2011-05-13T13:39:29] debug2: No Assoc usage file > >>>>>>> (/var/lib/slurm/slurmctld/assoc_usage) to recover > >>>>>>> [2011-05-13T13:39:29] slurmctld version 2.2.5 started on cluster > >>>>>>> cluster > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/crypto_munge.so > >>>>>>> [2011-05-13T13:39:29] Munge cryptographic signature plugin loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/select_cons_res.so > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/preempt_none.so > >>>>>>> [2011-05-13T13:39:29] preempt/none loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/checkpoint_none.so > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] Checkpoint plugin loaded: checkpoint/none > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/jobacct_gather_none.so > >>>>>>> [2011-05-13T13:39:29] Job accounting gather NOT_INVOKED plugin loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug: No backup controller to shutdown > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/switch_none.so > >>>>>>> [2011-05-13T13:39:29] switch NONE plugin loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/topology_none.so > >>>>>>> [2011-05-13T13:39:29] topology NONE plugin loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug: No DownNodes > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/jobcomp_none.so > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin > >>>>>>> /usr/local/lib/slurm/sched_backfill.so > >>>>>>> [2011-05-13T13:39:29] sched: Backfill scheduler plugin loaded > >>>>>>> [2011-05-13T13:39:29] debug3: Success. > >>>>>>> [2011-05-13T13:39:29] debug: No job state file > >>>>>>> (/var/lib/slurm/slurmctld/job_state) to recover > >>>>>>> [2011-05-13T13:39:29] cons_res: select_p_node_init > >>>>>>> > >>>>> > >>>>> > >>> > >>> > >> > > > > > > · · · · — · · — — — > > Jon O. Bringhurst > > High Performance Computing Systems - http://lanl.gov > > > > Email: [email protected] | Office: +1 505 667 9337 | Blog: > > http://bringhurst.org > > Schedule: B > > > > > >
