Although this is a shot in the dark, try to apply the following patch and see if it changes anything:
https://gist.github.com/975422 -Jon On May 16, 2011, at 3:27 PM, Jon Bringhurst wrote: > This might have something to do with the __APPLE__ weak imports in > src/plugins/select/cons_res/select_cons_res.c. > > Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. > Unfortunately I don't have anything running 10.5 available to debug this one. > :\ > > -Jon > > On May 16, 2011, at 2:57 PM, Tyler Strickland wrote: > >> Here's the result of recompiling with --enable-debug: >> >> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv >> Assertion failed: (l != NULL), function list_count, file list.c, line 351. >> Abort trap >> >> And here's the gdb output: >> (gdb) run -Dvv >> Starting program: /usr/local/sbin/slurmctld -Dvv >> Reading symbols for shared libraries ++. done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries .. done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Reading symbols for shared libraries . done >> Assertion failed: (l != NULL), function list_count, file list.c, line 351. >> >> Program received signal SIGABRT, Aborted. >> 0x94630e42 in __kill () >> (gdb) bt full >> #0 0x94630e42 in __kill () >> No symbol table info available. >> #1 0x94630e34 in kill$UNIX2003 () >> No symbol table info available. >> #2 0x946a323a in raise () >> No symbol table info available. >> #3 0x946af679 in abort () >> No symbol table info available. >> #4 0x946a43db in __assert_rtn () >> No symbol table info available. >> #5 0x00087abd in list_count () >> No symbol table info available. >> #6 0x003b5ade in _create_part_data () >> No symbol table info available. >> #7 0x003b8dd9 in select_p_node_init () >> No symbol table info available. >> #8 0x000a9796 in select_g_node_init () >> No symbol table info available. >> #9 0x00059153 in read_slurm_conf () >> No symbol table info available. >> #10 0x0000a3ec in main () >> No symbol table info available. >> >> Tyler >> >> On 05/16/2011 11:43 AM, Auble, Danny wrote: >>> Could you configure with the --with-debug option and recompile? In any >>> case. This appears to be a wild goose chase. Could you also try to >>> compile against the lastest trunk in the git repo on github? It has other >>> places fixed in headers to make sure we don't miss one in the future. >>> >>> Danny >>> >>>> -----Original Message----- >>>> From: [email protected] >>>> [mailto:[email protected]] On Behalf Of Tyler >>>> Strickland >>>> Sent: Friday, May 13, 2011 12:03 PM >>>> To: [email protected] >>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 >>>> >>>> Here's the full gdb output. What might cause slurm to not be able to >>>> access the memory? >>>> >>>> (gdb) run -Dvv >>>> Starting program: /usr/local/sbin/slurmctld -Dvv >>>> Reading symbols for shared libraries ++. done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries .. done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> Reading symbols for shared libraries . done >>>> >>>> Program received signal EXC_BAD_ACCESS, Could not access memory. >>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014 >>>> 0x945cab7e in pthread_mutex_lock () >>>> (gdb) bt full >>>> #0 0x945cab7e in pthread_mutex_lock () >>>> No symbol table info available. >>>> #1 0x00079eda in list_count () >>>> No symbol table info available. >>>> #2 0x00337e0e in _create_part_data () >>>> No symbol table info available. >>>> #3 0x0033b109 in select_p_node_init () >>>> No symbol table info available. >>>> #4 0x00096ee9 in select_g_node_init () >>>> No symbol table info available. >>>> #5 0x000504e3 in read_slurm_conf () >>>> No symbol table info available. >>>> #6 0x0000a768 in main () >>>> No symbol table info available. >>>> (gdb) >>>> >>>> >>>> On 05/13/2011 02:36 PM, Auble, Danny wrote: >>>>> Could you run it is gdb and get the backtrace? >>>>> >>>>> gdb slurmctld >>>>> (gdb) run -Dvv >>>>> ...crash... >>>>> (gdb) bt full >>>>> >>>>> >>>>> That might give us something. >>>>> >>>>> Danny >>>>> >>>>>> -----Original Message----- >>>>>> From: [email protected] >>>>>> [mailto:[email protected]] On Behalf Of Tyler >>>>>> Strickland >>>>>> Sent: Friday, May 13, 2011 11:33 AM >>>>>> To: [email protected] >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5 >>>>>> >>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I go >>>>>> about doing that? One of the past list posts said to run 'ulimit -c >>>>>> unlimited' followed by slurmctld -D, after which the core dump would be >>>>>> placed in the current directory (/tmp). Unfortunately, nothing is to be >>>>>> found in the folder after the crash. >>>>>> >>>>>> Thanks, >>>>>> Tyler >>>>>> >>>>>> >>>>>> >>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote: >>>>>>> If you can get a core file on SIGBUS and generate a backtrace, that may >>>>>>> help. >>>>>>> ________________________________________ >>>>>>> From: [email protected] [[email protected]] >>>>>>> On Behalf Of Tyler >>>> Strickland >>>>>> [[email protected]] >>>>>>> Sent: Friday, May 13, 2011 10:42 AM >>>>>>> To: [email protected] >>>>>>> Subject: [slurm-dev] slurmctld not starting on OSX 10.5 >>>>>>> >>>>>>> All, >>>>>>> >>>>>>> After the fun with getting SLURM compiled light night, I've finally >>>>>>> succeeded at getting it installed. slurmd starts up fine but slurmctld >>>>>>> doesn't - and there are no errors indicating why. When I try to run it >>>>>>> with -D the words "Bus Error" are printed and the log appearing much >>>>>>> line the one below. >>>>>>> >>>>>>> The logfile for "slurmd -cvvvvvvvvv" >>>>>>> >>>>>>> Thanks, >>>>>>> Tyler >>>>>>> >>>>>>> [2011-05-13T13:39:29] pidfile not locked, assuming no running daemon >>>>>>> [2011-05-13T13:39:29] debug: sched: slurmctld starting >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/accounting_storage_none.so >>>>>>> [2011-05-13T13:39:29] Accounting storage NOT INVOKED plugin loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug3: not enforcing associations and no list was >>>>>>> given so we are giving a blank list >>>>>>> [2011-05-13T13:39:29] debug2: No Assoc usage file >>>>>>> (/var/lib/slurm/slurmctld/assoc_usage) to recover >>>>>>> [2011-05-13T13:39:29] slurmctld version 2.2.5 started on cluster cluster >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/crypto_munge.so >>>>>>> [2011-05-13T13:39:29] Munge cryptographic signature plugin loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/select_cons_res.so >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/preempt_none.so >>>>>>> [2011-05-13T13:39:29] preempt/none loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/checkpoint_none.so >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] Checkpoint plugin loaded: checkpoint/none >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/jobacct_gather_none.so >>>>>>> [2011-05-13T13:39:29] Job accounting gather NOT_INVOKED plugin loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug: No backup controller to shutdown >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/switch_none.so >>>>>>> [2011-05-13T13:39:29] switch NONE plugin loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/topology_none.so >>>>>>> [2011-05-13T13:39:29] topology NONE plugin loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug: No DownNodes >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/jobcomp_none.so >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin >>>>>>> /usr/local/lib/slurm/sched_backfill.so >>>>>>> [2011-05-13T13:39:29] sched: Backfill scheduler plugin loaded >>>>>>> [2011-05-13T13:39:29] debug3: Success. >>>>>>> [2011-05-13T13:39:29] debug: No job state file >>>>>>> (/var/lib/slurm/slurmctld/job_state) to recover >>>>>>> [2011-05-13T13:39:29] cons_res: select_p_node_init >>>>>>> >>>>> >>>>> >>> >>> >> > > > · · · · — · · — — — > Jon O. Bringhurst > High Performance Computing Systems - http://lanl.gov > > Email: [email protected] | Office: +1 505 667 9337 | Blog: http://bringhurst.org > Schedule: B > > · · · · — · · — — — Jon O. Bringhurst High Performance Computing Systems - http://lanl.gov Email: [email protected] | Office: +1 505 667 9337 | Blog: http://bringhurst.org Schedule: B
