Hey Tyler,

I don't think you can call this one solved yet since your patch probably is 
creating a host of other problems you aren't aware of yet.  The slurmd being 
just one of them.  I am guessing quite a few of the user tools won't work 
either.
You may be on the right track though, perhaps there is something other than a 
weak import needed in the APPLE section.

Danny

On Thursday, May 19, 2011 07:12:25 PM Tyler Strickland wrote:
> Jon, Danny, and Moe,
> 
> After several hours of scouring through the code and trying to find out 
> why it wasn't working I finally hit upon something - if I comment out 
> the __APPLE__ section in select_cons_res.c AND the part_list and 
> job_list declarations, I can get slurmctld to start.  Unfortunately, 
> that change kills slurmd - and in such a manner that it dies with exit 
> code 01, nothing in the log, and nothing printed anywhere - not a single 
> clue to its death.  Arggh.  One step closer and one step further away.
> 
> Tyler
> 
> On 05/16/2011 05:27 PM, Jon Bringhurst wrote:
> > This might have something to do with the __APPLE__ weak imports in 
> > src/plugins/select/cons_res/select_cons_res.c.
> >
> > Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. 
> > Unfortunately I don't have anything running 10.5 available to debug this 
> > one. :\
> >
> > -Jon
> >
> > On May 16, 2011, at 2:57 PM, Tyler Strickland wrote:
> >
> >> Here's the result of recompiling with --enable-debug:
> >>
> >> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv
> >> Assertion failed: (l != NULL), function list_count, file list.c, line 351.
> >> Abort trap
> >>
> >> And here's the gdb output:
> >> (gdb) run -Dvv
> >> Starting program: /usr/local/sbin/slurmctld -Dvv
> >> Reading symbols for shared libraries ++. done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries .. done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Reading symbols for shared libraries . done
> >> Assertion failed: (l != NULL), function list_count, file list.c, line 351.
> >>
> >> Program received signal SIGABRT, Aborted.
> >> 0x94630e42 in __kill ()
> >> (gdb) bt full
> >> #0  0x94630e42 in __kill ()
> >> No symbol table info available.
> >> #1  0x94630e34 in kill$UNIX2003 ()
> >> No symbol table info available.
> >> #2  0x946a323a in raise ()
> >> No symbol table info available.
> >> #3  0x946af679 in abort ()
> >> No symbol table info available.
> >> #4  0x946a43db in __assert_rtn ()
> >> No symbol table info available.
> >> #5  0x00087abd in list_count ()
> >> No symbol table info available.
> >> #6  0x003b5ade in _create_part_data ()
> >> No symbol table info available.
> >> #7  0x003b8dd9 in select_p_node_init ()
> >> No symbol table info available.
> >> #8  0x000a9796 in select_g_node_init ()
> >> No symbol table info available.
> >> #9  0x00059153 in read_slurm_conf ()
> >> No symbol table info available.
> >> #10 0x0000a3ec in main ()
> >> No symbol table info available.
> >>
> >> Tyler
> >>
> >> On 05/16/2011 11:43 AM, Auble, Danny wrote:
> >>> Could you configure with the --with-debug option and recompile?  In any 
> >>> case.  This appears to be a wild goose chase.  Could you also try to 
> >>> compile against the lastest trunk in the git repo on github?  It has 
> >>> other places fixed in headers to make sure we don't miss one in the 
> >>> future.
> >>>
> >>> Danny
> >>>
> >>>> -----Original Message-----
> >>>> From: [email protected] 
> >>>> [mailto:[email protected]] On Behalf Of Tyler
> >>>> Strickland
> >>>> Sent: Friday, May 13, 2011 12:03 PM
> >>>> To: [email protected]
> >>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
> >>>>
> >>>> Here's the full gdb output.  What might cause slurm to not be able to
> >>>> access the memory?
> >>>>
> >>>> (gdb) run -Dvv
> >>>> Starting program: /usr/local/sbin/slurmctld -Dvv
> >>>> Reading symbols for shared libraries ++. done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries .. done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>> Reading symbols for shared libraries . done
> >>>>
> >>>> Program received signal EXC_BAD_ACCESS, Could not access memory.
> >>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014
> >>>> 0x945cab7e in pthread_mutex_lock ()
> >>>> (gdb) bt full
> >>>> #0  0x945cab7e in pthread_mutex_lock ()
> >>>> No symbol table info available.
> >>>> #1  0x00079eda in list_count ()
> >>>> No symbol table info available.
> >>>> #2  0x00337e0e in _create_part_data ()
> >>>> No symbol table info available.
> >>>> #3  0x0033b109 in select_p_node_init ()
> >>>> No symbol table info available.
> >>>> #4  0x00096ee9 in select_g_node_init ()
> >>>> No symbol table info available.
> >>>> #5  0x000504e3 in read_slurm_conf ()
> >>>> No symbol table info available.
> >>>> #6  0x0000a768 in main ()
> >>>> No symbol table info available.
> >>>> (gdb)
> >>>>
> >>>>
> >>>> On 05/13/2011 02:36 PM, Auble, Danny wrote:
> >>>>> Could you run it is gdb and get the backtrace?
> >>>>>
> >>>>> gdb slurmctld
> >>>>> (gdb) run -Dvv
> >>>>> ...crash...
> >>>>> (gdb) bt full
> >>>>>
> >>>>>
> >>>>> That might give us something.
> >>>>>
> >>>>> Danny
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: [email protected] 
> >>>>>> [mailto:[email protected]] On Behalf Of Tyler
> >>>>>> Strickland
> >>>>>> Sent: Friday, May 13, 2011 11:33 AM
> >>>>>> To: [email protected]
> >>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
> >>>>>>
> >>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I go
> >>>>>> about doing that?  One of the past list posts said to run 'ulimit -c
> >>>>>> unlimited' followed by slurmctld -D, after which the core dump would be
> >>>>>> placed in the current directory (/tmp).  Unfortunately, nothing is to 
> >>>>>> be
> >>>>>> found in the folder after the crash.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Tyler
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote:
> >>>>>>> If you can get a core file on SIGBUS and generate a backtrace, that 
> >>>>>>> may help.
> >>>>>>> ________________________________________
> >>>>>>> From: [email protected] [[email protected]] 
> >>>>>>> On Behalf Of Tyler
> >>>> Strickland
> >>>>>> [[email protected]]
> >>>>>>> Sent: Friday, May 13, 2011 10:42 AM
> >>>>>>> To: [email protected]
> >>>>>>> Subject: [slurm-dev] slurmctld not starting on OSX 10.5
> >>>>>>>
> >>>>>>> All,
> >>>>>>>
> >>>>>>> After the fun with getting SLURM compiled light night, I've finally
> >>>>>>> succeeded at getting it installed.  slurmd starts up fine but 
> >>>>>>> slurmctld
> >>>>>>> doesn't - and there are no errors indicating why. When I try to run it
> >>>>>>> with -D the words "Bus Error" are printed and the log appearing much
> >>>>>>> line the one below.
> >>>>>>>
> >>>>>>> The logfile for "slurmd -cvvvvvvvvv"
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Tyler
> >>>>>>>
> >>>>>>> [2011-05-13T13:39:29] pidfile not locked, assuming no running daemon
> >>>>>>> [2011-05-13T13:39:29] debug:  sched: slurmctld starting
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/accounting_storage_none.so
> >>>>>>> [2011-05-13T13:39:29] Accounting storage NOT INVOKED plugin loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug3: not enforcing associations and no list 
> >>>>>>> was
> >>>>>>> given so we are giving a blank list
> >>>>>>> [2011-05-13T13:39:29] debug2: No Assoc usage file
> >>>>>>> (/var/lib/slurm/slurmctld/assoc_usage) to recover
> >>>>>>> [2011-05-13T13:39:29] slurmctld version 2.2.5 started on cluster 
> >>>>>>> cluster
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/crypto_munge.so
> >>>>>>> [2011-05-13T13:39:29] Munge cryptographic signature plugin loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/select_cons_res.so
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/preempt_none.so
> >>>>>>> [2011-05-13T13:39:29] preempt/none loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/checkpoint_none.so
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] Checkpoint plugin loaded: checkpoint/none
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/jobacct_gather_none.so
> >>>>>>> [2011-05-13T13:39:29] Job accounting gather NOT_INVOKED plugin loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug:  No backup controller to shutdown
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/switch_none.so
> >>>>>>> [2011-05-13T13:39:29] switch NONE plugin loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/topology_none.so
> >>>>>>> [2011-05-13T13:39:29] topology NONE plugin loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug:  No DownNodes
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/jobcomp_none.so
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
> >>>>>>> /usr/local/lib/slurm/sched_backfill.so
> >>>>>>> [2011-05-13T13:39:29] sched: Backfill scheduler plugin loaded
> >>>>>>> [2011-05-13T13:39:29] debug3: Success.
> >>>>>>> [2011-05-13T13:39:29] debug:  No job state file
> >>>>>>> (/var/lib/slurm/slurmctld/job_state) to recover
> >>>>>>> [2011-05-13T13:39:29] cons_res: select_p_node_init
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>
> >
> >
> > · · · · — · · — — —
> > Jon O. Bringhurst
> > High Performance Computing Systems - http://lanl.gov
> >
> > Email: [email protected]  | Office: +1 505 667 9337 | Blog: 
> > http://bringhurst.org
> > Schedule: B
> >
> >
> 
> 

Reply via email to