Although this is a shot in the dark, try to apply the following patch and see 
if it changes anything:

https://gist.github.com/975422

-Jon

On May 16, 2011, at 3:27 PM, Jon Bringhurst wrote:

> This might have something to do with the __APPLE__ weak imports in 
> src/plugins/select/cons_res/select_cons_res.c.
> 
> Chaos master HEAD doesn't seem to get this on my OS X 10.6 install. 
> Unfortunately I don't have anything running 10.5 available to debug this one. 
> :\
> 
> -Jon
> 
> On May 16, 2011, at 2:57 PM, Tyler Strickland wrote:
> 
>> Here's the result of recompiling with --enable-debug:
>> 
>> cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv
>> Assertion failed: (l != NULL), function list_count, file list.c, line 351.
>> Abort trap
>> 
>> And here's the gdb output:
>> (gdb) run -Dvv
>> Starting program: /usr/local/sbin/slurmctld -Dvv
>> Reading symbols for shared libraries ++. done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries .. done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Reading symbols for shared libraries . done
>> Assertion failed: (l != NULL), function list_count, file list.c, line 351.
>> 
>> Program received signal SIGABRT, Aborted.
>> 0x94630e42 in __kill ()
>> (gdb) bt full
>> #0  0x94630e42 in __kill ()
>> No symbol table info available.
>> #1  0x94630e34 in kill$UNIX2003 ()
>> No symbol table info available.
>> #2  0x946a323a in raise ()
>> No symbol table info available.
>> #3  0x946af679 in abort ()
>> No symbol table info available.
>> #4  0x946a43db in __assert_rtn ()
>> No symbol table info available.
>> #5  0x00087abd in list_count ()
>> No symbol table info available.
>> #6  0x003b5ade in _create_part_data ()
>> No symbol table info available.
>> #7  0x003b8dd9 in select_p_node_init ()
>> No symbol table info available.
>> #8  0x000a9796 in select_g_node_init ()
>> No symbol table info available.
>> #9  0x00059153 in read_slurm_conf ()
>> No symbol table info available.
>> #10 0x0000a3ec in main ()
>> No symbol table info available.
>> 
>> Tyler
>> 
>> On 05/16/2011 11:43 AM, Auble, Danny wrote:
>>> Could you configure with the --with-debug option and recompile?  In any 
>>> case.  This appears to be a wild goose chase.  Could you also try to 
>>> compile against the lastest trunk in the git repo on github?  It has other 
>>> places fixed in headers to make sure we don't miss one in the future.
>>> 
>>> Danny
>>> 
>>>> -----Original Message-----
>>>> From: [email protected] 
>>>> [mailto:[email protected]] On Behalf Of Tyler
>>>> Strickland
>>>> Sent: Friday, May 13, 2011 12:03 PM
>>>> To: [email protected]
>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
>>>> 
>>>> Here's the full gdb output.  What might cause slurm to not be able to
>>>> access the memory?
>>>> 
>>>> (gdb) run -Dvv
>>>> Starting program: /usr/local/sbin/slurmctld -Dvv
>>>> Reading symbols for shared libraries ++. done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries .. done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> Reading symbols for shared libraries . done
>>>> 
>>>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>>>> Reason: KERN_PROTECTION_FAILURE at address: 0x00000014
>>>> 0x945cab7e in pthread_mutex_lock ()
>>>> (gdb) bt full
>>>> #0  0x945cab7e in pthread_mutex_lock ()
>>>> No symbol table info available.
>>>> #1  0x00079eda in list_count ()
>>>> No symbol table info available.
>>>> #2  0x00337e0e in _create_part_data ()
>>>> No symbol table info available.
>>>> #3  0x0033b109 in select_p_node_init ()
>>>> No symbol table info available.
>>>> #4  0x00096ee9 in select_g_node_init ()
>>>> No symbol table info available.
>>>> #5  0x000504e3 in read_slurm_conf ()
>>>> No symbol table info available.
>>>> #6  0x0000a768 in main ()
>>>> No symbol table info available.
>>>> (gdb)
>>>> 
>>>> 
>>>> On 05/13/2011 02:36 PM, Auble, Danny wrote:
>>>>> Could you run it is gdb and get the backtrace?
>>>>> 
>>>>> gdb slurmctld
>>>>> (gdb) run -Dvv
>>>>> ...crash...
>>>>> (gdb) bt full
>>>>> 
>>>>> 
>>>>> That might give us something.
>>>>> 
>>>>> Danny
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: [email protected] 
>>>>>> [mailto:[email protected]] On Behalf Of Tyler
>>>>>> Strickland
>>>>>> Sent: Friday, May 13, 2011 11:33 AM
>>>>>> To: [email protected]
>>>>>> Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
>>>>>> 
>>>>>> At the risk (OK, guarantee) of showing my ignorance, how might I go
>>>>>> about doing that?  One of the past list posts said to run 'ulimit -c
>>>>>> unlimited' followed by slurmctld -D, after which the core dump would be
>>>>>> placed in the current directory (/tmp).  Unfortunately, nothing is to be
>>>>>> found in the folder after the crash.
>>>>>> 
>>>>>> Thanks,
>>>>>> Tyler
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 05/13/2011 02:14 PM, Jette, Moe wrote:
>>>>>>> If you can get a core file on SIGBUS and generate a backtrace, that may 
>>>>>>> help.
>>>>>>> ________________________________________
>>>>>>> From: [email protected] [[email protected]] 
>>>>>>> On Behalf Of Tyler
>>>> Strickland
>>>>>> [[email protected]]
>>>>>>> Sent: Friday, May 13, 2011 10:42 AM
>>>>>>> To: [email protected]
>>>>>>> Subject: [slurm-dev] slurmctld not starting on OSX 10.5
>>>>>>> 
>>>>>>> All,
>>>>>>> 
>>>>>>> After the fun with getting SLURM compiled light night, I've finally
>>>>>>> succeeded at getting it installed.  slurmd starts up fine but slurmctld
>>>>>>> doesn't - and there are no errors indicating why. When I try to run it
>>>>>>> with -D the words "Bus Error" are printed and the log appearing much
>>>>>>> line the one below.
>>>>>>> 
>>>>>>> The logfile for "slurmd -cvvvvvvvvv"
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Tyler
>>>>>>> 
>>>>>>> [2011-05-13T13:39:29] pidfile not locked, assuming no running daemon
>>>>>>> [2011-05-13T13:39:29] debug:  sched: slurmctld starting
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/accounting_storage_none.so
>>>>>>> [2011-05-13T13:39:29] Accounting storage NOT INVOKED plugin loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug3: not enforcing associations and no list was
>>>>>>> given so we are giving a blank list
>>>>>>> [2011-05-13T13:39:29] debug2: No Assoc usage file
>>>>>>> (/var/lib/slurm/slurmctld/assoc_usage) to recover
>>>>>>> [2011-05-13T13:39:29] slurmctld version 2.2.5 started on cluster cluster
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/crypto_munge.so
>>>>>>> [2011-05-13T13:39:29] Munge cryptographic signature plugin loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/select_cons_res.so
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/preempt_none.so
>>>>>>> [2011-05-13T13:39:29] preempt/none loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/checkpoint_none.so
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] Checkpoint plugin loaded: checkpoint/none
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/jobacct_gather_none.so
>>>>>>> [2011-05-13T13:39:29] Job accounting gather NOT_INVOKED plugin loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug:  No backup controller to shutdown
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/switch_none.so
>>>>>>> [2011-05-13T13:39:29] switch NONE plugin loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/topology_none.so
>>>>>>> [2011-05-13T13:39:29] topology NONE plugin loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug:  No DownNodes
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/jobcomp_none.so
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug3: Trying to load plugin
>>>>>>> /usr/local/lib/slurm/sched_backfill.so
>>>>>>> [2011-05-13T13:39:29] sched: Backfill scheduler plugin loaded
>>>>>>> [2011-05-13T13:39:29] debug3: Success.
>>>>>>> [2011-05-13T13:39:29] debug:  No job state file
>>>>>>> (/var/lib/slurm/slurmctld/job_state) to recover
>>>>>>> [2011-05-13T13:39:29] cons_res: select_p_node_init
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
> 
> · · · · — · · — — —
> Jon O. Bringhurst
> High Performance Computing Systems - http://lanl.gov
> 
> Email: [email protected]  | Office: +1 505 667 9337 | Blog: http://bringhurst.org
> Schedule: B
> 
> 


· · · · — · · — — —
Jon O. Bringhurst
High Performance Computing Systems - http://lanl.gov

Email: [email protected]  | Office: +1 505 667 9337 | Blog: http://bringhurst.org
Schedule: B


Reply via email to