Jon,
One additional note: I changed from select/cons_res to select/linear in
the configuration file just to see if the problem is with
select_cons_res and still had the crash. Here's the trace:
cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv
Assertion failed: (l != NULL), function list_iterator_create, file
list.c, line 698.
Abort trap
cgrc-xs11:~ root# gdb /usr/local/sbin/slurmctld
GNU gdb 6.3.50-20050815 (Apple version gdb-967) (Tue Jul 14 02:11:58 UTC
2009)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-apple-darwin"...Reading symbols for
shared libraries ... done
(gdb) run -Dvv
Starting program: /usr/local/sbin/slurmctld -Dvv
Reading symbols for shared libraries ++. done
Reading symbols for shared libraries . done
Reading symbols for shared libraries .. done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Assertion failed: (l != NULL), function list_iterator_create, file
list.c, line 698.
Program received signal SIGABRT, Aborted.
0x94630e42 in __kill ()
(gdb) bt full
#0 0x94630e42 in __kill ()
No symbol table info available.
#1 0x94630e34 in kill$UNIX2003 ()
No symbol table info available.
#2 0x946a323a in raise ()
No symbol table info available.
#3 0x946af679 in abort ()
No symbol table info available.
#4 0x946a43db in __assert_rtn ()
No symbol table info available.
#5 0x00089447 in list_iterator_create ()
No symbol table info available.
#6 0x003b8591 in _init_node_cr ()
No symbol table info available.
#7 0x003ba776 in select_p_reconfigure ()
No symbol table info available.
#8 0x000aaa0a in select_g_reconfigure ()
No symbol table info available.
#9 0x000594ef in read_slurm_conf ()
No symbol table info available.
#10 0x0000a3ec in main ()
No symbol table info available.
Thanks for your help,
Tyler
On 05/16/2011 08:32 PM, Tyler Strickland wrote:
Sorry Jon, no change with the patch.
--Tyler
On 05/16/2011 05:39 PM, Jon Bringhurst wrote:
Although this is a shot in the dark, try to apply the following patch
and see if it changes anything:
https://gist.github.com/975422
-Jon
On May 16, 2011, at 3:27 PM, Jon Bringhurst wrote:
This might have something to do with the __APPLE__ weak imports in
src/plugins/select/cons_res/select_cons_res.c.
Chaos master HEAD doesn't seem to get this on my OS X 10.6 install.
Unfortunately I don't have anything running 10.5 available to debug
this one. :\
-Jon
On May 16, 2011, at 2:57 PM, Tyler Strickland wrote:
Here's the result of recompiling with --enable-debug:
cgrc-xs11:~ root# /usr/local/sbin/slurmctld -Dvv
Assertion failed: (l != NULL), function list_count, file list.c,
line 351.
Abort trap
And here's the gdb output:
(gdb) run -Dvv
Starting program: /usr/local/sbin/slurmctld -Dvv
Reading symbols for shared libraries ++. done
Reading symbols for shared libraries . done
Reading symbols for shared libraries .. done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Assertion failed: (l != NULL), function list_count, file list.c,
line 351.
Program received signal SIGABRT, Aborted.
0x94630e42 in __kill ()
(gdb) bt full
#0 0x94630e42 in __kill ()
No symbol table info available.
#1 0x94630e34 in kill$UNIX2003 ()
No symbol table info available.
#2 0x946a323a in raise ()
No symbol table info available.
#3 0x946af679 in abort ()
No symbol table info available.
#4 0x946a43db in __assert_rtn ()
No symbol table info available.
#5 0x00087abd in list_count ()
No symbol table info available.
#6 0x003b5ade in _create_part_data ()
No symbol table info available.
#7 0x003b8dd9 in select_p_node_init ()
No symbol table info available.
#8 0x000a9796 in select_g_node_init ()
No symbol table info available.
#9 0x00059153 in read_slurm_conf ()
No symbol table info available.
#10 0x0000a3ec in main ()
No symbol table info available.
Tyler
On 05/16/2011 11:43 AM, Auble, Danny wrote:
Could you configure with the --with-debug option and recompile? In
any case. This appears to be a wild goose chase. Could you also try
to compile against the lastest trunk in the git repo on github? It
has other places fixed in headers to make sure we don't miss one in
the future.
Danny
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Tyler
Strickland
Sent: Friday, May 13, 2011 12:03 PM
To: [email protected]
Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
Here's the full gdb output. What might cause slurm to not be able to
access the memory?
(gdb) run -Dvv
Starting program: /usr/local/sbin/slurmctld -Dvv
Reading symbols for shared libraries ++. done
Reading symbols for shared libraries . done
Reading symbols for shared libraries .. done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000014
0x945cab7e in pthread_mutex_lock ()
(gdb) bt full
#0 0x945cab7e in pthread_mutex_lock ()
No symbol table info available.
#1 0x00079eda in list_count ()
No symbol table info available.
#2 0x00337e0e in _create_part_data ()
No symbol table info available.
#3 0x0033b109 in select_p_node_init ()
No symbol table info available.
#4 0x00096ee9 in select_g_node_init ()
No symbol table info available.
#5 0x000504e3 in read_slurm_conf ()
No symbol table info available.
#6 0x0000a768 in main ()
No symbol table info available.
(gdb)
On 05/13/2011 02:36 PM, Auble, Danny wrote:
Could you run it is gdb and get the backtrace?
gdb slurmctld
(gdb) run -Dvv
...crash...
(gdb) bt full
That might give us something.
Danny
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Tyler
Strickland
Sent: Friday, May 13, 2011 11:33 AM
To: [email protected]
Subject: Re: [slurm-dev] slurmctld not starting on OSX 10.5
At the risk (OK, guarantee) of showing my ignorance, how might I go
about doing that? One of the past list posts said to run 'ulimit -c
unlimited' followed by slurmctld -D, after which the core dump
would be
placed in the current directory (/tmp). Unfortunately, nothing
is to be
found in the folder after the crash.
Thanks,
Tyler
On 05/13/2011 02:14 PM, Jette, Moe wrote:
If you can get a core file on SIGBUS and generate a backtrace,
that may help.
________________________________________
From: [email protected]
[[email protected]] On Behalf Of Tyler
Strickland
[[email protected]]
Sent: Friday, May 13, 2011 10:42 AM
To: [email protected]
Subject: [slurm-dev] slurmctld not starting on OSX 10.5
All,
After the fun with getting SLURM compiled light night, I've
finally
succeeded at getting it installed. slurmd starts up fine but
slurmctld
doesn't - and there are no errors indicating why. When I try to
run it
with -D the words "Bus Error" are printed and the log appearing
much
line the one below.
The logfile for "slurmd -cvvvvvvvvv"
Thanks,
Tyler
[2011-05-13T13:39:29] pidfile not locked, assuming no running
daemon
[2011-05-13T13:39:29] debug: sched: slurmctld starting
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_none.so
[2011-05-13T13:39:29] Accounting storage NOT INVOKED plugin loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug3: not enforcing associations and no
list was
given so we are giving a blank list
[2011-05-13T13:39:29] debug2: No Assoc usage file
(/var/lib/slurm/slurmctld/assoc_usage) to recover
[2011-05-13T13:39:29] slurmctld version 2.2.5 started on
cluster cluster
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/crypto_munge.so
[2011-05-13T13:39:29] Munge cryptographic signature plugin loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/select_cons_res.so
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/preempt_none.so
[2011-05-13T13:39:29] preempt/none loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/checkpoint_none.so
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] Checkpoint plugin loaded: checkpoint/none
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/jobacct_gather_none.so
[2011-05-13T13:39:29] Job accounting gather NOT_INVOKED plugin
loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug: No backup controller to shutdown
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/switch_none.so
[2011-05-13T13:39:29] switch NONE plugin loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/topology_none.so
[2011-05-13T13:39:29] topology NONE plugin loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug: No DownNodes
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/jobcomp_none.so
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug3: Trying to load plugin
/usr/local/lib/slurm/sched_backfill.so
[2011-05-13T13:39:29] sched: Backfill scheduler plugin loaded
[2011-05-13T13:39:29] debug3: Success.
[2011-05-13T13:39:29] debug: No job state file
(/var/lib/slurm/slurmctld/job_state) to recover
[2011-05-13T13:39:29] cons_res: select_p_node_init
· · · · — · · — — —
Jon O. Bringhurst
High Performance Computing Systems - http://lanl.gov
Email: [email protected] | Office: +1 505 667 9337 | Blog:
http://bringhurst.org
Schedule: B
· · · · — · · — — —
Jon O. Bringhurst
High Performance Computing Systems - http://lanl.gov
Email: [email protected] | Office: +1 505 667 9337 | Blog:
http://bringhurst.org
Schedule: B