Hi Danny,
We're still seeing a few issues with our Blue Gene /P on SLURM 2.3.3 and
now that I've dug through a bit of the bluegene select plugin code I've
got a few questions that I hope you can answer. Actually, just one:
_dynamically_request() (in bg_job_place.c) calls create_dynamic_block()
on line 794. Is it normal for the list of bg_records returned by
create_dynamic_block() to have a bg_record with a bg_block_id of
"(null)", ionode_str of "(null)" and no job running (job_running of -1)?
I ask because that's what we're seeing, and it's this special null block
that then gets added to the block_list on line 815, which then gets
elected by _find_matching_block() as the ideal place to run a pending
job. It then gets cascaded up through find_best_block_match() to
submit_job() which then decides that it can start this pending job on
the null block. But in reality the block is bogus and when the job tries
to get backfilled we get this:
debug3: backfill: Failed to start JobId=33190: Requested nodes are busy
(I think that's what's happening anyway)
The following patch:
---
src/plugins/select/bluegene/bg_job_place.c | 1 +
1 file changed, 1 insertion(+)
Index: slurm-2.3.3/src/plugins/select/bluegene/bg_job_place.c
===================================================================
--- slurm-2.3.3.orig/src/plugins/select/bluegene/bg_job_place.c
+++ slurm-2.3.3/src/plugins/select/bluegene/bg_job_place.c
@@ -821,6 +821,7 @@ static int _dynamically_request(List blo
preempted blocks have time
to clear out.
*/
+ info("MARK: _dynamically_request: we're
about to add bg_record to
block_list with bg_block_id of %s, with ionodes %s, job_running of %d,
reason of %s", bg_record->bg_block_id, bg_record->ionode_str,
bg_record->job_running, bg_record->reason);
list_append(block_list, bg_record);
(*blocks_added) = 1;
} else {
Gives the following in the slurmctld.log file (along with some other
patches that I've got to print out other things like the number of
bg_records in the block_list and what bg_records we're testing):
[2012-02-24T16:08:30] MARK: _dynamically_request: we're about to add
bg_record to block_list with bg_block_id of (null), with ionodes (nul
l), job_running of -1, reason of (null)
[2012-02-24T16:08:30] _find_best_block_match: before calling
_find_matching_block, block_list has 19 blocks
[2012-02-24T16:08:30] number of blocks to check: 19 state 513 asking for
2048-2048 cpus
[2012-02-24T16:08:30] MARK: we've checking bg_block_id of
RMP24Fe064657709, with ionodes 0, job_running of -1, reason of (null)
[2012-02-24T16:08:30] block RMP24Fe064657709 CPU count (256) not suitable
[2012-02-24T16:08:30] MARK: we've checking bg_block_id of
RMP24Fe130528293, with ionodes 5, job_running of -1, reason of (null)
[2012-02-24T16:08:30] block RMP24Fe130528293 CPU count (256) not suitable
[2012-02-24T16:08:30] MARK: we've checking bg_block_id of
RMP18Fe120532215, with ionodes 0-1, job_running of -1, reason of (null)
[2012-02-24T16:08:30] block RMP18Fe120532215 CPU count (512) not suitable
[2012-02-24T16:08:30] MARK: we've checking bg_block_id of (null), with
ionodes (null), job_running of -1, reason of (null)
[2012-02-24T16:08:30] we found one! (null)
[2012-02-24T16:08:30] debug: _find_best_block_match (null) <bgp000>
[2012-02-24T16:08:30] debug: 513 can start unassigned job 33190 at
1330060110 on bgp000
[2012-02-24T16:08:30] debug3: backfill: Failed to start JobId=33190:
Requested nodes are busy
[2012-02-24T16:08:30] backfill: completed testing 1 jobs, usec=69940
At the point in time that this was taken, system has 18 blocks:
~> sinfo -b -h | wc -l
18
~> sinfo -b
BG_BLOCK MIDPLANES OWNER STATE CONNECTION USE
RMP24Fe064657709 bgp000 slurm Ready Small COPROCESSOR
RMP24Fe064657829 bgp000 mbuskes Ready Small COPROCESSOR
RMP22Fe231504417 bgp000 mbuskes Ready Small COPROCESSOR
RMP22Fe231504545 bgp000 bwgoudey Ready Small COPROCESSOR
RMP24Fe130528174 bgp000 bwgoudey Ready Small COPROCESSOR
RMP24Fe130528293 bgp000 slurm Ready Small COPROCESSOR
RMP23Fe025220481 bgp000 christen Ready Small COPROCESSOR
RMP18Fe120532215 bgp001 slurm Ready Small COPROCESSOR
RMP19Fe174227862 bgp001 bwgoudey Ready Small COPROCESSOR
RMP22Fe105038431 bgp001 bwgoudey Ready Small COPROCESSOR
RMP22Fe105038550 bgp001 mbuskes Ready Small COPROCESSOR
RMP24Fe125557160 bgp001 bwgoudey Ready Small COPROCESSOR
RMP24Fe125557297 bgp001 bwgoudey Ready Small COPROCESSOR
RMP24Fe100736197 bgp011 bwgoudey Ready Small COPROCESSOR
RMP24Fe064622640 bgp011 christen Ready Small COPROCESSOR
RMP24Fe064622766 bgp011 mike Ready Small COPROCESSOR
RMP24Fe110947199 bgp011 christen Ready Small COPROCESSOR
RMP08Fe140222251 bgp010 aooi Ready Torus COPROCESSOR
Any help would be greatly appreciated.
Thanks!
Mark