Hi Danny,
One further update:
So I'm looking through the backfill code now, and I'm wondering: should
this branch below (from line 724 of _attempt_backfill() in backfill.c)
ever be taken?
} else if (rc != SLURM_SUCCESS) {
/* Planned to start job, but something bad
* happended. */
job_ptr->start_time = 0;
break;
where we're testing the result of:
int rc = _start_job(job_ptr, resv_bitmap);
(just for a bit of context)
From the comment I'm guessing that if we're taking this branch
something is going wrong and so we don't run any further backfill tests
for the remaining pending jobs.
With this patch (the last hunk is the important one):
---
src/plugins/sched/backfill/backfill.c | 9 +++++++++
1 file changed, 9 insertions(+)
Index: slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
===================================================================
--- slurm-2.3.3a6dca4.orig/src/plugins/sched/backfill/backfill.c
+++ slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
@@ -500,6 +500,9 @@ static int _attempt_backfill(void)
filter_root = true;
job_queue = build_job_queue(true);
+
+ debug("MARK backfill: have to look at %d jobs", list_count(job_queue));
+
if (list_count(job_queue) <= 1) {
debug("backfill: no jobs to backfill");
list_destroy(job_queue);
@@ -692,8 +695,13 @@ static int _attempt_backfill(void)
if (j != SLURM_SUCCESS) {
job_ptr->time_limit = orig_time_limit;
job_ptr->start_time = 0;
+
+ debug("MARK backfill: %d isn't runnable right now, move
on to the
next one", job_ptr->job_id);
+
continue; /* not runable */
}
+ else
+ debug("MARK backfill: %d IS runnable right now, let's
run it",
job_ptr->job_id);
if (start_res > job_ptr->start_time) {
job_ptr->start_time = start_res;
@@ -722,6 +730,7 @@ static int _attempt_backfill(void)
job_ptr->start_time = 0;
continue;
} else if (rc != SLURM_SUCCESS) {
+ debug("MARK backfill: We're about to break out
of the backfill loop");
/* Planned to start job, but something bad
* happended. */
job_ptr->start_time = 0;
I get the following in the log (important line highlighted with *******):
[2012-02-25T22:37:31] =========================================
[2012-02-25T22:37:31] Begin:2012-02-25T22:37:31 End:2012-02-26T12:58:00
Nodes:bgp[000x011]
[2012-02-25T22:37:31] Begin:2012-02-26T12:58:00 End:2012-02-26T20:10:00
Nodes:bgp[010x011]
[2012-02-25T22:37:31] Begin:2012-02-26T20:10:00 End:2012-02-26T22:37:31
Nodes:
[2012-02-25T22:37:31] =========================================
[2012-02-25T22:37:31] backfill test for job 33258
[2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc
28 (account vr0126) which has usage of 187
14286 and is currently reserving 3121949, with a limit of 113195280
[2012-02-25T22:37:31] debug2: backfill: entering _try_sched for job 33258.
[2012-02-25T22:37:31] debug: bluegene:submit_job: 33258 mode=514
Connection=Torus Reboot=no Rotate=yes Geometry=0x
0x0 Block_ID=unassigned nodes=1-1-1
[2012-02-25T22:37:31] debug3: ComputNodeImage=default
[2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
[2012-02-25T22:37:31] debug3: MloaderImage=default
[2012-02-25T22:37:31] debug2: 1 can't allocate
[2012-02-25T22:37:31] debug2: name = 000
[2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
[2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1
start is 000
[2012-02-25T22:37:31] debug: _find_best_block_match (null) <bgp000>
[2012-02-25T22:37:31] debug: 514 can start unassigned job 33258 at
1330169851 on bgp000
[2012-02-25T22:37:31] debug2: backfill: finished _try_sched for job 33258.
[2012-02-25T22:37:31] debug: MARK backfill: 33258 IS runnable right
now, let's run it
[2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc
28 (account vr0126) which has usage of 187
14286 and is currently reserving 3121949, with a limit of 113195280
[2012-02-25T22:37:31] debug2: found 4 usable nodes from config
containing bgp[000x011]
[2012-02-25T22:37:31] debug3: _pick_best_nodes: job 33258 idle_nodes 0
share_nodes 4
[2012-02-25T22:37:31] debug: bluegene:submit_job: 33258 mode=512
Connection=Torus Reboot=no Rotate=yes Geometry=0x
0x0 Block_ID=unassigned nodes=1-1-1
[2012-02-25T22:37:31] debug3: ComputNodeImage=default
[2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
[2012-02-25T22:37:31] debug3: MloaderImage=default
[2012-02-25T22:37:31] debug: bluegene:submit_job: 33258 mode=513
Connection=Torus Reboot=no Rotate=yes Geometry=0x
0x0 Block_ID=unassigned nodes=1-1-1
[2012-02-25T22:37:31] debug3: ComputNodeImage=default
[2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
[2012-02-25T22:37:31] debug3: MloaderImage=default
[2012-02-25T22:37:31] debug2: 1 can't allocate
[2012-02-25T22:37:31] debug2: name = 000
[2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
[2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1
start is 000
[2012-02-25T22:37:31] debug: _find_best_block_match (null) <bgp000>
[2012-02-25T22:37:31] debug: 513 can start unassigned job 33258 at
1330169851 on bgp000
[2012-02-25T22:37:31] backfill: Failed to start JobId=33258 on
bgp[000x011]: Requested nodes are busy
*******[2012-02-25T22:37:31] debug: MARK backfill: We're about to break
out of the backfill loop
[2012-02-25T22:37:31] backfill: completed testing 3 jobs, usec=368180
With the queue looking like this:
tambo:~ # squeue -o "%.7i %.9P %.8j %.2t %.10M %.6D %R"
JOBID PARTITION NAME ST TIME NODES BP_LIST(REASON)
33295 main bgtest.4 PD 0:00 1024 (Priority)
33288 main bgtest.4 PD 0:00 1024 (Resources)
33258 main logistic PD 0:00 512 (Priority)
33259 main logistic PD 0:00 512 (Priority)
33261 main logistic PD 0:00 512 (Priority)
33294 main bgtest.2 PD 0:00 512 (Priority)
33264 main sbatch_p PD 0:00 128 (Priority)
33265 main sbatch_p PD 0:00 128 (Priority)
33266 main sbatch_p PD 0:00 128 (Priority)
33293 main bgtest.1 PD 0:00 256 (Priority)
33272 main sbatch_p PD 0:00 128 (Priority)
33292 main bgtest.5 PD 0:00 128 (Priority)
33274 main sbatch_p PD 0:00 64 (Priority)
33280 main sbatch_p PD 0:00 64 (Priority)
33281 main bgtest.6 PD 0:00 64 (Priority)
33289 main bgtest.6 PD 0:00 64 (Priority)
33290 main bgtest.1 PD 0:00 64 (Priority)
33291 main bgtest.2 PD 0:00 64 (Priority)
33297 main sbatch_s PD 0:00 64 (Priority)
33303 main tambo PD 0:00 128 (Priority)
33307 main PV1W_Pen PD 0:00 128 (Priority)
33310 main sbatch_s PD 0:00 64 (Priority)
33314 main sbatch_s PD 0:00 64 (Priority)
33225 main sbatch_s R 9:40:49 512 bgp001
33254 main 20pi_Re1 R 1:29:05 512 bgp010
33224 main sbatch_s R 13:37:11 512 bgp011
33263 main sbatch_p R 12:03:33 128 bgp000[6-7]
33309 main sbatch_s R 3:57:37 64 bgp000[3]
33313 main sbatch_s R 3:34:10 64 bgp000[2]
33312 main sbatch_s R 3:44:21 64 bgp000[4]
33311 main sbatch_s R 3:47:25 64 bgp000[1]
So there's clearly more than three pending jobs that need to be backfill
tested, but after the third job that we test we break out of the while
loop in _attempt_backfill() because of that branch being taken. I'm
still digging further to see why this happens but I figured I'd share
what I'd found so far.
Any thoughts?
Many thanks!
Mark
On 25/02/12 10:36, Mark Nelson wrote:
>
> Many thanks for your quick response on this Danny!
>
<snip>
>> This line...
>>
>> [2012-02-24T16:08:30] debug: 513 can start unassigned job 33190 at
>> 1330060110 on bgp000
>>
>> states the job is unassigned to a block.
>
> When I convert 1330060110 to a human readable date format I get: Fri 24
> Feb 2012 16:08:30 EST GMT+11 (which is "right now" for when this was
> actually happening). I assumed this meant that SLURM thought that it
> could start the job 33190 "right now" - at 1330060110... What does that
> timestamp (1330060110) mean on that line?
>
>>
>> The job is not assigned to this future made block, or any other block,
>> that is why it doesn't run. As you point out there are a bunch of jobs
>> in the way of it, that is why it doesn't start or why the block doesn't
>> get a bg_block_id.
>>
>> If you read the code just a bit further, in the bg_job_place.c
>> submit_job() function you will see this future block is not added to the
>> main block list because of the null bg_block_id.
>
> Oh yes, the block of code on 1667. I'm not sure why, but I had assumed
> that bg_block_id was pointing to a string with contents "(null)" (which
> sounds incredibly stupid now ;) ) so I thought that branch wasn't taken.
> Ooops! Clearly need more coffee, or something ;)
>
> There is definite funkiness going on with our Blue Gene: looking at the
> machine this morning an entire midplane had drained and was filled with
> small blocks, none of which were used, but a midplane sized job wasn't
> running there even though it was pending and had the highest priority.
> Removing the blocks on the midplane in sview allowed the midplane-sized
> job to start running on that now block-less midplane.
>
> Clearly I need to spend more time with this code to work out what's
> going on.
>
>> Hope this helps.
>
> It does - it all helps put the pieces together ;)
>
> Thanks!
> Mark