It looks as though the problem is caused by something not setting the
job_ptr of the bg_record that gets returned to submit_job() of the
usable-in-the-future NULL block (bg_block_id is NULL). And that means
that in submit_job(), nothing is overriding starttime (which is set to
time(NULL) at the beginning of the fuction) - neither the outer if, or
the else if of this block is taken:
if (bg_record->job_ptr
&& bg_record->job_ptr->end_time) {
if (bg_record->job_ptr->end_time <= starttime)
starttime += 5;
else
starttime = bg_record->job_ptr->end_time;
} else if (bg_record->job_running == BLOCK_ERROR_STATE)
starttime = INFINITE;
which means that this line (1583):
job_ptr->start_time = starttime;
causes submit_job() to try to run the job "now" which is why the
backfill is failing, which I identified in the last e-mail:
[2012-02-29T07:44:45] backfill: beginning
[2012-02-29T07:44:45] debug: MARK backfill: have to look at 14 jobs
[2012-02-29T07:44:45] =========================================
[2012-02-29T07:44:45] Begin:2012-02-29T07:44:45 End:2012-03-01T07:44:45
Nodes:bgp[000x011]
[2012-02-29T07:44:45] =========================================
[2012-02-29T07:44:45] backfill test for job 33394
[2012-02-29T07:44:45] acct_job_limits: job 33394 wants 2826240 from
assoc 629 (account vr0210) which has usage of 22841856 and is currently
reserving 0, with a limit of 94211280
[2012-02-29T07:44:45] debug2: backfill: entering _try_sched for job 33394.
[2012-02-29T07:44:45] debug: MARK backfill: 4 is bit_set_count of
avail_bitmap
[2012-02-29T07:44:45] debug: MARK submit_job: job 33394 has start_time
of 0 upon entry to submit_job
[2012-02-29T07:44:45] debug: bluegene:submit_job: 33394 mode=514
Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned
nodes=1-1-1
[2012-02-29T07:44:45] debug3: ComputNodeImage=default
[2012-02-29T07:44:45] debug3: RamDiskIoLoadImage=default
[2012-02-29T07:44:45] debug3: MloaderImage=default
[2012-02-29T07:44:45] debug2: 1 can't allocate
[2012-02-29T07:44:45] debug2: name = 001
[2012-02-29T07:44:45] debug3: process_nodes: 001 is included in this block
[2012-02-29T07:44:45] debug3: process_nodes: geo = 111 mp count is 1
start is 001
[2012-02-29T07:44:45] debug: _find_best_block_match (null) <bgp001>
[2012-02-29T07:44:45] debug: MARK submit_job: we've got a block now:
job 33394 has start_time of 0
[2012-02-29T07:44:45] debug: MARK submit_job: job_ptr of bg_record is NULL
[2012-02-29T07:44:45] debug: MARK submit_job: we're about to set the
start time of job 33394 to be 1330461885
[2012-02-29T07:44:45] debug: 514 can start unassigned job 33394 at
1330461885 on bgp001
[2012-02-29T07:44:45] debug: MARK _try_sched: select_g_job_test
succeeded for not sharing nodes
[2012-02-29T07:44:45] debug2: backfill: finished _try_sched for job 33394.
[2012-02-29T07:44:45] debug: MARK backfill: 33394 IS runnable right
now, let's run it
(the timestamp 1330461885 is Wed 29 Feb 2012 07:44:45)
The above logging was produced with this patch:
---
Index: slurm-2.3.3a6dca4/src/plugins/select/bluegene/bg_job_place.c
===================================================================
--- slurm-2.3.3a6dca4.orig/src/plugins/select/bluegene/bg_job_place.c
+++ slurm-2.3.3a6dca4/src/plugins/select/bluegene/bg_job_place.c
@@ -1418,6 +1418,8 @@ extern int submit_job(struct job_record
int avail_cpus = num_unused_cpus;
int dim = 0;
+ debug("MARK submit_job: job %d has start_time of %ld upon entry to
submit_job", job_ptr->job_id, job_ptr->start_time);
+
for (dim=0; dim<SYSTEM_DIMENSIONS; dim++)
conn_type[dim] = (uint16_t)NO_VAL;
if (preemptee_candidates && preemptee_job_list
@@ -1558,6 +1560,8 @@ extern int submit_job(struct job_record
}
if (rc == SLURM_SUCCESS) {
+ debug("MARK submit_job: we've got a block now: job %d has
start_time
of %ld", job_ptr->job_id, job_ptr->start_time);
+
if (!bg_record)
fatal("we got a success, but no block back");
/* Here we see if there is a job running since
@@ -1567,18 +1571,32 @@ extern int submit_job(struct job_record
* past or current time) we add 5 seconds to
* it so we don't use the block immediately.
*/
+
+ if (bg_record->job_ptr == NULL)
+ debug("MARK submit_job: job_ptr of bg_record is NULL");
+ else
+ debug("MARK submit_job: job_ptr of bg_record is for
jobid of %d",
bg_record->job_ptr->job_id);
+
if (bg_record->job_ptr
&& bg_record->job_ptr->end_time) {
- if (bg_record->job_ptr->end_time <= starttime)
+ if (bg_record->job_ptr->end_time <= starttime) {
starttime += 5;
- else
+ debug("MARK submit_job: starttime is now %ld
because:
bg_record->job_ptr->end_time <= starttime", starttime);
+ }
+ else {
starttime = bg_record->job_ptr->end_time;
+ debug("MARK submit_job: starttime is now %ld
from the else branch",
starttime);
+ }
} else if (bg_record->job_running == BLOCK_ERROR_STATE)
starttime = INFINITE;
/* make sure the job is eligible to run */
- if (job_ptr->details->begin_time > starttime)
+ if (job_ptr->details->begin_time > starttime) {
starttime = job_ptr->details->begin_time;
+ debug("MARK submit_job: starttime is now %ld set from
the
job_details' begin_time", starttime);
+ }
+
+ debug("MARK submit_job: we're about to set the start time of
job %d
to be %ld", job_ptr->job_id, starttime);
job_ptr->start_time = starttime;
Still digging, but I figured I'd share what I'd found so far ;)
Thanks!
Mark
On 25/02/12 23:02, Mark Nelson wrote:
>
> Hi Danny,
>
> One further update:
>
> So I'm looking through the backfill code now, and I'm wondering: should
> this branch below (from line 724 of _attempt_backfill() in backfill.c)
> ever be taken?
>
> } else if (rc != SLURM_SUCCESS) {
> /* Planned to start job, but something bad
> * happended. */
> job_ptr->start_time = 0;
> break;
>
> where we're testing the result of:
> int rc = _start_job(job_ptr, resv_bitmap);
> (just for a bit of context)
>
> From the comment I'm guessing that if we're taking this branch
> something is going wrong and so we don't run any further backfill tests
> for the remaining pending jobs.
>
> With this patch (the last hunk is the important one):
>
> ---
> src/plugins/sched/backfill/backfill.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> Index: slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
> ===================================================================
> --- slurm-2.3.3a6dca4.orig/src/plugins/sched/backfill/backfill.c
> +++ slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
> @@ -500,6 +500,9 @@ static int _attempt_backfill(void)
> filter_root = true;
>
> job_queue = build_job_queue(true);
> +
> + debug("MARK backfill: have to look at %d jobs", list_count(job_queue));
> +
> if (list_count(job_queue)<= 1) {
> debug("backfill: no jobs to backfill");
> list_destroy(job_queue);
> @@ -692,8 +695,13 @@ static int _attempt_backfill(void)
> if (j != SLURM_SUCCESS) {
> job_ptr->time_limit = orig_time_limit;
> job_ptr->start_time = 0;
> +
> + debug("MARK backfill: %d isn't runnable right now, move
> on to the
> next one", job_ptr->job_id);
> +
> continue; /* not runable */
> }
> + else
> + debug("MARK backfill: %d IS runnable right now, let's
> run it",
> job_ptr->job_id);
>
> if (start_res> job_ptr->start_time) {
> job_ptr->start_time = start_res;
> @@ -722,6 +730,7 @@ static int _attempt_backfill(void)
> job_ptr->start_time = 0;
> continue;
> } else if (rc != SLURM_SUCCESS) {
> + debug("MARK backfill: We're about to break out
> of the backfill loop");
> /* Planned to start job, but something bad
> * happended. */
> job_ptr->start_time = 0;
>
> I get the following in the log (important line highlighted with *******):
>
> [2012-02-25T22:37:31] =========================================
> [2012-02-25T22:37:31] Begin:2012-02-25T22:37:31 End:2012-02-26T12:58:00
> Nodes:bgp[000x011]
> [2012-02-25T22:37:31] Begin:2012-02-26T12:58:00 End:2012-02-26T20:10:00
> Nodes:bgp[010x011]
> [2012-02-25T22:37:31] Begin:2012-02-26T20:10:00 End:2012-02-26T22:37:31
> Nodes:
> [2012-02-25T22:37:31] =========================================
> [2012-02-25T22:37:31] backfill test for job 33258
> [2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc
> 28 (account vr0126) which has usage of 187
> 14286 and is currently reserving 3121949, with a limit of 113195280
> [2012-02-25T22:37:31] debug2: backfill: entering _try_sched for job 33258.
> [2012-02-25T22:37:31] debug: bluegene:submit_job: 33258 mode=514
> Connection=Torus Reboot=no Rotate=yes Geometry=0x
> 0x0 Block_ID=unassigned nodes=1-1-1
> [2012-02-25T22:37:31] debug3: ComputNodeImage=default
> [2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
> [2012-02-25T22:37:31] debug3: MloaderImage=default
> [2012-02-25T22:37:31] debug2: 1 can't allocate
> [2012-02-25T22:37:31] debug2: name = 000
> [2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
> [2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1
> start is 000
> [2012-02-25T22:37:31] debug: _find_best_block_match (null)<bgp000>
> [2012-02-25T22:37:31] debug: 514 can start unassigned job 33258 at
> 1330169851 on bgp000
> [2012-02-25T22:37:31] debug2: backfill: finished _try_sched for job 33258.
> [2012-02-25T22:37:31] debug: MARK backfill: 33258 IS runnable right
> now, let's run it
> [2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc
> 28 (account vr0126) which has usage of 187
> 14286 and is currently reserving 3121949, with a limit of 113195280
> [2012-02-25T22:37:31] debug2: found 4 usable nodes from config
> containing bgp[000x011]
> [2012-02-25T22:37:31] debug3: _pick_best_nodes: job 33258 idle_nodes 0
> share_nodes 4
> [2012-02-25T22:37:31] debug: bluegene:submit_job: 33258 mode=512
> Connection=Torus Reboot=no Rotate=yes Geometry=0x
> 0x0 Block_ID=unassigned nodes=1-1-1
> [2012-02-25T22:37:31] debug3: ComputNodeImage=default
> [2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
> [2012-02-25T22:37:31] debug3: MloaderImage=default
> [2012-02-25T22:37:31] debug: bluegene:submit_job: 33258 mode=513
> Connection=Torus Reboot=no Rotate=yes Geometry=0x
> 0x0 Block_ID=unassigned nodes=1-1-1
> [2012-02-25T22:37:31] debug3: ComputNodeImage=default
> [2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
> [2012-02-25T22:37:31] debug3: MloaderImage=default
> [2012-02-25T22:37:31] debug2: 1 can't allocate
> [2012-02-25T22:37:31] debug2: name = 000
> [2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
> [2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1
> start is 000
> [2012-02-25T22:37:31] debug: _find_best_block_match (null)<bgp000>
> [2012-02-25T22:37:31] debug: 513 can start unassigned job 33258 at
> 1330169851 on bgp000
> [2012-02-25T22:37:31] backfill: Failed to start JobId=33258 on
> bgp[000x011]: Requested nodes are busy
> *******[2012-02-25T22:37:31] debug: MARK backfill: We're about to break
> out of the backfill loop
> [2012-02-25T22:37:31] backfill: completed testing 3 jobs, usec=368180
>
> With the queue looking like this:
> tambo:~ # squeue -o "%.7i %.9P %.8j %.2t %.10M %.6D %R"
> JOBID PARTITION NAME ST TIME NODES BP_LIST(REASON)
> 33295 main bgtest.4 PD 0:00 1024 (Priority)
> 33288 main bgtest.4 PD 0:00 1024 (Resources)
> 33258 main logistic PD 0:00 512 (Priority)
> 33259 main logistic PD 0:00 512 (Priority)
> 33261 main logistic PD 0:00 512 (Priority)
> 33294 main bgtest.2 PD 0:00 512 (Priority)
> 33264 main sbatch_p PD 0:00 128 (Priority)
> 33265 main sbatch_p PD 0:00 128 (Priority)
> 33266 main sbatch_p PD 0:00 128 (Priority)
> 33293 main bgtest.1 PD 0:00 256 (Priority)
> 33272 main sbatch_p PD 0:00 128 (Priority)
> 33292 main bgtest.5 PD 0:00 128 (Priority)
> 33274 main sbatch_p PD 0:00 64 (Priority)
> 33280 main sbatch_p PD 0:00 64 (Priority)
> 33281 main bgtest.6 PD 0:00 64 (Priority)
> 33289 main bgtest.6 PD 0:00 64 (Priority)
> 33290 main bgtest.1 PD 0:00 64 (Priority)
> 33291 main bgtest.2 PD 0:00 64 (Priority)
> 33297 main sbatch_s PD 0:00 64 (Priority)
> 33303 main tambo PD 0:00 128 (Priority)
> 33307 main PV1W_Pen PD 0:00 128 (Priority)
> 33310 main sbatch_s PD 0:00 64 (Priority)
> 33314 main sbatch_s PD 0:00 64 (Priority)
> 33225 main sbatch_s R 9:40:49 512 bgp001
> 33254 main 20pi_Re1 R 1:29:05 512 bgp010
> 33224 main sbatch_s R 13:37:11 512 bgp011
> 33263 main sbatch_p R 12:03:33 128 bgp000[6-7]
> 33309 main sbatch_s R 3:57:37 64 bgp000[3]
> 33313 main sbatch_s R 3:34:10 64 bgp000[2]
> 33312 main sbatch_s R 3:44:21 64 bgp000[4]
> 33311 main sbatch_s R 3:47:25 64 bgp000[1]
>
>
> So there's clearly more than three pending jobs that need to be backfill
> tested, but after the third job that we test we break out of the while
> loop in _attempt_backfill() because of that branch being taken. I'm
> still digging further to see why this happens but I figured I'd share
> what I'd found so far.
>
> Any thoughts?
>
> Many thanks!
> Mark
>
> On 25/02/12 10:36, Mark Nelson wrote:
>>
>> Many thanks for your quick response on this Danny!
>>
>
> <snip>
>
>>> This line...
>>>
>>> [2012-02-24T16:08:30] debug: 513 can start unassigned job 33190 at
>>> 1330060110 on bgp000
>>>
>>> states the job is unassigned to a block.
>>
>> When I convert 1330060110 to a human readable date format I get: Fri 24
>> Feb 2012 16:08:30 EST GMT+11 (which is "right now" for when this was
>> actually happening). I assumed this meant that SLURM thought that it
>> could start the job 33190 "right now" - at 1330060110... What does that
>> timestamp (1330060110) mean on that line?
>>
>>>
>>> The job is not assigned to this future made block, or any other block,
>>> that is why it doesn't run. As you point out there are a bunch of jobs
>>> in the way of it, that is why it doesn't start or why the block doesn't
>>> get a bg_block_id.
>>>
>>> If you read the code just a bit further, in the bg_job_place.c
>>> submit_job() function you will see this future block is not added to the
>>> main block list because of the null bg_block_id.
>>
>> Oh yes, the block of code on 1667. I'm not sure why, but I had assumed
>> that bg_block_id was pointing to a string with contents "(null)" (which
>> sounds incredibly stupid now ;) ) so I thought that branch wasn't taken.
>> Ooops! Clearly need more coffee, or something ;)
>>
>> There is definite funkiness going on with our Blue Gene: looking at the
>> machine this morning an entire midplane had drained and was filled with
>> small blocks, none of which were used, but a midplane sized job wasn't
>> running there even though it was pending and had the highest priority.
>> Removing the blocks on the midplane in sview allowed the midplane-sized
>> job to start running on that now block-less midplane.
>>
>> Clearly I need to spend more time with this code to work out what's
>> going on.
>>
>>> Hope this helps.
>>
>> It does - it all helps put the pieces together ;)
>>
>> Thanks!
>> Mark