[slurm-dev] Re: create_dynamic_block() returns bg_record with bg_block_id of "(null)". Is this normal?

Mark Nelson Tue, 28 Feb 2012 17:28:06 -0800

It looks as though the problem is caused by something not setting the 
job_ptr of the bg_record that gets returned to submit_job() of the 
usable-in-the-future NULL block (bg_block_id is NULL). And that means 
that in submit_job(), nothing is overriding starttime (which is set to 
time(NULL) at the beginning of the fuction) - neither the outer if, or 
the else if of this block is taken:


                if (bg_record->job_ptr
                    && bg_record->job_ptr->end_time) {
                        if (bg_record->job_ptr->end_time <= starttime)
                                starttime += 5;
                        else
                                starttime = bg_record->job_ptr->end_time;
                } else if (bg_record->job_running == BLOCK_ERROR_STATE)
                        starttime = INFINITE;

which means that this line (1583):
job_ptr->start_time = starttime;

causes submit_job() to try to run the job "now" which is why the 
backfill is failing, which I identified in the last e-mail:
[2012-02-29T07:44:45] backfill: beginning
[2012-02-29T07:44:45] debug:  MARK backfill: have to look at 14 jobs
[2012-02-29T07:44:45] =========================================
[2012-02-29T07:44:45] Begin:2012-02-29T07:44:45 End:2012-03-01T07:44:45 
Nodes:bgp[000x011]
[2012-02-29T07:44:45] =========================================
[2012-02-29T07:44:45] backfill test for job 33394
[2012-02-29T07:44:45] acct_job_limits: job 33394 wants 2826240 from 
assoc 629 (account vr0210) which has usage of 22841856 and is currently 
reserving 0, with a limit of 94211280
[2012-02-29T07:44:45] debug2: backfill: entering _try_sched for job 33394.
[2012-02-29T07:44:45] debug:  MARK backfill: 4 is bit_set_count of 
avail_bitmap
[2012-02-29T07:44:45] debug:  MARK submit_job: job 33394 has start_time 
of 0 upon entry to submit_job
[2012-02-29T07:44:45] debug:  bluegene:submit_job: 33394 mode=514 
Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned 
nodes=1-1-1
[2012-02-29T07:44:45] debug3: ComputNodeImage=default
[2012-02-29T07:44:45] debug3: RamDiskIoLoadImage=default
[2012-02-29T07:44:45] debug3: MloaderImage=default
[2012-02-29T07:44:45] debug2: 1 can't allocate
[2012-02-29T07:44:45] debug2: name = 001
[2012-02-29T07:44:45] debug3: process_nodes: 001 is included in this block
[2012-02-29T07:44:45] debug3: process_nodes: geo = 111 mp count is 1 
start is 001
[2012-02-29T07:44:45] debug:  _find_best_block_match (null) <bgp001>
[2012-02-29T07:44:45] debug:  MARK submit_job: we've got a block now: 
job 33394 has start_time of 0
[2012-02-29T07:44:45] debug:  MARK submit_job: job_ptr of bg_record is NULL
[2012-02-29T07:44:45] debug:  MARK submit_job: we're about to set the 
start time of job 33394 to be 1330461885
[2012-02-29T07:44:45] debug:  514 can start unassigned job 33394 at 
1330461885 on bgp001
[2012-02-29T07:44:45] debug:  MARK _try_sched: select_g_job_test 
succeeded for not sharing nodes
[2012-02-29T07:44:45] debug2: backfill: finished _try_sched for job 33394.
[2012-02-29T07:44:45] debug:  MARK backfill: 33394 IS runnable right 
now, let's run it

(the timestamp 1330461885 is Wed 29 Feb 2012 07:44:45)

The above logging was produced with this patch:
---

Index: slurm-2.3.3a6dca4/src/plugins/select/bluegene/bg_job_place.c
===================================================================
--- slurm-2.3.3a6dca4.orig/src/plugins/select/bluegene/bg_job_place.c
+++ slurm-2.3.3a6dca4/src/plugins/select/bluegene/bg_job_place.c
@@ -1418,6 +1418,8 @@ extern int submit_job(struct job_record
        int avail_cpus = num_unused_cpus;
        int dim = 0;

+       debug("MARK submit_job: job %d has start_time of %ld upon entry to 
submit_job", job_ptr->job_id, job_ptr->start_time);
+
        for (dim=0; dim<SYSTEM_DIMENSIONS; dim++)
                conn_type[dim] = (uint16_t)NO_VAL;
        if (preemptee_candidates && preemptee_job_list
@@ -1558,6 +1560,8 @@ extern int submit_job(struct job_record
        }

        if (rc == SLURM_SUCCESS) {
+               debug("MARK submit_job: we've got a block now: job %d has 
start_time 
of %ld", job_ptr->job_id, job_ptr->start_time);
+
                if (!bg_record)
                        fatal("we got a success, but no block back");
                /* Here we see if there is a job running since
@@ -1567,18 +1571,32 @@ extern int submit_job(struct job_record
                 * past or current time) we add 5 seconds to
                 * it so we don't use the block immediately.
                 */
+
+               if (bg_record->job_ptr == NULL)
+                       debug("MARK submit_job: job_ptr of bg_record is NULL");
+               else
+                       debug("MARK submit_job: job_ptr of bg_record is for 
jobid of %d", 
bg_record->job_ptr->job_id);
+
                if (bg_record->job_ptr
                    && bg_record->job_ptr->end_time) {
-                       if (bg_record->job_ptr->end_time <= starttime)
+                       if (bg_record->job_ptr->end_time <= starttime) {
                                starttime += 5;
-                       else
+                               debug("MARK submit_job: starttime is now %ld 
because: 
bg_record->job_ptr->end_time <= starttime", starttime);
+                       }
+                       else {
                                starttime = bg_record->job_ptr->end_time;
+                               debug("MARK submit_job: starttime is now %ld 
from the else branch", 
starttime);
+                       }
                } else if (bg_record->job_running == BLOCK_ERROR_STATE)
                        starttime = INFINITE;

                /* make sure the job is eligible to run */
-               if (job_ptr->details->begin_time > starttime)
+               if (job_ptr->details->begin_time > starttime) {
                        starttime = job_ptr->details->begin_time;
+                       debug("MARK submit_job: starttime is now %ld set from 
the 
job_details' begin_time", starttime);
+               }
+
+               debug("MARK submit_job: we're about to set the start time of 
job %d 
to be %ld", job_ptr->job_id, starttime);

                job_ptr->start_time = starttime;


Still digging, but I figured I'd share what I'd found so far ;)

Thanks!
Mark

On 25/02/12 23:02, Mark Nelson wrote:
>
> Hi Danny,
>
> One further update:
>
> So I'm looking through the backfill code now, and I'm wondering: should
> this branch below (from line 724 of _attempt_backfill() in backfill.c)
> ever be taken?
>
>                       } else if (rc != SLURM_SUCCESS) {
>                               /* Planned to start job, but something bad
>                                * happended. */
>                               job_ptr->start_time = 0;        
>                               break;
>
> where we're testing the result of:
> int rc = _start_job(job_ptr, resv_bitmap);
> (just for a bit of context)
>
>   From the comment I'm guessing that if we're taking this branch
> something is going wrong and so we don't run any further backfill tests
> for the remaining pending jobs.
>
> With this patch (the last hunk is the important one):
>
> ---
>    src/plugins/sched/backfill/backfill.c |    9 +++++++++
>    1 file changed, 9 insertions(+)
>
> Index: slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
> ===================================================================
> --- slurm-2.3.3a6dca4.orig/src/plugins/sched/backfill/backfill.c
> +++ slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
> @@ -500,6 +500,9 @@ static int _attempt_backfill(void)
>               filter_root = true;
>
>       job_queue = build_job_queue(true);
> +
> +     debug("MARK backfill: have to look at %d jobs", list_count(job_queue));
> +
>       if (list_count(job_queue)<= 1) {
>               debug("backfill: no jobs to backfill");
>               list_destroy(job_queue);
> @@ -692,8 +695,13 @@ static int _attempt_backfill(void)
>               if (j != SLURM_SUCCESS) {
>                       job_ptr->time_limit = orig_time_limit;
>                       job_ptr->start_time = 0;        
> +
> +                     debug("MARK backfill: %d isn't runnable right now, move 
> on to the
> next one", job_ptr->job_id);
> +
>                       continue;       /* not runable */
>               }
> +             else
> +                     debug("MARK backfill: %d IS runnable right now, let's 
> run it",
> job_ptr->job_id);
>
>               if (start_res>  job_ptr->start_time) {
>                       job_ptr->start_time = start_res;
> @@ -722,6 +730,7 @@ static int _attempt_backfill(void)
>                               job_ptr->start_time = 0;        
>                               continue;
>                       } else if (rc != SLURM_SUCCESS) {
> +                             debug("MARK backfill: We're about to break out 
> of the backfill loop");
>                               /* Planned to start job, but something bad
>                                * happended. */
>                               job_ptr->start_time = 0;        
>
> I get the following in the log (important line highlighted with *******):
>
> [2012-02-25T22:37:31] =========================================
> [2012-02-25T22:37:31] Begin:2012-02-25T22:37:31 End:2012-02-26T12:58:00
> Nodes:bgp[000x011]
> [2012-02-25T22:37:31] Begin:2012-02-26T12:58:00 End:2012-02-26T20:10:00
> Nodes:bgp[010x011]
> [2012-02-25T22:37:31] Begin:2012-02-26T20:10:00 End:2012-02-26T22:37:31
> Nodes:
> [2012-02-25T22:37:31] =========================================
> [2012-02-25T22:37:31] backfill test for job 33258
> [2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc
> 28 (account vr0126) which has usage of 187
> 14286 and is currently reserving 3121949, with a limit of 113195280
> [2012-02-25T22:37:31] debug2: backfill: entering _try_sched for job 33258.
> [2012-02-25T22:37:31] debug:  bluegene:submit_job: 33258 mode=514
> Connection=Torus Reboot=no Rotate=yes Geometry=0x
> 0x0 Block_ID=unassigned nodes=1-1-1
> [2012-02-25T22:37:31] debug3: ComputNodeImage=default
> [2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
> [2012-02-25T22:37:31] debug3: MloaderImage=default
> [2012-02-25T22:37:31] debug2: 1 can't allocate
> [2012-02-25T22:37:31] debug2: name = 000
> [2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
> [2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1
> start is 000
> [2012-02-25T22:37:31] debug:  _find_best_block_match (null)<bgp000>
> [2012-02-25T22:37:31] debug:  514 can start unassigned job 33258 at
> 1330169851 on bgp000
> [2012-02-25T22:37:31] debug2: backfill: finished _try_sched for job 33258.
> [2012-02-25T22:37:31] debug:  MARK backfill: 33258 IS runnable right
> now, let's run it
> [2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc
> 28 (account vr0126) which has usage of 187
> 14286 and is currently reserving 3121949, with a limit of 113195280
> [2012-02-25T22:37:31] debug2: found 4 usable nodes from config
> containing bgp[000x011]
> [2012-02-25T22:37:31] debug3: _pick_best_nodes: job 33258 idle_nodes 0
> share_nodes 4
> [2012-02-25T22:37:31] debug:  bluegene:submit_job: 33258 mode=512
> Connection=Torus Reboot=no Rotate=yes Geometry=0x
> 0x0 Block_ID=unassigned nodes=1-1-1
> [2012-02-25T22:37:31] debug3: ComputNodeImage=default
> [2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
> [2012-02-25T22:37:31] debug3: MloaderImage=default
> [2012-02-25T22:37:31] debug:  bluegene:submit_job: 33258 mode=513
> Connection=Torus Reboot=no Rotate=yes Geometry=0x
> 0x0 Block_ID=unassigned nodes=1-1-1
> [2012-02-25T22:37:31] debug3: ComputNodeImage=default
> [2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
> [2012-02-25T22:37:31] debug3: MloaderImage=default
> [2012-02-25T22:37:31] debug2: 1 can't allocate
> [2012-02-25T22:37:31] debug2: name = 000
> [2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
> [2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1
> start is 000
> [2012-02-25T22:37:31] debug:  _find_best_block_match (null)<bgp000>
> [2012-02-25T22:37:31] debug:  513 can start unassigned job 33258 at
> 1330169851 on bgp000
> [2012-02-25T22:37:31] backfill: Failed to start JobId=33258 on
> bgp[000x011]: Requested nodes are busy
> *******[2012-02-25T22:37:31] debug:  MARK backfill: We're about to break
> out of the backfill loop
> [2012-02-25T22:37:31] backfill: completed testing 3 jobs, usec=368180
>
> With the queue looking like this:
> tambo:~ # squeue -o "%.7i %.9P %.8j %.2t %.10M %.6D %R"
>     JOBID PARTITION     NAME ST       TIME  NODES BP_LIST(REASON)
>     33295      main bgtest.4 PD       0:00   1024 (Priority)
>     33288      main bgtest.4 PD       0:00   1024 (Resources)
>     33258      main logistic PD       0:00    512 (Priority)
>     33259      main logistic PD       0:00    512 (Priority)
>     33261      main logistic PD       0:00    512 (Priority)
>     33294      main bgtest.2 PD       0:00    512 (Priority)
>     33264      main sbatch_p PD       0:00    128 (Priority)
>     33265      main sbatch_p PD       0:00    128 (Priority)
>     33266      main sbatch_p PD       0:00    128 (Priority)
>     33293      main bgtest.1 PD       0:00    256 (Priority)
>     33272      main sbatch_p PD       0:00    128 (Priority)
>     33292      main bgtest.5 PD       0:00    128 (Priority)
>     33274      main sbatch_p PD       0:00     64 (Priority)
>     33280      main sbatch_p PD       0:00     64 (Priority)
>     33281      main bgtest.6 PD       0:00     64 (Priority)
>     33289      main bgtest.6 PD       0:00     64 (Priority)
>     33290      main bgtest.1 PD       0:00     64 (Priority)
>     33291      main bgtest.2 PD       0:00     64 (Priority)
>     33297      main sbatch_s PD       0:00     64 (Priority)
>     33303      main    tambo PD       0:00    128 (Priority)
>     33307      main PV1W_Pen PD       0:00    128 (Priority)
>     33310      main sbatch_s PD       0:00     64 (Priority)
>     33314      main sbatch_s PD       0:00     64 (Priority)
>     33225      main sbatch_s  R    9:40:49    512 bgp001
>     33254      main 20pi_Re1  R    1:29:05    512 bgp010
>     33224      main sbatch_s  R   13:37:11    512 bgp011
>     33263      main sbatch_p  R   12:03:33    128 bgp000[6-7]
>     33309      main sbatch_s  R    3:57:37     64 bgp000[3]
>     33313      main sbatch_s  R    3:34:10     64 bgp000[2]
>     33312      main sbatch_s  R    3:44:21     64 bgp000[4]
>     33311      main sbatch_s  R    3:47:25     64 bgp000[1]
>
>
> So there's clearly more than three pending jobs that need to be backfill
> tested, but after the third job that we test we break out of the while
> loop in _attempt_backfill() because of that branch being taken. I'm
> still digging further to see why this happens but I figured I'd share
> what I'd found so far.
>
> Any thoughts?
>
> Many thanks!
> Mark
>
> On 25/02/12 10:36, Mark Nelson wrote:
>>
>> Many thanks for your quick response on this Danny!
>>
>
> <snip>
>
>>> This line...
>>>
>>> [2012-02-24T16:08:30] debug:  513 can start unassigned job 33190 at
>>> 1330060110 on bgp000
>>>
>>> states the job is unassigned to a block.
>>
>> When I convert 1330060110 to a human readable date format I get: Fri 24
>> Feb 2012 16:08:30 EST GMT+11 (which is "right now" for when this was
>> actually happening). I assumed this meant that SLURM thought that it
>> could start the job 33190 "right now" - at 1330060110... What does that
>> timestamp (1330060110) mean on that line?
>>
>>>
>>> The job is not assigned to this future made block, or any other block,
>>> that is why it doesn't run.  As you point out there are a bunch of jobs
>>> in the way of it, that is why it doesn't start or why the block doesn't
>>> get a bg_block_id.
>>>
>>> If you read the code just a bit further, in the bg_job_place.c
>>> submit_job() function you will see this future block is not added to the
>>> main block list because of the null bg_block_id.
>>
>> Oh yes, the block of code on 1667. I'm not sure why, but I had assumed
>> that bg_block_id was pointing to a string with contents "(null)" (which
>> sounds incredibly stupid now ;) ) so I thought that branch wasn't taken.
>> Ooops! Clearly need more coffee, or something ;)
>>
>> There is definite funkiness going on with our Blue Gene: looking at the
>> machine this morning an entire midplane had drained and was filled with
>> small blocks, none of which were used, but a midplane sized job wasn't
>> running there even though it was pending and had the highest priority.
>> Removing the blocks on the midplane in sview allowed the midplane-sized
>> job to start running on that now block-less midplane.
>>
>> Clearly I need to spend more time with this code to work out what's
>> going on.
>>
>>> Hope this helps.
>>
>> It does - it all helps put the pieces together ;)
>>
>> Thanks!
>> Mark

[slurm-dev] Re: create_dynamic_block() returns bg_record with bg_block_id of "(null)". Is this normal?

Reply via email to