[slurm-dev] Re: create_dynamic_block() returns bg_record with bg_block_id of "(null)". Is this normal?

Mark Nelson Sat, 25 Feb 2012 03:48:11 -0800

Hi Danny,

One further update:


So I'm looking through the backfill code now, and I'm wondering: should 
this branch below (from line 724 of _attempt_backfill() in backfill.c) 
ever be taken?

                        } else if (rc != SLURM_SUCCESS) {
                                /* Planned to start job, but something bad
                                 * happended. */
                                job_ptr->start_time = 0;        
                                break;

where we're testing the result of:
int rc = _start_job(job_ptr, resv_bitmap);
(just for a bit of context)

 From the comment I'm guessing that if we're taking this branch 
something is going wrong and so we don't run any further backfill tests 
for the remaining pending jobs.

With this patch (the last hunk is the important one):

---
  src/plugins/sched/backfill/backfill.c |    9 +++++++++
  1 file changed, 9 insertions(+)

Index: slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
===================================================================
--- slurm-2.3.3a6dca4.orig/src/plugins/sched/backfill/backfill.c
+++ slurm-2.3.3a6dca4/src/plugins/sched/backfill/backfill.c
@@ -500,6 +500,9 @@ static int _attempt_backfill(void)
                filter_root = true;

        job_queue = build_job_queue(true);
+
+       debug("MARK backfill: have to look at %d jobs", list_count(job_queue));
+
        if (list_count(job_queue) <= 1) {
                debug("backfill: no jobs to backfill");
                list_destroy(job_queue);
@@ -692,8 +695,13 @@ static int _attempt_backfill(void)
                if (j != SLURM_SUCCESS) {
                        job_ptr->time_limit = orig_time_limit;
                        job_ptr->start_time = 0;        
+
+                       debug("MARK backfill: %d isn't runnable right now, move 
on to the 
next one", job_ptr->job_id);
+
                        continue;       /* not runable */
                }
+               else
+                       debug("MARK backfill: %d IS runnable right now, let's 
run it", 
job_ptr->job_id);

                if (start_res > job_ptr->start_time) {
                        job_ptr->start_time = start_res;
@@ -722,6 +730,7 @@ static int _attempt_backfill(void)
                                job_ptr->start_time = 0;        
                                continue;
                        } else if (rc != SLURM_SUCCESS) {
+                               debug("MARK backfill: We're about to break out 
of the backfill loop");
                                /* Planned to start job, but something bad
                                 * happended. */
                                job_ptr->start_time = 0;        

I get the following in the log (important line highlighted with *******):

[2012-02-25T22:37:31] =========================================
[2012-02-25T22:37:31] Begin:2012-02-25T22:37:31 End:2012-02-26T12:58:00 
Nodes:bgp[000x011]
[2012-02-25T22:37:31] Begin:2012-02-26T12:58:00 End:2012-02-26T20:10:00 
Nodes:bgp[010x011]
[2012-02-25T22:37:31] Begin:2012-02-26T20:10:00 End:2012-02-26T22:37:31 
Nodes:
[2012-02-25T22:37:31] =========================================
[2012-02-25T22:37:31] backfill test for job 33258
[2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc 
28 (account vr0126) which has usage of 187
14286 and is currently reserving 3121949, with a limit of 113195280
[2012-02-25T22:37:31] debug2: backfill: entering _try_sched for job 33258.
[2012-02-25T22:37:31] debug:  bluegene:submit_job: 33258 mode=514 
Connection=Torus Reboot=no Rotate=yes Geometry=0x
0x0 Block_ID=unassigned nodes=1-1-1
[2012-02-25T22:37:31] debug3: ComputNodeImage=default
[2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
[2012-02-25T22:37:31] debug3: MloaderImage=default
[2012-02-25T22:37:31] debug2: 1 can't allocate
[2012-02-25T22:37:31] debug2: name = 000
[2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
[2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1 
start is 000
[2012-02-25T22:37:31] debug:  _find_best_block_match (null) <bgp000>
[2012-02-25T22:37:31] debug:  514 can start unassigned job 33258 at 
1330169851 on bgp000
[2012-02-25T22:37:31] debug2: backfill: finished _try_sched for job 33258.
[2012-02-25T22:37:31] debug:  MARK backfill: 33258 IS runnable right 
now, let's run it
[2012-02-25T22:37:31] acct_job_limits: job 33258 wants 256000 from assoc 
28 (account vr0126) which has usage of 187
14286 and is currently reserving 3121949, with a limit of 113195280
[2012-02-25T22:37:31] debug2: found 4 usable nodes from config 
containing bgp[000x011]
[2012-02-25T22:37:31] debug3: _pick_best_nodes: job 33258 idle_nodes 0 
share_nodes 4
[2012-02-25T22:37:31] debug:  bluegene:submit_job: 33258 mode=512 
Connection=Torus Reboot=no Rotate=yes Geometry=0x
0x0 Block_ID=unassigned nodes=1-1-1
[2012-02-25T22:37:31] debug3: ComputNodeImage=default
[2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
[2012-02-25T22:37:31] debug3: MloaderImage=default
[2012-02-25T22:37:31] debug:  bluegene:submit_job: 33258 mode=513 
Connection=Torus Reboot=no Rotate=yes Geometry=0x
0x0 Block_ID=unassigned nodes=1-1-1
[2012-02-25T22:37:31] debug3: ComputNodeImage=default
[2012-02-25T22:37:31] debug3: RamDiskIoLoadImage=default
[2012-02-25T22:37:31] debug3: MloaderImage=default
[2012-02-25T22:37:31] debug2: 1 can't allocate
[2012-02-25T22:37:31] debug2: name = 000
[2012-02-25T22:37:31] debug3: process_nodes: 000 is included in this block
[2012-02-25T22:37:31] debug3: process_nodes: geo = 111 mp count is 1 
start is 000
[2012-02-25T22:37:31] debug:  _find_best_block_match (null) <bgp000>
[2012-02-25T22:37:31] debug:  513 can start unassigned job 33258 at 
1330169851 on bgp000
[2012-02-25T22:37:31] backfill: Failed to start JobId=33258 on 
bgp[000x011]: Requested nodes are busy
*******[2012-02-25T22:37:31] debug:  MARK backfill: We're about to break 
out of the backfill loop
[2012-02-25T22:37:31] backfill: completed testing 3 jobs, usec=368180

With the queue looking like this:
tambo:~ # squeue -o "%.7i %.9P %.8j %.2t %.10M %.6D %R"
   JOBID PARTITION     NAME ST       TIME  NODES BP_LIST(REASON)
   33295      main bgtest.4 PD       0:00   1024 (Priority)
   33288      main bgtest.4 PD       0:00   1024 (Resources)
   33258      main logistic PD       0:00    512 (Priority)
   33259      main logistic PD       0:00    512 (Priority)
   33261      main logistic PD       0:00    512 (Priority)
   33294      main bgtest.2 PD       0:00    512 (Priority)
   33264      main sbatch_p PD       0:00    128 (Priority)
   33265      main sbatch_p PD       0:00    128 (Priority)
   33266      main sbatch_p PD       0:00    128 (Priority)
   33293      main bgtest.1 PD       0:00    256 (Priority)
   33272      main sbatch_p PD       0:00    128 (Priority)
   33292      main bgtest.5 PD       0:00    128 (Priority)
   33274      main sbatch_p PD       0:00     64 (Priority)
   33280      main sbatch_p PD       0:00     64 (Priority)
   33281      main bgtest.6 PD       0:00     64 (Priority)
   33289      main bgtest.6 PD       0:00     64 (Priority)
   33290      main bgtest.1 PD       0:00     64 (Priority)
   33291      main bgtest.2 PD       0:00     64 (Priority)
   33297      main sbatch_s PD       0:00     64 (Priority)
   33303      main    tambo PD       0:00    128 (Priority)
   33307      main PV1W_Pen PD       0:00    128 (Priority)
   33310      main sbatch_s PD       0:00     64 (Priority)
   33314      main sbatch_s PD       0:00     64 (Priority)
   33225      main sbatch_s  R    9:40:49    512 bgp001
   33254      main 20pi_Re1  R    1:29:05    512 bgp010
   33224      main sbatch_s  R   13:37:11    512 bgp011
   33263      main sbatch_p  R   12:03:33    128 bgp000[6-7]
   33309      main sbatch_s  R    3:57:37     64 bgp000[3]
   33313      main sbatch_s  R    3:34:10     64 bgp000[2]
   33312      main sbatch_s  R    3:44:21     64 bgp000[4]
   33311      main sbatch_s  R    3:47:25     64 bgp000[1]


So there's clearly more than three pending jobs that need to be backfill 
tested, but after the third job that we test we break out of the while 
loop in _attempt_backfill() because of that branch being taken. I'm 
still digging further to see why this happens but I figured I'd share 
what I'd found so far.

Any thoughts?

Many thanks!
Mark

On 25/02/12 10:36, Mark Nelson wrote:
>
> Many thanks for your quick response on this Danny!
>

<snip>

>> This line...
>>
>> [2012-02-24T16:08:30] debug:  513 can start unassigned job 33190 at
>> 1330060110 on bgp000
>>
>> states the job is unassigned to a block.
>
> When I convert 1330060110 to a human readable date format I get: Fri 24
> Feb 2012 16:08:30 EST GMT+11 (which is "right now" for when this was
> actually happening). I assumed this meant that SLURM thought that it
> could start the job 33190 "right now" - at 1330060110... What does that
> timestamp (1330060110) mean on that line?
>
>>
>> The job is not assigned to this future made block, or any other block,
>> that is why it doesn't run.  As you point out there are a bunch of jobs
>> in the way of it, that is why it doesn't start or why the block doesn't
>> get a bg_block_id.
>>
>> If you read the code just a bit further, in the bg_job_place.c
>> submit_job() function you will see this future block is not added to the
>> main block list because of the null bg_block_id.
>
> Oh yes, the block of code on 1667. I'm not sure why, but I had assumed
> that bg_block_id was pointing to a string with contents "(null)" (which
> sounds incredibly stupid now ;) ) so I thought that branch wasn't taken.
> Ooops! Clearly need more coffee, or something ;)
>
> There is definite funkiness going on with our Blue Gene: looking at the
> machine this morning an entire midplane had drained and was filled with
> small blocks, none of which were used, but a midplane sized job wasn't
> running there even though it was pending and had the highest priority.
> Removing the blocks on the midplane in sview allowed the midplane-sized
> job to start running on that now block-less midplane.
>
> Clearly I need to spend more time with this code to work out what's
> going on.
>
>> Hope this helps.
>
> It does - it all helps put the pieces together ;)
>
> Thanks!
> Mark

[slurm-dev] Re: create_dynamic_block() returns bg_record with bg_block_id of "(null)". Is this normal?

Reply via email to