On 02/23/12 21:27, Mark Nelson wrote:
> Hi Danny,
>
> We're still seeing a few issues with our Blue Gene /P on SLURM 2.3.3 
> and now that I've dug through a bit of the bluegene select plugin code 
> I've got a few questions that I hope you can answer. Actually, just one:
>
> _dynamically_request() (in bg_job_place.c) calls 
> create_dynamic_block() on line 794. Is it normal for the list of 
> bg_records returned by create_dynamic_block() to have a bg_record with 
> a bg_block_id of "(null)", ionode_str of "(null)" and no job running 
> (job_running of -1)?
>
Yes (very common as you have noticed).  This points to a block that can 
be made in the future.  It isn't a real block.
> I ask because that's what we're seeing, and it's this special null 
> block that then gets added to the block_list on line 815, which then 
> gets elected by _find_matching_block() as the ideal place to run a 
> pending job. It then gets cascaded up through find_best_block_match() 
> to submit_job() which then decides that it can start this pending job 
> on the null block. But in reality the block is bogus and when the job 
> tries to get backfilled we get this:
> debug3: backfill: Failed to start JobId=33190: Requested nodes are busy
> (I think that's what's happening anyway)
The code is doing exactly what it is suppose to do.  Don't get caught up 
in debug3 messages, they are debug3 for a reason ;).

This line...

[2012-02-24T16:08:30] debug:  513 can start unassigned job 33190 at 
1330060110 on bgp000

states the job is unassigned to a block.

The job is not assigned to this future made block, or any other block, 
that is why it doesn't run.  As you point out there are a bunch of jobs 
in the way of it, that is why it doesn't start or why the block doesn't 
get a bg_block_id.

If you read the code just a bit further, in the bg_job_place.c 
submit_job() function you will see this future block is not added to the 
main block list because of the null bg_block_id.

Hope this helps.

Danny

>
> The following patch:
>
> ---
>  src/plugins/select/bluegene/bg_job_place.c |    1 +
>  1 file changed, 1 insertion(+)
>
> Index: slurm-2.3.3/src/plugins/select/bluegene/bg_job_place.c
> ===================================================================
> --- slurm-2.3.3.orig/src/plugins/select/bluegene/bg_job_place.c
> +++ slurm-2.3.3/src/plugins/select/bluegene/bg_job_place.c
> @@ -821,6 +821,7 @@ static int _dynamically_request(List blo
>                         preempted blocks have time
>                         to clear out.
>                      */
> +                    info("MARK: _dynamically_request: we're about to 
> add bg_record to block_list with bg_block_id of %s, with ionodes %s, 
> job_running of %d, reason of %s", bg_record->bg_block_id, 
> bg_record->ionode_str, bg_record->job_running, bg_record->reason);
>                      list_append(block_list, bg_record);
>                      (*blocks_added) = 1;
>                  } else {
>
> Gives the following in the slurmctld.log file (along with some other 
> patches that I've got to print out other things like the number of 
> bg_records in the block_list and what bg_records we're testing):
> [2012-02-24T16:08:30] MARK: _dynamically_request: we're about to add 
> bg_record to block_list with bg_block_id of (null), with ionodes (nul
> l), job_running of -1, reason of (null)
> [2012-02-24T16:08:30] _find_best_block_match: before calling 
> _find_matching_block, block_list has 19 blocks
> [2012-02-24T16:08:30] number of blocks to check: 19 state 513 asking 
> for 2048-2048 cpus
> [2012-02-24T16:08:30] MARK: we've checking bg_block_id of 
> RMP24Fe064657709, with ionodes 0, job_running of -1, reason of (null)
> [2012-02-24T16:08:30] block RMP24Fe064657709 CPU count (256) not suitable
> [2012-02-24T16:08:30] MARK: we've checking bg_block_id of 
> RMP24Fe130528293, with ionodes 5, job_running of -1, reason of (null)
> [2012-02-24T16:08:30] block RMP24Fe130528293 CPU count (256) not suitable
> [2012-02-24T16:08:30] MARK: we've checking bg_block_id of 
> RMP18Fe120532215, with ionodes 0-1, job_running of -1, reason of (null)
> [2012-02-24T16:08:30] block RMP18Fe120532215 CPU count (512) not suitable
> [2012-02-24T16:08:30] MARK: we've checking bg_block_id of (null), with 
> ionodes (null), job_running of -1, reason of (null)
> [2012-02-24T16:08:30] we found one! (null)
> [2012-02-24T16:08:30] debug:  _find_best_block_match (null) <bgp000>
> [2012-02-24T16:08:30] debug:  513 can start unassigned job 33190 at 
> 1330060110 on bgp000
> [2012-02-24T16:08:30] debug3: backfill: Failed to start JobId=33190: 
> Requested nodes are busy
> [2012-02-24T16:08:30] backfill: completed testing 1 jobs, usec=69940
>
> At the point in time that this was taken, system has 18 blocks:
> ~> sinfo -b -h | wc -l
> 18
> ~> sinfo -b
> BG_BLOCK         MIDPLANES       OWNER    STATE    CONNECTION USE
> RMP24Fe064657709 bgp000          slurm    Ready    Small      COPROCESSOR
> RMP24Fe064657829 bgp000          mbuskes  Ready    Small      COPROCESSOR
> RMP22Fe231504417 bgp000          mbuskes  Ready    Small      COPROCESSOR
> RMP22Fe231504545 bgp000          bwgoudey Ready    Small      COPROCESSOR
> RMP24Fe130528174 bgp000          bwgoudey Ready    Small      COPROCESSOR
> RMP24Fe130528293 bgp000          slurm    Ready    Small      COPROCESSOR
> RMP23Fe025220481 bgp000          christen Ready    Small      COPROCESSOR
> RMP18Fe120532215 bgp001          slurm    Ready    Small      COPROCESSOR
> RMP19Fe174227862 bgp001          bwgoudey Ready    Small      COPROCESSOR
> RMP22Fe105038431 bgp001          bwgoudey Ready    Small      COPROCESSOR
> RMP22Fe105038550 bgp001          mbuskes  Ready    Small      COPROCESSOR
> RMP24Fe125557160 bgp001          bwgoudey Ready    Small      COPROCESSOR
> RMP24Fe125557297 bgp001          bwgoudey Ready    Small      COPROCESSOR
> RMP24Fe100736197 bgp011          bwgoudey Ready    Small      COPROCESSOR
> RMP24Fe064622640 bgp011          christen Ready    Small      COPROCESSOR
> RMP24Fe064622766 bgp011          mike     Ready    Small      COPROCESSOR
> RMP24Fe110947199 bgp011          christen Ready    Small      COPROCESSOR
> RMP08Fe140222251 bgp010          aooi     Ready    Torus      COPROCESSOR
>
> Any help would be greatly appreciated.
>
> Thanks!
> Mark

Reply via email to