Many thanks for your quick response on this Danny! On 25/02/12 02:14, Danny Auble wrote: > > > > On 02/23/12 21:27, Mark Nelson wrote: >> Hi Danny, >> >> We're still seeing a few issues with our Blue Gene /P on SLURM 2.3.3 >> and now that I've dug through a bit of the bluegene select plugin code >> I've got a few questions that I hope you can answer. Actually, just one: >> >> _dynamically_request() (in bg_job_place.c) calls >> create_dynamic_block() on line 794. Is it normal for the list of >> bg_records returned by create_dynamic_block() to have a bg_record with >> a bg_block_id of "(null)", ionode_str of "(null)" and no job running >> (job_running of -1)? >> > Yes (very common as you have noticed). This points to a block that can > be made in the future. It isn't a real block. >> I ask because that's what we're seeing, and it's this special null >> block that then gets added to the block_list on line 815, which then >> gets elected by _find_matching_block() as the ideal place to run a >> pending job. It then gets cascaded up through find_best_block_match() >> to submit_job() which then decides that it can start this pending job >> on the null block. But in reality the block is bogus and when the job >> tries to get backfilled we get this: >> debug3: backfill: Failed to start JobId=33190: Requested nodes are busy >> (I think that's what's happening anyway) > The code is doing exactly what it is suppose to do. Don't get caught up > in debug3 messages, they are debug3 for a reason ;). > > This line... > > [2012-02-24T16:08:30] debug: 513 can start unassigned job 33190 at > 1330060110 on bgp000 > > states the job is unassigned to a block.
When I convert 1330060110 to a human readable date format I get: Fri 24 Feb 2012 16:08:30 EST GMT+11 (which is "right now" for when this was actually happening). I assumed this meant that SLURM thought that it could start the job 33190 "right now" - at 1330060110... What does that timestamp (1330060110) mean on that line? > > The job is not assigned to this future made block, or any other block, > that is why it doesn't run. As you point out there are a bunch of jobs > in the way of it, that is why it doesn't start or why the block doesn't > get a bg_block_id. > > If you read the code just a bit further, in the bg_job_place.c > submit_job() function you will see this future block is not added to the > main block list because of the null bg_block_id. Oh yes, the block of code on 1667. I'm not sure why, but I had assumed that bg_block_id was pointing to a string with contents "(null)" (which sounds incredibly stupid now ;) ) so I thought that branch wasn't taken. Ooops! Clearly need more coffee, or something ;) There is definite funkiness going on with our Blue Gene: looking at the machine this morning an entire midplane had drained and was filled with small blocks, none of which were used, but a midplane sized job wasn't running there even though it was pending and had the highest priority. Removing the blocks on the midplane in sview allowed the midplane-sized job to start running on that now block-less midplane. Clearly I need to spend more time with this code to work out what's going on. > Hope this helps. It does - it all helps put the pieces together ;) Thanks! Mark
