It looks like SLURM is trying to create a block to run the job now (thus the timestamp on the block) but the block isn't being created successfully (it overlaps other running jobs/blocks) and that's what gets it caught in this state:
slurmctld.log: [2012-02-06T10:41:28] debug: sched: Running job scheduler [2012-02-06T10:41:28] acct_job_limits: job 32263 wants 5898240 from assoc 24 (account vr0069) which has usage of 37835669 and is currently reserving 0, with a limit of 50159760 [2012-02-06T10:41:28] debug2: found 4 usable nodes from config containing bgp[000x011] [2012-02-06T10:41:28] debug3: _pick_best_nodes: job 32263 idle_nodes 0 share_nodes 4 [2012-02-06T10:41:28] debug: bluegene:submit_job: 32263 mode=512 Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned nodes=1-1- 1 [2012-02-06T10:41:28] debug3: ComputNodeImage=default [2012-02-06T10:41:28] debug3: RamDiskIoLoadImage=default [2012-02-06T10:41:28] debug3: MloaderImage=default [2012-02-06T10:41:28] debug2: 1 can't allocate [2012-02-06T10:41:28] debug2: name = 010 [2012-02-06T10:41:28] debug3: process_nodes: 010 is included in this block [2012-02-06T10:41:28] debug3: process_nodes: geo = 111 mp count is 1 start is 010 [2012-02-06T10:41:28] debug2: adding block [2012-02-06T10:41:28] debug2: done adding [2012-02-06T10:41:28] Record: BlockID:RMP06Fe104128150 Nodes:bgp010 Conn:Torus [2012-02-06T10:41:28] debug: _find_best_block_match none found [2012-02-06T10:41:28] debug: bluegene:submit_job: 32263 mode=513 Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned nodes=1-1- 1 [2012-02-06T10:41:28] debug3: ComputNodeImage=default [2012-02-06T10:41:28] debug3: RamDiskIoLoadImage=default [2012-02-06T10:41:28] debug3: MloaderImage=default [2012-02-06T10:41:28] debug2: 1 can't allocate [2012-02-06T10:41:28] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-02-06T10:41:28] debug2: name = 010 [2012-02-06T10:41:28] debug3: process_nodes: 010 is included in this block [2012-02-06T10:41:28] debug3: process_nodes: geo = 111 mp count is 1 start is 010 [2012-02-06T10:41:28] debug: _find_best_block_match (null) <bgp010> [2012-02-06T10:41:28] debug: 513 can start unassigned job 32263 at 1328485288 on bgp010 [2012-02-06T10:41:28] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-02-06T10:41:28] debug: _post_block_free: It appears this block RMP06Fe104128150 isn't in the main list anymore. [2012-02-06T10:41:28] debug3: sched: JobId=32263. State=PENDING. Reason=Resources. Priority=10277. Partition=main. bridgeapi.log: <Feb 06 10:41:28.150657> BRIDGE (Debug): rm_add_partition() - Called <Feb 06 10:41:28.150694> BRIDGE (Info) : rm_add_partition() - Partition id set to RMP06Fe104128150 <Feb 06 10:41:28.150755> BRIDGE (Debug): compute_relative_BP_positions() - The relative positions are: (0 0 0) <Feb 06 10:41:28.224266> BRIDGE (Debug): rm_add_partition() - Completed Successfully <Feb 06 10:41:28.258047> BRIDGE (Debug): rm_remove_partition() - Called <Feb 06 10:41:28.284313> BRIDGE (Debug): rm_remove_partition() - Completed Successfully and it keeps doing this repeatedly, trying to create: RMP06Fe104828401, RMP06Fe104928448, RMP06Fe105028149, RMP06Fe105128860... Thanks! Mark. On 06/02/12 10:15, Mark Nelson wrote: > Hi All, > > We moved from SLURM 2.2.7 to 2.3.3 on our two rack Blue Gene/P last week > in order to use the multiple front-end node facility. So far it's been > great (did the upgrade without losing any running jobs at all). So first > many thanks for everyone's hard work. > > As the subject suggests though, we're seeing some issues where backfill > of a midplane-sized job will fail because SLURM seems to lose track of > blocks (or gets confused with phantom blocks) and the jobs that should > be backfilled after it are never tested and able to run. This > effectively drains the machine, although I can get backfill working > again by holding the midplane-sized job and letting the others queued up > behind it jump in an appropriate place. > > Here's a snippet of the slurmctld.log file: > [2012-02-06T09:10:42] backfill: beginning > [2012-02-06T09:10:42] ========================================= > [2012-02-06T09:10:42] Begin:2012-02-06T09:10:42 End:2012-02-07T09:10:42 > Nodes:bgp[000x011] > [2012-02-06T09:10:42] ========================================= > [2012-02-06T09:10:42] backfill test for job 32263 > [2012-02-06T09:10:42] acct_job_limits: job 32263 wants 5898240 from > assoc 24 (account vr0069) which has usage of 37835669 and is currently > reserving 4815650, with a limit of 50159760 > [2012-02-06T09:10:42] debug2: backfill: entering _try_sched for job 32263. > [2012-02-06T09:10:42] debug: bluegene:submit_job: 32263 mode=514 > Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned > nodes=1-1-1 > [2012-02-06T09:10:42] debug3: ComputNodeImage=default > [2012-02-06T09:10:42] debug3: RamDiskIoLoadImage=default > [2012-02-06T09:10:42] debug3: MloaderImage=default > [2012-02-06T09:10:42] debug2: 1 can't allocate > [2012-02-06T09:10:42] debug2: name = 000 > [2012-02-06T09:10:42] debug3: process_nodes: 000 is included in this block > [2012-02-06T09:10:42] debug: _find_best_block_match (null) <bgp000> > [2012-02-06T09:10:42] debug: 514 can start unassigned job 32263 at > 1328479842 on bgp000 > [2012-02-06T09:10:42] debug2: backfill: finished _try_sched for job 32263. > [2012-02-06T09:10:42] acct_job_limits: job 32263 wants 5898240 from > assoc 24 (account vr0069) which has usage of 37835669 and is currently > reserving 4815650, with a limit of 50159760 > [2012-02-06T09:10:42] debug2: found 4 usable nodes from config > containing bgp[000x011] > [2012-02-06T09:10:42] debug3: _pick_best_nodes: job 32263 idle_nodes 0 > share_nodes 4 > [2012-02-06T09:10:42] debug: bluegene:submit_job: 32263 mode=512 > Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned > nodes=1-1-1 > [2012-02-06T09:10:42] debug3: ComputNodeImage=default > [2012-02-06T09:10:42] debug3: RamDiskIoLoadImage=default > [2012-02-06T09:10:42] debug3: MloaderImage=default > [2012-02-06T09:10:42] debug2: 1 can't allocate > [2012-02-06T09:10:42] debug2: name = 000 > [2012-02-06T09:10:42] debug3: process_nodes: 000 is included in this block > [2012-02-06T09:10:42] debug2: adding block > [2012-02-06T09:10:42] debug2: done adding > [2012-02-06T09:10:42] Record: BlockID:RMP06Fe091042072 Nodes:bgp000 > Conn:Torus > [2012-02-06T09:10:42] debug: _find_best_block_match none found > [2012-02-06T09:10:42] debug: bluegene:submit_job: 32263 mode=513 > Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned > nodes=1-1-1 > [2012-02-06T09:10:42] debug3: ComputNodeImage=default > [2012-02-06T09:10:42] debug3: RamDiskIoLoadImage=default > [2012-02-06T09:10:42] debug3: MloaderImage=default > [2012-02-06T09:10:42] debug2: 1 can't allocate > [2012-02-06T09:10:42] debug: _track_freeing_blocks: Going to free 1 for > job 4294967294 > [2012-02-06T09:10:42] debug2: name = 000 > [2012-02-06T09:10:42] debug3: process_nodes: 000 is included in this block > [2012-02-06T09:10:42] debug: _find_best_block_match (null) <bgp000> > [2012-02-06T09:10:42] debug: 513 can start unassigned job 32263 at > 1328479842 on bgp000 > [2012-02-06T09:10:42] debug: _track_freeing_blocks: Freed them all for > job 4294967294 > [2012-02-06T09:10:42] debug: _post_block_free: It appears this block > RMP06Fe091042072 isn't in the main list anymore. > [2012-02-06T09:10:42] debug3: backfill: Failed to start JobId=32263: > Requested nodes are busy > [2012-02-06T09:10:42] backfill: completed testing 1 jobs, usec=207282 > [2012-02-06T09:10:44] debug3: Processing RPC: REQUEST_NODE_INFO from > uid=589 > [2012-02-06T09:10:44] debug3: _slurm_rpc_dump_nodes, size=512 usec=396 > [2012-02-06T09:10:44] debug2: Processing RPC: REQUEST_PARTITION_INFO > uid=589 > [2012-02-06T09:10:44] debug2: _slurm_rpc_dump_partitions, no change > [2012-02-06T09:10:44] debug2: Processing RPC: REQUEST_BLOCK_INFO from > uid=589 > [2012-02-06T09:10:49] debug2: Performing purge of old job records > [2012-02-06T09:10:49] debug: sched: Running job scheduler > [2012-02-06T09:10:49] acct_job_limits: job 32263 wants 5898240 from > assoc 24 (account vr0069) which has usage of 37835669 and is currently > reserving 4815650, with a limit of 50159760 > [2012-02-06T09:10:49] debug2: found 4 usable nodes from config > containing bgp[000x011] > [2012-02-06T09:10:49] debug3: _pick_best_nodes: job 32263 idle_nodes 0 > share_nodes 4 > [2012-02-06T09:10:49] debug: bluegene:submit_job: 32263 mode=512 > Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned > nodes=1-1-1 > [2012-02-06T09:10:49] debug3: ComputNodeImage=default > [2012-02-06T09:10:49] debug3: RamDiskIoLoadImage=default > [2012-02-06T09:10:49] debug3: MloaderImage=default > [2012-02-06T09:10:49] debug2: 1 can't allocate > [2012-02-06T09:10:49] debug2: name = 000 > [2012-02-06T09:10:49] debug3: process_nodes: 000 is included in this block > [2012-02-06T09:10:49] debug2: adding block > [2012-02-06T09:10:49] debug3: Processing RPC: REQUEST_NODE_INFO from > uid=589 > [2012-02-06T09:10:49] debug2: done adding > [2012-02-06T09:10:49] Record: BlockID:RMP06Fe091049750 Nodes:bgp000 > Conn:Torus > [2012-02-06T09:10:49] debug: _find_best_block_match none found > [2012-02-06T09:10:49] debug: bluegene:submit_job: 32263 mode=513 > Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned > nodes=1-1-1 > [2012-02-06T09:10:49] debug3: ComputNodeImage=default > [2012-02-06T09:10:49] debug3: RamDiskIoLoadImage=default > [2012-02-06T09:10:49] debug3: MloaderImage=default > [2012-02-06T09:10:49] debug2: 1 can't allocate > [2012-02-06T09:10:49] debug: _track_freeing_blocks: Going to free 1 for > job 4294967294 > [2012-02-06T09:10:49] debug2: name = 000 > [2012-02-06T09:10:49] debug3: process_nodes: 000 is included in this block > [2012-02-06T09:10:49] debug: _find_best_block_match (null) <bgp000> > [2012-02-06T09:10:49] debug: 513 can start unassigned job 32263 at > 1328479849 on bgp000 > [2012-02-06T09:10:49] debug: _track_freeing_blocks: Freed them all for > job 4294967294 > [2012-02-06T09:10:49] debug: _post_block_free: It appears this block > RMP06Fe091049750 isn't in the main list anymore. > [2012-02-06T09:10:49] debug3: sched: JobId=32263. State=PENDING. > Reason=Resources. Priority=10268. Partition=main. > [2012-02-06T09:10:49] debug3: sched: JobId=32264. State=PENDING. > Reason=Priority. Priority=10197. Partition=main. > [2012-02-06T09:10:49] debug3: sched: JobId=32266. State=PENDING. > Reason=Priority. Priority=10147. Partition=main. > [2012-02-06T09:10:49] debug3: sched: JobId=32265. State=PENDING. > Reason=Resources. Priority=167. Partition=filler. > [2012-02-06T09:10:49] debug2: Node select info for set all hasn't > changed since 1328479842 > [2012-02-06T09:10:49] debug3: _slurm_rpc_dump_nodes, size=512 usec=46129 > [2012-02-06T09:10:49] debug2: Processing RPC: REQUEST_PARTITION_INFO > uid=589 > [2012-02-06T09:10:49] debug2: _slurm_rpc_dump_partitions, no change > [2012-02-06T09:10:49] debug2: Processing RPC: REQUEST_BLOCK_INFO from > uid=589 > [2012-02-06T09:10:49] debug2: Node select info hasn't changed since > 1328479842 > [2012-02-06T09:10:49] debug3: _slurm_rpc_block_info: Data has not > changed since time specified > [2012-02-06T09:10:51] debug2: Running _test_down_switches > [2012-02-06T09:10:51] debug2: Running _test_down_nodes > [2012-02-06T09:10:54] debug2: Processing RPC: REQUEST_SHARE_INFO from > uid=3562 > [2012-02-06T09:10:54] debug2: _slurm_rpc_get_shares usec=1103 > [2012-02-06T09:10:54] debug3: Processing RPC: REQUEST_NODE_INFO from > uid=589 > [2012-02-06T09:10:54] debug3: _slurm_rpc_dump_nodes, size=512 usec=747 > [2012-02-06T09:10:54] debug2: Processing RPC: REQUEST_PARTITION_INFO > uid=589 > [2012-02-06T09:10:54] debug2: _slurm_rpc_dump_partitions, no change > [2012-02-06T09:10:54] debug2: Processing RPC: REQUEST_BLOCK_INFO from > uid=589 > [2012-02-06T09:10:59] debug3: Processing RPC: REQUEST_NODE_INFO from > uid=589 > [2012-02-06T09:10:59] debug2: Node select info for set all hasn't > changed since 1328479849 > [2012-02-06T09:10:59] debug3: _slurm_rpc_dump_nodes, size=512 usec=383 > [2012-02-06T09:10:59] debug2: Processing RPC: REQUEST_PARTITION_INFO > uid=589 > [2012-02-06T09:10:59] debug2: _slurm_rpc_dump_partitions, no change > [2012-02-06T09:10:59] debug2: Processing RPC: REQUEST_BLOCK_INFO from > uid=589 > [2012-02-06T09:10:59] debug2: Node select info hasn't changed since > 1328479849 > [2012-02-06T09:10:59] debug3: _slurm_rpc_block_info: Data has not > changed since time specified > [2012-02-06T09:11:04] debug3: Processing RPC: REQUEST_NODE_INFO from > uid=589 > [2012-02-06T09:11:04] debug2: Node select info for set all hasn't > changed since 1328479849 > [2012-02-06T09:11:04] debug3: _slurm_rpc_dump_nodes, no change > [2012-02-06T09:11:04] debug2: Processing RPC: REQUEST_PARTITION_INFO > uid=589 > [2012-02-06T09:11:04] debug2: _slurm_rpc_dump_partitions, no change > [2012-02-06T09:11:04] debug2: Processing RPC: REQUEST_BLOCK_INFO from > uid=589 > [2012-02-06T09:11:04] debug2: Node select info hasn't changed since > 1328479849 > [2012-02-06T09:11:04] debug3: _slurm_rpc_block_info: Data has not > changed since time specified > [2012-02-06T09:11:06] debug2: Testing job time limits and checkpoints > [2012-02-06T09:11:09] debug3: Processing RPC: REQUEST_NODE_INFO from > uid=589 > [2012-02-06T09:11:09] debug2: Node select info for set all hasn't > changed since 1328479849 > [2012-02-06T09:11:09] debug3: _slurm_rpc_dump_nodes, no change > [2012-02-06T09:11:09] debug2: Processing RPC: REQUEST_PARTITION_INFO > uid=589 > [2012-02-06T09:11:09] debug2: _slurm_rpc_dump_partitions, no change > [2012-02-06T09:11:09] debug2: Processing RPC: REQUEST_BLOCK_INFO from > uid=589 > [2012-02-06T09:11:09] debug2: Node select info hasn't changed since > 1328479849 > [2012-02-06T09:11:09] debug3: _slurm_rpc_block_info: Data has not > changed since time specified > > I've also attached our slurm.conf and bluegene.conf files. The only > change that was made to these files after the ugprade was adding the > "FRONTEND NODES" section to the slurm.conf file. > > I can't see anything in the bridgeapi.log file that sheds any light on > this, although the verbosity was set to level 0. I've just upped the > logging level to 2 on this so that may give us something... > > Has anyone ever seen anything like this? > > Many thanks! > Mark Nelson.
