It looks like SLURM is trying to create a block to run the job now (thus 
the timestamp on the block) but the block isn't being created 
successfully (it overlaps other running jobs/blocks) and that's what 
gets it caught in this state:

slurmctld.log:
[2012-02-06T10:41:28] debug:  sched: Running job scheduler
[2012-02-06T10:41:28] acct_job_limits: job 32263 wants 5898240 from 
assoc 24 (account vr0069) which has usage of 37835669 and is currently 
reserving
0, with a limit of 50159760
[2012-02-06T10:41:28] debug2: found 4 usable nodes from config 
containing bgp[000x011]
[2012-02-06T10:41:28] debug3: _pick_best_nodes: job 32263 idle_nodes 0 
share_nodes 4
[2012-02-06T10:41:28] debug:  bluegene:submit_job: 32263 mode=512 
Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned 
nodes=1-1-
1
[2012-02-06T10:41:28] debug3: ComputNodeImage=default
[2012-02-06T10:41:28] debug3: RamDiskIoLoadImage=default
[2012-02-06T10:41:28] debug3: MloaderImage=default
[2012-02-06T10:41:28] debug2: 1 can't allocate
[2012-02-06T10:41:28] debug2: name = 010
[2012-02-06T10:41:28] debug3: process_nodes: 010 is included in this block
[2012-02-06T10:41:28] debug3: process_nodes: geo = 111 mp count is 1 
start is 010
[2012-02-06T10:41:28] debug2: adding block
[2012-02-06T10:41:28] debug2: done adding
[2012-02-06T10:41:28] Record: BlockID:RMP06Fe104128150 Nodes:bgp010 
Conn:Torus
[2012-02-06T10:41:28] debug:  _find_best_block_match none found
[2012-02-06T10:41:28] debug:  bluegene:submit_job: 32263 mode=513 
Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned 
nodes=1-1-
1
[2012-02-06T10:41:28] debug3: ComputNodeImage=default
[2012-02-06T10:41:28] debug3: RamDiskIoLoadImage=default
[2012-02-06T10:41:28] debug3: MloaderImage=default
[2012-02-06T10:41:28] debug2: 1 can't allocate
[2012-02-06T10:41:28] debug:  _track_freeing_blocks: Going to free 1 for 
job 4294967294
[2012-02-06T10:41:28] debug2: name = 010
[2012-02-06T10:41:28] debug3: process_nodes: 010 is included in this block
[2012-02-06T10:41:28] debug3: process_nodes: geo = 111 mp count is 1 
start is 010
[2012-02-06T10:41:28] debug:  _find_best_block_match (null) <bgp010>
[2012-02-06T10:41:28] debug:  513 can start unassigned job 32263 at 
1328485288 on bgp010
[2012-02-06T10:41:28] debug:  _track_freeing_blocks: Freed them all for 
job 4294967294
[2012-02-06T10:41:28] debug:  _post_block_free: It appears this block 
RMP06Fe104128150 isn't in the main list anymore.
[2012-02-06T10:41:28] debug3: sched: JobId=32263. State=PENDING. 
Reason=Resources. Priority=10277. Partition=main.

bridgeapi.log:
<Feb 06 10:41:28.150657> BRIDGE (Debug): rm_add_partition() - Called
<Feb 06 10:41:28.150694> BRIDGE (Info) : rm_add_partition() - Partition 
id set to RMP06Fe104128150
<Feb 06 10:41:28.150755> BRIDGE (Debug): compute_relative_BP_positions() 
- The relative positions are: (0 0 0)
<Feb 06 10:41:28.224266> BRIDGE (Debug): rm_add_partition() - Completed 
Successfully
<Feb 06 10:41:28.258047> BRIDGE (Debug): rm_remove_partition() - Called
<Feb 06 10:41:28.284313> BRIDGE (Debug): rm_remove_partition() - 
Completed Successfully

and it keeps doing this repeatedly, trying to create: RMP06Fe104828401, 
RMP06Fe104928448, RMP06Fe105028149, RMP06Fe105128860...

Thanks!
Mark.


On 06/02/12 10:15, Mark Nelson wrote:
> Hi All,
>
> We moved from SLURM 2.2.7 to 2.3.3 on our two rack Blue Gene/P last week
> in order to use the multiple front-end node facility. So far it's been
> great (did the upgrade without losing any running jobs at all). So first
> many thanks for everyone's hard work.
>
> As the subject suggests though, we're seeing some issues where backfill
> of a midplane-sized job will fail because SLURM seems to lose track of
> blocks (or gets confused with phantom blocks) and the jobs that should
> be backfilled after it are never tested and able to run. This
> effectively drains the machine, although I can get backfill working
> again by holding the midplane-sized job and letting the others queued up
> behind it jump in an appropriate place.
>
> Here's a snippet of the slurmctld.log file:
> [2012-02-06T09:10:42] backfill: beginning
> [2012-02-06T09:10:42] =========================================
> [2012-02-06T09:10:42] Begin:2012-02-06T09:10:42 End:2012-02-07T09:10:42
> Nodes:bgp[000x011]
> [2012-02-06T09:10:42] =========================================
> [2012-02-06T09:10:42] backfill test for job 32263
> [2012-02-06T09:10:42] acct_job_limits: job 32263 wants 5898240 from
> assoc 24 (account vr0069) which has usage of 37835669 and is currently
> reserving 4815650, with a limit of 50159760
> [2012-02-06T09:10:42] debug2: backfill: entering _try_sched for job 32263.
> [2012-02-06T09:10:42] debug: bluegene:submit_job: 32263 mode=514
> Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned
> nodes=1-1-1
> [2012-02-06T09:10:42] debug3: ComputNodeImage=default
> [2012-02-06T09:10:42] debug3: RamDiskIoLoadImage=default
> [2012-02-06T09:10:42] debug3: MloaderImage=default
> [2012-02-06T09:10:42] debug2: 1 can't allocate
> [2012-02-06T09:10:42] debug2: name = 000
> [2012-02-06T09:10:42] debug3: process_nodes: 000 is included in this block
> [2012-02-06T09:10:42] debug: _find_best_block_match (null) <bgp000>
> [2012-02-06T09:10:42] debug: 514 can start unassigned job 32263 at
> 1328479842 on bgp000
> [2012-02-06T09:10:42] debug2: backfill: finished _try_sched for job 32263.
> [2012-02-06T09:10:42] acct_job_limits: job 32263 wants 5898240 from
> assoc 24 (account vr0069) which has usage of 37835669 and is currently
> reserving 4815650, with a limit of 50159760
> [2012-02-06T09:10:42] debug2: found 4 usable nodes from config
> containing bgp[000x011]
> [2012-02-06T09:10:42] debug3: _pick_best_nodes: job 32263 idle_nodes 0
> share_nodes 4
> [2012-02-06T09:10:42] debug: bluegene:submit_job: 32263 mode=512
> Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned
> nodes=1-1-1
> [2012-02-06T09:10:42] debug3: ComputNodeImage=default
> [2012-02-06T09:10:42] debug3: RamDiskIoLoadImage=default
> [2012-02-06T09:10:42] debug3: MloaderImage=default
> [2012-02-06T09:10:42] debug2: 1 can't allocate
> [2012-02-06T09:10:42] debug2: name = 000
> [2012-02-06T09:10:42] debug3: process_nodes: 000 is included in this block
> [2012-02-06T09:10:42] debug2: adding block
> [2012-02-06T09:10:42] debug2: done adding
> [2012-02-06T09:10:42] Record: BlockID:RMP06Fe091042072 Nodes:bgp000
> Conn:Torus
> [2012-02-06T09:10:42] debug: _find_best_block_match none found
> [2012-02-06T09:10:42] debug: bluegene:submit_job: 32263 mode=513
> Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned
> nodes=1-1-1
> [2012-02-06T09:10:42] debug3: ComputNodeImage=default
> [2012-02-06T09:10:42] debug3: RamDiskIoLoadImage=default
> [2012-02-06T09:10:42] debug3: MloaderImage=default
> [2012-02-06T09:10:42] debug2: 1 can't allocate
> [2012-02-06T09:10:42] debug: _track_freeing_blocks: Going to free 1 for
> job 4294967294
> [2012-02-06T09:10:42] debug2: name = 000
> [2012-02-06T09:10:42] debug3: process_nodes: 000 is included in this block
> [2012-02-06T09:10:42] debug: _find_best_block_match (null) <bgp000>
> [2012-02-06T09:10:42] debug: 513 can start unassigned job 32263 at
> 1328479842 on bgp000
> [2012-02-06T09:10:42] debug: _track_freeing_blocks: Freed them all for
> job 4294967294
> [2012-02-06T09:10:42] debug: _post_block_free: It appears this block
> RMP06Fe091042072 isn't in the main list anymore.
> [2012-02-06T09:10:42] debug3: backfill: Failed to start JobId=32263:
> Requested nodes are busy
> [2012-02-06T09:10:42] backfill: completed testing 1 jobs, usec=207282
> [2012-02-06T09:10:44] debug3: Processing RPC: REQUEST_NODE_INFO from
> uid=589
> [2012-02-06T09:10:44] debug3: _slurm_rpc_dump_nodes, size=512 usec=396
> [2012-02-06T09:10:44] debug2: Processing RPC: REQUEST_PARTITION_INFO
> uid=589
> [2012-02-06T09:10:44] debug2: _slurm_rpc_dump_partitions, no change
> [2012-02-06T09:10:44] debug2: Processing RPC: REQUEST_BLOCK_INFO from
> uid=589
> [2012-02-06T09:10:49] debug2: Performing purge of old job records
> [2012-02-06T09:10:49] debug: sched: Running job scheduler
> [2012-02-06T09:10:49] acct_job_limits: job 32263 wants 5898240 from
> assoc 24 (account vr0069) which has usage of 37835669 and is currently
> reserving 4815650, with a limit of 50159760
> [2012-02-06T09:10:49] debug2: found 4 usable nodes from config
> containing bgp[000x011]
> [2012-02-06T09:10:49] debug3: _pick_best_nodes: job 32263 idle_nodes 0
> share_nodes 4
> [2012-02-06T09:10:49] debug: bluegene:submit_job: 32263 mode=512
> Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned
> nodes=1-1-1
> [2012-02-06T09:10:49] debug3: ComputNodeImage=default
> [2012-02-06T09:10:49] debug3: RamDiskIoLoadImage=default
> [2012-02-06T09:10:49] debug3: MloaderImage=default
> [2012-02-06T09:10:49] debug2: 1 can't allocate
> [2012-02-06T09:10:49] debug2: name = 000
> [2012-02-06T09:10:49] debug3: process_nodes: 000 is included in this block
> [2012-02-06T09:10:49] debug2: adding block
> [2012-02-06T09:10:49] debug3: Processing RPC: REQUEST_NODE_INFO from
> uid=589
> [2012-02-06T09:10:49] debug2: done adding
> [2012-02-06T09:10:49] Record: BlockID:RMP06Fe091049750 Nodes:bgp000
> Conn:Torus
> [2012-02-06T09:10:49] debug: _find_best_block_match none found
> [2012-02-06T09:10:49] debug: bluegene:submit_job: 32263 mode=513
> Connection=Torus Reboot=no Rotate=yes Geometry=0x0x0 Block_ID=unassigned
> nodes=1-1-1
> [2012-02-06T09:10:49] debug3: ComputNodeImage=default
> [2012-02-06T09:10:49] debug3: RamDiskIoLoadImage=default
> [2012-02-06T09:10:49] debug3: MloaderImage=default
> [2012-02-06T09:10:49] debug2: 1 can't allocate
> [2012-02-06T09:10:49] debug: _track_freeing_blocks: Going to free 1 for
> job 4294967294
> [2012-02-06T09:10:49] debug2: name = 000
> [2012-02-06T09:10:49] debug3: process_nodes: 000 is included in this block
> [2012-02-06T09:10:49] debug: _find_best_block_match (null) <bgp000>
> [2012-02-06T09:10:49] debug: 513 can start unassigned job 32263 at
> 1328479849 on bgp000
> [2012-02-06T09:10:49] debug: _track_freeing_blocks: Freed them all for
> job 4294967294
> [2012-02-06T09:10:49] debug: _post_block_free: It appears this block
> RMP06Fe091049750 isn't in the main list anymore.
> [2012-02-06T09:10:49] debug3: sched: JobId=32263. State=PENDING.
> Reason=Resources. Priority=10268. Partition=main.
> [2012-02-06T09:10:49] debug3: sched: JobId=32264. State=PENDING.
> Reason=Priority. Priority=10197. Partition=main.
> [2012-02-06T09:10:49] debug3: sched: JobId=32266. State=PENDING.
> Reason=Priority. Priority=10147. Partition=main.
> [2012-02-06T09:10:49] debug3: sched: JobId=32265. State=PENDING.
> Reason=Resources. Priority=167. Partition=filler.
> [2012-02-06T09:10:49] debug2: Node select info for set all hasn't
> changed since 1328479842
> [2012-02-06T09:10:49] debug3: _slurm_rpc_dump_nodes, size=512 usec=46129
> [2012-02-06T09:10:49] debug2: Processing RPC: REQUEST_PARTITION_INFO
> uid=589
> [2012-02-06T09:10:49] debug2: _slurm_rpc_dump_partitions, no change
> [2012-02-06T09:10:49] debug2: Processing RPC: REQUEST_BLOCK_INFO from
> uid=589
> [2012-02-06T09:10:49] debug2: Node select info hasn't changed since
> 1328479842
> [2012-02-06T09:10:49] debug3: _slurm_rpc_block_info: Data has not
> changed since time specified
> [2012-02-06T09:10:51] debug2: Running _test_down_switches
> [2012-02-06T09:10:51] debug2: Running _test_down_nodes
> [2012-02-06T09:10:54] debug2: Processing RPC: REQUEST_SHARE_INFO from
> uid=3562
> [2012-02-06T09:10:54] debug2: _slurm_rpc_get_shares usec=1103
> [2012-02-06T09:10:54] debug3: Processing RPC: REQUEST_NODE_INFO from
> uid=589
> [2012-02-06T09:10:54] debug3: _slurm_rpc_dump_nodes, size=512 usec=747
> [2012-02-06T09:10:54] debug2: Processing RPC: REQUEST_PARTITION_INFO
> uid=589
> [2012-02-06T09:10:54] debug2: _slurm_rpc_dump_partitions, no change
> [2012-02-06T09:10:54] debug2: Processing RPC: REQUEST_BLOCK_INFO from
> uid=589
> [2012-02-06T09:10:59] debug3: Processing RPC: REQUEST_NODE_INFO from
> uid=589
> [2012-02-06T09:10:59] debug2: Node select info for set all hasn't
> changed since 1328479849
> [2012-02-06T09:10:59] debug3: _slurm_rpc_dump_nodes, size=512 usec=383
> [2012-02-06T09:10:59] debug2: Processing RPC: REQUEST_PARTITION_INFO
> uid=589
> [2012-02-06T09:10:59] debug2: _slurm_rpc_dump_partitions, no change
> [2012-02-06T09:10:59] debug2: Processing RPC: REQUEST_BLOCK_INFO from
> uid=589
> [2012-02-06T09:10:59] debug2: Node select info hasn't changed since
> 1328479849
> [2012-02-06T09:10:59] debug3: _slurm_rpc_block_info: Data has not
> changed since time specified
> [2012-02-06T09:11:04] debug3: Processing RPC: REQUEST_NODE_INFO from
> uid=589
> [2012-02-06T09:11:04] debug2: Node select info for set all hasn't
> changed since 1328479849
> [2012-02-06T09:11:04] debug3: _slurm_rpc_dump_nodes, no change
> [2012-02-06T09:11:04] debug2: Processing RPC: REQUEST_PARTITION_INFO
> uid=589
> [2012-02-06T09:11:04] debug2: _slurm_rpc_dump_partitions, no change
> [2012-02-06T09:11:04] debug2: Processing RPC: REQUEST_BLOCK_INFO from
> uid=589
> [2012-02-06T09:11:04] debug2: Node select info hasn't changed since
> 1328479849
> [2012-02-06T09:11:04] debug3: _slurm_rpc_block_info: Data has not
> changed since time specified
> [2012-02-06T09:11:06] debug2: Testing job time limits and checkpoints
> [2012-02-06T09:11:09] debug3: Processing RPC: REQUEST_NODE_INFO from
> uid=589
> [2012-02-06T09:11:09] debug2: Node select info for set all hasn't
> changed since 1328479849
> [2012-02-06T09:11:09] debug3: _slurm_rpc_dump_nodes, no change
> [2012-02-06T09:11:09] debug2: Processing RPC: REQUEST_PARTITION_INFO
> uid=589
> [2012-02-06T09:11:09] debug2: _slurm_rpc_dump_partitions, no change
> [2012-02-06T09:11:09] debug2: Processing RPC: REQUEST_BLOCK_INFO from
> uid=589
> [2012-02-06T09:11:09] debug2: Node select info hasn't changed since
> 1328479849
> [2012-02-06T09:11:09] debug3: _slurm_rpc_block_info: Data has not
> changed since time specified
>
> I've also attached our slurm.conf and bluegene.conf files. The only
> change that was made to these files after the ugprade was adding the
> "FRONTEND NODES" section to the slurm.conf file.
>
> I can't see anything in the bridgeapi.log file that sheds any light on
> this, although the verbosity was set to level 0. I've just upped the
> logging level to 2 on this so that may give us something...
>
> Has anyone ever seen anything like this?
>
> Many thanks!
> Mark Nelson.

Reply via email to