On 08/30/2011 07:00 PM, Danny Auble wrote:
Tim, could you up your debug to 3 or so on the BridgeAPIVerbose in the
bluegene.conf and post the /var/log/slurm/bridgeapi.log that would be helpful,
but this seems rather strange.
I am guessing it happens every time? For what it is worth, the crash is in the
IBM stuff so we can't do much about it. It seems strange you get a bunch
of these blocks made and then the last one doesn't get made with a bunch of
errors about the block.
Yep, every time. Bridgeapi.log corresponding to that is
http://scorec.rpi.edu/~wickbt/bridgeapi.log
Have you tried doing a clean start? Perhaps there is something wrong with the
state load from 2.2 to 2.3?
The logs you're seeing are from:
/bgl/local/slurm/sbin/slurmctld -B -c -D -vvvvv
I've rerun now after removing /var/spool/slurm/* as well; the error does
change some:
slurmctld: RMP30Au192651051 not found in the state file, adding
slurmctld: debug3: Block RMP30Au192651051 is in state Free
slurmctld: RMP30Au192651043 not found in the state file, adding
slurmctld: debug3: Block RMP30Au192651043 is in state Free
slurmctld: RMP30Au192651034 not found in the state file, adding
slurmctld: debug3: Block RMP30Au192651034 is in state Free
slurmctld: RMP30Au192651028 not found in the state file, adding
slurmctld: debug3: Block RMP30Au192651028 is in state Free
slurmctld: RMP30Au192651011 not found in the state file, adding
slurmctld: debug3: Block RMP30Au192651011 is in state Free
slurmctld: removing all current blocks (clean start)
slurmctld: debug: _track_freeing_blocks: Going to free 12 for job
4294967294
slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input
slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input
slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input
slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input
slurmctld: error: Requesting small block with 0 mps, needs to be 1.
slurmctld: fatal: Error, could not create the static blocks
bridgeapi.log corresponding to this run is:
http://scorec.rpi.edu/~wickbt/bridgeapi-clean.log
full slurmctld log:
http://scorec.rpi.edu/~wickbt/slurmctldlog-clean
- Tim
On Tuesday August 30 2011 6:53:17 PM you wrote:
Hey guys -
We were about to start testing out 2.3.0-rc2 on our 1-rack BG/L @ RPI,
but have not been able to launch slurmctld.
I've poked around and haven't found an obvious cause yet, although I can
see that the block creation code has been changed a decent amount
compared to 2.2 to make room for BG/Q.
The crash is:
slurmctld: Record: BlockID:RMP30Au165517092 Nodes:bp000[2] Conn:Small
slurmctld: debug2: adding block
slurmctld: debug2: done adding
slurmctld: Record: BlockID:RMP30Au165517102 Nodes:bp000[1] Conn:Small
slurmctld: debug2: adding block
slurmctld: debug2: done adding
slurmctld: Record: BlockID:RMP30Au165517115 Nodes:bp000[0] Conn:Small
slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input
slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input
slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input
slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input
slurmctld: error: Requesting small block with 0 mps, needs to be 1.
slurmctld: fatal: Error, could not create the static blocks
-CLI INVALID HANDLE-----
cliRC = -2
line = 242
file = TxObject.cc
slurmctld: error: bridge_get_block_info(RMP08Fe113120123): Internal error
Segmentation fault
The full debug output is
http://scorec.rpi.edu/~wickbt/slurmctld-crash-2.3.0b2
Our slurm.conf is http://scorec.rpi.edu/~wickbt/slurm.conf
Our bluegene.conf is http://scorec.rpi.edu/~wickbt/bluegene.conf
As an added challenge, it does *not* crash under the BG/L emulation
mode... I suspect this narrows it down to some potential mishandling of
the bg_record struct before the call in to _pre_allocate() ?
Any ideas?
thanks,
- Tim
--
Tim Wickberg
[email protected]
Senior System Administrator
Office of Research / SCOREC, Rensselaer Polytechnic Institute