Tim, Try these patches. I believe they will fix your problems.
This one deals with an unpacking issue when going from 2.2 -> 2.3 https://github.com/SchedMD/slurm/commit/ea01a57fb3325492b34a5b9365de733da5b549f5 This one should fix this problem you are experiencing. You should only of seen it when doing a clean start. https://github.com/SchedMD/slurm/commit/5e5ff72d1aec555e18e93e7221780665cb348e57 Let me know if you see anything else. Our access to real BGL machines is not as easily had anymore. Danny On Tuesday August 30 2011 8:14:36 PM you wrote: > On 08/30/2011 07:00 PM, Danny Auble wrote: > > Tim, could you up your debug to 3 or so on the BridgeAPIVerbose in the > > bluegene.conf and post the /var/log/slurm/bridgeapi.log that would be helpful, > > but this seems rather strange. > > > > I am guessing it happens every time? For what it is worth, the crash is in > > the IBM stuff so we can't do much about it. It seems strange you get a bunch > > of these blocks made and then the last one doesn't get made with a bunch of > > errors about the block. > > Yep, every time. Bridgeapi.log corresponding to that is > http://scorec.rpi.edu/~wickbt/bridgeapi.log > > > Have you tried doing a clean start? Perhaps there is something wrong with > > the state load from 2.2 to 2.3? > > The logs you're seeing are from: > > /bgl/local/slurm/sbin/slurmctld -B -c -D -vvvvv > > I've rerun now after removing /var/spool/slurm/* as well; the error does > change some: > > slurmctld: RMP30Au192651051 not found in the state file, adding > slurmctld: debug3: Block RMP30Au192651051 is in state Free > slurmctld: RMP30Au192651043 not found in the state file, adding > slurmctld: debug3: Block RMP30Au192651043 is in state Free > slurmctld: RMP30Au192651034 not found in the state file, adding > slurmctld: debug3: Block RMP30Au192651034 is in state Free > slurmctld: RMP30Au192651028 not found in the state file, adding > slurmctld: debug3: Block RMP30Au192651028 is in state Free > slurmctld: RMP30Au192651011 not found in the state file, adding > slurmctld: debug3: Block RMP30Au192651011 is in state Free > slurmctld: removing all current blocks (clean start) > slurmctld: debug: _track_freeing_blocks: Going to free 12 for job > 4294967294 > slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input > slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input > slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input > slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input > slurmctld: error: Requesting small block with 0 mps, needs to be 1. > slurmctld: fatal: Error, could not create the static blocks > > bridgeapi.log corresponding to this run is: > http://scorec.rpi.edu/~wickbt/bridgeapi-clean.log > full slurmctld log: > http://scorec.rpi.edu/~wickbt/slurmctldlog-clean > > - Tim > > > On Tuesday August 30 2011 6:53:17 PM you wrote: > >> Hey guys - > >> > >> We were about to start testing out 2.3.0-rc2 on our 1-rack BG/L @ RPI, > >> but have not been able to launch slurmctld. > >> > >> I've poked around and haven't found an obvious cause yet, although I can > >> see that the block creation code has been changed a decent amount > >> compared to 2.2 to make room for BG/Q. > >> > >> The crash is: > >> > >> slurmctld: Record: BlockID:RMP30Au165517092 Nodes:bp000[2] Conn:Small > >> slurmctld: debug2: adding block > >> slurmctld: debug2: done adding > >> slurmctld: Record: BlockID:RMP30Au165517102 Nodes:bp000[1] Conn:Small > >> slurmctld: debug2: adding block > >> slurmctld: debug2: done adding > >> slurmctld: Record: BlockID:RMP30Au165517115 Nodes:bp000[0] Conn:Small > >> slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input > >> slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input > >> slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input > >> slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input > >> slurmctld: error: Requesting small block with 0 mps, needs to be 1. > >> slurmctld: fatal: Error, could not create the static blocks > >> > >> -CLI INVALID HANDLE----- > >> cliRC = -2 > >> line = 242 > >> file = TxObject.cc > >> slurmctld: error: bridge_get_block_info(RMP08Fe113120123): Internal error > >> Segmentation fault > >> > >> > >> The full debug output is > >> http://scorec.rpi.edu/~wickbt/slurmctld-crash-2.3.0b2 > >> > >> Our slurm.conf is http://scorec.rpi.edu/~wickbt/slurm.conf > >> Our bluegene.conf is http://scorec.rpi.edu/~wickbt/bluegene.conf > >> > >> As an added challenge, it does *not* crash under the BG/L emulation > >> mode... I suspect this narrows it down to some potential mishandling of > >> the bg_record struct before the call in to _pre_allocate() ? > >> > >> Any ideas? > >> > >> thanks, > >> - Tim > >> > >> -- > >> Tim Wickberg > >> [email protected] > >> Senior System Administrator > >> Office of Research / SCOREC, Rensselaer Polytechnic Institute > >> >
