Tim,

Try these patches.  I believe they will fix your problems.

This one deals with an unpacking issue when going from 2.2 -> 2.3
https://github.com/SchedMD/slurm/commit/ea01a57fb3325492b34a5b9365de733da5b549f5

This one should fix this problem you are experiencing.  You should only of seen 
it when doing a clean start.
https://github.com/SchedMD/slurm/commit/5e5ff72d1aec555e18e93e7221780665cb348e57

Let me know if you see anything else.  Our access to real BGL machines is not 
as easily had anymore.

Danny

On Tuesday August 30 2011 8:14:36 PM you wrote:
> On 08/30/2011 07:00 PM, Danny Auble wrote:
> > Tim, could you up your debug to 3 or so on the BridgeAPIVerbose in the 
> > bluegene.conf and post the /var/log/slurm/bridgeapi.log that would be 
helpful,
> > but this seems rather strange.
> >
> > I am guessing it happens every time?  For what it is worth, the crash is in 
> > the IBM stuff so we can't do much about it.  It seems strange you get a 
bunch
> > of these blocks made and then the last one doesn't get made with a bunch of 
> > errors about the block.
> 
> Yep, every time. Bridgeapi.log corresponding to that is 
> http://scorec.rpi.edu/~wickbt/bridgeapi.log
> 
> > Have you tried doing a clean start?  Perhaps there is something wrong with 
> > the state load from 2.2 to 2.3?
> 
> The logs you're seeing are from:
> 
> /bgl/local/slurm/sbin/slurmctld -B -c -D -vvvvv
> 
> I've rerun now after removing /var/spool/slurm/* as well; the error does 
> change some:
> 
> slurmctld: RMP30Au192651051 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651051 is in state Free
> slurmctld: RMP30Au192651043 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651043 is in state Free
> slurmctld: RMP30Au192651034 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651034 is in state Free
> slurmctld: RMP30Au192651028 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651028 is in state Free
> slurmctld: RMP30Au192651011 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651011 is in state Free
> slurmctld: removing all current blocks (clean start)
> slurmctld: debug:  _track_freeing_blocks: Going to free 12 for job 
> 4294967294
> slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input
> slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input
> slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input
> slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input
> slurmctld: error: Requesting small block with 0 mps, needs to be 1.
> slurmctld: fatal: Error, could not create the static blocks
> 
> bridgeapi.log corresponding to this run is:
> http://scorec.rpi.edu/~wickbt/bridgeapi-clean.log
> full slurmctld log:
> http://scorec.rpi.edu/~wickbt/slurmctldlog-clean
> 
> - Tim
> 
> > On Tuesday August 30 2011 6:53:17 PM you wrote:
> >> Hey guys -
> >>
> >> We were about to start testing out 2.3.0-rc2 on our 1-rack BG/L @ RPI,
> >> but have not been able to launch slurmctld.
> >>
> >> I've poked around and haven't found an obvious cause yet, although I can
> >> see that the block creation code has been changed a decent amount
> >> compared to 2.2 to make room for BG/Q.
> >>
> >> The crash is:
> >>
> >> slurmctld: Record: BlockID:RMP30Au165517092 Nodes:bp000[2] Conn:Small
> >> slurmctld: debug2: adding block
> >> slurmctld: debug2: done adding
> >> slurmctld: Record: BlockID:RMP30Au165517102 Nodes:bp000[1] Conn:Small
> >> slurmctld: debug2: adding block
> >> slurmctld: debug2: done adding
> >> slurmctld: Record: BlockID:RMP30Au165517115 Nodes:bp000[0] Conn:Small
> >> slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input
> >> slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input
> >> slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input
> >> slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input
> >> slurmctld: error: Requesting small block with 0 mps, needs to be 1.
> >> slurmctld: fatal: Error, could not create the static blocks
> >>
> >> -CLI INVALID HANDLE-----
> >>     cliRC = -2
> >>     line  = 242
> >>     file  = TxObject.cc
> >> slurmctld: error: bridge_get_block_info(RMP08Fe113120123): Internal error
> >> Segmentation fault
> >>
> >>
> >> The full debug output is
> >> http://scorec.rpi.edu/~wickbt/slurmctld-crash-2.3.0b2
> >>
> >> Our slurm.conf is http://scorec.rpi.edu/~wickbt/slurm.conf
> >> Our bluegene.conf is http://scorec.rpi.edu/~wickbt/bluegene.conf
> >>
> >> As an added challenge, it does *not* crash under the BG/L emulation
> >> mode... I suspect this narrows it down to some potential mishandling of
> >> the bg_record struct before the call in to _pre_allocate() ?
> >>
> >> Any ideas?
> >>
> >> thanks,
> >> - Tim
> >>
> >> --
> >> Tim Wickberg
> >> [email protected]
> >> Senior System Administrator
> >> Office of Research / SCOREC, Rensselaer Polytechnic Institute
> >>
> 

Reply via email to