Tim,
Try the attached patch and let me know if this fixes the problem. This will
probably be fixed more completely if this patch works.
Danny
> On 08/30/2011 07:00 PM, Danny Auble wrote:
> > Tim, could you up your debug to 3 or so on the BridgeAPIVerbose in the
> > bluegene.conf and post the /var/log/slurm/bridgeapi.log that would be
> > helpful,
> > but this seems rather strange.
> >
> > I am guessing it happens every time? For what it is worth, the crash is in
> > the IBM stuff so we can't do much about it. It seems strange you get a
> > bunch
> > of these blocks made and then the last one doesn't get made with a bunch of
> > errors about the block.
>
> Yep, every time. Bridgeapi.log corresponding to that is
> http://scorec.rpi.edu/~wickbt/bridgeapi.log
>
> > Have you tried doing a clean start? Perhaps there is something wrong with
> > the state load from 2.2 to 2.3?
>
> The logs you're seeing are from:
>
> /bgl/local/slurm/sbin/slurmctld -B -c -D -vvvvv
>
> I've rerun now after removing /var/spool/slurm/* as well; the error does
> change some:
>
> slurmctld: RMP30Au192651051 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651051 is in state Free
> slurmctld: RMP30Au192651043 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651043 is in state Free
> slurmctld: RMP30Au192651034 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651034 is in state Free
> slurmctld: RMP30Au192651028 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651028 is in state Free
> slurmctld: RMP30Au192651011 not found in the state file, adding
> slurmctld: debug3: Block RMP30Au192651011 is in state Free
> slurmctld: removing all current blocks (clean start)
> slurmctld: debug: _track_freeing_blocks: Going to free 12 for job
> 4294967294
> slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input
> slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input
> slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input
> slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input
> slurmctld: error: Requesting small block with 0 mps, needs to be 1.
> slurmctld: fatal: Error, could not create the static blocks
>
> bridgeapi.log corresponding to this run is:
> http://scorec.rpi.edu/~wickbt/bridgeapi-clean.log
> full slurmctld log:
> http://scorec.rpi.edu/~wickbt/slurmctldlog-clean
>
> - Tim
>
> > On Tuesday August 30 2011 6:53:17 PM you wrote:
> >> Hey guys -
> >>
> >> We were about to start testing out 2.3.0-rc2 on our 1-rack BG/L @ RPI,
> >> but have not been able to launch slurmctld.
> >>
> >> I've poked around and haven't found an obvious cause yet, although I can
> >> see that the block creation code has been changed a decent amount
> >> compared to 2.2 to make room for BG/Q.
> >>
> >> The crash is:
> >>
> >> slurmctld: Record: BlockID:RMP30Au165517092 Nodes:bp000[2] Conn:Small
> >> slurmctld: debug2: adding block
> >> slurmctld: debug2: done adding
> >> slurmctld: Record: BlockID:RMP30Au165517102 Nodes:bp000[1] Conn:Small
> >> slurmctld: debug2: adding block
> >> slurmctld: debug2: done adding
> >> slurmctld: Record: BlockID:RMP30Au165517115 Nodes:bp000[0] Conn:Small
> >> slurmctld: error: bridge_set_data(RM_PartitionBlrtsImg): Invalid input
> >> slurmctld: error: bridge_set_data(RM_PartitionLinuxImg): Invalid input
> >> slurmctld: error: bridge_set_data(RM_PartitionRamdiskImg): Invalid input
> >> slurmctld: error: bridge_set_data(RM_PartitionMloaderImg): Invalid input
> >> slurmctld: error: Requesting small block with 0 mps, needs to be 1.
> >> slurmctld: fatal: Error, could not create the static blocks
> >>
> >> -CLI INVALID HANDLE-----
> >> cliRC = -2
> >> line = 242
> >> file = TxObject.cc
> >> slurmctld: error: bridge_get_block_info(RMP08Fe113120123): Internal error
> >> Segmentation fault
> >>
> >>
> >> The full debug output is
> >> http://scorec.rpi.edu/~wickbt/slurmctld-crash-2.3.0b2
> >>
> >> Our slurm.conf is http://scorec.rpi.edu/~wickbt/slurm.conf
> >> Our bluegene.conf is http://scorec.rpi.edu/~wickbt/bluegene.conf
> >>
> >> As an added challenge, it does *not* crash under the BG/L emulation
> >> mode... I suspect this narrows it down to some potential mishandling of
> >> the bg_record struct before the call in to _pre_allocate() ?
> >>
> >> Any ideas?
> >>
> >> thanks,
> >> - Tim
> >>
> >> --
> >> Tim Wickberg
> >> [email protected]
> >> Senior System Administrator
> >> Office of Research / SCOREC, Rensselaer Polytechnic Institute
> >>
>
>
diff --git a/src/plugins/select/bluegene/select_bluegene.c b/src/plugins/select/bluegene/select_bluegene.c
index 1aff0b8..b0c0129 100644
--- a/src/plugins/select/bluegene/select_bluegene.c
+++ b/src/plugins/select/bluegene/select_bluegene.c
@@ -322,17 +322,17 @@ static bg_record_t *_translate_info_2_record(block_info_t *block_info)
used_bitmap = bit_alloc(node_record_count);
ionode_bitmap = bit_alloc(bg_conf->ionodes_per_mp);
- if (bg_recover && (inx2bitstr(mp_bitmap, block_info->mp_inx) == -1))
+ if ((inx2bitstr(mp_bitmap, block_info->mp_inx) == -1) && bg_recover)
fatal("Job state recovered incompatible with "
"bluegene.conf. mp=%u",
node_record_count);
- if (bg_recover
- && (inx2bitstr(used_bitmap, block_info->mp_used_inx) == -1))
+ if ((inx2bitstr(used_bitmap, block_info->mp_used_inx) == -1)
+ && bg_recover)
fatal("Job state recovered incompatible with "
"bluegene.conf. used=%u",
node_record_count);
- if (bg_recover
- && (inx2bitstr(ionode_bitmap, block_info->ionode_inx) == -1))
+ if ((inx2bitstr(ionode_bitmap, block_info->ionode_inx) == -1)
+ && bg_recover)
fatal("Job state recovered incompatible with "
"bluegene.conf. ionodes=%u",
bg_conf->ionodes_per_mp);