sstat doesn't work for any bluegene system. I'll disable it in future versions.
If you could open a bug on this at http://bugs.SchedMD.com that would be helpful. Thanks, Danny Carl Schmidtmann <[email protected]> wrote: > >We are seeing a strange error from sstat on our BGQ. It is claiming the >host bg0001 (or bg0000) is unknown. Node definitions in BGQ are not >real nodes that can be looked up or connected to since everything is >handled through I/O nodes and blocks which are not configured in the >slurm files. > >Is there something I am missing in the config files? > >Config files and error message below. Slurm 2.4.3 on BGQ single rack >system. I plan to upgrade to 2.4.4 after SC12. > >Thanks, >Carl > >[cschmid7_local@bgqsn]$ sstat -j 313846 >JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS >MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask >AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks >------------ ---------- -------------- -------------- ---------- >---------- ---------- ---------- ---------- -------- ------------ >-------------- ---------- ---------- ---------- ---------- ---------- >-------- >sstat: error: Unable to resolve "bg0001": Unknown host >sstat: error: fwd_tree_thread: can't find address for host bg0001, >check slurm.conf >sstat: error: slurm_job_step_stat: unknown return given from bg0001: >9001 rc = Communication connection failure >sstat: error: problem getting step_layout for 313846.0: Communication >connection failure > >[cschmid7_local@bgqsn]$ grep -v '^#' slurm.conf >ControlMachine=bgqsn >AuthType=auth/munge >CacheGroups=0 >CryptoType=crypto/munge >Epilog=/usr/local/slurm/current/sbin/epilog.bash >JobSubmitPlugins=ur_cnode >Licenses=cnode*992 >MpiDefault=none >ProctrackType=proctrack/pgid >Prolog=/usr/local/slurm/current/sbin/prolog.bash >ReturnToService=1 >SlurmctldPidFile=/var/run/slurmctld.pid >SlurmctldPort=6817 >SlurmdPidFile=/var/run/slurmd.pid >SlurmdPort=6818 >SlurmdSpoolDir=/var/slurm/state/slurmd >SlurmUser=slurm >StateSaveLocation=/var/slurm/state >SwitchType=switch/none >TaskPlugin=task/none >InactiveLimit=0 >KillWait=30 >MinJobAge=300 >SlurmctldTimeout=120 >SlurmdTimeout=300 >Waittime=0 >FastSchedule=1 >SchedulerType=sched/backfill >SchedulerPort=7321 >SelectType=select/bluegene >PriorityType=priority/multifactor >PriorityDecayHalfLife=1-0 >PriorityMaxAge=1-0 >PriorityWeightAge=10000 >PriorityWeightPartition=30000 >PriorityWeightQOS=10000 >AccountingStorageHost=bgqsn >AccountingStorageLoc=slurmacct >AccountingStorageType=accounting_storage/slurmdbd >AccountingStoreJobComment=YES >ClusterName=ur_bgq >JobCompHost=localhost >JobCompLoc=slurmdb >JobCompType=jobcomp/slurmdbd >JobAcctGatherFrequency=30 >JobAcctGatherType=jobacct_gather/none >SlurmctldDebug=3 >SlurmctldLogFile=/var/log/slurm/slurmctld.log >SlurmdDebug=3 >SlurmdLogFile=/var/log/slurm/slurmd.log >SlurmSchedLogFile=/var/log/slurm/slurmsched.log >SlurmSchedLogLevel=3 >FrontendName=bgqsn State=UNKNOWN >NodeName=bg[0000x0001] CPUs=1024 State=UNKNOWN >PartitionName=DEFAULT State=UP DefaultTime=00:00:05 Shared=force >PartitionName=debug Nodes=bg[0000x0001] MaxTime=00:60:00 MaxNodes=32 >MinNodes=1 Priority=10000 >PartitionName=standard Nodes=bg[0000x0001] Default=YES MaxTime=24:00:00 >MaxNodes=256 MinNodes=1 Priority=5000 >PartitionName=large Nodes=bg[0000x0001] MaxTime=12:00:00 MaxNodes=1024 >MinNodes=512 Priority=1000 >PartitionName=reserved Nodes=bg[0000x0001] MaxTime=48:00:00 >MaxNodes=1024 MinNodes=4 Priority=32000 ReqResv=Yes >PartitionName=system Nodes=bg[0000x0001] MaxTime=48:00:00 MaxNodes=1024 >MinNodes=4 Priority=32000 RootOnly=Yes > >[cschmid7_local@bgqsn]$ grep -v '^#' bluegene.conf >MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware >IONodesPerMP=8 >MaxBlockInError=0 >BridgeAPILogFile=/var/log/slurm/bridgeapi.log >BridgeAPIVerbose=2 >BasePartitionNodeCnt=512 >NodeCardNodeCnt=32 >AllowSubBlockAllocations=yes >LayoutMode=DYNAMIC > > >-- >Carl Schmidtmann >Center for Integrated Research Computing >University of Rochester
