sstat doesn't work for any bluegene system.  I'll disable it in future versions.

If you could open a bug on this at http://bugs.SchedMD.com that would be 
helpful.

Thanks,
Danny

Carl Schmidtmann <[email protected]> wrote:

>
>We are seeing a strange error from sstat on our BGQ. It is claiming the
>host bg0001 (or bg0000) is unknown. Node definitions in BGQ are not
>real nodes that can be looked up or connected to since everything is
>handled through I/O nodes and blocks which are not configured in the
>slurm files.
>
>Is there something I am missing in the config files?
>
>Config files and error message below. Slurm 2.4.3 on BGQ single rack
>system. I plan to upgrade to 2.4.4 after SC12.
>
>Thanks,
>Carl
>
>[cschmid7_local@bgqsn]$ sstat -j 313846
>JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS
>MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask
>AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks
>------------ ---------- -------------- -------------- ----------
>---------- ---------- ---------- ---------- -------- ------------
>-------------- ---------- ---------- ---------- ---------- ----------
>--------
>sstat: error: Unable to resolve "bg0001": Unknown host
>sstat: error: fwd_tree_thread: can't find address for host bg0001,
>check slurm.conf
>sstat: error: slurm_job_step_stat: unknown return given from bg0001:
>9001 rc = Communication connection failure
>sstat: error: problem getting step_layout for 313846.0: Communication
>connection failure
>
>[cschmid7_local@bgqsn]$ grep -v '^#' slurm.conf
>ControlMachine=bgqsn
>AuthType=auth/munge
>CacheGroups=0
>CryptoType=crypto/munge
>Epilog=/usr/local/slurm/current/sbin/epilog.bash
>JobSubmitPlugins=ur_cnode
>Licenses=cnode*992
>MpiDefault=none
>ProctrackType=proctrack/pgid
>Prolog=/usr/local/slurm/current/sbin/prolog.bash
>ReturnToService=1
>SlurmctldPidFile=/var/run/slurmctld.pid
>SlurmctldPort=6817
>SlurmdPidFile=/var/run/slurmd.pid
>SlurmdPort=6818
>SlurmdSpoolDir=/var/slurm/state/slurmd
>SlurmUser=slurm
>StateSaveLocation=/var/slurm/state
>SwitchType=switch/none
>TaskPlugin=task/none
>InactiveLimit=0
>KillWait=30
>MinJobAge=300
>SlurmctldTimeout=120
>SlurmdTimeout=300
>Waittime=0
>FastSchedule=1
>SchedulerType=sched/backfill
>SchedulerPort=7321
>SelectType=select/bluegene
>PriorityType=priority/multifactor
>PriorityDecayHalfLife=1-0
>PriorityMaxAge=1-0
>PriorityWeightAge=10000
>PriorityWeightPartition=30000
>PriorityWeightQOS=10000
>AccountingStorageHost=bgqsn
>AccountingStorageLoc=slurmacct
>AccountingStorageType=accounting_storage/slurmdbd
>AccountingStoreJobComment=YES
>ClusterName=ur_bgq
>JobCompHost=localhost
>JobCompLoc=slurmdb
>JobCompType=jobcomp/slurmdbd
>JobAcctGatherFrequency=30
>JobAcctGatherType=jobacct_gather/none
>SlurmctldDebug=3
>SlurmctldLogFile=/var/log/slurm/slurmctld.log
>SlurmdDebug=3
>SlurmdLogFile=/var/log/slurm/slurmd.log
>SlurmSchedLogFile=/var/log/slurm/slurmsched.log
>SlurmSchedLogLevel=3
>FrontendName=bgqsn State=UNKNOWN
>NodeName=bg[0000x0001] CPUs=1024 State=UNKNOWN
>PartitionName=DEFAULT State=UP DefaultTime=00:00:05 Shared=force
>PartitionName=debug Nodes=bg[0000x0001] MaxTime=00:60:00 MaxNodes=32
>MinNodes=1 Priority=10000
>PartitionName=standard Nodes=bg[0000x0001] Default=YES MaxTime=24:00:00
>MaxNodes=256 MinNodes=1 Priority=5000
>PartitionName=large Nodes=bg[0000x0001] MaxTime=12:00:00 MaxNodes=1024
>MinNodes=512 Priority=1000
>PartitionName=reserved Nodes=bg[0000x0001] MaxTime=48:00:00
>MaxNodes=1024 MinNodes=4 Priority=32000 ReqResv=Yes
>PartitionName=system Nodes=bg[0000x0001] MaxTime=48:00:00 MaxNodes=1024
>MinNodes=4 Priority=32000 RootOnly=Yes
>
>[cschmid7_local@bgqsn]$ grep -v '^#' bluegene.conf
>MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware
>IONodesPerMP=8
>MaxBlockInError=0
>BridgeAPILogFile=/var/log/slurm/bridgeapi.log
>BridgeAPIVerbose=2
>BasePartitionNodeCnt=512
>NodeCardNodeCnt=32
>AllowSubBlockAllocations=yes
>LayoutMode=DYNAMIC
>
>
>--
>Carl Schmidtmann
>Center for Integrated Research Computing
>University of Rochester

Reply via email to