The troubleshooting guide online may help:
http://slurm.schedmd.com/troubleshoot.html#nodes

Quoting Felix Willenborg <[email protected]>:

Hi there,

first of all, i'm kinda new to slurm, so hopefully i may have missed something very basic here.

I'm trying to set up a system of six to seven nodes with homogenic hardware as SLURM nodes. The nodes are connected via Infiniband. As a controller, i have a system which differs the hardware specification a little bit. To keep munge.key and slurm.conf homogenic on all systems i use salt. So far so good.

The problem i recieve is that no node is responding to the master when "sinfo" is run under the controller. "scontrol ping" although says on every node, that the primary controller is up, which is really confusing. Another thing which seems weird is, that when i watch the log file of the controller, it says that the node is found when slurmd on the node is restarted, and after one minute approximately the connection is lost again.

I checked pretty much everything which came in my mind, like possible blocked ports or user/group rights set wrong. Maybe you have an idea.. i ran out of them. Also, here is the - anonymized - slurm.conf aswell as the slurmctld.log and slurmd.log of on node. I'm looking forward to some help!!

Best wishes,
Felix Willenborg

slurm.conf
------------------------------------------------------------------------------------------------------------------------------------------------------------
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=erica
ControlAddr=***.***.***.***
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=gpu
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=1
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=7200
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm-llnl/accounting
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=7
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=7
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
#NodeName=node[01-06] CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN NodeName=node01 NodeAddr=***.***.***.51 CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN NodeName=node02 NodeAddr=***.***.***.52 CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN NodeName=node03 NodeAddr=***.***.***.53 CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN NodeName=node04 NodeAddr=***.***.***.54 CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN NodeName=node05 NodeAddr=***.***.***.55 CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN NodeName=node06 NodeAddr=***.***.***.56 CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
PartitionName=dft default=YES Nodes=node[01-06] MaxTime=INFINITE State=UP


slurmctld.log
------------------------------------------------------------------------------------------------------------------------------------------------------------
[2015-03-16T15:39:54.813] debug:  sched: slurmctld starting
[2015-03-16T15:39:54.817] error: Configured MailProg is invalid
[2015-03-16T15:39:54.817] debug3: Trying to load plugin /usr/lib/slurm/accounting_storage_filetxt.so
[2015-03-16T15:39:54.817] debug2: slurmdb_init() called
[2015-03-16T15:39:54.817] Accounting storage FileTxt plugin loaded
[2015-03-16T15:39:54.818] debug3: Success.
[2015-03-16T15:39:54.818] debug3: not enforcing associations and no list was given so we are giving a blank list
[2015-03-16T15:39:54.818] debug3: Version in assoc_mgr_state header is 1
[2015-03-16T15:39:54.818] slurmctld version 2.6.5 started on cluster cluster
[2015-03-16T15:39:54.818] debug3: Trying to load plugin /usr/lib/slurm/crypto_munge.so
[2015-03-16T15:39:54.818] Munge cryptographic signature plugin loaded
[2015-03-16T15:39:54.818] debug3: Success.
[2015-03-16T15:39:54.818] debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so [2015-03-16T15:39:54.818] Consumable Resources (CR) Node Selection plugin loaded with argument 20
[2015-03-16T15:39:54.818] debug3: Success.
[2015-03-16T15:39:54.818] debug3: Trying to load plugin /usr/lib/slurm/preempt_none.so
[2015-03-16T15:39:54.818] preempt/none loaded
[2015-03-16T15:39:54.818] debug3: Success.
[2015-03-16T15:39:54.818] debug3: Trying to load plugin /usr/lib/slurm/checkpoint_none.so
[2015-03-16T15:39:54.818] debug3: Success.
[2015-03-16T15:39:54.818] Checkpoint plugin loaded: checkpoint/none
[2015-03-16T15:39:54.818] debug3: Trying to load plugin /usr/lib/slurm/jobacct_gather_linux.so
[2015-03-16T15:39:54.818] Job accounting gather LINUX plugin loaded
[2015-03-16T15:39:54.818] debug3: Success.
[2015-03-16T15:39:54.819] WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux [2015-03-16T15:39:54.819] debug3: Trying to load plugin /usr/lib/slurm/ext_sensors_none.so
[2015-03-16T15:39:54.819] ExtSensors NONE plugin loaded
[2015-03-16T15:39:54.819] debug3: Success.
[2015-03-16T15:39:54.819] debug:  No backup controller to shutdown
[2015-03-16T15:39:54.819] debug3: Trying to load plugin /usr/lib/slurm/switch_none.so
[2015-03-16T15:39:54.819] switch NONE plugin loaded
[2015-03-16T15:39:54.819] debug3: Success.
[2015-03-16T15:39:54.819] debug: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf [2015-03-16T15:39:54.820] debug3: Trying to load plugin /usr/lib/slurm/topology_none.so
[2015-03-16T15:39:54.820] topology NONE plugin loaded
[2015-03-16T15:39:54.820] debug3: Success.
[2015-03-16T15:39:54.827] debug:  No DownNodes
[2015-03-16T15:39:54.827] debug3: Trying to load plugin /usr/lib/slurm/jobcomp_none.so
[2015-03-16T15:39:54.827] debug3: Success.
[2015-03-16T15:39:54.827] debug3: Trying to load plugin /usr/lib/slurm/sched_backfill.so
[2015-03-16T15:39:54.827] sched: Backfill scheduler plugin loaded
[2015-03-16T15:39:54.827] debug3: Success.
[2015-03-16T15:39:54.828] debug3: Version string in node_state header is VER006
[2015-03-16T15:39:54.828] Recovered state of 6 nodes
[2015-03-16T15:39:54.828] debug3: Version string in job_state header is VER014
[2015-03-16T15:39:54.828] debug3: Job id in job_state header is 42
[2015-03-16T15:39:54.828] debug3: Set job_id_sequence to 42
[2015-03-16T15:39:54.828] Recovered information about 0 jobs
[2015-03-16T15:39:54.828] cons_res: select_p_node_init
[2015-03-16T15:39:54.828] cons_res: preparing for 1 partitions
[2015-03-16T15:39:54.828] debug:  Updating partition uid access list
[2015-03-16T15:39:54.828] debug3: Version string in resv_state header is VER004
[2015-03-16T15:39:54.828] Recovered state of 0 reservations
[2015-03-16T15:39:54.828] State of 0 triggers recovered
[2015-03-16T15:39:54.828] read_slurm_conf: backup_controller not specified.
[2015-03-16T15:39:54.828] cons_res: select_p_reconfigure
[2015-03-16T15:39:54.828] cons_res: select_p_node_init
[2015-03-16T15:39:54.828] cons_res: preparing for 1 partitions
[2015-03-16T15:39:54.828] Running as primary controller
[2015-03-16T15:39:54.829] debug3: Trying to load plugin /usr/lib/slurm/priority_basic.so
[2015-03-16T15:39:54.829] debug:  Priority BASIC plugin loaded
[2015-03-16T15:39:54.829] debug3: Success.
[2015-03-16T15:39:54.830] debug3: _slurmctld_rpc_mgr pid = 30521
[2015-03-16T15:39:54.830] debug3: _slurmctld_background pid = 30521
[2015-03-16T15:39:54.830] debug:  power_save module disabled, SuspendTime < 0
[2015-03-16T15:39:54.830] debug2: slurmctld listening on 0.0.0.0:6817
[2015-03-16T15:39:57.832] debug: Spawning registration agent for node[01-06] 6 hosts
[2015-03-16T15:39:57.832] debug2: Spawning RPC agent for msg_type 1001
[2015-03-16T15:39:57.837] debug2: got 1 threads to send out
[2015-03-16T15:39:57.840] debug3: Tree sending to node01
[2015-03-16T15:39:57.841] debug3: Tree sending to node02
[2015-03-16T15:39:57.842] debug3: Tree sending to node03
[2015-03-16T15:39:57.843] debug3: Tree sending to node04
[2015-03-16T15:39:57.844] debug3: Tree sending to node05
[2015-03-16T15:39:57.844] debug2: Tree head got back 0 looking for 6
[2015-03-16T15:39:57.844] debug3: Tree sending to node06
[2015-03-16T15:39:58.989] debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so [2015-03-16T15:39:58.989] auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2015-03-16T15:39:58.989] debug3: Success.
[2015-03-16T15:39:58.990] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2015-03-16T15:39:58.990] debug: validate_node_specs: node node01 registered with 0 jobs [2015-03-16T15:39:58.990] debug2: _slurm_rpc_node_registration complete for node01 usec=69 [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection timed out [2015-03-16T15:40:02.845] debug2: Error connecting slurm stream socket at ***.***.***.***52:6818: Connection timed out
[2015-03-16T15:40:02.845] debug3: problems with node01
[2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection timed out [2015-03-16T15:40:02.845] debug2: Error connecting slurm stream socket at ***.***.***.54:6818: Connection timed out [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection timed out [2015-03-16T15:40:02.845] debug2: Error connecting slurm stream socket at ***.***.***.56:6818: Connection timed out [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection timed out [2015-03-16T15:40:02.846] debug2: Error connecting slurm stream socket at ***.***.***.53:6818: Connection timed out
[2015-03-16T15:40:02.846] debug2: Tree head got back 1
[2015-03-16T15:40:02.846] debug2: _slurm_connect poll timeout: Connection timed out [2015-03-16T15:40:02.846] debug2: Error connecting slurm stream socket at ***.***.***.57:6818: Connection timed out
[2015-03-16T15:40:02.846] debug3: problems with node06
[2015-03-16T15:40:02.846] debug3: problems with node03
[2015-03-16T15:40:02.846] debug3: problems with node05
[2015-03-16T15:40:02.846] debug3: problems with node02
[2015-03-16T15:40:02.846] debug2: _slurm_connect poll timeout: Connection timed out [2015-03-16T15:40:02.846] debug2: Error connecting slurm stream socket at ***.***.***.55:6818: Connection timed out
[2015-03-16T15:40:02.846] debug2: Tree head got back 2
[2015-03-16T15:40:02.846] debug3: problems with node04
[2015-03-16T15:40:02.846] debug2: Tree head got back 3
[2015-03-16T15:40:02.846] debug2: Tree head got back 4
[2015-03-16T15:40:02.846] debug2: Tree head got back 5
[2015-03-16T15:40:02.846] debug2: Tree head got back 5
[2015-03-16T15:40:02.846] debug2: Tree head got back 6
[2015-03-16T15:40:02.846] agent/is_node_resp: node:node01 rpc:1001 : Communication connection failure [2015-03-16T15:40:02.846] agent/is_node_resp: node:node06 rpc:1001 : Communication connection failure [2015-03-16T15:40:02.846] agent/is_node_resp: node:node03 rpc:1001 : Communication connection failure [2015-03-16T15:40:02.846] agent/is_node_resp: node:node02 rpc:1001 : Communication connection failure [2015-03-16T15:40:02.846] agent/is_node_resp: node:node04 rpc:1001 : Communication connection failure [2015-03-16T15:40:02.846] agent/is_node_resp: node:node05 rpc:1001 : Communication connection failure [2015-03-16T15:40:03.113] debug: node_not_resp: node node01 responded since msg sent
[2015-03-16T15:40:03.833] error: Nodes node[01-06] not responding
[2015-03-16T15:40:24.835] debug2: Testing job time limits and checkpoints
[2015-03-16T15:40:53.000] debug:  backfill: beginning
[2015-03-16T15:40:53.000] debug:  backfill: no jobs to backfill
[2015-03-16T15:40:54.838] debug2: Testing job time limits and checkpoints
[2015-03-16T15:40:54.838] debug2: Performing purge of old job records
[2015-03-16T15:40:54.838] debug:  sched: Running job scheduler
[2015-03-16T15:41:24.842] debug2: Testing job time limits and checkpoints
[2015-03-16T15:41:54.845] debug2: Testing job time limits and checkpoints
[2015-03-16T15:41:54.845] debug2: Performing purge of old job records
[2015-03-16T15:41:54.845] debug:  sched: Running job scheduler
[2015-03-16T15:42:24.848] debug2: Testing job time limits and checkpoints


slurmd.log
------------------------------------------------------------------------------------------------------------------------------------------------------------
[2015-03-16T15:39:58.984] debug3: Trying to load plugin /usr/lib/slurm/topology_none.so
[2015-03-16T15:39:58.984] topology NONE plugin loaded
[2015-03-16T15:39:58.984] debug3: Success.
[2015-03-16T15:39:58.984] Gathering cpu frequency information for 12 cpus
[2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 0, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 1, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 2, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 3, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 4, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 5, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 6, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 7, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 8, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 9, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 10, reset freq: 1200000, reset governor: ondemand [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 11, reset freq: 1200000, reset governor: ondemand
[2015-03-16T15:39:58.985] debug3: NodeName    = node01
[2015-03-16T15:39:58.985] debug3: TopoAddr    = node01
[2015-03-16T15:39:58.985] debug3: TopoPattern = node
[2015-03-16T15:39:58.985] debug3: CacheGroups = 0
[2015-03-16T15:39:58.985] debug3: Confile     = `/etc/slurm-llnl/slurm.conf'
[2015-03-16T15:39:58.985] debug3: Debug       = 7
[2015-03-16T15:39:58.985] debug3: CPUs        = 12 (CF: 12, HW: 12)
[2015-03-16T15:39:58.985] debug3: Boards      = 1  (CF:  1, HW:  1)
[2015-03-16T15:39:58.985] debug3: Sockets     = 2  (CF:  2, HW:  2)
[2015-03-16T15:39:58.985] debug3: Cores       = 6  (CF:  6, HW:  6)
[2015-03-16T15:39:58.985] debug3: Threads     = 1  (CF:  1, HW:  1)
[2015-03-16T15:39:58.985] debug3: UpTime      = 1734749 = 20-01:52:29
[2015-03-16T15:39:58.985] debug3: Block Map   = 0,1,2,3,4,5,6,7,8,9,10,11
[2015-03-16T15:39:58.985] debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11
[2015-03-16T15:39:58.985] debug3: RealMemory  = 128910
[2015-03-16T15:39:58.985] debug3: TmpDisk     = 210195
[2015-03-16T15:39:58.985] debug3: Epilog      = `(null)'
[2015-03-16T15:39:58.985] debug3: Logfile = `/var/log/slurm-llnl/slurmd.log'
[2015-03-16T15:39:58.985] debug3: HealthCheck = `(null)'
[2015-03-16T15:39:58.985] debug3: NodeName    = node01
[2015-03-16T15:39:58.985] debug3: NodeAddr    = ***.***.***.52
[2015-03-16T15:39:58.985] debug3: Port        = 6818
[2015-03-16T15:39:58.985] debug3: Prolog      = `(null)'
[2015-03-16T15:39:58.985] debug3: TmpFS       = `/tmp'
[2015-03-16T15:39:58.985] debug3: Public Cert = `(null)'
[2015-03-16T15:39:58.985] debug3: Slurmstepd  = `/usr/sbin/slurmstepd'
[2015-03-16T15:39:58.985] debug3: Spool Dir   = `/var/lib/slurm-llnl/slurmd'
[2015-03-16T15:39:58.985] debug3: Pid File = `/var/run/slurm-llnl/slurmd.pid'
[2015-03-16T15:39:58.985] debug3: Slurm UID   = 64030
[2015-03-16T15:39:58.985] debug3: TaskProlog  = `(null)'
[2015-03-16T15:39:58.985] debug3: TaskEpilog  = `(null)'
[2015-03-16T15:39:58.985] debug3: TaskPluginParam = 0
[2015-03-16T15:39:58.985] debug3: Use PAM     = 0
[2015-03-16T15:39:58.985] debug3: Trying to load plugin /usr/lib/slurm/proctrack_pgid.so
[2015-03-16T15:39:58.985] debug3: Success.
[2015-03-16T15:39:58.985] debug3: Trying to load plugin /usr/lib/slurm/task_none.so
[2015-03-16T15:39:58.985] task NONE plugin loaded
[2015-03-16T15:39:58.985] debug3: Success.
[2015-03-16T15:39:58.985] debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so [2015-03-16T15:39:58.985] auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2015-03-16T15:39:58.985] debug3: Success.
[2015-03-16T15:39:58.985] debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf [2015-03-16T15:39:58.985] debug3: Trying to load plugin /usr/lib/slurm/crypto_munge.so
[2015-03-16T15:39:58.985] Munge cryptographic signature plugin loaded
[2015-03-16T15:39:58.985] debug3: Success.
[2015-03-16T15:39:58.985] debug3: initializing slurmd spool directory
[2015-03-16T15:39:58.985] debug3: slurmd initialization successful
[2015-03-16T15:39:58.986] Warning: Core limit is only 0 KB
[2015-03-16T15:39:58.986] slurmd version 2.6.5 started
[2015-03-16T15:39:58.986] debug3: finished daemonize
[2015-03-16T15:39:58.986] debug3: Trying to load plugin /usr/lib/slurm/jobacct_gather_linux.so
[2015-03-16T15:39:58.986] Job accounting gather LINUX plugin loaded
[2015-03-16T15:39:58.986] debug3: Success.
[2015-03-16T15:39:58.986] WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux [2015-03-16T15:39:58.986] debug3: Trying to load plugin /usr/lib/slurm/switch_none.so
[2015-03-16T15:39:58.987] switch NONE plugin loaded
[2015-03-16T15:39:58.987] debug3: Success.
[2015-03-16T15:39:58.987] debug3: successfully opened slurm listen port ***.***.***.52:6818
[2015-03-16T15:39:58.987] slurmd started on Mon, 16 Mar 2015 15:39:58 +0100
[2015-03-16T15:39:58.987] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=128910 TmpDisk=210195 Uptime=1734749 [2015-03-16T15:39:58.987] debug3: Trying to load plugin /usr/lib/slurm/acct_gather_energy_none.so
[2015-03-16T15:39:58.987] AcctGatherEnergy NONE plugin loaded
[2015-03-16T15:39:58.987] debug3: Success.
[2015-03-16T15:39:58.987] debug3: Trying to load plugin /usr/lib/slurm/acct_gather_profile_none.so
[2015-03-16T15:39:58.987] AcctGatherProfile NONE plugin loaded
[2015-03-16T15:39:58.987] debug3: Success.
[2015-03-16T15:39:58.987] debug3: Trying to load plugin /usr/lib/slurm/acct_gather_infiniband_none.so
[2015-03-16T15:39:58.988] AcctGatherInfiniband NONE plugin loaded
[2015-03-16T15:39:58.988] debug3: Success.
[2015-03-16T15:39:58.988] debug3: Trying to load plugin /usr/lib/slurm/acct_gather_filesystem_none.so
[2015-03-16T15:39:58.988] AcctGatherFilesystem NONE plugin loaded
[2015-03-16T15:39:58.988] debug3: Success.
[2015-03-16T15:39:58.988] debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to