Felix, I would suggest you look in the munged log for errors and make sure time is sync'd across all your nodes.
Trevor > On Mar 17, 2015, at 5:31 AM, Felix Willenborg > <[email protected]> wrote: > > > Hi there, > > first of all, i'm kinda new to slurm, so hopefully i may have missed > something very basic here. > > I'm trying to set up a system of six to seven nodes with homogenic hardware > as SLURM nodes. The nodes are connected via Infiniband. As a controller, i > have a system which differs the hardware specification a little bit. To keep > munge.key and slurm.conf homogenic on all systems i use salt. So far so good. > > The problem i recieve is that no node is responding to the master when > "sinfo" is run under the controller. "scontrol ping" although says on every > node, that the primary controller is up, which is really confusing. Another > thing which seems weird is, that when i watch the log file of the controller, > it says that the node is found when slurmd on the node is restarted, and > after one minute approximately the connection is lost again. > > I checked pretty much everything which came in my mind, like possible blocked > ports or user/group rights set wrong. Maybe you have an idea.. i ran out of > them. Also, here is the - anonymized - slurm.conf aswell as the slurmctld.log > and slurmd.log of on node. I'm looking forward to some help!! > > Best wishes, > Felix Willenborg > > slurm.conf > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ControlMachine=erica > ControlAddr=***.***.***.*** > #BackupController= > #BackupAddr= > # > AuthType=auth/munge > CacheGroups=0 > #CheckpointType=checkpoint/none > CryptoType=crypto/munge > #DisableRootJobs=NO > #EnforcePartLimits=NO > #Epilog= > #EpilogSlurmctld= > #FirstJobId=1 > #MaxJobId=999999 > #GresTypes=gpu > #GroupUpdateForce=0 > #GroupUpdateTime=600 > #JobCheckpointDir=/var/slurm/checkpoint > #JobCredentialPrivateKey= > #JobCredentialPublicCertificate= > #JobFileAppend=0 > #JobRequeue=1 > #JobSubmitPlugins=1 > #KillOnBadExit=0 > #Licenses=foo*4,bar > #MailProg=/bin/mail > #MaxJobCount=5000 > #MaxStepCount=40000 > #MaxTasksPerNode=128 > MpiDefault=none > #MpiParams=ports=#-# > #PluginDir= > #PlugStackConfig= > #PrivateData=jobs > ProctrackType=proctrack/pgid > #Prolog= > #PrologSlurmctld= > #PropagatePrioProcess=0 > #PropagateResourceLimits= > #PropagateResourceLimitsExcept= > ReturnToService=1 > #SallocDefaultCommand= > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd > SlurmUser=slurm > #SlurmdUser=root > #SrunEpilog= > #SrunProlog= > StateSaveLocation=/var/lib/slurm-llnl/slurmctld > SwitchType=switch/none > #TaskEpilog= > TaskPlugin=task/none > #TaskPluginParam= > #TaskProlog= > #TopologyPlugin=topology/tree > #TmpFs=/tmp > #TrackWCKey=no > #TreeWidth= > #UnkillableStepProgram= > #UsePAM=1 > # > # > # TIMERS > #BatchStartTimeout=10 > #CompleteWait=0 > #EpilogMsgTime=2000 > #GetEnvTimeout=2 > #HealthCheckInterval=0 > #HealthCheckProgram= > InactiveLimit=0 > KillWait=30 > #MessageTimeout=10 > #ResvOverRun=0 > MinJobAge=300 > #OverTimeLimit=0 > SlurmctldTimeout=120 > SlurmdTimeout=7200 > #UnkillableStepTimeout=60 > #VSizeFactor=0 > Waittime=0 > # > # > # SCHEDULING > #DefMemPerCPU=0 > FastSchedule=1 > #MaxMemPerCPU=0 > #SchedulerRootFilter=1 > #SchedulerTimeSlice=30 > SchedulerType=sched/backfill > SchedulerPort=7321 > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > # > # > # JOB PRIORITY > #PriorityType=priority/basic > #PriorityDecayHalfLife= > #PriorityCalcPeriod= > #PriorityFavorSmall= > #PriorityMaxAge= > #PriorityUsageResetPeriod= > #PriorityWeightAge= > #PriorityWeightFairshare= > #PriorityWeightJobSize= > #PriorityWeightPartition= > #PriorityWeightQOS= > # > # > # LOGGING AND ACCOUNTING > #AccountingStorageEnforce=0 > #AccountingStorageHost= > AccountingStorageLoc=/var/log/slurm-llnl/accounting > #AccountingStoragePass= > #AccountingStoragePort= > AccountingStorageType=accounting_storage/filetxt > #AccountingStorageUser= > AccountingStoreJobComment=YES > ClusterName=cluster > #DebugFlags= > #JobCompHost= > #JobCompLoc= > #JobCompPass= > #JobCompPort= > JobCompType=jobcomp/none > #JobCompUser= > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > SlurmctldDebug=7 > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > SlurmdDebug=7 > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > #SlurmSchedLogFile= > #SlurmSchedLogLevel= > # > # > # POWER SAVE SUPPORT FOR IDLE NODES (optional) > #SuspendProgram= > #ResumeProgram= > #SuspendTimeout= > #ResumeTimeout= > #ResumeRate= > #SuspendExcNodes= > #SuspendExcParts= > #SuspendRate= > #SuspendTime= > # > # > # COMPUTE NODES > #NodeName=node[01-06] CPUs=12 RealMemory=128910 Sockets=2 CoresPerSocket=6 > ThreadsPerCore=1 State=UNKNOWN > NodeName=node01 NodeAddr=***.***.***.51 CPUs=12 RealMemory=128910 Sockets=2 > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > NodeName=node02 NodeAddr=***.***.***.52 CPUs=12 RealMemory=128910 Sockets=2 > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > NodeName=node03 NodeAddr=***.***.***.53 CPUs=12 RealMemory=128910 Sockets=2 > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > NodeName=node04 NodeAddr=***.***.***.54 CPUs=12 RealMemory=128910 Sockets=2 > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > NodeName=node05 NodeAddr=***.***.***.55 CPUs=12 RealMemory=128910 Sockets=2 > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > NodeName=node06 NodeAddr=***.***.***.56 CPUs=12 RealMemory=128910 Sockets=2 > CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN > PartitionName=dft default=YES Nodes=node[01-06] MaxTime=INFINITE State=UP > > > slurmctld.log > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > [2015-03-16T15:39:54.813] debug: sched: slurmctld starting > [2015-03-16T15:39:54.817] error: Configured MailProg is invalid > [2015-03-16T15:39:54.817] debug3: Trying to load plugin > /usr/lib/slurm/accounting_storage_filetxt.so > [2015-03-16T15:39:54.817] debug2: slurmdb_init() called > [2015-03-16T15:39:54.817] Accounting storage FileTxt plugin loaded > [2015-03-16T15:39:54.818] debug3: Success. > [2015-03-16T15:39:54.818] debug3: not enforcing associations and no list was > given so we are giving a blank list > [2015-03-16T15:39:54.818] debug3: Version in assoc_mgr_state header is 1 > [2015-03-16T15:39:54.818] slurmctld version 2.6.5 started on cluster cluster > [2015-03-16T15:39:54.818] debug3: Trying to load plugin > /usr/lib/slurm/crypto_munge.so > [2015-03-16T15:39:54.818] Munge cryptographic signature plugin loaded > [2015-03-16T15:39:54.818] debug3: Success. > [2015-03-16T15:39:54.818] debug3: Trying to load plugin > /usr/lib/slurm/select_cons_res.so > [2015-03-16T15:39:54.818] Consumable Resources (CR) Node Selection plugin > loaded with argument 20 > [2015-03-16T15:39:54.818] debug3: Success. > [2015-03-16T15:39:54.818] debug3: Trying to load plugin > /usr/lib/slurm/preempt_none.so > [2015-03-16T15:39:54.818] preempt/none loaded > [2015-03-16T15:39:54.818] debug3: Success. > [2015-03-16T15:39:54.818] debug3: Trying to load plugin > /usr/lib/slurm/checkpoint_none.so > [2015-03-16T15:39:54.818] debug3: Success. > [2015-03-16T15:39:54.818] Checkpoint plugin loaded: checkpoint/none > [2015-03-16T15:39:54.818] debug3: Trying to load plugin > /usr/lib/slurm/jobacct_gather_linux.so > [2015-03-16T15:39:54.818] Job accounting gather LINUX plugin loaded > [2015-03-16T15:39:54.818] debug3: Success. > [2015-03-16T15:39:54.819] WARNING: We will use a much slower algorithm with > proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack > when using jobacct_gather/linux > [2015-03-16T15:39:54.819] debug3: Trying to load plugin > /usr/lib/slurm/ext_sensors_none.so > [2015-03-16T15:39:54.819] ExtSensors NONE plugin loaded > [2015-03-16T15:39:54.819] debug3: Success. > [2015-03-16T15:39:54.819] debug: No backup controller to shutdown > [2015-03-16T15:39:54.819] debug3: Trying to load plugin > /usr/lib/slurm/switch_none.so > [2015-03-16T15:39:54.819] switch NONE plugin loaded > [2015-03-16T15:39:54.819] debug3: Success. > [2015-03-16T15:39:54.819] debug: Reading slurm.conf file: > /etc/slurm-llnl/slurm.conf > [2015-03-16T15:39:54.820] debug3: Trying to load plugin > /usr/lib/slurm/topology_none.so > [2015-03-16T15:39:54.820] topology NONE plugin loaded > [2015-03-16T15:39:54.820] debug3: Success. > [2015-03-16T15:39:54.827] debug: No DownNodes > [2015-03-16T15:39:54.827] debug3: Trying to load plugin > /usr/lib/slurm/jobcomp_none.so > [2015-03-16T15:39:54.827] debug3: Success. > [2015-03-16T15:39:54.827] debug3: Trying to load plugin > /usr/lib/slurm/sched_backfill.so > [2015-03-16T15:39:54.827] sched: Backfill scheduler plugin loaded > [2015-03-16T15:39:54.827] debug3: Success. > [2015-03-16T15:39:54.828] debug3: Version string in node_state header is > VER006 > [2015-03-16T15:39:54.828] Recovered state of 6 nodes > [2015-03-16T15:39:54.828] debug3: Version string in job_state header is VER014 > [2015-03-16T15:39:54.828] debug3: Job id in job_state header is 42 > [2015-03-16T15:39:54.828] debug3: Set job_id_sequence to 42 > [2015-03-16T15:39:54.828] Recovered information about 0 jobs > [2015-03-16T15:39:54.828] cons_res: select_p_node_init > [2015-03-16T15:39:54.828] cons_res: preparing for 1 partitions > [2015-03-16T15:39:54.828] debug: Updating partition uid access list > [2015-03-16T15:39:54.828] debug3: Version string in resv_state header is > VER004 > [2015-03-16T15:39:54.828] Recovered state of 0 reservations > [2015-03-16T15:39:54.828] State of 0 triggers recovered > [2015-03-16T15:39:54.828] read_slurm_conf: backup_controller not specified. > [2015-03-16T15:39:54.828] cons_res: select_p_reconfigure > [2015-03-16T15:39:54.828] cons_res: select_p_node_init > [2015-03-16T15:39:54.828] cons_res: preparing for 1 partitions > [2015-03-16T15:39:54.828] Running as primary controller > [2015-03-16T15:39:54.829] debug3: Trying to load plugin > /usr/lib/slurm/priority_basic.so > [2015-03-16T15:39:54.829] debug: Priority BASIC plugin loaded > [2015-03-16T15:39:54.829] debug3: Success. > [2015-03-16T15:39:54.830] debug3: _slurmctld_rpc_mgr pid = 30521 > [2015-03-16T15:39:54.830] debug3: _slurmctld_background pid = 30521 > [2015-03-16T15:39:54.830] debug: power_save module disabled, SuspendTime < 0 > [2015-03-16T15:39:54.830] debug2: slurmctld listening on 0.0.0.0:6817 > [2015-03-16T15:39:57.832] debug: Spawning registration agent for node[01-06] > 6 hosts > [2015-03-16T15:39:57.832] debug2: Spawning RPC agent for msg_type 1001 > [2015-03-16T15:39:57.837] debug2: got 1 threads to send out > [2015-03-16T15:39:57.840] debug3: Tree sending to node01 > [2015-03-16T15:39:57.841] debug3: Tree sending to node02 > [2015-03-16T15:39:57.842] debug3: Tree sending to node03 > [2015-03-16T15:39:57.843] debug3: Tree sending to node04 > [2015-03-16T15:39:57.844] debug3: Tree sending to node05 > [2015-03-16T15:39:57.844] debug2: Tree head got back 0 looking for 6 > [2015-03-16T15:39:57.844] debug3: Tree sending to node06 > [2015-03-16T15:39:58.989] debug3: Trying to load plugin > /usr/lib/slurm/auth_munge.so > [2015-03-16T15:39:58.989] auth plugin for Munge > (http://code.google.com/p/munge/) loaded > [2015-03-16T15:39:58.989] debug3: Success. > [2015-03-16T15:39:58.990] debug2: Processing RPC: > MESSAGE_NODE_REGISTRATION_STATUS from uid=0 > [2015-03-16T15:39:58.990] debug: validate_node_specs: node node01 registered > with 0 jobs > [2015-03-16T15:39:58.990] debug2: _slurm_rpc_node_registration complete for > node01 usec=69 > [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection > timed out > [2015-03-16T15:40:02.845] debug2: Error connecting slurm stream socket at > ***.***.***.***52:6818: Connection timed out > [2015-03-16T15:40:02.845] debug3: problems with node01 > [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection > timed out > [2015-03-16T15:40:02.845] debug2: Error connecting slurm stream socket at > ***.***.***.54:6818: Connection timed out > [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection > timed out > [2015-03-16T15:40:02.845] debug2: Error connecting slurm stream socket at > ***.***.***.56:6818: Connection timed out > [2015-03-16T15:40:02.845] debug2: _slurm_connect poll timeout: Connection > timed out > [2015-03-16T15:40:02.846] debug2: Error connecting slurm stream socket at > ***.***.***.53:6818: Connection timed out > [2015-03-16T15:40:02.846] debug2: Tree head got back 1 > [2015-03-16T15:40:02.846] debug2: _slurm_connect poll timeout: Connection > timed out > [2015-03-16T15:40:02.846] debug2: Error connecting slurm stream socket at > ***.***.***.57:6818: Connection timed out > [2015-03-16T15:40:02.846] debug3: problems with node06 > [2015-03-16T15:40:02.846] debug3: problems with node03 > [2015-03-16T15:40:02.846] debug3: problems with node05 > [2015-03-16T15:40:02.846] debug3: problems with node02 > [2015-03-16T15:40:02.846] debug2: _slurm_connect poll timeout: Connection > timed out > [2015-03-16T15:40:02.846] debug2: Error connecting slurm stream socket at > ***.***.***.55:6818: Connection timed out > [2015-03-16T15:40:02.846] debug2: Tree head got back 2 > [2015-03-16T15:40:02.846] debug3: problems with node04 > [2015-03-16T15:40:02.846] debug2: Tree head got back 3 > [2015-03-16T15:40:02.846] debug2: Tree head got back 4 > [2015-03-16T15:40:02.846] debug2: Tree head got back 5 > [2015-03-16T15:40:02.846] debug2: Tree head got back 5 > [2015-03-16T15:40:02.846] debug2: Tree head got back 6 > [2015-03-16T15:40:02.846] agent/is_node_resp: node:node01 rpc:1001 : > Communication connection failure > [2015-03-16T15:40:02.846] agent/is_node_resp: node:node06 rpc:1001 : > Communication connection failure > [2015-03-16T15:40:02.846] agent/is_node_resp: node:node03 rpc:1001 : > Communication connection failure > [2015-03-16T15:40:02.846] agent/is_node_resp: node:node02 rpc:1001 : > Communication connection failure > [2015-03-16T15:40:02.846] agent/is_node_resp: node:node04 rpc:1001 : > Communication connection failure > [2015-03-16T15:40:02.846] agent/is_node_resp: node:node05 rpc:1001 : > Communication connection failure > [2015-03-16T15:40:03.113] debug: node_not_resp: node node01 responded since > msg sent > [2015-03-16T15:40:03.833] error: Nodes node[01-06] not responding > [2015-03-16T15:40:24.835] debug2: Testing job time limits and checkpoints > [2015-03-16T15:40:53.000] debug: backfill: beginning > [2015-03-16T15:40:53.000] debug: backfill: no jobs to backfill > [2015-03-16T15:40:54.838] debug2: Testing job time limits and checkpoints > [2015-03-16T15:40:54.838] debug2: Performing purge of old job records > [2015-03-16T15:40:54.838] debug: sched: Running job scheduler > [2015-03-16T15:41:24.842] debug2: Testing job time limits and checkpoints > [2015-03-16T15:41:54.845] debug2: Testing job time limits and checkpoints > [2015-03-16T15:41:54.845] debug2: Performing purge of old job records > [2015-03-16T15:41:54.845] debug: sched: Running job scheduler > [2015-03-16T15:42:24.848] debug2: Testing job time limits and checkpoints > > > slurmd.log > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > [2015-03-16T15:39:58.984] debug3: Trying to load plugin > /usr/lib/slurm/topology_none.so > [2015-03-16T15:39:58.984] topology NONE plugin loaded > [2015-03-16T15:39:58.984] debug3: Success. > [2015-03-16T15:39:58.984] Gathering cpu frequency information for 12 cpus > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 0, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 1, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 2, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 3, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 4, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 5, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.984] debug: cpu_freq_init: cpu 6, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 7, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 8, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 9, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 10, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.985] debug: cpu_freq_init: cpu 11, reset freq: 1200000, > reset governor: ondemand > [2015-03-16T15:39:58.985] debug3: NodeName = node01 > [2015-03-16T15:39:58.985] debug3: TopoAddr = node01 > [2015-03-16T15:39:58.985] debug3: TopoPattern = node > [2015-03-16T15:39:58.985] debug3: CacheGroups = 0 > [2015-03-16T15:39:58.985] debug3: Confile = `/etc/slurm-llnl/slurm.conf' > [2015-03-16T15:39:58.985] debug3: Debug = 7 > [2015-03-16T15:39:58.985] debug3: CPUs = 12 (CF: 12, HW: 12) > [2015-03-16T15:39:58.985] debug3: Boards = 1 (CF: 1, HW: 1) > [2015-03-16T15:39:58.985] debug3: Sockets = 2 (CF: 2, HW: 2) > [2015-03-16T15:39:58.985] debug3: Cores = 6 (CF: 6, HW: 6) > [2015-03-16T15:39:58.985] debug3: Threads = 1 (CF: 1, HW: 1) > [2015-03-16T15:39:58.985] debug3: UpTime = 1734749 = 20-01:52:29 > [2015-03-16T15:39:58.985] debug3: Block Map = 0,1,2,3,4,5,6,7,8,9,10,11 > [2015-03-16T15:39:58.985] debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11 > [2015-03-16T15:39:58.985] debug3: RealMemory = 128910 > [2015-03-16T15:39:58.985] debug3: TmpDisk = 210195 > [2015-03-16T15:39:58.985] debug3: Epilog = `(null)' > [2015-03-16T15:39:58.985] debug3: Logfile = > `/var/log/slurm-llnl/slurmd.log' > [2015-03-16T15:39:58.985] debug3: HealthCheck = `(null)' > [2015-03-16T15:39:58.985] debug3: NodeName = node01 > [2015-03-16T15:39:58.985] debug3: NodeAddr = ***.***.***.52 > [2015-03-16T15:39:58.985] debug3: Port = 6818 > [2015-03-16T15:39:58.985] debug3: Prolog = `(null)' > [2015-03-16T15:39:58.985] debug3: TmpFS = `/tmp' > [2015-03-16T15:39:58.985] debug3: Public Cert = `(null)' > [2015-03-16T15:39:58.985] debug3: Slurmstepd = `/usr/sbin/slurmstepd' > [2015-03-16T15:39:58.985] debug3: Spool Dir = `/var/lib/slurm-llnl/slurmd' > [2015-03-16T15:39:58.985] debug3: Pid File = > `/var/run/slurm-llnl/slurmd.pid' > [2015-03-16T15:39:58.985] debug3: Slurm UID = 64030 > [2015-03-16T15:39:58.985] debug3: TaskProlog = `(null)' > [2015-03-16T15:39:58.985] debug3: TaskEpilog = `(null)' > [2015-03-16T15:39:58.985] debug3: TaskPluginParam = 0 > [2015-03-16T15:39:58.985] debug3: Use PAM = 0 > [2015-03-16T15:39:58.985] debug3: Trying to load plugin > /usr/lib/slurm/proctrack_pgid.so > [2015-03-16T15:39:58.985] debug3: Success. > [2015-03-16T15:39:58.985] debug3: Trying to load plugin > /usr/lib/slurm/task_none.so > [2015-03-16T15:39:58.985] task NONE plugin loaded > [2015-03-16T15:39:58.985] debug3: Success. > [2015-03-16T15:39:58.985] debug3: Trying to load plugin > /usr/lib/slurm/auth_munge.so > [2015-03-16T15:39:58.985] auth plugin for Munge > (http://code.google.com/p/munge/) loaded > [2015-03-16T15:39:58.985] debug3: Success. > [2015-03-16T15:39:58.985] debug: spank: opening plugin stack > /etc/slurm-llnl/plugstack.conf > [2015-03-16T15:39:58.985] debug3: Trying to load plugin > /usr/lib/slurm/crypto_munge.so > [2015-03-16T15:39:58.985] Munge cryptographic signature plugin loaded > [2015-03-16T15:39:58.985] debug3: Success. > [2015-03-16T15:39:58.985] debug3: initializing slurmd spool directory > [2015-03-16T15:39:58.985] debug3: slurmd initialization successful > [2015-03-16T15:39:58.986] Warning: Core limit is only 0 KB > [2015-03-16T15:39:58.986] slurmd version 2.6.5 started > [2015-03-16T15:39:58.986] debug3: finished daemonize > [2015-03-16T15:39:58.986] debug3: Trying to load plugin > /usr/lib/slurm/jobacct_gather_linux.so > [2015-03-16T15:39:58.986] Job accounting gather LINUX plugin loaded > [2015-03-16T15:39:58.986] debug3: Success. > [2015-03-16T15:39:58.986] WARNING: We will use a much slower algorithm with > proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack > when using jobacct_gather/linux > [2015-03-16T15:39:58.986] debug3: Trying to load plugin > /usr/lib/slurm/switch_none.so > [2015-03-16T15:39:58.987] switch NONE plugin loaded > [2015-03-16T15:39:58.987] debug3: Success. > [2015-03-16T15:39:58.987] debug3: successfully opened slurm listen port > ***.***.***.52:6818 > [2015-03-16T15:39:58.987] slurmd started on Mon, 16 Mar 2015 15:39:58 +0100 > [2015-03-16T15:39:58.987] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 > Memory=128910 TmpDisk=210195 Uptime=1734749 > [2015-03-16T15:39:58.987] debug3: Trying to load plugin > /usr/lib/slurm/acct_gather_energy_none.so > [2015-03-16T15:39:58.987] AcctGatherEnergy NONE plugin loaded > [2015-03-16T15:39:58.987] debug3: Success. > [2015-03-16T15:39:58.987] debug3: Trying to load plugin > /usr/lib/slurm/acct_gather_profile_none.so > [2015-03-16T15:39:58.987] AcctGatherProfile NONE plugin loaded > [2015-03-16T15:39:58.987] debug3: Success. > [2015-03-16T15:39:58.987] debug3: Trying to load plugin > /usr/lib/slurm/acct_gather_infiniband_none.so > [2015-03-16T15:39:58.988] AcctGatherInfiniband NONE plugin loaded > [2015-03-16T15:39:58.988] debug3: Success. > [2015-03-16T15:39:58.988] debug3: Trying to load plugin > /usr/lib/slurm/acct_gather_filesystem_none.so > [2015-03-16T15:39:58.988] AcctGatherFilesystem NONE plugin loaded > [2015-03-16T15:39:58.988] debug3: Success. > [2015-03-16T15:39:58.988] debug2: No acct_gather.conf file > (/etc/slurm-llnl/acct_gather.conf) -- Trevor
