Sorry for the delay, was trying to fix it but still not working. The node is always down. The master machine is also the compute machine. It's a single server that i use for that. 1 node and 12 cpus.
In the log below i see this line [2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure Here below my slurm.conf file: ControlMachine=linuxcluster AuthType=auth/munge CryptoType=crypto/munge MailProg=/usr/bin/mail MpiDefault=none PluginDir=/usr/local/lib/slurm ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill AccountingStorageHost=linuxcluster AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=linuxcluster JobCompType=jobcomp/none JobCompUser=slurm JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=5 SlurmctldLogFile=/var/log/slurm/slurmctrl.log SlurmdDebug=5 SelectType=select/cons_res SelectTypeParameters=CR_CPU NodeName=linuxcluster CPUs=12 PartitionName=testq Nodes=linuxclusterDefault=YES MaxTime=INFINITE State=UP slurmctrld.log: [2017-11-30T09:24:28.025] debug: Log file re-opened [2017-11-30T09:24:28.025] debug: sched: slurmctld starting [2017-11-30T09:24:28.025] slurmctld version 17.11.0 started on cluster linuxcluster [2017-11-30T09:24:28.026] Munge cryptographic signature plugin loaded [2017-11-30T09:24:28.026] Consumable Resources (CR) Node Selection plugin loaded with argument 1 [2017-11-30T09:24:28.026] preempt/none loaded [2017-11-30T09:24:28.026] debug: Checkpoint plugin loaded: checkpoint/none [2017-11-30T09:24:28.026] debug: AcctGatherEnergy NONE plugin loaded [2017-11-30T09:24:28.026] debug: AcctGatherProfile NONE plugin loaded [2017-11-30T09:24:28.026] debug: AcctGatherInterconnect NONE plugin loaded [2017-11-30T09:24:28.026] debug: AcctGatherFilesystem NONE plugin loaded [2017-11-30T09:24:28.026] debug: Job accounting gather cgroup plugin loaded [2017-11-30T09:24:28.026] ExtSensors NONE plugin loaded [2017-11-30T09:24:28.026] debug: switch NONE plugin loaded [2017-11-30T09:24:28.026] debug: power_save module disabled, SuspendTime < 0 [2017-11-30T09:24:28.026] debug: No backup controller to shutdown [2017-11-30T09:24:28.026] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) [2017-11-30T09:24:28.027] debug: Munge authentication plugin loaded [2017-11-30T09:24:28.030] debug: slurmdbd: Sent PersistInit msg [2017-11-30T09:24:28.030] slurmdbd: recovered 0 pending RPCs [2017-11-30T09:24:28.429] debug: Reading slurm.conf file: /usr/local/etc/slurm.conf [2017-11-30T09:24:28.430] layouts: no layout to initialize [2017-11-30T09:24:28.430] topology NONE plugin loaded [2017-11-30T09:24:28.430] debug: No DownNodes [2017-11-30T09:24:28.435] debug: Log file re-opened [2017-11-30T09:24:28.435] sched: Backfill scheduler plugin loaded [2017-11-30T09:24:28.435] route default plugin loaded [2017-11-30T09:24:28.435] layouts: loading entities/relations information [2017-11-30T09:24:28.435] debug: layouts: 1/1 nodes in hash table, rc=0 [2017-11-30T09:24:28.435] debug: layouts: loading stage 1 [2017-11-30T09:24:28.435] debug: layouts: loading stage 1.1 (restore state) [2017-11-30T09:24:28.435] debug: layouts: loading stage 2 [2017-11-30T09:24:28.435] debug: layouts: loading stage 3 [2017-11-30T09:24:28.435] Recovered state of 1 nodes [2017-11-30T09:24:28.435] Down nodes: linuxcluster [2017-11-30T09:24:28.435] Recovered JobID=15 State=0x4 NodeCnt=0 Assoc=6 [2017-11-30T09:24:28.435] Recovered information about 1 jobs [2017-11-30T09:24:28.435] cons_res: select_p_node_init [2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions [2017-11-30T09:24:28.436] debug: Updating partition uid access list [2017-11-30T09:24:28.436] Recovered state of 0 reservations [2017-11-30T09:24:28.436] State of 0 triggers recovered [2017-11-30T09:24:28.436] _preserve_plugins: backup_controller not specified [2017-11-30T09:24:28.436] cons_res: select_p_reconfigure [2017-11-30T09:24:28.436] cons_res: select_p_node_init [2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions [2017-11-30T09:24:28.436] Running as primary controller [2017-11-30T09:24:28.436] debug: No BackupController, not launching heartbeat. [2017-11-30T09:24:28.436] Registering slurmctld at port 6817 with slurmdbd. [2017-11-30T09:24:28.677] debug: No feds to retrieve from state [2017-11-30T09:24:28.757] debug: Priority BASIC plugin loaded [2017-11-30T09:24:28.758] No parameter for mcs plugin, default values set [2017-11-30T09:24:28.758] mcs: MCSParameters = (null). ondemand set. [2017-11-30T09:24:28.758] debug: mcs none plugin loaded [2017-11-30T09:24:28.758] debug: power_save mode not enabled [2017-11-30T09:24:31.761] debug: Spawning registration agent for linuxcluster1 hosts [2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2017-11-30T09:24:58.435] debug: backfill: beginning [2017-11-30T09:24:58.435] debug: backfill: no jobs to backfill [2017-11-30T09:25:28.435] debug: backfill: beginning [2017-11-30T09:25:28.436] debug: backfill: no jobs to backfill [2017-11-30T09:25:28.830] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_sta rt=0,sched_min_interval=2 [2017-11-30T09:25:28.830] debug: sched: Running job scheduler [2017-11-30T09:25:58.436] debug: backfill: beginning [2017-11-30T09:25:58.436] debug: backfill: no jobs to backfill ps -ef | grep slurm ubuntu@linuxcluster:/home/dvi/$ ps -ef | grep slurm slurm 11388 1 0 09:24 ? 00:00:00 /usr/local/sbin/slurmdbd slurm 11430 1 0 09:24 ? 00:00:00 /usr/local/sbin/slurmctld Any idea ? El El mié, 29 nov 2017 a las 18:21, Le Biot, Pierre-Marie < pierre-marie.leb...@hpe.com> escribió: > Hello David, > > > > So linuxcluster is the Head node and also a Compute node ? > > > > Is slurmd running ? > > > > What does /var/log/slurm/slurmd.log say ? > > > > Regards, > > Pierre-Marie Le Biot > > > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *david vilanova > *Sent:* Wednesday, November 29, 2017 4:33 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] slurm conf with single machine with multi > cores. > > > > Hi, > > I have updated the slurm.conf as follows: > > > SelectType=select/cons_res > > SelectTypeParameters=CR_CPU > > NodeName=linuxcluster CPUs=2 > > PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE > State=UP > > > > Still get testq node in down status ??? Any idea ? > > > > Below log from db and controller: > > ==> /var/log/slurm/slurmctrl.log <== > > [2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster > linuxcluster > > [2017-11-29T16:28:30.850] error: SelectType specified more than once, > latest value used > > [2017-11-29T16:28:30.851] layouts: no layout to initialize > > [2017-11-29T16:28:30.855] layouts: loading entities/relations information > > [2017-11-29T16:28:30.855] Recovered state of 1 nodes > > [2017-11-29T16:28:30.855] Down nodes: linuxcluster > > [2017-11-29T16:28:30.855] Recovered information about 0 jobs > > [2017-11-29T16:28:30.855] cons_res: select_p_node_init > > [2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions > > [2017-11-29T16:28:30.856] Recovered state of 0 reservations > > [2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not > specified > > [2017-11-29T16:28:30.856] cons_res: select_p_reconfigure > > [2017-11-29T16:28:30.856] cons_res: select_p_node_init > > [2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions > > [2017-11-29T16:28:30.856] Running as primary controller > > [2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd. > > [2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set > > [2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set. > > [2017-11-29T16:29:31.169] > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 > > > > David > > > > > > > > El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald < > steffen.grunew...@aei.mpg.de> escribió: > > Hi David, > > On Wed, 2017-11-29 at 14:45:06 +0000, david vilanova wrote: > > Hello, > > I have installed latest 7.11 release and my node is shown as down. > > I hava a single physical server with 12 cores so not sure the conf below > is > > correct ?? can you help ?? > > > > In slurm.conf the node is configure as follows: > > > > NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1 > > ThreadsPerCore=1 Feature=local > > 12 Sockets? Certainly not... 12 Cores per socket, yes. > (IIRC CPUS shouldn't be specified if the detailed topology is given. > You may try CPUs=12 and drop the details.) > > > PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE > State=UP > ^^ typo? > > Cheers, > Steffen > >