Hi there, I'm setting up a small cluster composed of 4 blades with 32 (physical) cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb RAM). A CentOS 7 VM is running on each blade. The slurm controller service is up and running on one of the blade, and the daemon service has been installed on each of the four blades (up and running as well).
A few days ago, I submitted a job using the MIRA assembler (multithreaded) on 60 cores and it worked well, using all the resources I allocated to the job. At that point, only 2 blades (including the one with the controller) were running and the job was completed successfully using 60 cores when needed. The problem appeared when I added the 2 last blades and it seems that it doesn't matter how much resources (cores) I allocate to a job, it now runs on a maximum of 32 cores (the number of physical cores per node). I tried it with 60, 90 and 120 cores but MIRA, according to the system monitor from CentOS, seem to use only a maximum of 32 cores (all cores from one node but none of the others that were allocated). Is it possible that there is a communication issue between the nodes? (although all seem available when using the sinfo command). I tried to restart the different services (controller/slaves) but it doesn't seem to help. I would be grateful if someone could give me a hint on how to solve this issue, Many thanks in advance, Pierre Here is the *slurm.conf* information: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=hpc-srvbio-03 ControlAddr=192.168.12.12 # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/filetxt ClusterName=cluster #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none #SlurmctldDebug=3 #SlurmctldLogFile= #SlurmdDebug=3 #SlurmdLogFile= # # # COMPUTE NODES #NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2] Default=YES MaxTime=INFINITE State=UP NodeName=DEFAULT CPUs=31 RealMemory=750000 TmpDisk=36758 NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12 NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13 NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11 NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10
