Hi there,

I'm setting up a small cluster composed of 4 blades with 32 (physical)
cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb RAM). A
CentOS 7 VM is running on each blade.
The slurm controller service is up and running on one of the blade, and the
daemon service has been installed on each of the four blades (up and
running as well).

A few days ago, I submitted a job using the MIRA assembler (multithreaded)
on 60 cores and it worked well, using all the resources I allocated to the
job. At that point, only 2 blades (including the one with the controller)
were running and the job was completed successfully using 60 cores when
needed.

The problem appeared when I added the 2 last blades and it seems that it
doesn't matter how much resources (cores) I allocate to a job, it now runs
on a maximum of 32 cores (the number of physical cores per node).
I tried it with 60, 90 and 120 cores but MIRA, according to the system
monitor from CentOS, seem to use only a maximum of 32 cores (all cores from
one node but none of the others that were allocated). Is it possible that
there is a communication issue between the nodes? (although all seem
available when using the sinfo command).

I tried to restart the different services (controller/slaves) but it
doesn't seem to help.

I would be grateful if someone could give me a hint on how to solve this
issue,

Many thanks in advance,
Pierre

Here is the *slurm.conf* information:

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=hpc-srvbio-03
ControlAddr=192.168.12.12
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN
PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2]
Default=YES MaxTime=INFINITE State=UP
NodeName=DEFAULT CPUs=31 RealMemory=750000 TmpDisk=36758
NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12
NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13
NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11
NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10

Reply via email to