Can you post how you submitted the job?
Mira on 60 cores needs MPI in your case. Multi threading works w/o

BTW. Your config says 31cpus. Generated without incr index or intended?

Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger 
<pierre.schneeber...@gmail.com>:
>Hi there,
>
>I'm setting up a small cluster composed of 4 blades with 32 (physical)
>cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb
>RAM). A
>CentOS 7 VM is running on each blade.
>The slurm controller service is up and running on one of the blade, and
>the
>daemon service has been installed on each of the four blades (up and
>running as well).
>
>A few days ago, I submitted a job using the MIRA assembler
>(multithreaded)
>on 60 cores and it worked well, using all the resources I allocated to
>the
>job. At that point, only 2 blades (including the one with the
>controller)
>were running and the job was completed successfully using 60 cores when
>needed.
>
>The problem appeared when I added the 2 last blades and it seems that
>it
>doesn't matter how much resources (cores) I allocate to a job, it now
>runs
>on a maximum of 32 cores (the number of physical cores per node).
>I tried it with 60, 90 and 120 cores but MIRA, according to the system
>monitor from CentOS, seem to use only a maximum of 32 cores (all cores
>from
>one node but none of the others that were allocated). Is it possible
>that
>there is a communication issue between the nodes? (although all seem
>available when using the sinfo command).
>
>I tried to restart the different services (controller/slaves) but it
>doesn't seem to help.
>
>I would be grateful if someone could give me a hint on how to solve
>this
>issue,
>
>Many thanks in advance,
>Pierre
>
>Here is the *slurm.conf* information:
>
># slurm.conf file generated by configurator easy.html.
># Put this file on all nodes of your cluster.
># See the slurm.conf man page for more information.
>#
>ControlMachine=hpc-srvbio-03
>ControlAddr=192.168.12.12
>#
>#MailProg=/bin/mail
>MpiDefault=none
>#MpiParams=ports=#-#
>ProctrackType=proctrack/pgid
>ReturnToService=1
>SlurmctldPidFile=/var/run/slurmctld.pid
>#SlurmctldPort=6817
>SlurmdPidFile=/var/run/slurmd.pid
>#SlurmdPort=6818
>SlurmdSpoolDir=/var/spool/slurmd
>SlurmUser=root
>#SlurmdUser=root
>StateSaveLocation=/var/spool/slurmctld
>SwitchType=switch/none
>TaskPlugin=task/none
>#
>#
># TIMERS
>#KillWait=30
>#MinJobAge=300
>#SlurmctldTimeout=120
>#SlurmdTimeout=300
>#
>#
># SCHEDULING
>FastSchedule=1
>SchedulerType=sched/backfill
>#SchedulerPort=7321
>SelectType=select/linear
>#
>#
># LOGGING AND ACCOUNTING
>AccountingStorageType=accounting_storage/filetxt
>ClusterName=cluster
>#JobAcctGatherFrequency=30
>JobAcctGatherType=jobacct_gather/none
>#SlurmctldDebug=3
>#SlurmctldLogFile=
>#SlurmdDebug=3
>#SlurmdLogFile=
>#
>#
># COMPUTE NODES
>#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN
>PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2]
>Default=YES MaxTime=INFINITE State=UP
>NodeName=DEFAULT CPUs=31 RealMemory=750000 TmpDisk=36758
>NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12
>NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13
>NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11
>NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10

--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.HTML
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

Reply via email to