Can you post how you submitted the job? Mira on 60 cores needs MPI in your case. Multi threading works w/o
BTW. Your config says 31cpus. Generated without incr index or intended? Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger <pierre.schneeber...@gmail.com>: >Hi there, > >I'm setting up a small cluster composed of 4 blades with 32 (physical) >cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb >RAM). A >CentOS 7 VM is running on each blade. >The slurm controller service is up and running on one of the blade, and >the >daemon service has been installed on each of the four blades (up and >running as well). > >A few days ago, I submitted a job using the MIRA assembler >(multithreaded) >on 60 cores and it worked well, using all the resources I allocated to >the >job. At that point, only 2 blades (including the one with the >controller) >were running and the job was completed successfully using 60 cores when >needed. > >The problem appeared when I added the 2 last blades and it seems that >it >doesn't matter how much resources (cores) I allocate to a job, it now >runs >on a maximum of 32 cores (the number of physical cores per node). >I tried it with 60, 90 and 120 cores but MIRA, according to the system >monitor from CentOS, seem to use only a maximum of 32 cores (all cores >from >one node but none of the others that were allocated). Is it possible >that >there is a communication issue between the nodes? (although all seem >available when using the sinfo command). > >I tried to restart the different services (controller/slaves) but it >doesn't seem to help. > >I would be grateful if someone could give me a hint on how to solve >this >issue, > >Many thanks in advance, >Pierre > >Here is the *slurm.conf* information: > ># slurm.conf file generated by configurator easy.html. ># Put this file on all nodes of your cluster. ># See the slurm.conf man page for more information. ># >ControlMachine=hpc-srvbio-03 >ControlAddr=192.168.12.12 ># >#MailProg=/bin/mail >MpiDefault=none >#MpiParams=ports=#-# >ProctrackType=proctrack/pgid >ReturnToService=1 >SlurmctldPidFile=/var/run/slurmctld.pid >#SlurmctldPort=6817 >SlurmdPidFile=/var/run/slurmd.pid >#SlurmdPort=6818 >SlurmdSpoolDir=/var/spool/slurmd >SlurmUser=root >#SlurmdUser=root >StateSaveLocation=/var/spool/slurmctld >SwitchType=switch/none >TaskPlugin=task/none ># ># ># TIMERS >#KillWait=30 >#MinJobAge=300 >#SlurmctldTimeout=120 >#SlurmdTimeout=300 ># ># ># SCHEDULING >FastSchedule=1 >SchedulerType=sched/backfill >#SchedulerPort=7321 >SelectType=select/linear ># ># ># LOGGING AND ACCOUNTING >AccountingStorageType=accounting_storage/filetxt >ClusterName=cluster >#JobAcctGatherFrequency=30 >JobAcctGatherType=jobacct_gather/none >#SlurmctldDebug=3 >#SlurmctldLogFile= >#SlurmdDebug=3 >#SlurmdLogFile= ># ># ># COMPUTE NODES >#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN >PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2] >Default=YES MaxTime=INFINITE State=UP >NodeName=DEFAULT CPUs=31 RealMemory=750000 TmpDisk=36758 >NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12 >NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13 >NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11 >NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10 -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.HTML vox: +49 3641 9 44323 | fax: +49 3641 9 44321