Dear Benjamin, Many thanks for your answer. The blades have all 32 cores but I left one free since there is a VM running on each of them, don't really know if that helps but I wanted to be on the safe side :)
I submitted the job with sbatch and the following command: #!/bin/bash #SBATCH -n 80 # number of cores #SBATCH -o /mnt/nfs/bio/HPC_related_material/Jobs_STDOUT_logs/slurm.%N.%j.out # STDOUT #SBATCH -e /mnt/nfs/bio/HPC_related_material/Jobs_STDERR_logs/slurm.%N.%j.err # STDERR perl /mnt/nfs/bio/Script_test_folder/Mira_script.pl And the mira manifest file (don't know if you have experience with this assembler?) is written in a way that the software should use the total amount of allocated cores: project = 36 job = genome,denovo,accurate readgroup = Illumina_Paired_End_files autopairing data = Stool_R1_36_NEB_Ultra.fastq Stool_R2_36_NEB_Ultra.fastq technology = solexa parameters = -GENERAL:number_of_threads=80 -GENERAL:mps=0 -SK:mchr=2048 -SK:mhpr=1000 -AS:sep=on -AS:nop=4 -NW:cmrnl=warn -OUT:rtd=on -OUT:orc=off -OUT:orm=off -OUT:ora=off -OUT:ort=off -NW:cnfs=warn Do you see anything here that could be the cause of the issue? Best regards, Pierre 2016-02-04 19:41 GMT+01:00 Benjamin Redling <benjamin.ra...@uni-jena.de>: > > Can you post how you submitted the job? > Mira on 60 cores needs MPI in your case. Multi threading works w/o > > BTW. Your config says 31cpus. Generated without incr index or intended? > > Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger < > pierre.schneeber...@gmail.com>: > >Hi there, > > > >I'm setting up a small cluster composed of 4 blades with 32 (physical) > >cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb > >RAM). A > >CentOS 7 VM is running on each blade. > >The slurm controller service is up and running on one of the blade, and > >the > >daemon service has been installed on each of the four blades (up and > >running as well). > > > >A few days ago, I submitted a job using the MIRA assembler > >(multithreaded) > >on 60 cores and it worked well, using all the resources I allocated to > >the > >job. At that point, only 2 blades (including the one with the > >controller) > >were running and the job was completed successfully using 60 cores when > >needed. > > > >The problem appeared when I added the 2 last blades and it seems that > >it > >doesn't matter how much resources (cores) I allocate to a job, it now > >runs > >on a maximum of 32 cores (the number of physical cores per node). > >I tried it with 60, 90 and 120 cores but MIRA, according to the system > >monitor from CentOS, seem to use only a maximum of 32 cores (all cores > >from > >one node but none of the others that were allocated). Is it possible > >that > >there is a communication issue between the nodes? (although all seem > >available when using the sinfo command). > > > >I tried to restart the different services (controller/slaves) but it > >doesn't seem to help. > > > >I would be grateful if someone could give me a hint on how to solve > >this > >issue, > > > >Many thanks in advance, > >Pierre > > > >Here is the *slurm.conf* information: > > > ># slurm.conf file generated by configurator easy.html. > ># Put this file on all nodes of your cluster. > ># See the slurm.conf man page for more information. > ># > >ControlMachine=hpc-srvbio-03 > >ControlAddr=192.168.12.12 > ># > >#MailProg=/bin/mail > >MpiDefault=none > >#MpiParams=ports=#-# > >ProctrackType=proctrack/pgid > >ReturnToService=1 > >SlurmctldPidFile=/var/run/slurmctld.pid > >#SlurmctldPort=6817 > >SlurmdPidFile=/var/run/slurmd.pid > >#SlurmdPort=6818 > >SlurmdSpoolDir=/var/spool/slurmd > >SlurmUser=root > >#SlurmdUser=root > >StateSaveLocation=/var/spool/slurmctld > >SwitchType=switch/none > >TaskPlugin=task/none > ># > ># > ># TIMERS > >#KillWait=30 > >#MinJobAge=300 > >#SlurmctldTimeout=120 > >#SlurmdTimeout=300 > ># > ># > ># SCHEDULING > >FastSchedule=1 > >SchedulerType=sched/backfill > >#SchedulerPort=7321 > >SelectType=select/linear > ># > ># > ># LOGGING AND ACCOUNTING > >AccountingStorageType=accounting_storage/filetxt > >ClusterName=cluster > >#JobAcctGatherFrequency=30 > >JobAcctGatherType=jobacct_gather/none > >#SlurmctldDebug=3 > >#SlurmctldLogFile= > >#SlurmdDebug=3 > >#SlurmdLogFile= > ># > ># > ># COMPUTE NODES > >#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN > >PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2] > >Default=YES MaxTime=INFINITE State=UP > >NodeName=DEFAULT CPUs=31 RealMemory=750000 TmpDisk=36758 > >NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12 > >NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13 > >NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11 > >NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10 > > -- > FSU Jena | JULIELab.de/Staff/Benjamin+Redling.HTML > vox: +49 3641 9 44323 | fax: +49 3641 9 44321 >