Dear Benjamin,

Many thanks for your answer.
The blades have all 32 cores but I left one free since there is a VM
running on each of them, don't really know if that helps but I wanted to be
on the safe side :)

I submitted the job with sbatch and the following command:

#!/bin/bash
#SBATCH -n 80 # number of cores
#SBATCH -o
/mnt/nfs/bio/HPC_related_material/Jobs_STDOUT_logs/slurm.%N.%j.out # STDOUT
#SBATCH -e
/mnt/nfs/bio/HPC_related_material/Jobs_STDERR_logs/slurm.%N.%j.err # STDERR
perl /mnt/nfs/bio/Script_test_folder/Mira_script.pl

And the mira manifest file (don't know if you have experience with this
assembler?) is written in a way that the software should use the total
amount of allocated cores:

project = 36
job = genome,denovo,accurate
readgroup = Illumina_Paired_End_files
autopairing
data = Stool_R1_36_NEB_Ultra.fastq Stool_R2_36_NEB_Ultra.fastq
technology = solexa
parameters = -GENERAL:number_of_threads=80 -GENERAL:mps=0 -SK:mchr=2048
-SK:mhpr=1000 -AS:sep=on -AS:nop=4 -NW:cmrnl=warn -OUT:rtd=on -OUT:orc=off
-OUT:orm=off -OUT:ora=off -OUT:ort=off -NW:cnfs=warn

Do you see anything here that could be the cause of the issue?


Best regards,
Pierre


2016-02-04 19:41 GMT+01:00 Benjamin Redling <benjamin.ra...@uni-jena.de>:

>
> Can you post how you submitted the job?
> Mira on 60 cores needs MPI in your case. Multi threading works w/o
>
> BTW. Your config says 31cpus. Generated without incr index or intended?
>
> Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger <
> pierre.schneeber...@gmail.com>:
> >Hi there,
> >
> >I'm setting up a small cluster composed of 4 blades with 32 (physical)
> >cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb
> >RAM). A
> >CentOS 7 VM is running on each blade.
> >The slurm controller service is up and running on one of the blade, and
> >the
> >daemon service has been installed on each of the four blades (up and
> >running as well).
> >
> >A few days ago, I submitted a job using the MIRA assembler
> >(multithreaded)
> >on 60 cores and it worked well, using all the resources I allocated to
> >the
> >job. At that point, only 2 blades (including the one with the
> >controller)
> >were running and the job was completed successfully using 60 cores when
> >needed.
> >
> >The problem appeared when I added the 2 last blades and it seems that
> >it
> >doesn't matter how much resources (cores) I allocate to a job, it now
> >runs
> >on a maximum of 32 cores (the number of physical cores per node).
> >I tried it with 60, 90 and 120 cores but MIRA, according to the system
> >monitor from CentOS, seem to use only a maximum of 32 cores (all cores
> >from
> >one node but none of the others that were allocated). Is it possible
> >that
> >there is a communication issue between the nodes? (although all seem
> >available when using the sinfo command).
> >
> >I tried to restart the different services (controller/slaves) but it
> >doesn't seem to help.
> >
> >I would be grateful if someone could give me a hint on how to solve
> >this
> >issue,
> >
> >Many thanks in advance,
> >Pierre
> >
> >Here is the *slurm.conf* information:
> >
> ># slurm.conf file generated by configurator easy.html.
> ># Put this file on all nodes of your cluster.
> ># See the slurm.conf man page for more information.
> >#
> >ControlMachine=hpc-srvbio-03
> >ControlAddr=192.168.12.12
> >#
> >#MailProg=/bin/mail
> >MpiDefault=none
> >#MpiParams=ports=#-#
> >ProctrackType=proctrack/pgid
> >ReturnToService=1
> >SlurmctldPidFile=/var/run/slurmctld.pid
> >#SlurmctldPort=6817
> >SlurmdPidFile=/var/run/slurmd.pid
> >#SlurmdPort=6818
> >SlurmdSpoolDir=/var/spool/slurmd
> >SlurmUser=root
> >#SlurmdUser=root
> >StateSaveLocation=/var/spool/slurmctld
> >SwitchType=switch/none
> >TaskPlugin=task/none
> >#
> >#
> ># TIMERS
> >#KillWait=30
> >#MinJobAge=300
> >#SlurmctldTimeout=120
> >#SlurmdTimeout=300
> >#
> >#
> ># SCHEDULING
> >FastSchedule=1
> >SchedulerType=sched/backfill
> >#SchedulerPort=7321
> >SelectType=select/linear
> >#
> >#
> ># LOGGING AND ACCOUNTING
> >AccountingStorageType=accounting_storage/filetxt
> >ClusterName=cluster
> >#JobAcctGatherFrequency=30
> >JobAcctGatherType=jobacct_gather/none
> >#SlurmctldDebug=3
> >#SlurmctldLogFile=
> >#SlurmdDebug=3
> >#SlurmdLogFile=
> >#
> >#
> ># COMPUTE NODES
> >#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN
> >PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2]
> >Default=YES MaxTime=INFINITE State=UP
> >NodeName=DEFAULT CPUs=31 RealMemory=750000 TmpDisk=36758
> >NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12
> >NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13
> >NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11
> >NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10
>
> --
> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.HTML
> vox: +49 3641 9 44323 | fax: +49 3641 9 44321
>

Reply via email to