Of course, -N 1 is wrong since you request more cpu than available on 1 node. I didn't read your mail to the end sorry.
try with: -n 25 -m plane=20 On 04/04/2014 13:57, Joan Arbona wrote: > Not working, just says that cannot use more nodes than requested: > > srun: error: Unable to create job step: More processors requested than > permitted > > Thanks > > On 04/04/14 13:50, Mehdi Denou wrote: >> Try with: >> srun -N 1 -n 25 >> >> On 04/04/2014 13:47, Joan Arbona wrote: >>> Excuse me, I confused "Nodes" with "Tasks". When I wrote "Nodes" in the >>> last e-mail I meant "tasks". >>> >>> Let me explain it again with an example: >>> >>> My cluster has 2 nodes with 20 processors/node. I want to allocate all >>> 40 processors and both nodes in sbatch. Then I have to execute a jobstep >>> with srun on a subset of 25 processors. I want SLURM to fill completely >>> the maximum number of nodes: That is, using all 20 processors of the >>> first node and 5 of the second one. >>> >>> If I execute an sbatch like this: >>> #!/bin/bash >>> [...] >>> #SBATCH --nodes=2 >>> #SBATCH --ntasks=40 >>> srun -n25 hostname >>> >>> Does not work and executes 12 hostname on the first node and 13 on the >>> second one, and should execute 20 hostname on the first one and 5 on the >>> second one. >>> >>> >>> Thanks and sorry for the confusion, >>> Joan >>> >>> >>> >>> On 04/04/14 13:22, Mehdi Denou wrote: >>>> It's a little bit confusing: >>>> >>>> When in sbatch I specify that I want to allocate 25 nodes and I execute >>>> >>>> So it means -N 25 >>>> For example if you want to allocate 40 nodes and then execute srun on 25: >>>> >>>> #!/bin/bash >>>> #SBATCH -N 40 >>>> >>>> srun -N 25 hostname >>>> >>>> -n is the number of task (the number of system process) >>>> -N or --nodes is the number of nodes. >>>> >>>> If you don't specify -n it's set to 1 by default. >>>> >>>> On 04/04/2014 11:24, Joan Arbona wrote: >>>>> Thanks for the answer. No luck anyway. >>>>> When in sbatch I specify that I want to allocate 25 nodes and I execute >>>>> srun without parameters it works. However, if I specify I want to >>>>> allocate 40 nodes and then I execute srun selecting only 25 of them it >>>>> does not work. >>>>> >>>>> That is: >>>>> >>>>> --- >>>>> >>>>> 1. >>>>> #!/bin/bash >>>>> [...] >>>>> #SBATCH --nodes=2 >>>>> #SBATCH --ntasks=25 >>>>> >>>>> srun hostname >>>>> >>>>> -> Works, but we don't want it because we need srun to select a subset >>>>> of the requested nodes. >>>>> >>>>> --- >>>>> >>>>> 2. >>>>> #!/bin/bash >>>>> [...] >>>>> #SBATCH --nodes=2 >>>>> #SBATCH --ntasks=40 >>>>> >>>>> srun -n25 hostname >>>>> >>>>> -> Doesn't work. Executes half of the processes on the first node and >>>>> the other half on the second. Also tried to remove --nodes=2. >>>>> >>>>> --- >>>>> >>>>> It seems that it's the way sbatch influences srun. Is there anyway to >>>>> see which parameters does the sbatch call transfers to srun? >>>>> >>>>> Thanks, >>>>> Joan >>>>> >>>>> >>>>> >>>>> >>>>> On 04/04/14 10:54, Mehdi Denou wrote: >>>>>> Hello, >>>>>> >>>>>> You should take a look at the parameter --mincpu >>>>>> >>>>>> >>>>>> On 04/04/2014 10:22, Joan Arbona wrote: >>>>>>> Hello all, >>>>>>> >>>>>>> We have a cluster with 40 nodes and 20 cores for node and we are trying >>>>>>> to distribute jobsteps executed with sbatch "in blocks". That means we >>>>>>> want to fill the maximum number of nodes and, if the number of tasks is >>>>>>> not multiple of 20, to have only one node without all cores busy. For >>>>>>> example, if we executed a task on 25 cores, we would have node 1 with >>>>>>> all 20 cores reserved and node 2 with only 5 cores reserved. >>>>>>> >>>>>>> If we execute >>>>>>> srun -n25 -pthin hostname >>>>>>> works fine and produces the following output: >>>>>>> >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner118 >>>>>>> foner119 >>>>>>> foner119 >>>>>>> foner119 >>>>>>> foner119 >>>>>>> foner119 >>>>>>> >>>>>>> >>>>>>> However, when we execute this in a sbatch script it does not work at >>>>>>> all. I have tried it with all possible configurations I know and with >>>>>>> all useful parameters. Instead it executes 13 processes on the first >>>>>>> node and 12 processes on the second node. >>>>>>> >>>>>>> This is our sbatch script: >>>>>>> #!/bin/bash >>>>>>> #SBATCH --job-name=prova_joan >>>>>>> #SBATCH --partition=thin >>>>>>> #SBATCH --output=WRFJobName-%j.out >>>>>>> #SBATCH --error=WRFJobName-%j.err >>>>>>> #SBATCH --nodes=2 >>>>>>> #SBATCH --ntasks=40 >>>>>>> >>>>>>> srun -n25 --exclusive hostname & >>>>>>> >>>>>>> wait >>>>>>> >>>>>>> I have already tried to remove the --exclusive and the & without >>>>>>> success. >>>>>>> >>>>>>> To sum up, the question is: What's the way to group tasks of jobsteps so >>>>>>> they fill as many nodes as possible with sbatch? >>>>>>> >>>>>>> Thanks, >>>>>>> Joan >>>>>>> >>>>>>> >>>>>>> PS: Attaching slurm.conf: >>>>>>> >>>>>>> >>>>>>> ##################BEGIN SLURM.CONF####################### >>>>>>> ClusterName=foner >>>>>>> ControlMachine=foner1,foner2 >>>>>>> ControlAddr=slurm-server >>>>>>> #BackupController= >>>>>>> #BackupAddr= >>>>>>> # >>>>>>> SlurmUser=slurm >>>>>>> #SlurmdUser=root >>>>>>> SlurmctldPort=6817 >>>>>>> SlurmdPort=6818 >>>>>>> AuthType=auth/munge >>>>>>> CryptoType=crypto/munge >>>>>>> JobCredentialPrivateKey=/etc/slurm/private.key >>>>>>> JobCredentialPublicCertificate=/etc/slurm/public.key >>>>>>> StateSaveLocation=/SLURM >>>>>>> SlurmdSpoolDir=/var/log/slurm/spool_slurmd/ >>>>>>> SwitchType=switch/none >>>>>>> MpiDefault=none >>>>>>> SlurmctldPidFile=/var/run/slurm/slurmctld.pid >>>>>>> SlurmdPidFile=/var/run/slurmd.pid >>>>>>> #ProctrackType=proctrack/pgid >>>>>>> ProctrackType=proctrack/linuxproc >>>>>>> TaskPlugin=task/affinity >>>>>>> TaskPluginParam=Cpusets >>>>>>> #PluginDir= >>>>>>> CacheGroups=0 >>>>>>> #FirstJobId= >>>>>>> ReturnToService=0 >>>>>>> #MaxJobCount= >>>>>>> #PlugStackConfig= >>>>>>> #PropagatePrioProcess= >>>>>>> #PropagateResourceLimits= >>>>>>> #PropagateResourceLimitsExcept= >>>>>>> #Prolog=/data/scripts/prolog_ctld.sh >>>>>>> #Prolog= >>>>>>> Epilog=/data/scripts/epilog.sh >>>>>>> #SrunProlog= >>>>>>> #SrunEpilog= >>>>>>> #TaskProlog= >>>>>>> #TaskEpilog= >>>>>>> #TaskPlugin= >>>>>>> #TrackWCKey=no >>>>>>> #TreeWidth=50 >>>>>>> #TmpFS= >>>>>>> #UsePAM= >>>>>>> #UsePAM=1 >>>>>>> # >>>>>>> # TIMERS >>>>>>> SlurmctldTimeout=300 >>>>>>> SlurmdTimeout=300 >>>>>>> InactiveLimit=0 >>>>>>> MinJobAge=300 >>>>>>> KillWait=30 >>>>>>> Waittime=0 >>>>>>> # >>>>>>> # SCHEDULING >>>>>>> SchedulerType=sched/backfill >>>>>>> #SchedulerAuth= >>>>>>> #SchedulerPort= >>>>>>> #SchedulerRootFilter= >>>>>>> #SelectType=select/linear >>>>>>> SelectType=select/cons_res >>>>>>> SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK >>>>>>> FastSchedule=1 >>>>>>> PriorityType=priority/multifactor >>>>>>> #PriorityDecayHalfLife=14-0 >>>>>>> #PriorityUsageResetPeriod=14-0 >>>>>>> PriorityWeightFairshare=0 >>>>>>> PriorityWeightAge=0 >>>>>>> PriorityWeightPartition=0 >>>>>>> PriorityWeightJobSize=0 >>>>>>> PriorityWeightQOS=1000 >>>>>>> #PriorityMaxAge=1-0 >>>>>>> # >>>>>>> # LOGGING >>>>>>> SlurmctldDebug=5 >>>>>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log >>>>>>> SlurmdDebug=5 >>>>>>> SlurmdLogFile=/var/log/slurm/slurmd.log >>>>>>> JobCompType=jobcomp/none >>>>>>> #JobCompLoc= >>>>>>> # >>>>>>> # ACCOUNTING >>>>>>> #JobAcctGatherType=jobacct_gather/linux >>>>>>> #JobAcctGatherFrequency=30 >>>>>>> # >>>>>>> #AccountingStorageType=accounting_storage/slurmdbd >>>>>>> ##AccountingStorageHost=slurm-server >>>>>>> #AccountingStorageLoc= >>>>>>> #AccountingStoragePass= >>>>>>> #AccountingStorageUser= >>>>>>> # >>>>>>> >>>>>>> AccountingStorageEnforce=qos >>>>>>> AccountingStorageLoc=slurm_acct_db >>>>>>> AccountingStorageType=accounting_storage/slurmdbd >>>>>>> AccountingStoragePort=8544 >>>>>>> AccountingStorageUser=root >>>>>>> #AccountingStoragePass=slurm >>>>>>> AccountingStorageHost=slurm-server >>>>>>> # ACCT_GATHER >>>>>>> JobAcctGatherType=jobacct_gather/linux >>>>>>> JobAcctGatherFrequency=60 >>>>>>> #AcctGatherEnergyType=acct_gather_energy/rapl >>>>>>> #AcctGatherNodeFreq=30 >>>>>>> >>>>>>> #Memoria >>>>>>> #DefMemPerCPU=1024 # 1GB >>>>>>> #MaxMemPerCPU=3072 # 3GB >>>>>>> >>>>>>> >>>>>>> >>>>>>> # COMPUTE NODES >>>>>>> NodeName=foner[11-14] Procs=20 RealMemory= 258126 Sockets=2 >>>>>>> CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN >>>>>>> >>>>>>> NodeName=foner[101-142] CPUs=20 Sockets=2 CoresPerSocket=10 >>>>>>> ThreadsPerCore=1 RealMemory=64398 State=UNKNOWN >>>>>>> >>>>>>> PartitionName=thin Nodes=foner[103-142] Shared=NO PreemptMode=CANCEL >>>>>>> State=UP MaxTime=4320 MinNodes=2 >>>>>>> PartitionName=thin_test Nodes=foner[101,102] Default=YES Shared=NO >>>>>>> PreemptMode=CANCEL State=UP MaxTime=60 MaxNodes=1 >>>>>>> PartitionName=fat Nodes=foner[11-14] Shared=NO PreemptMode=CANCEL >>>>>>> State=UP MaxTime=4320 MaxNodes=1 >>>>>>> >>>>>>> ##################END SLURM.CONF####################### >>>>>>> > -- --- Mehdi Denou International HPC support +336 45 57 66 56