Hi Carles,
On 02/27/2012 11:28 PM, Carles Fenoy wrote:
We are running slurm 2.3.2 and found no issues when two epilogs run
simultaneosly on the same node.
We are performing tests on 2.3.3.
What do you do inside your epilogs?
Nothing special: database queries + several rsyncs.
I don't understand what you mean with increasing the NumCPUs.
I mean this one:
[taras@ts-sl5slurm ~]$ date
Tue Feb 28 11:59:45 CET 2012
[taras@ts-sl5slurm ~]$ scontrol show job 1587
JobId=1587 Name=job.3-us-west-2-upload
UserId=taras(1002) GroupId=taras(1002)
Priority=4294901104 Account=(null) QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2012-02-28T11:53:08 EligibleTime=2012-02-28T11:53:08
StartTime=Unknown EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=ts-sl5slurm:25574
ReqNodeList=ts-sl5slurm ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=1:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/tmp/upload-_bF-OA.job
WorkDir=/home/taras/cmsub
[taras@ts-sl5slurm ~]$ squeue -o '%.7i %.28j %.6t %.20R'
JOBID NAME ST NODELIST(REASON)
1586 job.2-eu-west-1-download CG ts-sl5slurm
1587 job.3-us-west-2-upload PD (Resources)
[taras@ts-sl5slurm ~]$ date
Tue Feb 28 12:01:30 CET 2012
[taras@ts-sl5slurm ~]$ scontrol show job 1587
JobId=1587 Name=job.3-us-west-2-upload
UserId=taras(1002) GroupId=taras(1002)
Priority=4294901104 Account=(null) QOS=normal
JobState=COMPLETING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2012-02-28T11:53:08 EligibleTime=2012-02-28T11:53:08
StartTime=2012-02-28T12:01:18 EndTime=2012-02-28T12:01:18
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=ts-sl5slurm:25574
ReqNodeList=ts-sl5slurm ExcNodeList=(null)
NodeList=ts-sl5slurm
BatchHost=ts-sl5slurm
NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqS:C:T=1:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/tmp/upload-_bF-OA.job
WorkDir=/home/taras/cmsub
So, firstly jobs 1586 and 1587 had NumCPUs=1, than these jobs was
finished in parallel on node ts-sl5slurm, then their NumCPUs were
increased to maximum value (12 CPUs) automatically and then state of
1587 was changed to PD, because all CPUs are allocated by epilog of job
1586.
Can you post your configuration?
slurm.conf:
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/cm/shared/apps/slurm/current/cm/statesave
SlurmdSpoolDir=/cm/local/apps/slurm/2.3.3/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
CacheGroups=0
ReturnToService=2
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
SlurmctldTimeout=5
SlurmdTimeout=5
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
FastSchedule=0
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=7
SlurmdLogFile=/var/log/slurmd
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
GresTypes=gpu,eu-west-1,us-west-2
SchedulerType=sched/backfill
ControlMachine=ts-sl5slurm
ControlAddr=ts-sl5slurm
NodeName=node001
NodeName=cnode001-eu-west-1 Feature=eu-west-1
NodeName=ts-sl5slurm
NodeName=cnode001-us-west-2 Feature=us-west-2
PartitionName=defq
Nodes=cnode001-eu-west-1,cnode001-us-west-2,node001,ts-sl5slurm
Default=YES MinNodes=1 MaxNodes=UNLIMITED MaxTime=UNLIMITED
AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO
Best regards,
Carles Fenoy
El 24/02/2012 17:51, "Taras Shapovalov"
<[email protected]
<mailto:[email protected]>> escribió:
Hi,
Nobody can answer my question?
--
Best regards,
Taras
On Tue, Feb 21, 2012 at 5:44 PM, Taras Shapovalov
<[email protected]
<mailto:[email protected]>> wrote:
Dear developers,
If we set up epilog script (Epilog parameter in slurm.conf),
then epilogs
of different jobs will be executed serially - one by one (if
they run on
the same node).
As I understand, after a job is finished the NumCPUs value
will be
automatically increased to maximum number of CPU cores
on the node. Therefore, another epilogs will be waiting.
Also administrator can not change
NumCPUs with 'scontrol update' during epilog stage to allow
another jobs
(which are in epilog stages) to run their epilogs too.
Is it possible to allow several epilogs to execute in
parallel on the same host?
--
Best regards,
Taras
--
Best regards,
Taras