Hi Carles,

On 02/27/2012 11:28 PM, Carles Fenoy wrote:

We are running slurm 2.3.2 and found no issues when two epilogs run simultaneosly on the same node.

We are performing tests on 2.3.3.

What do you do inside your epilogs?

Nothing special: database queries + several rsyncs.

I don't understand what you mean with increasing the NumCPUs.

I mean this one:

[taras@ts-sl5slurm ~]$ date
Tue Feb 28 11:59:45 CET 2012

[taras@ts-sl5slurm ~]$ scontrol show job 1587
JobId=1587 Name=job.3-us-west-2-upload
   UserId=taras(1002) GroupId=taras(1002)
   Priority=4294901104 Account=(null) QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2012-02-28T11:53:08 EligibleTime=2012-02-28T11:53:08
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=ts-sl5slurm:25574
   ReqNodeList=ts-sl5slurm ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=1:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/upload-_bF-OA.job
   WorkDir=/home/taras/cmsub

[taras@ts-sl5slurm ~]$ squeue -o '%.7i %.28j %.6t %.20R'
  JOBID                         NAME     ST     NODELIST(REASON)
   1586     job.2-eu-west-1-download     CG          ts-sl5slurm
   1587       job.3-us-west-2-upload     PD          (Resources)

[taras@ts-sl5slurm ~]$ date
Tue Feb 28 12:01:30 CET 2012

[taras@ts-sl5slurm ~]$ scontrol show job 1587
JobId=1587 Name=job.3-us-west-2-upload
   UserId=taras(1002) GroupId=taras(1002)
   Priority=4294901104 Account=(null) QOS=normal
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2012-02-28T11:53:08 EligibleTime=2012-02-28T11:53:08
   StartTime=2012-02-28T12:01:18 EndTime=2012-02-28T12:01:18
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=ts-sl5slurm:25574
   ReqNodeList=ts-sl5slurm ExcNodeList=(null)
   NodeList=ts-sl5slurm
   BatchHost=ts-sl5slurm
   NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqS:C:T=1:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/upload-_bF-OA.job
   WorkDir=/home/taras/cmsub


So, firstly jobs 1586 and 1587 had NumCPUs=1, than these jobs was finished in parallel on node ts-sl5slurm, then their NumCPUs were increased to maximum value (12 CPUs) automatically and then state of 1587 was changed to PD, because all CPUs are allocated by epilog of job 1586.

Can you post your configuration?

slurm.conf:

SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/cm/shared/apps/slurm/current/cm/statesave
SlurmdSpoolDir=/cm/local/apps/slurm/2.3.3/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
CacheGroups=0
ReturnToService=2
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
SlurmctldTimeout=5
SlurmdTimeout=5
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
FastSchedule=0
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=7
SlurmdLogFile=/var/log/slurmd
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
GresTypes=gpu,eu-west-1,us-west-2
SchedulerType=sched/backfill
ControlMachine=ts-sl5slurm
ControlAddr=ts-sl5slurm
NodeName=node001
NodeName=cnode001-eu-west-1 Feature=eu-west-1
NodeName=ts-sl5slurm
NodeName=cnode001-us-west-2 Feature=us-west-2
PartitionName=defq Nodes=cnode001-eu-west-1,cnode001-us-west-2,node001,ts-sl5slurm Default=YES MinNodes=1 MaxNodes=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO

Best regards,
Carles Fenoy

El 24/02/2012 17:51, "Taras Shapovalov" <[email protected] <mailto:[email protected]>> escribió:

    Hi,

    Nobody can answer my question?

-- Best regards,
     Taras


    On Tue, Feb 21, 2012 at 5:44 PM, Taras Shapovalov
    <[email protected]
    <mailto:[email protected]>> wrote:


        Dear developers,

         If we set up epilog script (Epilog parameter in slurm.conf),
        then epilogs
         of different jobs will be executed serially - one by one (if
        they run on
        the same node).

         As I understand, after a job is finished the NumCPUs value
        will be
         automatically increased to maximum number of CPU cores
         on the node. Therefore, another epilogs will be waiting.
        Also administrator can not change
        NumCPUs with 'scontrol update' during epilog stage to allow
        another jobs
        (which are in epilog stages) to run their epilogs too.

         Is it possible to allow several epilogs to execute in
        parallel on the same host?

        --
        Best regards,
        Taras




--
Best regards,
  Taras

Reply via email to