Answers below...

On Tue, Feb 28, 2012 at 12:35 PM, Taras Shapovalov <
[email protected]> wrote:

> **
> Hi Carles,
>
>
> On 02/27/2012 11:28 PM, Carles Fenoy wrote:
>
> We are running slurm 2.3.2 and found no issues when two epilogs run
> simultaneosly on the same node.
>
> We are performing tests on 2.3.3.
>
>
>  What do you do inside your epilogs?
>
> Nothing special: database queries + several rsyncs.
>
>
>  I don't understand what you mean with increasing the NumCPUs.
>
> I mean this one:
>
> [taras@ts-sl5slurm ~]$ date
> Tue Feb 28 11:59:45 CET 2012
>
> [taras@ts-sl5slurm ~]$ scontrol show job 1587
> JobId=1587 Name=job.3-us-west-2-upload
>    UserId=taras(1002) GroupId=taras(1002)
>    Priority=4294901104 Account=(null) QOS=normal
>    JobState=PENDING Reason=Priority Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>    SubmitTime=2012-02-28T11:53:08 EligibleTime=2012-02-28T11:53:08
>    StartTime=Unknown EndTime=Unknown
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=defq AllocNode:Sid=ts-sl5slurm:25574
>    ReqNodeList=ts-sl5slurm ExcNodeList=(null)
>    NodeList=(null)
>    NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=1:*:*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) Gres=(null) Reservation=(null)
>    Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>    Command=/tmp/upload-_bF-OA.job
>    WorkDir=/home/taras/cmsub
>
> [taras@ts-sl5slurm ~]$ squeue -o '%.7i %.28j %.6t %.20R'
>   JOBID                         NAME     ST     NODELIST(REASON)
>    1586     job.2-eu-west-1-download     CG          ts-sl5slurm
>    1587       job.3-us-west-2-upload     PD          (Resources)
>
> [taras@ts-sl5slurm ~]$ date
> Tue Feb 28 12:01:30 CET 2012
>
> [taras@ts-sl5slurm ~]$ scontrol show job 1587
> JobId=1587 Name=job.3-us-west-2-upload
>    UserId=taras(1002) GroupId=taras(1002)
>    Priority=4294901104 Account=(null) QOS=normal
>    JobState=COMPLETING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>    SubmitTime=2012-02-28T11:53:08 EligibleTime=2012-02-28T11:53:08
>    StartTime=2012-02-28T12:01:18 EndTime=2012-02-28T12:01:18
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=defq AllocNode:Sid=ts-sl5slurm:25574
>    ReqNodeList=ts-sl5slurm ExcNodeList=(null)
>    NodeList=ts-sl5slurm
>    BatchHost=ts-sl5slurm
>    NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqS:C:T=1:*:*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) Gres=(null) Reservation=(null)
>    Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>    Command=/tmp/upload-_bF-OA.job
>    WorkDir=/home/taras/cmsub
>
>
> So, firstly jobs 1586 and 1587 had NumCPUs=1, than these jobs was finished
> in parallel on node ts-sl5slurm, then their NumCPUs were increased to
> maximum value (12 CPUs) automatically and then state of 1587 was changed to
> PD, because all CPUs are allocated by epilog of job 1586.
>
>
As far as I can see here, job 1587 waits untill the job 1586 finishes. In
the squeue output you added, 1586 is completing because, probably because
its epilog is running. When it finishes job 1587 starts. As you have
Shared=NO in your partition configuration, slurm considers the job has used
all the cpus in the node and sets NumCPUs=12.
So there are NO 2 jobs running in parallel, but sequentially because of the
non shared partition.

Hopes this solves your question.

Regards,
Carles Fenoy



>  Can you post your configuration?
>
> slurm.conf:
>
> SlurmUser=slurm
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/cm/shared/apps/slurm/current/cm/statesave
> SlurmdSpoolDir=/cm/local/apps/slurm/2.3.3/spool
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/pgid
> CacheGroups=0
> ReturnToService=2
> PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog
> Epilog=/cm/local/apps/cmd/scripts/epilog
> SlurmctldTimeout=5
> SlurmdTimeout=5
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> FastSchedule=0
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurmctld
> SlurmdDebug=7
> SlurmdLogFile=/var/log/slurmd
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> AccountingStorageType=accounting_storage/slurmdbd
> GresTypes=gpu,eu-west-1,us-west-2
> SchedulerType=sched/backfill
> ControlMachine=ts-sl5slurm
> ControlAddr=ts-sl5slurm
> NodeName=node001
> NodeName=cnode001-eu-west-1 Feature=eu-west-1
> NodeName=ts-sl5slurm
> NodeName=cnode001-us-west-2 Feature=us-west-2
> PartitionName=defq
> Nodes=cnode001-eu-west-1,cnode001-us-west-2,node001,ts-sl5slurm Default=YES
> MinNodes=1 MaxNodes=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL Priority=1
> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO
>
>
>  Best regards,
> Carles Fenoy
> El 24/02/2012 17:51, "Taras Shapovalov" <
> [email protected]> escribió:
>
>>  Hi,
>>
>>  Nobody can answer my question?
>>
>>  --
>>  Best regards,
>>  Taras
>>
>>
>>  On Tue, Feb 21, 2012 at 5:44 PM, Taras Shapovalov <
>> [email protected]> wrote:
>>
>>>
>>> Dear developers,
>>>
>>>  If we set up epilog script (Epilog parameter in slurm.conf), then
>>> epilogs
>>>  of different jobs will be executed serially - one by one (if they run on
>>> the same node).
>>>
>>>  As I understand, after a job is finished the NumCPUs value will be
>>>  automatically increased to maximum number of CPU cores
>>>  on the node. Therefore, another epilogs will be waiting.
>>> Also administrator can not change
>>> NumCPUs with 'scontrol update' during epilog stage to allow another jobs
>>> (which are in epilog stages) to run their epilogs too.
>>>
>>>  Is it possible to allow several epilogs to execute in parallel on the
>>> same host?
>>>
>>> --
>>> Best regards,
>>> Taras
>>>
>>
>>
>>
> --
> Best regards,
>   Taras
>
>


-- 
--
Carles Fenoy

Reply via email to