Hello,
following an update to 15.08.8 from 14.11.7 we observe what appears to be a bug.

We can reproduce the bug on clusters with TaskPlugin=task/affinity and 
TaskPlugin=task/cgroup

The background is that one of our clusters has two groups of nodes with 
different core counts (16 and 24) so we advise users not to specify the number 
of nodes in order to make best use of the resources. The topology plugin 
ensures that jobs will run on one group or the other but never span both.

The problem is as follows:

I submit a simple job


#!/bin/sh
#SBATCH --ntasks=1536
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00


We then see


$ scontrol show job 432224
JobId=432224 JobName=run64x24.job
   UserId=eroche(141633) GroupId=scitas-ge(11902)
   Priority=105081 Nice=0 Account=scitas-ge QOS=scitas
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2016-02-24T10:04:24 EligibleTime=2016-02-24T10:04:24
   StartTime=2016-02-25T15:56:16 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=parallel AllocNode:Sid=deneb2:9853
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=64 NumCPUs=1536 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1536,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/eroche/jobs/run64x24.job
   WorkDir=/home/eroche/jobs
   StdErr=/home/eroche/jobs/slurm-432224.out
   StdIn=/dev/null
   StdOut=/home/eroche/jobs/slurm-432224.out
   Power= SICP=0


If the administrator then updates the priority of this task


scontrol update jobid=432224 priority=10000000

The job information changes and the task is held with Reason=BadConstraints


$ scontrol show job 432224
JobId=432224 JobName=run64x24.job
   UserId=eroche(141633) GroupId=scitas-ge(11902)
   Priority=0 Nice=0 Account=scitas-ge QOS=scitas
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2016-02-24T10:04:24 EligibleTime=2016-02-24T10:04:24
   StartTime=2016-02-24T10:04:54 EndTime=2016-02-24T10:04:54
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=parallel AllocNode:Sid=deneb2:9853
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=64 NumCPUs=1536 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1536,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1536 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/eroche/jobs/run64x24.job
   WorkDir=/home/eroche/jobs
   StdErr=/home/eroche/jobs/slurm-432224.out
   StdIn=/dev/null
   StdOut=/home/eroche/jobs/slurm-432224.out
   Power= SICP=0


Looking at the scheduler logs we see


[2016-02-24T10:04:53.742] update_job: setting pn_min_cpus from 1 to 1536 for 
job_id 432224
[2016-02-24T10:04:53.742] sched: update_job: setting priority to 10000000 for 
job_id 432224
[2016-02-24T10:04:53.742] debug2: initial priority for job 432224 is 10000000
[2016-02-24T10:04:53.743] _slurm_rpc_update_job complete JobId=432224 uid=0 
usec=427
[2016-02-24T10:04:53.743] debug3: Writing job id 432224 to header record of 
job_state file

So for some reason pn_min_cpus gets set to the total number of tasks and as we 
don’t have any nodes with 1536 cores the scheduler advises accordingly.


[2016-02-24T10:04:54.224] _build_node_list: No nodes satisfy job 432224 
requirements in partition parallel
[2016-02-24T10:04:54.224] sched: schedule: JobID=432224 State=0x0 NodeCnt=0 
non-runnable:Requested node configuration is not available
[2016-02-24T10:04:57.229] debug3: sched: JobId=432224. State=PENDING. 
Reason=BadConstraints. Priority=0.


If we try and change this


scontrol update jobid=432224 MinCPUsNode=1

we see


[2016-02-24T10:05:32.332] debug3: JobDesc: user_id=4294967294 job_id=432224 
partition=(null) name=(null)
[2016-02-24T10:05:32.332] update_job: setting pn_min_cpus to 1 for job_id 432224
[2016-02-24T10:05:32.332] update_job: setting pn_min_cpus from 1 to 1536 for 
job_id 432224
[2016-02-24T10:05:32.333] _slurm_rpc_update_job complete JobId=432224 uid=0 
usec=505
[2016-02-24T10:05:32.333] debug3: Writing job id 432224 to header record of 
job_state file


So it gets changed and then immediately changed back

The only way we have found to fix this is with:


scontrol update jobid=432224 NumNodes=64-64

Which results in


[2016-02-24T10:05:58.476] debug3: JobDesc: user_id=4294967294 job_id=432224 
partition=(null) name=(null)
[2016-02-24T10:05:58.476] update_job: setting min_nodes from 1 to 64 for job_id 
432224
[2016-02-24T10:05:58.476] update_job: setting pn_min_cpus from 1536 to 24 for 
job_id 432224
[2016-02-24T10:05:58.476] _slurm_rpc_update_job complete JobId=432224 uid=0 
usec=387
[2016-02-24T10:05:58.477] debug3: Writing job id 432224 to header record of 
job_state file


Does anybody know why the pn_min_cpus is being “incorrectly” set and therefore 
blocking the job?

If the job script includes “-N 64” everything works correctly.


Thanks

Ewan Roche

SCITAS
Ecole Polytechnique Fédérale de Lausanne




Reply via email to