I've run into a somewhat odd issue.  When QOS limits are hit it seems SLURM
is not updating some counter that defines what's currently in use or in the
queues and as such preventing new jobs from being submitted.

Right now we set a QOS of background with MaxSubmit=3000 and all QOS' have
Flags=DenyOnLimit.  If I submit 3000 jobs to SLURM the next job I submit is
correctly denied at submit time.  I then use "scontrol update job=<jobid>
partition=hepx qos=hepx" so that the background QOS shows 2999 jobs via
squeue.  I try to submit one job via sbatch and that submission is denied.

Below are examples of what's happening.  Is there some configurable delay
between a job being modified and SLURM updating the counters of what's
currently allocated to a QOS?  We have users who frequently submit large
job arrays to "background" and as their stakeholder partition has free
slots they will modify their pending jobs into the stakeholder partition.
This is how they have invented their own type of job routing (they were
used to having a Torque routing queue).  Unfortunately the used limits seem
to not be updated when jobs are modified.

$ sacctmgr show qos background format=MaxSubmit
MaxSubmit
---------
     3000

$ squeue --array --noheader --qos background | wc -l
2999

$ sbatch -p background batches/sleep.slrm
sbatch: error: Batch job submission failed: Job violates accounting/QOS
policy (job submit limit, user's size and/or time limits)

>From slurmctld:

[2015-01-27T11:36:07.205] debug2: job submit for user treydock(1380): qos
max submit job limit exceeded 3000
[2015-01-27T11:36:07.205] _job_create: exceeded association/qos's limit for
user 1380
[2015-01-27T11:36:07.205] _slurm_rpc_submit_batch_job: Job violates
accounting/QOS policy (job submit limit, user's size and/or time limits)

# Update job array that contained 8 jobs
$ scontrol update job=21446  partition=hepx qos=hepx

$ squeue --array --noheader --qos background | wc -l
2991

$ sbatch -p background batches/sleep.slrm
sbatch: error: Batch job submission failed: Job violates accounting/QOS
policy (job submit limit, user's size and/or time limits)

[2015-01-27T11:37:45.797] debug2: job submit for user treydock(1380): qos
max submit job limit exceeded 3000
[2015-01-27T11:37:45.797] _job_create: exceeded association/qos's limit for
user 1380
[2015-01-27T11:37:45.797] _slurm_rpc_submit_batch_job: Job violates
accounting/QOS policy (job submit limit, user's size and/or time limits)

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

Reply via email to