Hi,

I've found that when I restart slurmctld 2.3.3 it _cancels running jobs_
with invalid QoS. It does not kill running jobs when the QoS level is
changed or when I run reconfig instead of restart.

Our setup is as follows:

- a "normal" QoS for accounts that haven't used their monthly allocation,
with priority 1000
- a "bonus" QoS for accounts that have used up their allocation, with
priority 100
- a "disable" QoS for accounts that have used up 2x their allocation, with
priority 0 and other limits to prevent jobs from starting

An hourly cron job checks usage and adjusts the QoS when necessary,
including the QoS of pending jobs. However, it doesn't touch running jobs
since changing QoS of running jobs doesn't work. When trying, I get:

[root@control ~]# scontrol update job=19162 qos=glennbonus
slurm_update error: Requested operation is presently disabled

I've seen this discussed on the slurm-dev list previously, so we know this
is by design.

On the first of each month, the QoS levels are reset to "normal"
(including pending jobs). This is no problem for running jobs, even if
they are running under the "bonus" QoS - they keep running. They also keep
running if we reconfigure Slurm. However, if I restart the slurmctld
daemon it checks running jobs upon startup and kills those jobs it sees
running under the wrong QoS. See excerpt from ctld.log:

[2012-05-02T12:32:59] Recovered state of 351 nodes
[2012-05-02T12:32:59] recovered job step 16163.0
[2012-05-02T12:32:59] Recovered job 16163 353
[2012-05-02T12:32:59] error: This association 353(account='c3se001-12-5',
user='username', partition='glenn') does not have access to qos glennbonus
[2012-05-02T12:32:59] Cancelling job 16163 with invalid qos
[2012-05-02T12:32:59] recovered job step 16384.0
[2012-05-02T12:32:59] Recovered job 16384 353
[2012-05-02T12:32:59] error: This association 353(account='c3se001-12-5',
user='username', partition='glenn') does not have access to qos glennbonus
[2012-05-02T12:32:59] Cancelling job 16384 with invalid qos

I haven't seen this behavior documented. Is it by design, or by accident?

It was a rather nasty surprise to suddenly see a lot fewer running jobs.

Regards,

Johan Alvbring, C3SE

Reply via email to