Hello,

I'm looking into why some jobs are getting cancelled/requeued on my
cluster. The default hypothesis is that it is priority (QOS) preemption,
which was recently turned on. But it seems to be happening way more than it
should based on how many jobs are actually being submitted to a
preemption-capable QOS. I tried looking for jobs which were in the
PREEMPTED state at some point:

$ sacct --allusers --qos=normal --state=PREEMPTED --starttime=2017-06-1
--duplicates --format=jobid,elapsed,qos,user,state,exitcode

There were very few results, and none of the jobs from users who recently
reported lots of preemptions.

When I tried searching for the information of the jobs of one of these
users, many jobs had been in the REQUEUED (but not PREEMPTED) state. But
what is the REQUEUED state? I can't find any mention of it in the
documentation <https://slurm.schedmd.com/sacct.html> (searched
'state_list'). Does this mean that the jobs aren't being preempted due to
priority?

We're running Slurm 16.05.4.

Thanks,
Evan

Reply via email to