Hi all,
We've noticed an oddity when it comes to how JOBSIZE is calculated in a
priority/multifactor setup.
Here are two jobs both asking for 128 cores (the nodes have 8 cores each, so
that's 16 nodes), but one ends up with almost double the JOBSIZE value:
user1: salloc --ntasks 128 ....
user2: sbatch --ntasks 128 --nodes 16 ....
(I don't think that there's a difference whether it's salloc/sbatch)
The relevant values from "sprio -l":
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE
48020 user1 138877 28989 2546 6343
48081 user2 138911 24503 1468 11940
In "scontrol show job" the relevant values are:
user1: NumNodes=16 NumCPUs=128
user2: NumNodes=16-16 NumCPUs=128
Basically it's a problem because user2 has figured this out and it using it to
game the system, and user1 is getting annoyed (their FAIRSHARE *should* win in
this case).
I've noticed this before in older versions (possibly back to 2.x), so it's not a
recent change.
Has anyone else noticed this?
We will speak to the people involved (they're both in the same group in this
instance, so we can ask them to play nicely with each other).
But it would be good if there was a way to harden the priority system against
it. I've looked in slurm.conf and can't see any parameter which might be
relevant.
Or is the current behaviour a desired feature for some reason that I'm not
seeing?
Thanks,
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/