Hi all,

We've noticed an oddity when it comes to how JOBSIZE is calculated in a
priority/multifactor setup.

Here are two jobs both asking for 128 cores (the nodes have 8 cores each, so
that's 16 nodes), but one ends up with almost double the JOBSIZE value:

user1: salloc --ntasks 128 ....
user2: sbatch --ntasks 128 --nodes 16 ....

(I don't think that there's a difference whether it's salloc/sbatch)

The relevant values from "sprio -l":

          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE
          48020    user1     138877      28989       2546       6343
          48081    user2     138911      24503       1468      11940


In "scontrol show job" the relevant values are:

user1: NumNodes=16 NumCPUs=128
user2: NumNodes=16-16 NumCPUs=128


Basically it's a problem because user2 has figured this out and it using it to
game the system, and user1 is getting annoyed (their FAIRSHARE *should* win in
this case).

I've noticed this before in older versions (possibly back to 2.x), so it's not a
recent change.

Has anyone else noticed this?


We will speak to the people involved (they're both in the same group in this
instance, so we can ask them to play nicely with each other).

But it would be good if there was a way to harden the priority system against
it. I've looked in slurm.conf and can't see any parameter which might be
relevant.

Or is the current behaviour a desired feature for some reason that I'm not
seeing?

Thanks,
Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/

Reply via email to