Hi Paddy,

Paddy Doyle <[email protected]> writes:

> Hi all,
>
> We've noticed an oddity when it comes to how JOBSIZE is calculated in a
> priority/multifactor setup.
>
> Here are two jobs both asking for 128 cores (the nodes have 8 cores each, so
> that's 16 nodes), but one ends up with almost double the JOBSIZE value:
>
> user1: salloc --ntasks 128 ....
> user2: sbatch --ntasks 128 --nodes 16 ....
>
> (I don't think that there's a difference whether it's salloc/sbatch)
>
> The relevant values from "sprio -l":
>
>           JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE
>           48020    user1     138877      28989       2546       6343
>           48081    user2     138911      24503       1468      11940
>
>
> In "scontrol show job" the relevant values are:
>
> user1: NumNodes=16 NumCPUs=128
> user2: NumNodes=16-16 NumCPUs=128
>
>
> Basically it's a problem because user2 has figured this out and it using it to
> game the system, and user1 is getting annoyed (their FAIRSHARE *should* win in
> this case).
>
> I've noticed this before in older versions (possibly back to 2.x), so it's 
> not a
> recent change.
>
> Has anyone else noticed this?
>
>
> We will speak to the people involved (they're both in the same group in this
> instance, so we can ask them to play nicely with each other).
>
> But it would be good if there was a way to harden the priority system against
> it. I've looked in slurm.conf and can't see any parameter which might be
> relevant.
>
> Or is the current behaviour a desired feature for some reason that I'm not
> seeing?
>
> Thanks,
> Paddy

If jobs are sharing nodes, I would say it is desirable to be able to
distinguish between job which wants just any 128 cores and one which
wants 16 complete 8-core-nodes.  The former, if cpu-intensive, might
well be able to fill up cores left empty by memory-intensive jobs; the
latter requires that complete nodes be drained to make space for it,
which can be a waste of resources.

Because our nodes are shared, we usually try to talk users out of
specifying the number of nodes for MPI jobs, because the reduced
wait-time often makes up for any loss of efficiency due to the increased
spread across nodes and switches.

Cheers,

Loris

-- 
This signature is currently under construction.

Reply via email to