Hey Don,
Thanks for getting back to me... I actually discovered the true root
cause of the issue late last night, and apologize for not copying the
list earlier... the issue was that the newly submitted jobs by that user
his sbatch script declared 1 full node and thus with the simpler script
by the first user grabbing every slot that freed itself no full node
ever came available. It seems priority multifactor, and slurm in general
is working perfectly.
Thanks for taking a look though.
AC
On 09/20/2013 11:58 AM, Lipari, Don wrote:
Alan,
Without more info, it is hard to tell. But the first thing I would consider is
that the lower priority jobs are being backfilled. See what you've defined in
your slurm.conf:
SchedulerType=sched/backfill
Don
-----Original Message-----
From: Alan V. Cowles [mailto:[email protected]]
Sent: Thursday, September 19, 2013 1:52 PM
To: slurm-dev
Subject: [slurm-dev] Problems with priority multifactor being ignored.
Hey guys,
Hopefully this is an easy one that maybe others have encountered, we are
curious if any of the multi-factor priority plugins have trump values over
others if they are maxed out?
We are running slurm 2.5.4 on a cluster with 640 available slots.
We currently have fairshare set to 5000, counting down to 0, age at 0
counting up to 3000, and partition priority same for everyone
on partitions at 8000.
In our example case we are back to our classic problem user that submits
thousands of jobs to the default partition, and walks away for a week. She
takes all of the slots immediately available, and the rest of her jobs are
queued. Her fairshare value drops and as these are lengthy jobs, her age
increments up...
She hits her maxed value of 11000 (8000 + 3000 + 0) for her jobs waiting
in the queue.
A new user comes in, submits to the local parition as well... should come
in with a higher priority by default simply based on the idea that their
values summed are 8000 for partition, 5000 for fairshare, and 0 for age,
so 13000.
And yet we are seeing the jobs at 11000 still jumping the higher priority
jobs and running...
We thought perhaps there may be something about maxed out priority values
jumping the queue, or what exactly we are missing here?
Sample output from sprio -l:
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION
QOS NICE
202545 bem28 11000 3000 0 0 8000
0 0
202546 bem28 11000 3000 0 0 8000
0 0
202547 bem28 11000 3000 0 0 8000
0 0
202548 bem28 11000 3000 0 0 8000
0 0
202549 bem28 11000 3000 0 0 8000
0 0
202550 bem28 11000 3000 0 0 8000
0 0
202551 bem28 11000 3000 0 0 8000
0 0
202552 bem28 11000 3000 0 0 8000
0 0
202553 bem28 11000 3000 0 0 8000
0 0
202554 bem28 11000 3000 0 0 8000
0 0
202555 bem28 11000 3000 0 0 8000
0 0
202556 bem28 11000 3000 0 0 8000
0 0
202653 bem28 11000 3000 0 0 8000
0 0
203965 ter18 12862 402 4460 0 8000
0 0
203967 ter18 12862 402 4460 0 8000
0 0
203969 ter18 12861 402 4460 0 8000
0 0
203971 ter18 12861 402 4460 0 8000
0 0
203973 ter18 12861 402 4460 0 8000
0 0
203975 ter18 12861 402 4460 0 8000
0 0
203977 ter18 12861 402 4460 0 8000
0 0
203979 ter18 12861 402 4460 0 8000
0 0
203981 ter18 12861 402 4460 0 8000
0 0
In the example his jobs have been waiting for about 7 hours even... so he
has a time factor in play too... but as of even a few minutes ago, the
first user's jobs are still jumping the 2nd users now. So there is
something we are missing we just don't know what.
Sample output of squeue:
197043 lowmem full_per bem28 PD 0:00 1 (Priority)
197044 lowmem full_per bem28 PD 0:00 1 (Priority)
197045 lowmem full_per bem28 PD 0:00 1 (Priority)
197046 lowmem full_per bem28 PD 0:00 1 (Priority)
197047 lowmem full_per bem28 PD 0:00 1 (Priority)
197048 lowmem full_per bem28 PD 0:00 1 (Priority)
197049 lowmem full_per bem28 PD 0:00 1 (Priority)
197050 lowmem full_per bem28 PD 0:00 1 (Priority)
196887 lowmem full_per bem28 R 3:10 1 hardac-node01-1
196888 lowmem full_per bem28 R 3:10 1 hardac-node04-1
196886 lowmem full_per bem28 R 3:19 1 hardac-node07-2
196885 lowmem full_per bem28 R 7:04 1 hardac-node06-2
196884 lowmem full_per bem28 R 11:49 1 hardac-node06-1
196883 lowmem full_per bem28 R 13:40 1 hardac-node03-3
Thoughts from any other slurm users would be greatly appreciated.
AC