Alan,

Without more info, it is hard to tell.  But the first thing I would consider is 
that the lower priority jobs are being backfilled.  See what you've defined in 
your slurm.conf:

SchedulerType=sched/backfill

Don

> -----Original Message-----
> From: Alan V. Cowles [mailto:[email protected]]
> Sent: Thursday, September 19, 2013 1:52 PM
> To: slurm-dev
> Subject: [slurm-dev] Problems with priority multifactor being ignored.
> 
> Hey guys,
> 
> Hopefully this is an easy one that maybe others have encountered, we are
> curious if any of the multi-factor priority plugins have trump values over
> others if they are maxed out?
> 
> We are running slurm 2.5.4 on a cluster with 640 available slots.
> 
> We currently have fairshare set to 5000, counting down to 0, age at 0
> counting up to 3000, and partition priority same for everyone
> on  partitions at 8000.
> 
> In our example case we are back to our classic problem user that submits
> thousands of jobs to the default partition, and walks away for a week. She
> takes all of the slots immediately available, and the rest of her jobs are
> queued. Her fairshare value drops and as these are lengthy jobs, her age
> increments up...
> 
> She hits her maxed value of 11000 (8000 + 3000 + 0) for her jobs waiting
> in the queue.
> 
> A new user comes in, submits to the local parition as well... should come
> in with a higher priority by default simply based on the idea that their
> values summed are 8000 for partition, 5000 for fairshare, and 0 for age,
> so 13000.
> 
> And yet we are seeing the jobs at 11000 still jumping the higher priority
> jobs and running...
> 
> We thought perhaps there may be something about maxed out priority values
> jumping the queue, or what exactly we are missing here?
> 
> Sample output from sprio -l:
> 
> JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION
>    QOS   NICE
> 
>  202545    bem28      11000       3000          0          0       8000
>        0      0
>  202546    bem28      11000       3000          0          0       8000
>        0      0
>  202547    bem28      11000       3000          0          0       8000
>        0      0
>  202548    bem28      11000       3000          0          0       8000
>        0      0
>  202549    bem28      11000       3000          0          0       8000
>        0      0
>  202550    bem28      11000       3000          0          0       8000
>        0      0
>  202551    bem28      11000       3000          0          0       8000
>        0      0
>  202552    bem28      11000       3000          0          0       8000
>        0      0
>  202553    bem28      11000       3000          0          0       8000
>        0      0
>  202554    bem28      11000       3000          0          0       8000
>        0      0
>  202555    bem28      11000       3000          0          0       8000
>        0      0
>  202556    bem28      11000       3000          0          0       8000
>        0      0
>  202653    bem28      11000       3000          0          0       8000
>        0      0
>  203965    ter18      12862        402       4460          0       8000
>        0      0
>  203967    ter18      12862        402       4460          0       8000
>        0      0
>  203969    ter18      12861        402       4460          0       8000
>        0      0
>  203971    ter18      12861        402       4460          0       8000
>        0      0
>  203973    ter18      12861        402       4460          0       8000
>        0      0
>  203975    ter18      12861        402       4460          0       8000
>        0      0
>  203977    ter18      12861        402       4460          0       8000
>        0      0
>  203979    ter18      12861        402       4460          0       8000
>        0      0
>  203981    ter18      12861        402       4460          0       8000
>        0      0
> 
> 
> In the example his jobs have been waiting for about 7 hours even... so he
> has a time factor in play too... but as of even a few minutes ago, the
> first user's jobs are still jumping the 2nd users now. So there is
> something we are missing we just don't know what.
> 
> Sample output of squeue:
> 
> 197043    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197044    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197045    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197046    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197047    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197048    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197049    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  197050    lowmem full_per    bem28  PD       0:00      1 (Priority)
>  196887    lowmem full_per    bem28   R       3:10      1 hardac-node01-1
>  196888    lowmem full_per    bem28   R       3:10      1 hardac-node04-1
>  196886    lowmem full_per    bem28   R       3:19      1 hardac-node07-2
>  196885    lowmem full_per    bem28   R       7:04      1 hardac-node06-2
>  196884    lowmem full_per    bem28   R      11:49      1 hardac-node06-1
>  196883    lowmem full_per    bem28   R      13:40      1 hardac-node03-3
> 
> 
> Thoughts from any other slurm users would be greatly appreciated.
> 
> AC

Reply via email to