Hello Joan,

Joan Arbona <[email protected]> writes:

> Hello all,
>
> We have realized that in our cluster the backfill plugin is not working as we
> expected. When a user submits jobs using an smaller set of nodes, always get
> running before jobs with a larger set of nodes, even if these have more
> priority.
>
> Our cluster has:
>
> - 1 partition of 40 nodes called THIN. 5 of them are requested by a 
> reservation
> every day, so they are unusable.
> - Default Max Time of THIN partition is 3 days (4320 minutes)
> - Fairshare priority scheme
> - Backfill scheduler
> - Backfill parameters are all set to default.
>
> Lets assume the following circumstance:
>
> 1. User A submits jobs of 10 nodes regularly, lets say, twice or three times a
> day. Those nodes are exclusive for him. He does not specify any time, so job's
> max time is 3 days.
> 2. User B submited one job of 30 nodes at 26th of october. This job is waiting
> for user A jobs to finish. B's jobs have more priority than A's.
>
> The following table shows the output of smap:
>
> .........333333333322222222221111111111. (those numbers are JOBID in the table
> below)
>
>                                                                    
>  JOBID PARTITION USER NAME    ST  TIME        NODES NODELIST       
>                                                                    
>  1     thin      A    gromac  R   1-00:11:19  10    foner[132-141] 
>                                                                    
>  2     thin      A    gromac  R   21:33:49    10    foner[122-131] 
>                                                                    
>  3     thin      A    gromac  R   13:31:49    10    foner[112-121] 
>                                                                    
>  4     thin      B    DART_c  PD  00:00:00    30    waiting...     
>                                                                    
>  5     thin      A    gromac  PD  00:00:00    10    waiting...     
>                                                                    
>
> Theorically and due to backfill , when user A finishes any of his running jobs
> (1,2 or 3), although job 4 does not fit in the cluster the schedule should not
> put job 5 to run. The reason is that job 5 it has less priority than job 4, 
> and
> backfill does not alter the time of jobs with more prioirty. It should wait
> until other A's jobs finish and then put job 4 to run. 
>
> Well, this does not happen. As user A is submitting jobs all the time, they're
> filling all holes that user A's jobs are leaving, because job 4 doesn't fit 
> (it
> needs 30 nodes, not 10). Then, job 4 will never start until user A stops 
> sending
> jobs.
>
> I have tried it in a test environment using sleeps. I have realized that I get
> the same behavior when submitting jobs with more slurm max time (--time) than
> the duration of the command (sleep time). Also, I have tried to adjust
> parameters like bf_window, that is set to one day by default, without luck.
>
> Does anybody knows why does this happen? Why in this case the backfill 
> principle
> of not altering jobs with more priority does not apply? Is there a way to 
> solve
> this?
>
> Thanks,
> Joan
>
> Attaching slurm.conf and the output of squeue:
>
> squeue --start
>
>                                                                              
>  JOBID PARTITION NAME     USER ST START_TIME          NODES NODELIST(REASON) 
>                                                                              
>  5     thin      gromacs_ A    PD 2014-11-06T12:06:19 10    (Priority)       
>                                                                              
>  4     thin      DART_cyc B    PD 2014-11-06T22:45:49 30    (Resources)      
>                                                                              
>
> In fact, job 4's start_time has been changing all the time when user A's jobs
> get running. Maybe backfill can't calculate start_time accuratelly?

One thing you might need to look at is the value of the scheduler
parameter 'bf_window'.  The default value is 1440 minutes (1 day) but it
should probably be as large as your tMaxTime, i.e.

SchedulerParameters=bf_window=4320

See 'man slurm.conf' for more details.

Cheers,

Loris

-- 
This signature is currently under construction.

Reply via email to