Re: [gridengine users] slotwise preemption and requeue

HUMMEL Michel Thu, 24 Apr 2014 00:25:25 -0700

Thank you for the hit (slotwise configured at 84 slots per node), but as you 
can see, preemption seem's to work (at less until you reach the 42 urgent jobs).


I tryed with :
subordinate_list      slots=12(all.q:0:sr)

And it gives me the exact same result, do you have any suggestion ?


[@@ THALES GROUP INTERNAL @@]


-----Message d'origine-----
De : Reuti [mailto:[email protected]] 
Envoyé : mercredi 23 avril 2014 18:42
À : HUMMEL Michel
Cc : [email protected]
Objet : Re: [gridengine users] slotwise preemption and requeue

Hi,

Am 23.04.2014 um 15:06 schrieb HUMMEL Michel:

> I'm trying to configure my OGS to allow urgent priority jobs to requeue low 
> priority jobs.
> It seem's to work for a limited number of urgent priority jobs but there is 
> an limit above which the system don't work as expected.
> Here is the configuration I used :
> 
> I have 7 nodes (named GSE1-7) of 12 slots each, which means 84 slots at all.
> 
> I have 2 queues using slotwise preemption :
> all.q and urgent.q (see configurations [1] and [2])
> 
> To allow the requeue of  jobs I have configured a checkpoint :
> $ qconf -sckpt Requeue
> ckpt_name          Requeue
> interface          APPLICATION-LEVEL
> ckpt_command       
> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
>                   $job_pid
> migr_command       
> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
>                   $job_pid
> restart_command    NONE
> clean_command      NONE
> ckpt_dir           /tmp
> signal             NONE
> when               xsr
> 
> I manage priority levels with complexes P1,p2, p3 for "normal jobs"
> P0 for urgent jobs
> $ qconf -sc
> priority0           p0         BOOL        ==      FORCED      NO         
> FALSE    40
> priority1           p1         BOOL        ==      YES         NO         
> FALSE    30
> priority2           p2         BOOL        ==      YES         NO         
> FALSE    20
> priority3           p3         BOOL        ==      YES         NO         
> FALSE    10
> 
> To limit the number of jobs running concurrently on the to queues i used an 
> RQS on the all.q queue :
> $ qconf -srqs
> {
>   name         limit_DCH
>   description  NONE
>   enabled      TRUE
>   limit        queues {all.q} hosts {*} to slots=$num_proc
> }
> 
> I submit 110 "normal jobs" and 84 of them are executed, 12 on each node.
> 
> for i in $(seq 1 110); do qsub -l p1 -ckpt Requeue job.sh; done qstat 
> | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
>     12 job.sh all.q@OGSE1
>     12 job.sh all.q@OGSE2
>     12 job.sh all.q@OGSE3
>     12 job.sh all.q@OGSE4
>     12 job.sh all.q@OGSE5
>     12 job.sh all.q@OGSE6
>     12 job.sh all.q@OGSE7
> 
> Then I submit 40 urgent jobs and it works as expected, the 40 are executed in 
> the urgent.q queue and 40 jobs of all.q are requeued (state Rq):
> for i in $(seq 1 40); do qsub  -l p0  job.sh; done (I grep OGSE on  
> the output to only catch jobs which are affected to a queue) qstat | 
> grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
>      8 job.sh all.q@OGSE2
>     12 job.sh all.q@OGSE3
>     12 job.sh all.q@OGSE5
>     12 job.sh all.q@OGSE6
>     12 job.sh urgent.q@OGSE1
>      4 job.sh urgent.q@OGSE2
>     12 job.sh urgent.q@OGSE4
>     12 job.sh urgent.q@OGSE7
> As you can see there is only 12 jobs running on each node.
> 
> It works until i reach 42 urgent jobs (I've submited the others one by one to 
> find the exact limit).
> When the 42th job starts then the requeue system doesn't work anymore and OGS 
> begin to suspend other "normal jobs" then migrates it on an other node, the 
> suspend another, ... and this as long as there is 42 or more urgent jobs 
> running or pending.
> $ qsub  -l p0  job.sh;
> $ qsub  -l p0  job.sh;
> qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
>     12 job.sh all.q@OGSE1
>     12 job.sh all.q@OGSE2
>     11 job.sh all.q@OGSE3
>      2 job.sh all.q@OGSE4
>     11 job.sh all.q@OGSE5
>     12 job.sh all.q@OGSE6
>     12 job.sh urgent.q@OGSE1
>      4 job.sh urgent.q@OGSE2
>      1 job.sh urgent.q@OGSE3
>     12 job.sh urgent.q@OGSE4
>      1 job.sh urgent.q@OGSE5
>      1 job.sh urgent.q@OGSE6
>     12 job.sh urgent.q@OGSE7
> 
> If I qdel one ugrent job, the system work again as expected, only 12 jobs can 
> run on each node and no jobs are in the suspended state.
> 
> Is someone have an idea of what's going on ?
> Any help will be appreciated
> 
> Michel Hummel
> 
> ------------------
> [1]
> $ qconf -sq all.q
> qname                 all.q
> hostlist              OGSE1 OGSE2 OGSE3 OGSE4 OGSE5 OGSE6 OGSE7
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      INFINITY
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             Requeue
> pe_list               default distribute make
> rerun                 TRUE
> slots                 84
> tmpdir                /tmp
> shell                 /bin/sh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            arusers
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        priority1=TRUE,priority2=TRUE,priority3=TRUE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> [2]
> $ qconf -sq urgent.q
> qname                 urgent.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              -20
> min_cpu_interval      INFINITY
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             Requeue
> pe_list               default make
> rerun                 FALSE
> slots                 12
> tmpdir                /tmp
> shell                 /bin/sh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         SIGINT
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            arusers
> xuser_lists           NONE
> subordinate_list      slots=84(all.q:0:sr)

First one thought: this limit is per queue instance. So it should never trigger 
any suspension at all unless you reach > 84 per node.

-- Reuti


> complex_values        priority0=True
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] slotwise preemption and requeue

Reply via email to