Hi,
Am 23.04.2014 um 15:06 schrieb HUMMEL Michel:
> I'm trying to configure my OGS to allow urgent priority jobs to requeue low
> priority jobs.
> It seem's to work for a limited number of urgent priority jobs but there is
> an limit above which the system don't work as expected.
> Here is the configuration I used :
>
> I have 7 nodes (named GSE1-7) of 12 slots each, which means 84 slots at all.
>
> I have 2 queues using slotwise preemption :
> all.q and urgent.q (see configurations [1] and [2])
>
> To allow the requeue of jobs I have configured a checkpoint :
> $ qconf -sckpt Requeue
> ckpt_name Requeue
> interface APPLICATION-LEVEL
> ckpt_command
> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
> $job_pid
> migr_command
> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
> $job_pid
> restart_command NONE
> clean_command NONE
> ckpt_dir /tmp
> signal NONE
> when xsr
>
> I manage priority levels with complexes
> P1,p2, p3 for "normal jobs"
> P0 for urgent jobs
> $ qconf -sc
> priority0 p0 BOOL == FORCED NO
> FALSE 40
> priority1 p1 BOOL == YES NO
> FALSE 30
> priority2 p2 BOOL == YES NO
> FALSE 20
> priority3 p3 BOOL == YES NO
> FALSE 10
>
> To limit the number of jobs running concurrently on the to queues i used an
> RQS on the all.q queue :
> $ qconf -srqs
> {
> name limit_DCH
> description NONE
> enabled TRUE
> limit queues {all.q} hosts {*} to slots=$num_proc
> }
>
> I submit 110 "normal jobs" and 84 of them are executed, 12 on each node.
>
> for i in $(seq 1 110); do qsub -l p1 -ckpt Requeue job.sh; done
> qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
> 12 job.sh all.q@OGSE1
> 12 job.sh all.q@OGSE2
> 12 job.sh all.q@OGSE3
> 12 job.sh all.q@OGSE4
> 12 job.sh all.q@OGSE5
> 12 job.sh all.q@OGSE6
> 12 job.sh all.q@OGSE7
>
> Then I submit 40 urgent jobs and it works as expected, the 40 are executed in
> the urgent.q queue and 40 jobs of all.q are requeued (state Rq):
> for i in $(seq 1 40); do qsub -l p0 job.sh; done
> (I grep OGSE on the output to only catch jobs which are affected to a queue)
> qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
> 8 job.sh all.q@OGSE2
> 12 job.sh all.q@OGSE3
> 12 job.sh all.q@OGSE5
> 12 job.sh all.q@OGSE6
> 12 job.sh urgent.q@OGSE1
> 4 job.sh urgent.q@OGSE2
> 12 job.sh urgent.q@OGSE4
> 12 job.sh urgent.q@OGSE7
> As you can see there is only 12 jobs running on each node.
>
> It works until i reach 42 urgent jobs (I've submited the others one by one to
> find the exact limit).
> When the 42th job starts then the requeue system doesn't work anymore and OGS
> begin to suspend other "normal jobs" then migrates it on an other node, the
> suspend another, ... and this as long as there is 42 or more urgent jobs
> running or pending.
> $ qsub -l p0 job.sh;
> $ qsub -l p0 job.sh;
> qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
> 12 job.sh all.q@OGSE1
> 12 job.sh all.q@OGSE2
> 11 job.sh all.q@OGSE3
> 2 job.sh all.q@OGSE4
> 11 job.sh all.q@OGSE5
> 12 job.sh all.q@OGSE6
> 12 job.sh urgent.q@OGSE1
> 4 job.sh urgent.q@OGSE2
> 1 job.sh urgent.q@OGSE3
> 12 job.sh urgent.q@OGSE4
> 1 job.sh urgent.q@OGSE5
> 1 job.sh urgent.q@OGSE6
> 12 job.sh urgent.q@OGSE7
>
> If I qdel one ugrent job, the system work again as expected, only 12 jobs can
> run on each node and no jobs are in the suspended state.
>
> Is someone have an idea of what's going on ?
> Any help will be appreciated
>
> Michel Hummel
>
> ------------------
> [1]
> $ qconf -sq all.q
> qname all.q
> hostlist OGSE1 OGSE2 OGSE3 OGSE4 OGSE5 OGSE6 OGSE7
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval INFINITY
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list Requeue
> pe_list default distribute make
> rerun TRUE
> slots 84
> tmpdir /tmp
> shell /bin/sh
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists arusers
> xuser_lists NONE
> subordinate_list NONE
> complex_values priority1=TRUE,priority2=TRUE,priority3=TRUE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
> [2]
> $ qconf -sq urgent.q
> qname urgent.q
> hostlist @allhosts
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority -20
> min_cpu_interval INFINITY
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list Requeue
> pe_list default make
> rerun FALSE
> slots 12
> tmpdir /tmp
> shell /bin/sh
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method SIGINT
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists arusers
> xuser_lists NONE
> subordinate_list slots=84(all.q:0:sr)
First one thought: this limit is per queue instance. So it should never trigger
any suspension at all unless you reach > 84 per node.
-- Reuti
> complex_values priority0=True
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users