Thank you for the hit (slotwise configured at 84 slots per node), but as you can see, preemption seem's to work (at less until you reach the 42 urgent jobs).
I tryed with : subordinate_list slots=12(all.q:0:sr) And it gives me the exact same result, do you have any suggestion ? [@@ THALES GROUP INTERNAL @@] -----Message d'origine----- De : Reuti [mailto:[email protected]] Envoyé : mercredi 23 avril 2014 18:42 À : HUMMEL Michel Cc : [email protected] Objet : Re: [gridengine users] slotwise preemption and requeue Hi, Am 23.04.2014 um 15:06 schrieb HUMMEL Michel: > I'm trying to configure my OGS to allow urgent priority jobs to requeue low > priority jobs. > It seem's to work for a limited number of urgent priority jobs but there is > an limit above which the system don't work as expected. > Here is the configuration I used : > > I have 7 nodes (named GSE1-7) of 12 slots each, which means 84 slots at all. > > I have 2 queues using slotwise preemption : > all.q and urgent.q (see configurations [1] and [2]) > > To allow the requeue of jobs I have configured a checkpoint : > $ qconf -sckpt Requeue > ckpt_name Requeue > interface APPLICATION-LEVEL > ckpt_command > /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \ > $job_pid > migr_command > /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \ > $job_pid > restart_command NONE > clean_command NONE > ckpt_dir /tmp > signal NONE > when xsr > > I manage priority levels with complexes P1,p2, p3 for "normal jobs" > P0 for urgent jobs > $ qconf -sc > priority0 p0 BOOL == FORCED NO > FALSE 40 > priority1 p1 BOOL == YES NO > FALSE 30 > priority2 p2 BOOL == YES NO > FALSE 20 > priority3 p3 BOOL == YES NO > FALSE 10 > > To limit the number of jobs running concurrently on the to queues i used an > RQS on the all.q queue : > $ qconf -srqs > { > name limit_DCH > description NONE > enabled TRUE > limit queues {all.q} hosts {*} to slots=$num_proc > } > > I submit 110 "normal jobs" and 84 of them are executed, 12 on each node. > > for i in $(seq 1 110); do qsub -l p1 -ckpt Requeue job.sh; done qstat > | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c > 12 job.sh all.q@OGSE1 > 12 job.sh all.q@OGSE2 > 12 job.sh all.q@OGSE3 > 12 job.sh all.q@OGSE4 > 12 job.sh all.q@OGSE5 > 12 job.sh all.q@OGSE6 > 12 job.sh all.q@OGSE7 > > Then I submit 40 urgent jobs and it works as expected, the 40 are executed in > the urgent.q queue and 40 jobs of all.q are requeued (state Rq): > for i in $(seq 1 40); do qsub -l p0 job.sh; done (I grep OGSE on > the output to only catch jobs which are affected to a queue) qstat | > grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c > 8 job.sh all.q@OGSE2 > 12 job.sh all.q@OGSE3 > 12 job.sh all.q@OGSE5 > 12 job.sh all.q@OGSE6 > 12 job.sh urgent.q@OGSE1 > 4 job.sh urgent.q@OGSE2 > 12 job.sh urgent.q@OGSE4 > 12 job.sh urgent.q@OGSE7 > As you can see there is only 12 jobs running on each node. > > It works until i reach 42 urgent jobs (I've submited the others one by one to > find the exact limit). > When the 42th job starts then the requeue system doesn't work anymore and OGS > begin to suspend other "normal jobs" then migrates it on an other node, the > suspend another, ... and this as long as there is 42 or more urgent jobs > running or pending. > $ qsub -l p0 job.sh; > $ qsub -l p0 job.sh; > qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c > 12 job.sh all.q@OGSE1 > 12 job.sh all.q@OGSE2 > 11 job.sh all.q@OGSE3 > 2 job.sh all.q@OGSE4 > 11 job.sh all.q@OGSE5 > 12 job.sh all.q@OGSE6 > 12 job.sh urgent.q@OGSE1 > 4 job.sh urgent.q@OGSE2 > 1 job.sh urgent.q@OGSE3 > 12 job.sh urgent.q@OGSE4 > 1 job.sh urgent.q@OGSE5 > 1 job.sh urgent.q@OGSE6 > 12 job.sh urgent.q@OGSE7 > > If I qdel one ugrent job, the system work again as expected, only 12 jobs can > run on each node and no jobs are in the suspended state. > > Is someone have an idea of what's going on ? > Any help will be appreciated > > Michel Hummel > > ------------------ > [1] > $ qconf -sq all.q > qname all.q > hostlist OGSE1 OGSE2 OGSE3 OGSE4 OGSE5 OGSE6 OGSE7 > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval INFINITY > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list Requeue > pe_list default distribute make > rerun TRUE > slots 84 > tmpdir /tmp > shell /bin/sh > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists arusers > xuser_lists NONE > subordinate_list NONE > complex_values priority1=TRUE,priority2=TRUE,priority3=TRUE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > [2] > $ qconf -sq urgent.q > qname urgent.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority -20 > min_cpu_interval INFINITY > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list Requeue > pe_list default make > rerun FALSE > slots 12 > tmpdir /tmp > shell /bin/sh > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method SIGINT > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists arusers > xuser_lists NONE > subordinate_list slots=84(all.q:0:sr) First one thought: this limit is per queue instance. So it should never trigger any suspension at all unless you reach > 84 per node. -- Reuti > complex_values priority0=True > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
