Am 24.04.2014 um 09:22 schrieb HUMMEL Michel: > Thank you for the hit (slotwise configured at 84 slots per node), but as you > can see, preemption seem's to work (at less until you reach the 42 urgent > jobs). > > I tryed with : > subordinate_list slots=12(all.q:0:sr) > > And it gives me the exact same result, do you have any suggestion ?
Okay. Another thing that spot my eye: > [@@ THALES GROUP INTERNAL @@] > > > -----Message d'origine----- > De : Reuti [mailto:[email protected]] > Envoyé : mercredi 23 avril 2014 18:42 > À : HUMMEL Michel > Cc : [email protected] > Objet : Re: [gridengine users] slotwise preemption and requeue > > Hi, > > Am 23.04.2014 um 15:06 schrieb HUMMEL Michel: > >> I'm trying to configure my OGS to allow urgent priority jobs to requeue low >> priority jobs. >> It seem's to work for a limited number of urgent priority jobs but there is >> an limit above which the system don't work as expected. >> Here is the configuration I used : >> >> I have 7 nodes (named GSE1-7) of 12 slots each, which means 84 slots at all. >> >> I have 2 queues using slotwise preemption : >> all.q and urgent.q (see configurations [1] and [2]) >> >> To allow the requeue of jobs I have configured a checkpoint : >> $ qconf -sckpt Requeue >> ckpt_name Requeue >> interface APPLICATION-LEVEL >> ckpt_command >> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \ >> $job_pid >> migr_command >> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \ >> $job_pid >> restart_command NONE >> clean_command NONE >> ckpt_dir /tmp >> signal NONE >> when xsr >> >> I manage priority levels with complexes P1,p2, p3 for "normal jobs" >> P0 for urgent jobs >> $ qconf -sc >> priority0 p0 BOOL == FORCED NO >> FALSE 40 >> priority1 p1 BOOL == YES NO >> FALSE 30 >> priority2 p2 BOOL == YES NO >> FALSE 20 >> priority3 p3 BOOL == YES NO >> FALSE 10 >> >> To limit the number of jobs running concurrently on the to queues i used an >> RQS on the all.q queue : >> $ qconf -srqs >> { >> name limit_DCH >> description NONE >> enabled TRUE >> limit queues {all.q} hosts {*} to slots=$num_proc >> } As you have 12 slots per queue instance in all.q, this RQS seems not to have any effect I think. >> I submit 110 "normal jobs" and 84 of them are executed, 12 on each node. >> >> for i in $(seq 1 110); do qsub -l p1 -ckpt Requeue job.sh; done qstat >> | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c >> 12 job.sh all.q@OGSE1 >> 12 job.sh all.q@OGSE2 >> 12 job.sh all.q@OGSE3 >> 12 job.sh all.q@OGSE4 >> 12 job.sh all.q@OGSE5 >> 12 job.sh all.q@OGSE6 >> 12 job.sh all.q@OGSE7 >> >> Then I submit 40 urgent jobs and it works as expected, the 40 are executed >> in the urgent.q queue and 40 jobs of all.q are requeued (state Rq): >> for i in $(seq 1 40); do qsub -l p0 job.sh; done (I grep OGSE on >> the output to only catch jobs which are affected to a queue) qstat | >> grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c >> 8 job.sh all.q@OGSE2 >> 12 job.sh all.q@OGSE3 >> 12 job.sh all.q@OGSE5 >> 12 job.sh all.q@OGSE6 >> 12 job.sh urgent.q@OGSE1 >> 4 job.sh urgent.q@OGSE2 >> 12 job.sh urgent.q@OGSE4 >> 12 job.sh urgent.q@OGSE7 >> As you can see there is only 12 jobs running on each node. >> >> It works until i reach 42 urgent jobs (I've submited the others one by one >> to find the exact limit). >> When the 42th job starts then the requeue system doesn't work anymore and >> OGS begin to suspend other "normal jobs" then migrates it on an other node, >> the suspend another, ... and this as long as there is 42 or more urgent jobs >> running or pending. >> $ qsub -l p0 job.sh; >> $ qsub -l p0 job.sh; >> qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c >> 12 job.sh all.q@OGSE1 >> 12 job.sh all.q@OGSE2 >> 11 job.sh all.q@OGSE3 >> 2 job.sh all.q@OGSE4 >> 11 job.sh all.q@OGSE5 >> 12 job.sh all.q@OGSE6 >> 12 job.sh urgent.q@OGSE1 >> 4 job.sh urgent.q@OGSE2 >> 1 job.sh urgent.q@OGSE3 >> 12 job.sh urgent.q@OGSE4 >> 1 job.sh urgent.q@OGSE5 >> 1 job.sh urgent.q@OGSE6 >> 12 job.sh urgent.q@OGSE7 >> >> If I qdel one ugrent job, the system work again as expected, only 12 jobs >> can run on each node and no jobs are in the suspended state. >> >> Is someone have an idea of what's going on ? >> Any help will be appreciated >> >> Michel Hummel >> >> ------------------ >> [1] >> $ qconf -sq all.q >> qname all.q >> hostlist OGSE1 OGSE2 OGSE3 OGSE4 OGSE5 OGSE6 OGSE7 >> seq_no 0 >> load_thresholds np_load_avg=1.75 >> suspend_thresholds NONE >> nsuspend 1 >> suspend_interval 00:05:00 >> priority 0 >> min_cpu_interval INFINITY >> processors UNDEFINED >> qtype BATCH INTERACTIVE >> ckpt_list Requeue >> pe_list default distribute make >> rerun TRUE >> slots 84 This would be the number of slots per queue instance too. Maybe this was reason you introduced the RQS? -- Reuti >> tmpdir /tmp >> shell /bin/sh >> prolog NONE >> epilog NONE >> shell_start_mode posix_compliant >> starter_method NONE >> suspend_method NONE >> resume_method NONE >> terminate_method NONE >> notify 00:00:60 >> owner_list NONE >> user_lists arusers >> xuser_lists NONE >> subordinate_list NONE >> complex_values priority1=TRUE,priority2=TRUE,priority3=TRUE >> projects NONE >> xprojects NONE >> calendar NONE >> initial_state default >> s_rt INFINITY >> h_rt INFINITY >> s_cpu INFINITY >> h_cpu INFINITY >> s_fsize INFINITY >> h_fsize INFINITY >> s_data INFINITY >> h_data INFINITY >> s_stack INFINITY >> h_stack INFINITY >> s_core INFINITY >> h_core INFINITY >> s_rss INFINITY >> h_rss INFINITY >> s_vmem INFINITY >> h_vmem INFINITY >> >> [2] >> $ qconf -sq urgent.q >> qname urgent.q >> hostlist @allhosts >> seq_no 0 >> load_thresholds np_load_avg=1.75 >> suspend_thresholds NONE >> nsuspend 1 >> suspend_interval 00:05:00 >> priority -20 >> min_cpu_interval INFINITY >> processors UNDEFINED >> qtype BATCH INTERACTIVE >> ckpt_list Requeue >> pe_list default make >> rerun FALSE >> slots 12 >> tmpdir /tmp >> shell /bin/sh >> prolog NONE >> epilog NONE >> shell_start_mode posix_compliant >> starter_method NONE >> suspend_method NONE >> resume_method SIGINT >> terminate_method NONE >> notify 00:00:60 >> owner_list NONE >> user_lists arusers >> xuser_lists NONE >> subordinate_list slots=84(all.q:0:sr) > > First one thought: this limit is per queue instance. So it should never > trigger any suspension at all unless you reach > 84 per node. > > -- Reuti > > >> complex_values priority0=True >> projects NONE >> xprojects NONE >> calendar NONE >> initial_state default >> s_rt INFINITY >> h_rt INFINITY >> s_cpu INFINITY >> h_cpu INFINITY >> s_fsize INFINITY >> h_fsize INFINITY >> s_data INFINITY >> h_data INFINITY >> s_stack INFINITY >> h_stack INFINITY >> s_core INFINITY >> h_core INFINITY >> s_rss INFINITY >> h_rss INFINITY >> s_vmem INFINITY >> h_vmem INFINITY >> >> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
