Hy,
I'm trying to configure my OGS to allow urgent priority jobs to requeue low
priority jobs.
It seem's to work for a limited number of urgent priority jobs but there is an
limit above which the system don't work as expected.
Here is the configuration I used :
I have 7 nodes (named GSE1-7) of 12 slots each, which means 84 slots at all.
I have 2 queues using slotwise preemption :
all.q and urgent.q (see configurations [1] and [2])
To allow the requeue of jobs I have configured a checkpoint :
$ qconf -sckpt Requeue
ckpt_name Requeue
interface APPLICATION-LEVEL
ckpt_command
/data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
$job_pid
migr_command
/data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
$job_pid
restart_command NONE
clean_command NONE
ckpt_dir /tmp
signal NONE
when xsr
I manage priority levels with complexes
P1,p2, p3 for "normal jobs"
P0 for urgent jobs
$ qconf -sc
priority0 p0 BOOL == FORCED NO FALSE
40
priority1 p1 BOOL == YES NO FALSE
30
priority2 p2 BOOL == YES NO FALSE
20
priority3 p3 BOOL == YES NO FALSE
10
To limit the number of jobs running concurrently on the to queues i used an RQS
on the all.q queue :
$ qconf -srqs
{
name limit_DCH
description NONE
enabled TRUE
limit queues {all.q} hosts {*} to slots=$num_proc
}
I submit 110 "normal jobs" and 84 of them are executed, 12 on each node.
for i in $(seq 1 110); do qsub -l p1 -ckpt Requeue job.sh; done
qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
12 job.sh all.q@OGSE1
12 job.sh all.q@OGSE2
12 job.sh all.q@OGSE3
12 job.sh all.q@OGSE4
12 job.sh all.q@OGSE5
12 job.sh all.q@OGSE6
12 job.sh all.q@OGSE7
Then I submit 40 urgent jobs and it works as expected, the 40 are executed in
the urgent.q queue and 40 jobs of all.q are requeued (state Rq):
for i in $(seq 1 40); do qsub -l p0 job.sh; done
(I grep OGSE on the output to only catch jobs which are affected to a queue)
qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
8 job.sh all.q@OGSE2
12 job.sh all.q@OGSE3
12 job.sh all.q@OGSE5
12 job.sh all.q@OGSE6
12 job.sh urgent.q@OGSE1
4 job.sh urgent.q@OGSE2
12 job.sh urgent.q@OGSE4
12 job.sh urgent.q@OGSE7
As you can see there is only 12 jobs running on each node.
It works until i reach 42 urgent jobs (I've submited the others one by one to
find the exact limit).
When the 42th job starts then the requeue system doesn't work anymore and OGS
begin to suspend other "normal jobs" then migrates it on an other node, the
suspend another, ... and this as long as there is 42 or more urgent jobs
running or pending.
$ qsub -l p0 job.sh;
$ qsub -l p0 job.sh;
qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
12 job.sh all.q@OGSE1
12 job.sh all.q@OGSE2
11 job.sh all.q@OGSE3
2 job.sh all.q@OGSE4
11 job.sh all.q@OGSE5
12 job.sh all.q@OGSE6
12 job.sh urgent.q@OGSE1
4 job.sh urgent.q@OGSE2
1 job.sh urgent.q@OGSE3
12 job.sh urgent.q@OGSE4
1 job.sh urgent.q@OGSE5
1 job.sh urgent.q@OGSE6
12 job.sh urgent.q@OGSE7
If I qdel one ugrent job, the system work again as expected, only 12 jobs can
run on each node and no jobs are in the suspended state.
Is someone have an idea of what's going on ?
Any help will be appreciated
Michel Hummel
------------------
[1]
$ qconf -sq all.q
qname all.q
hostlist OGSE1 OGSE2 OGSE3 OGSE4 OGSE5 OGSE6 OGSE7
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval INFINITY
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list Requeue
pe_list default distribute make
rerun TRUE
slots 84
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists arusers
xuser_lists NONE
subordinate_list NONE
complex_values priority1=TRUE,priority2=TRUE,priority3=TRUE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
[2]
$ qconf -sq urgent.q
qname urgent.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority -20
min_cpu_interval INFINITY
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list Requeue
pe_list default make
rerun FALSE
slots 12
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method SIGINT
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists arusers
xuser_lists NONE
subordinate_list slots=84(all.q:0:sr)
complex_values priority0=True
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users