Am 25.04.2014 um 10:39 schrieb HUMMEL Michel:
> Thank you for the response,
> My objective is to allow urgent jobs to requeue jobs running in all.q without
> having jobs suspended (which consume resources).
> Without the RQS, if all.q is full and an urgent job is submited, a job of
> all.q is suspended, then it is requeue ( thanks to the checkpoint
> properties). But as the queue instance of all.q has 1 slot free, a new
> "normal" job starts in the queue, which is suspended, requeue, etc ...
>
> To break this loop of "start, suspend, requeue", I use the RQS (which I
> think, prevent oversubscription of the node only for all.q ?)
Yes.
The:
limit queues {all.q} hosts {*} to slots=$num_proc
could also be wtitten as:
limit queues all.q hosts {*} to slots=12
> The number of slots declared in all.q is bigger than 12 because I saw in my
> tests that this influences the limit above which the system begin to fail.
> In my last test, I set it to 96, this has increased to 60 the limit below
> which it works fine (I really don't know why).
It means, that the RQS is taken into account before a job is to be resumed,
while a limit on a queue instance level will allow a job to start which will be
suspended instantly due to a slot-wise subordination (which can happen
actually). Is this your obervation?
> To finish I tryed to increase it again but this has no more effect.
> I found an intermediate solution which is to add of an RQS to limit the slots
> of the urgent queue to 60. This works but I realy need to allow the urgent.q
> queue to use all the slots of the cluster.
>
> Here is the RQS definitions used :
> limit queues {all.q} hosts {*} to slots=$num_proc
> limit queues {urgent.q} hosts * to slots=60
limit queues urgent.q to slots=60
The last line looks like a limit in the entire cluster.
Although I have no hint for the original issue, maybe shortening the RQS can
give a clue in:
$ qquota
-- Reuti
> -----Message d'origine-----
> De : Reuti [mailto:[email protected]]
> Envoyé : vendredi 25 avril 2014 00:31
> À : HUMMEL Michel
> Cc : [email protected]
> Objet : Re: [gridengine users] slotwise preemption and requeue
>
> Am 24.04.2014 um 09:22 schrieb HUMMEL Michel:
>
>> Thank you for the hit (slotwise configured at 84 slots per node), but as you
>> can see, preemption seem's to work (at less until you reach the 42 urgent
>> jobs).
>>
>> I tryed with :
>> subordinate_list slots=12(all.q:0:sr)
>>
>> And it gives me the exact same result, do you have any suggestion ?
>
> Okay. Another thing that spot my eye:
>
>
>> [@@ THALES GROUP INTERNAL @@]
>>
>>
>> -----Message d'origine-----
>> De : Reuti [mailto:[email protected]]
>> Envoyé : mercredi 23 avril 2014 18:42
>> À : HUMMEL Michel
>> Cc : [email protected]
>> Objet : Re: [gridengine users] slotwise preemption and requeue
>>
>> Hi,
>>
>> Am 23.04.2014 um 15:06 schrieb HUMMEL Michel:
>>
>>> I'm trying to configure my OGS to allow urgent priority jobs to requeue low
>>> priority jobs.
>>> It seem's to work for a limited number of urgent priority jobs but there is
>>> an limit above which the system don't work as expected.
>>> Here is the configuration I used :
>>>
>>> I have 7 nodes (named GSE1-7) of 12 slots each, which means 84 slots at all.
>>>
>>> I have 2 queues using slotwise preemption :
>>> all.q and urgent.q (see configurations [1] and [2])
>>>
>>> To allow the requeue of jobs I have configured a checkpoint :
>>> $ qconf -sckpt Requeue
>>> ckpt_name Requeue
>>> interface APPLICATION-LEVEL
>>> ckpt_command
>>> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
>>> $job_pid
>>> migr_command
>>> /data/module/install/OGS/2011.11p1/install/GE2011.11/kill_tree.sh \
>>> $job_pid
>>> restart_command NONE
>>> clean_command NONE
>>> ckpt_dir /tmp
>>> signal NONE
>>> when xsr
>>>
>>> I manage priority levels with complexes P1,p2, p3 for "normal jobs"
>>> P0 for urgent jobs
>>> $ qconf -sc
>>> priority0 p0 BOOL == FORCED NO
>>> FALSE 40
>>> priority1 p1 BOOL == YES NO
>>> FALSE 30
>>> priority2 p2 BOOL == YES NO
>>> FALSE 20
>>> priority3 p3 BOOL == YES NO
>>> FALSE 10
>>>
>>> To limit the number of jobs running concurrently on the to queues i used an
>>> RQS on the all.q queue :
>>> $ qconf -srqs
>>> {
>>> name limit_DCH
>>> description NONE
>>> enabled TRUE
>>> limit queues {all.q} hosts {*} to slots=$num_proc
>>> }
>
> As you have 12 slots per queue instance in all.q, this RQS seems not to have
> any effect I think.
>
>
>>> I submit 110 "normal jobs" and 84 of them are executed, 12 on each node.
>>>
>>> for i in $(seq 1 110); do qsub -l p1 -ckpt Requeue job.sh; done qstat
>>> | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
>>> 12 job.sh all.q@OGSE1
>>> 12 job.sh all.q@OGSE2
>>> 12 job.sh all.q@OGSE3
>>> 12 job.sh all.q@OGSE4
>>> 12 job.sh all.q@OGSE5
>>> 12 job.sh all.q@OGSE6
>>> 12 job.sh all.q@OGSE7
>>>
>>> Then I submit 40 urgent jobs and it works as expected, the 40 are executed
>>> in the urgent.q queue and 40 jobs of all.q are requeued (state Rq):
>>> for i in $(seq 1 40); do qsub -l p0 job.sh; done (I grep OGSE on
>>> the output to only catch jobs which are affected to a queue) qstat |
>>> grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
>>> 8 job.sh all.q@OGSE2
>>> 12 job.sh all.q@OGSE3
>>> 12 job.sh all.q@OGSE5
>>> 12 job.sh all.q@OGSE6
>>> 12 job.sh urgent.q@OGSE1
>>> 4 job.sh urgent.q@OGSE2
>>> 12 job.sh urgent.q@OGSE4
>>> 12 job.sh urgent.q@OGSE7
>>> As you can see there is only 12 jobs running on each node.
>>>
>>> It works until i reach 42 urgent jobs (I've submited the others one by one
>>> to find the exact limit).
>>> When the 42th job starts then the requeue system doesn't work anymore and
>>> OGS begin to suspend other "normal jobs" then migrates it on an other node,
>>> the suspend another, ... and this as long as there is 42 or more urgent
>>> jobs running or pending.
>>> $ qsub -l p0 job.sh;
>>> $ qsub -l p0 job.sh;
>>> qstat | grep 'OGSE' | sort -k 8 | awk '{print $3 " " $8 }' | uniq -c
>>> 12 job.sh all.q@OGSE1
>>> 12 job.sh all.q@OGSE2
>>> 11 job.sh all.q@OGSE3
>>> 2 job.sh all.q@OGSE4
>>> 11 job.sh all.q@OGSE5
>>> 12 job.sh all.q@OGSE6
>>> 12 job.sh urgent.q@OGSE1
>>> 4 job.sh urgent.q@OGSE2
>>> 1 job.sh urgent.q@OGSE3
>>> 12 job.sh urgent.q@OGSE4
>>> 1 job.sh urgent.q@OGSE5
>>> 1 job.sh urgent.q@OGSE6
>>> 12 job.sh urgent.q@OGSE7
>>>
>>> If I qdel one ugrent job, the system work again as expected, only 12 jobs
>>> can run on each node and no jobs are in the suspended state.
>>>
>>> Is someone have an idea of what's going on ?
>>> Any help will be appreciated
>>>
>>> Michel Hummel
>>>
>>> ------------------
>>> [1]
>>> $ qconf -sq all.q
>>> qname all.q
>>> hostlist OGSE1 OGSE2 OGSE3 OGSE4 OGSE5 OGSE6 OGSE7
>>> seq_no 0
>>> load_thresholds np_load_avg=1.75
>>> suspend_thresholds NONE
>>> nsuspend 1
>>> suspend_interval 00:05:00
>>> priority 0
>>> min_cpu_interval INFINITY
>>> processors UNDEFINED
>>> qtype BATCH INTERACTIVE
>>> ckpt_list Requeue
>>> pe_list default distribute make
>>> rerun TRUE
>>> slots 84
>
> This would be the number of slots per queue instance too. Maybe this was
> reason you introduced the RQS?
>
> -- Reuti
>
>
>>> tmpdir /tmp
>>> shell /bin/sh
>>> prolog NONE
>>> epilog NONE
>>> shell_start_mode posix_compliant
>>> starter_method NONE
>>> suspend_method NONE
>>> resume_method NONE
>>> terminate_method NONE
>>> notify 00:00:60
>>> owner_list NONE
>>> user_lists arusers
>>> xuser_lists NONE
>>> subordinate_list NONE
>>> complex_values priority1=TRUE,priority2=TRUE,priority3=TRUE
>>> projects NONE
>>> xprojects NONE
>>> calendar NONE
>>> initial_state default
>>> s_rt INFINITY
>>> h_rt INFINITY
>>> s_cpu INFINITY
>>> h_cpu INFINITY
>>> s_fsize INFINITY
>>> h_fsize INFINITY
>>> s_data INFINITY
>>> h_data INFINITY
>>> s_stack INFINITY
>>> h_stack INFINITY
>>> s_core INFINITY
>>> h_core INFINITY
>>> s_rss INFINITY
>>> h_rss INFINITY
>>> s_vmem INFINITY
>>> h_vmem INFINITY
>>>
>>> [2]
>>> $ qconf -sq urgent.q
>>> qname urgent.q
>>> hostlist @allhosts
>>> seq_no 0
>>> load_thresholds np_load_avg=1.75
>>> suspend_thresholds NONE
>>> nsuspend 1
>>> suspend_interval 00:05:00
>>> priority -20
>>> min_cpu_interval INFINITY
>>> processors UNDEFINED
>>> qtype BATCH INTERACTIVE
>>> ckpt_list Requeue
>>> pe_list default make
>>> rerun FALSE
>>> slots 12
>>> tmpdir /tmp
>>> shell /bin/sh
>>> prolog NONE
>>> epilog NONE
>>> shell_start_mode posix_compliant
>>> starter_method NONE
>>> suspend_method NONE
>>> resume_method SIGINT
>>> terminate_method NONE
>>> notify 00:00:60
>>> owner_list NONE
>>> user_lists arusers
>>> xuser_lists NONE
>>> subordinate_list slots=84(all.q:0:sr)
>>
>> First one thought: this limit is per queue instance. So it should never
>> trigger any suspension at all unless you reach > 84 per node.
>>
>> -- Reuti
>>
>>
>>> complex_values priority0=True
>>> projects NONE
>>> xprojects NONE
>>> calendar NONE
>>> initial_state default
>>> s_rt INFINITY
>>> h_rt INFINITY
>>> s_cpu INFINITY
>>> h_cpu INFINITY
>>> s_fsize INFINITY
>>> h_fsize INFINITY
>>> s_data INFINITY
>>> h_data INFINITY
>>> s_stack INFINITY
>>> h_stack INFINITY
>>> s_core INFINITY
>>> h_core INFINITY
>>> s_rss INFINITY
>>> h_rss INFINITY
>>> s_vmem INFINITY
>>> h_vmem INFINITY
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>
>>
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users