Re: [gridengine users] execd load sensors timing

Reuti Mon, 09 Jul 2012 07:02:37 -0700

Am 09.07.2012 um 15:54 schrieb William Hay:

> On 9 July 2012 14:08, Reuti <[email protected]> wrote:
>> Am 09.07.2012 um 14:51 schrieb William Hay:
>> 
>>> On 9 July 2012 12:50, Reuti <[email protected]> wrote:
>>>> Am 09.07.2012 um 11:42 schrieb William Hay:
>>>> 
>>>>> When execd starts is it safe to assume that the load sensors will be
>>>>> run and reported back to the qmaster/scheduler before the node is
>>>>> declared
>>>>> contactable/eligible for scheduling again?
>>>>> 
>>>>> I have a load sensor that reports when the node was last booted and
>>>>> would like to be sure that the time used for scheduling decisions is
>>>>> accurate.
>>>> 
>>>> No. The load sensor will only be triggered with the next interval when 
>>>> it's triggered in the usual cycle AFAICS when I start the execd on a 
>>>> particular node.
>>>> 
>>>> To avoid it, you could report a BOOLEAN in the load sensor too and use 
>>>> this as an entry in load_thresholds in the queue definition to put the 
>>>> queue instance into alarm state (i.e. don't get any jobs scheduled 
>>>> thereto), as long as the load sensor doesn't report TRUE to reflect 
>>>> available.
>>>> 
>>> Would there not be a similar risk there though where the boolean is
>>> cached from before a reboot or do load thresholds work differently?
>> 
>> If you reboot to fast: yes. So the old values should first vanish from the 
>> load report.
> 
> How does one determine what is too fast?


That the values are still reported from the last run in `qhost -F ...`. But 
when the reboot is taking only a few minutes the load sensor would report the 
same value as before. Or do you upgrade the OS in just a load_report interval, 
so that the old value would be wrong?

-- Reuti


>> You can set "initial_state" disabled in the queue configuration, so that 
>> queue on this exechost needs to be enabled first after a reboot.
> 
> Really want to keep the initial_state at enabled.  The point of the
> exercise is to let grid engine schedule node reboots for us.  We
> currently
> do this by submitting jobs targeted at specific hosts but it can take
> a lot of time this way.  We have a lot of checks that run before
> sge_execd is started so it is safe for jobs to run immediately
> post-reboot.  This helps minimise down time of individual nodes.
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] execd load sensors timing

Reply via email to