Am 09.07.2012 um 15:54 schrieb William Hay: > On 9 July 2012 14:08, Reuti <[email protected]> wrote: >> Am 09.07.2012 um 14:51 schrieb William Hay: >> >>> On 9 July 2012 12:50, Reuti <[email protected]> wrote: >>>> Am 09.07.2012 um 11:42 schrieb William Hay: >>>> >>>>> When execd starts is it safe to assume that the load sensors will be >>>>> run and reported back to the qmaster/scheduler before the node is >>>>> declared >>>>> contactable/eligible for scheduling again? >>>>> >>>>> I have a load sensor that reports when the node was last booted and >>>>> would like to be sure that the time used for scheduling decisions is >>>>> accurate. >>>> >>>> No. The load sensor will only be triggered with the next interval when >>>> it's triggered in the usual cycle AFAICS when I start the execd on a >>>> particular node. >>>> >>>> To avoid it, you could report a BOOLEAN in the load sensor too and use >>>> this as an entry in load_thresholds in the queue definition to put the >>>> queue instance into alarm state (i.e. don't get any jobs scheduled >>>> thereto), as long as the load sensor doesn't report TRUE to reflect >>>> available. >>>> >>> Would there not be a similar risk there though where the boolean is >>> cached from before a reboot or do load thresholds work differently? >> >> If you reboot to fast: yes. So the old values should first vanish from the >> load report. > > How does one determine what is too fast?
That the values are still reported from the last run in `qhost -F ...`. But when the reboot is taking only a few minutes the load sensor would report the same value as before. Or do you upgrade the OS in just a load_report interval, so that the old value would be wrong? -- Reuti >> You can set "initial_state" disabled in the queue configuration, so that >> queue on this exechost needs to be enabled first after a reboot. > > Really want to keep the initial_state at enabled. The point of the > exercise is to let grid engine schedule node reboots for us. We > currently > do this by submitting jobs targeted at specific hosts but it can take > a lot of time this way. We have a lot of checks that run before > sge_execd is started so it is safe for jobs to run immediately > post-reboot. This helps minimise down time of individual nodes. > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
