Fritz Ferstl <[email protected]> writes:

> It should actually be quite easy. In a first implementation you'll 
> probably want to introduce a qmaster_param "black_hole_exit_rate" and 
> then keep a statistic of the exit frequency rate for each exec host near 
> the code where qmaster receives job completion information. Qmaster 
> would compare the exit frequency of the hosts against 
> black_hole_exit_rate and disable a host if its exit frequency is higher 
> than allowed by black_hole_exit_rate.
>
> A more advanced implementation would provide a black_hole_exit_rate per 
> exec host or even per cluster queue (i.e. per job class) plus per exec 
> host. The checking itself won't get much more complicated. The problem 
> with the more advanced approach is only that you'd have to modify the 
> format of the queue and host configuration. This would make that version 
> incompatible with earlier versions. So the upgrade step will get more 
> "involved".

Understanding this might be more generally useful.

What's the reason for doing it in the qmaster?  The way I'd hope to be
able to do it would be locally in execd, with and error state triggered
if the rate exceeded that specified by a complex (or more than one).  I
don't know enough about the architecture, though, and maybe the execd
doesn't have access to the relevant information, for one thing.

If this was implemented, it might be useful to try to ignore classes of
error that seemed to be due to the user to balance the risks between
losing jobs and knocking out the whole cluster.  When there's a large
array job -- or a big batch of jobs which should be an array job -- it's
easy to knock out the cluster with some simple mistake already, as at
least one sort of user error can put the queue in an error state
(probably when the working directory disappears, but I can't remember
off-hand).
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to