Thanks for all the suggestions. In the end the easiest was to just
monitor the qmaster log and run a qmod -sq on the execution host to
suspend it, then a qmod -cj to resubmit the job. This will do until sssd
gets fixed
Cheers,
Campbell
On 18/03/13 13:59, William Hay wrote:
On 15 March 2013 14:36, Campbell McLeay
<[email protected]> wrote:
Hi,
We're running Grid Engine 6.2u5, and we're having an issue with jobs not
getting run because one node (out of several hundred) has a NIS error
(so can't run the job). The whole job then sits in an error state, due
presumably to the returned prolog errors. Is it possible to have the
host set to 'Errored' in case of a NIS error, so it won't accept any
more jobs? I haven't been able to find a way to do this so far.
Cheers,
Rather than an error state what about an alarm state? Write a load
sensor that detects the problem and set an appropriate load_threshold
on each queue to put the queue into alarm state when the problem is
detected. There is still
a risk of a race if the job starts just as the problem manifests but
it shouldn't be too bad. As Reuti suggested you could have the prolog
run as a local user in order to work around the problem. If you are
using 6.2u5 there are some security issues that arise from running
prolog/epilog as something other than the user (in particular running
as root). It is possible to work around these issues but you need to
be careful. Alternatively an upgrade to one of the live forks should
solve the security issue.
William
Campbell
--
Campbell McLeay | Senior Systems Administrator
T: +44 797 164 1427
E: [email protected]
A: 2-4 Bucknall Street, London WC2H 8LA
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Campbell McLeay | Senior Systems Administrator
T: +44 797 164 1427
E: [email protected]
A: 2-4 Bucknall Street, London WC2H 8LA
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users