On 15 March 2013 14:36, Campbell McLeay <[email protected]> wrote: > Hi, > > We're running Grid Engine 6.2u5, and we're having an issue with jobs not > getting run because one node (out of several hundred) has a NIS error > (so can't run the job). The whole job then sits in an error state, due > presumably to the returned prolog errors. Is it possible to have the > host set to 'Errored' in case of a NIS error, so it won't accept any > more jobs? I haven't been able to find a way to do this so far. > > Cheers, > Rather than an error state what about an alarm state? Write a load sensor that detects the problem and set an appropriate load_threshold on each queue to put the queue into alarm state when the problem is detected. There is still a risk of a race if the job starts just as the problem manifests but it shouldn't be too bad. As Reuti suggested you could have the prolog run as a local user in order to work around the problem. If you are using 6.2u5 there are some security issues that arise from running prolog/epilog as something other than the user (in particular running as root). It is possible to work around these issues but you need to be careful. Alternatively an upgrade to one of the live forks should solve the security issue.
William > Campbell > > -- > > Campbell McLeay | Senior Systems Administrator > T: +44 797 164 1427 > E: [email protected] > A: 2-4 Bucknall Street, London WC2H 8LA > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
