Hi,
I have a cluster with Rayson's OGE from Oct 2011.
I see an unusual issue: our queue instances don't error out when a
user's job fails.
We have an underlying issue with the filesystem, and sometimes the
compute nodes lose filesystem access. A job gets dispatched, errors out
with
failed 26 : opening input/output file
and then lots of other jobs go to that same node and error out before
the filesystem comes back.
IIRC, the queue should switch to error state when the first job errors
out. But this isn't happening here. Is there some setting I can check?
I see the documentation says "A job enters the error state when Grid
Engine tried to execute a job in a queue, but it failed for a reason
that is considered specific to the job. A queue enters the error state
when Grid Engine tried to execute a job in a queue, but it failed for a
reason that is considered specific to the queue." per
http://arc.liv.ac.uk/SGE/howto/troubleshooting.html
We also have a load sensor that checks for the presence of this
filesystem, but the load sensor only updates every few minutes, while
the filesystem tends to disappear for only about 60s.
Regards,
--
Alex Chekholko [email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users