Hi,

I have a cluster with Rayson's OGE from Oct 2011.

I see an unusual issue: our queue instances don't error out when a user's job fails.

We have an underlying issue with the filesystem, and sometimes the compute nodes lose filesystem access. A job gets dispatched, errors out with

failed       26  : opening input/output file

and then lots of other jobs go to that same node and error out before the filesystem comes back.

IIRC, the queue should switch to error state when the first job errors out. But this isn't happening here. Is there some setting I can check?

I see the documentation says "A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is considered specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is considered specific to the queue." per
http://arc.liv.ac.uk/SGE/howto/troubleshooting.html

We also have a load sensor that checks for the presence of this filesystem, but the load sensor only updates every few minutes, while the filesystem tends to disappear for only about 60s.

Regards,
--
Alex Chekholko [email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to