Hi Rayson,
OK, that is reasonable.
My real problem is this filesystem problem, which is not related to grid
engine. But in the meantime, I wonder if there is some workaround for
my filesystem issue.
Is there a way to make the load sensor check more frequent?
Regards,
Alex
On 09/04/2012 12:03 PM, Rayson Ho wrote:
Hi Alex,
That's the correct behavior (for SSTATE_OPEN_OUTPUT), or else a user
can DoS the cluster easily by pointing the input or output file to a
path that can't be opened by the user.
Rayson
On Tue, Sep 4, 2012 at 2:50 PM, Alex Chekholko <[email protected]> wrote:
Hi,
I have a cluster with Rayson's OGE from Oct 2011.
I see an unusual issue: our queue instances don't error out when a user's
job fails.
We have an underlying issue with the filesystem, and sometimes the compute
nodes lose filesystem access. A job gets dispatched, errors out with
failed 26 : opening input/output file
and then lots of other jobs go to that same node and error out before the
filesystem comes back.
IIRC, the queue should switch to error state when the first job errors out.
But this isn't happening here. Is there some setting I can check?
I see the documentation says "A job enters the error state when Grid Engine
tried to execute a job in a queue, but it failed for a reason that is
considered specific to the job. A queue enters the error state when Grid
Engine tried to execute a job in a queue, but it failed for a reason that is
considered specific to the queue." per
http://arc.liv.ac.uk/SGE/howto/troubleshooting.html
We also have a load sensor that checks for the presence of this filesystem,
but the load sensor only updates every few minutes, while the filesystem
tends to disappear for only about 60s.
Regards,
--
Alex Chekholko [email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Alex Chekholko [email protected] 347-401-4860
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users