Hi Alex,

That's the correct behavior (for SSTATE_OPEN_OUTPUT), or else a user
can DoS the cluster easily by pointing the input or output file to a
path that can't be opened by the user.

Rayson



On Tue, Sep 4, 2012 at 2:50 PM, Alex Chekholko <[email protected]> wrote:
> Hi,
>
> I have a cluster with Rayson's OGE from Oct 2011.
>
> I see an unusual issue: our queue instances don't error out when a user's
> job fails.
>
> We have an underlying issue with the filesystem, and sometimes the compute
> nodes lose filesystem access.  A job gets dispatched, errors out with
>
> failed       26  : opening input/output file
>
> and then lots of other jobs go to that same node and error out before the
> filesystem comes back.
>
> IIRC, the queue should switch to error state when the first job errors out.
> But this isn't happening here.  Is there some setting I can check?
>
> I see the documentation says "A job enters the error state when Grid Engine
> tried to execute a job in a queue, but it failed for a reason that is
> considered specific to the job. A queue enters the error state when Grid
> Engine tried to execute a job in a queue, but it failed for a reason that is
> considered specific to the queue." per
> http://arc.liv.ac.uk/SGE/howto/troubleshooting.html
>
> We also have a load sensor that checks for the presence of this filesystem,
> but the load sensor only updates every few minutes, while the filesystem
> tends to disappear for only about 60s.
>
> Regards,
> --
> Alex Chekholko [email protected]
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to