>From Grid Engine point of view, it does not know whether it is a filesystem issue or a bad pathname passed via qsub by the user.
Rayson On Tue, Sep 4, 2012 at 3:27 PM, Alex Chekholko <[email protected]> wrote: > Hi Rayson, > > OK, that is reasonable. > > My real problem is this filesystem problem, which is not related to grid > engine. But in the meantime, I wonder if there is some workaround for my > filesystem issue. > > Is there a way to make the load sensor check more frequent? > > Regards, > Alex > > > On 09/04/2012 12:03 PM, Rayson Ho wrote: >> >> Hi Alex, >> >> That's the correct behavior (for SSTATE_OPEN_OUTPUT), or else a user >> can DoS the cluster easily by pointing the input or output file to a >> path that can't be opened by the user. >> >> Rayson >> >> >> >> On Tue, Sep 4, 2012 at 2:50 PM, Alex Chekholko <[email protected]> wrote: >>> >>> Hi, >>> >>> I have a cluster with Rayson's OGE from Oct 2011. >>> >>> I see an unusual issue: our queue instances don't error out when a user's >>> job fails. >>> >>> We have an underlying issue with the filesystem, and sometimes the >>> compute >>> nodes lose filesystem access. A job gets dispatched, errors out with >>> >>> failed 26 : opening input/output file >>> >>> and then lots of other jobs go to that same node and error out before the >>> filesystem comes back. >>> >>> IIRC, the queue should switch to error state when the first job errors >>> out. >>> But this isn't happening here. Is there some setting I can check? >>> >>> I see the documentation says "A job enters the error state when Grid >>> Engine >>> tried to execute a job in a queue, but it failed for a reason that is >>> considered specific to the job. A queue enters the error state when Grid >>> Engine tried to execute a job in a queue, but it failed for a reason that >>> is >>> considered specific to the queue." per >>> http://arc.liv.ac.uk/SGE/howto/troubleshooting.html >>> >>> We also have a load sensor that checks for the presence of this >>> filesystem, >>> but the load sensor only updates every few minutes, while the filesystem >>> tends to disappear for only about 60s. >>> >>> Regards, >>> -- >>> Alex Chekholko [email protected] >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > > -- > Alex Chekholko [email protected] 347-401-4860 _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
