On 4 September 2012 20:27, Alex Chekholko <[email protected]> wrote: > Hi Rayson, > > OK, that is reasonable. > > My real problem is this filesystem problem, which is not related to grid > engine. But in the meantime, I wonder if there is some workaround for > my filesystem issue. > > Is there a way to make the load sensor check more frequent? > Change load_report_time in the sge config should do it Unfortunately this affects all load sensors on the node. If the problem affects all nodes simultaneously you could have only one node run the sensor on one node (reporting a global rather than per host load) and decrease load_report_time on that node only.
William > Regards, > Alex > > On 09/04/2012 12:03 PM, Rayson Ho wrote: >> Hi Alex, >> >> That's the correct behavior (for SSTATE_OPEN_OUTPUT), or else a user >> can DoS the cluster easily by pointing the input or output file to a >> path that can't be opened by the user. >> >> Rayson >> >> >> >> On Tue, Sep 4, 2012 at 2:50 PM, Alex Chekholko <[email protected]> wrote: >>> Hi, >>> >>> I have a cluster with Rayson's OGE from Oct 2011. >>> >>> I see an unusual issue: our queue instances don't error out when a user's >>> job fails. >>> >>> We have an underlying issue with the filesystem, and sometimes the compute >>> nodes lose filesystem access. A job gets dispatched, errors out with >>> >>> failed 26 : opening input/output file >>> >>> and then lots of other jobs go to that same node and error out before the >>> filesystem comes back. >>> >>> IIRC, the queue should switch to error state when the first job errors out. >>> But this isn't happening here. Is there some setting I can check? >>> >>> I see the documentation says "A job enters the error state when Grid Engine >>> tried to execute a job in a queue, but it failed for a reason that is >>> considered specific to the job. A queue enters the error state when Grid >>> Engine tried to execute a job in a queue, but it failed for a reason that is >>> considered specific to the queue." per >>> http://arc.liv.ac.uk/SGE/howto/troubleshooting.html >>> >>> We also have a load sensor that checks for the presence of this filesystem, >>> but the load sensor only updates every few minutes, while the filesystem >>> tends to disappear for only about 60s. >>> >>> Regards, >>> -- >>> Alex Chekholko [email protected] >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > -- > Alex Chekholko [email protected] 347-401-4860 > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
