>From Grid Engine point of view, it does not know whether it is a
filesystem issue or a bad pathname passed via qsub by the user.

Rayson



On Tue, Sep 4, 2012 at 3:27 PM, Alex Chekholko <[email protected]> wrote:
> Hi Rayson,
>
> OK, that is reasonable.
>
> My real problem is this filesystem problem, which is not related to grid
> engine.  But in the meantime, I wonder if there is some workaround for my
> filesystem issue.
>
> Is there a way to make the load sensor check more frequent?
>
> Regards,
> Alex
>
>
> On 09/04/2012 12:03 PM, Rayson Ho wrote:
>>
>> Hi Alex,
>>
>> That's the correct behavior (for SSTATE_OPEN_OUTPUT), or else a user
>> can DoS the cluster easily by pointing the input or output file to a
>> path that can't be opened by the user.
>>
>> Rayson
>>
>>
>>
>> On Tue, Sep 4, 2012 at 2:50 PM, Alex Chekholko <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I have a cluster with Rayson's OGE from Oct 2011.
>>>
>>> I see an unusual issue: our queue instances don't error out when a user's
>>> job fails.
>>>
>>> We have an underlying issue with the filesystem, and sometimes the
>>> compute
>>> nodes lose filesystem access.  A job gets dispatched, errors out with
>>>
>>> failed       26  : opening input/output file
>>>
>>> and then lots of other jobs go to that same node and error out before the
>>> filesystem comes back.
>>>
>>> IIRC, the queue should switch to error state when the first job errors
>>> out.
>>> But this isn't happening here.  Is there some setting I can check?
>>>
>>> I see the documentation says "A job enters the error state when Grid
>>> Engine
>>> tried to execute a job in a queue, but it failed for a reason that is
>>> considered specific to the job. A queue enters the error state when Grid
>>> Engine tried to execute a job in a queue, but it failed for a reason that
>>> is
>>> considered specific to the queue." per
>>> http://arc.liv.ac.uk/SGE/howto/troubleshooting.html
>>>
>>> We also have a load sensor that checks for the presence of this
>>> filesystem,
>>> but the load sensor only updates every few minutes, while the filesystem
>>> tends to disappear for only about 60s.
>>>
>>> Regards,
>>> --
>>> Alex Chekholko [email protected]
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>
>
> --
> Alex Chekholko [email protected] 347-401-4860
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to