Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the
manpage at this URL:
http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html )

Request the job to run in this queue/host again, and see why the
shepherd can't open the job_pid.

(And remember to unset the execd_params or else you will fill up your
local spool dir eventually with job information.)

Rayson



On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman
<[email protected]> wrote:
> On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]> wrote:
>>
>> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman
>> <[email protected]> wrote:
>> > From the qmaster messages file:
>> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host
>> > cs428.ftc.avagotech.net general before job because: 06/14/2012 21:29:37
>> > [20339:8436]: can't open file job_pid: Permission denied
>> >
>> > I checked a job_pid file on a currently running job on the system that
>> > had
>> > the above errors, permission down the entire tree seems fine and here is
>> > the
>> > job_id file:
>> >
>> > -rw-r--r-- 1 grid  grid       6 Jun 14 17:40 job_pid
>>
>> Is your execd spool dir on NFS or local??
>>
> Local.
>
>>
>> Also, does it happen to all nodes or just a node or queue?
>>
>
> Happened on 2 different nodes.   Not all jobs caused this.
>
>>
>> Rayson
>>
>>
>>
>> >
>> > Any clues?    Is the path perhaps hard coded into sge_shepherd for this
>> > file?
>> >
>> > Thanks.
>> > --
>> > -MichaelC
>> >
>> > _______________________________________________
>> > users mailing list
>> > [email protected]
>> > https://gridengine.org/mailman/listinfo/users
>> >
>
>
>
>
> --
> -MichaelC

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to