Also might be of interest:

==============================================================
qname        all.q
hostname     cs431.ftc.avagotech.net
group        fidlib
owner        bgp
project      NONE
department   priority
jobname      qsubcmd.21231
jobnumber    17593
taskid       undefined
account      sge
priority     0
qsub_time    Wed Dec 31 17:00:00 1969
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        0
failed       11  : before job
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined


On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman <
[email protected]> wrote:

> On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <[email protected]> wrote:
>
>> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the
>> manpage at this URL:
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html )
>>
>> Request the job to run in this queue/host again, and see why the
>> shepherd can't open the job_pid.
>>
>> (And remember to unset the execd_params or else you will fill up your
>> local spool dir eventually with job information.)
>>
>>
> I can't do this on my production grid.   And I don't know how to replicate
> the problem currently.   I will set things up on a test setup and try and
> reproduce the issue with KEEP_ACTIVE turned on.
>
> Is it possible to set the KEEP_ACTIVE per host?   I only see this in the
> qconf -sconf
>
>
>> Rayson
>>
>>
>>
>> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman
>> <[email protected]> wrote:
>> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]>
>> wrote:
>> >>
>> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman
>> >> <[email protected]> wrote:
>> >> > From the qmaster messages file:
>> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host
>> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012
>> 21:29:37
>> >> > [20339:8436]: can't open file job_pid: Permission denied
>> >> >
>> >> > I checked a job_pid file on a currently running job on the system
>> that
>> >> > had
>> >> > the above errors, permission down the entire tree seems fine and
>> here is
>> >> > the
>> >> > job_id file:
>> >> >
>> >> > -rw-r--r-- 1 grid  grid       6 Jun 14 17:40 job_pid
>> >>
>> >> Is your execd spool dir on NFS or local??
>> >>
>> > Local.
>> >
>> >>
>> >> Also, does it happen to all nodes or just a node or queue?
>> >>
>> >
>> > Happened on 2 different nodes.   Not all jobs caused this.
>> >
>> >>
>> >> Rayson
>> >>
>> >>
>> >>
>> >> >
>> >> > Any clues?    Is the path perhaps hard coded into sge_shepherd for
>> this
>> >> > file?
>> >> >
>> >> > Thanks.
>> >> > --
>> >> > -MichaelC
>> >> >
>> >> > _______________________________________________
>> >> > users mailing list
>> >> > [email protected]
>> >> > https://gridengine.org/mailman/listinfo/users
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > -MichaelC
>>
>
>
>
> --
> -MichaelC
>



-- 
-MichaelC
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to