On Fri, Jun 15, 2012 at 2:43 PM, Michael Coffman <[email protected]> wrote: > 06/14/2012 08:56:49| main|cs431|E|shepherd of job 9990340.1 exited with > exit status = 11
Hmm, then most likely the qmaster log also won't tell you anything... and thus we need the shepherd "trace" file (in the active_jobs directory) to find out what's happening. Also, do you know if the job has any "%s" parameters passed into it?? (We have received reports of it before - a highly random error that can happen depending on how the shepherd is built & the OS that it is running on...) Rayson > >> >> Rayson >> >> >> >> >> > >> > ============================================================== >> > qname all.q >> > hostname cs431.ftc.avagotech.net >> > group fidlib >> > owner bgp >> > project NONE >> > department priority >> > jobname qsubcmd.21231 >> > jobnumber 17593 >> > taskid undefined >> > account sge >> > priority 0 >> > qsub_time Wed Dec 31 17:00:00 1969 >> > start_time -/- >> > end_time -/- >> > granted_pe NONE >> > slots 0 >> > failed 11 : before job >> > exit_status 0 >> > ru_wallclock 0 >> > ru_utime 0.000 >> > ru_stime 0.000 >> > ru_maxrss 0 >> > ru_ixrss 0 >> > ru_ismrss 0 >> > ru_idrss 0 >> > ru_isrss 0 >> > ru_minflt 0 >> > ru_majflt 0 >> > ru_nswap 0 >> > ru_inblock 0 >> > ru_oublock 0 >> > ru_msgsnd 0 >> > ru_msgrcv 0 >> > ru_nsignals 0 >> > ru_nvcsw 0 >> > ru_nivcsw 0 >> > cpu 0.000 >> > mem 0.000 >> > io 0.000 >> > iow 0.000 >> > maxvmem 0.000 >> > arid undefined >> > >> > >> > >> > On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman >> > <[email protected]> wrote: >> >> >> >> On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <[email protected]> >> >> wrote: >> >>> >> >>> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the >> >>> manpage at this URL: >> >>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html ) >> >>> >> >>> Request the job to run in this queue/host again, and see why the >> >>> shepherd can't open the job_pid. >> >>> >> >>> (And remember to unset the execd_params or else you will fill up your >> >>> local spool dir eventually with job information.) >> >>> >> >> >> >> I can't do this on my production grid. And I don't know how to >> >> replicate >> >> the problem currently. I will set things up on a test setup and try >> >> and >> >> reproduce the issue with KEEP_ACTIVE turned on. >> >> >> >> Is it possible to set the KEEP_ACTIVE per host? I only see this in >> >> the >> >> qconf -sconf >> >> >> >>> >> >>> Rayson >> >>> >> >>> >> >>> >> >>> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman >> >>> <[email protected]> wrote: >> >>> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]> >> >>> > wrote: >> >>> >> >> >>> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman >> >>> >> <[email protected]> wrote: >> >>> >> > From the qmaster messages file: >> >>> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host >> >>> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012 >> >>> >> > 21:29:37 >> >>> >> > [20339:8436]: can't open file job_pid: Permission denied >> >>> >> > >> >>> >> > I checked a job_pid file on a currently running job on the system >> >>> >> > that >> >>> >> > had >> >>> >> > the above errors, permission down the entire tree seems fine and >> >>> >> > here is >> >>> >> > the >> >>> >> > job_id file: >> >>> >> > >> >>> >> > -rw-r--r-- 1 grid grid 6 Jun 14 17:40 job_pid >> >>> >> >> >>> >> Is your execd spool dir on NFS or local?? >> >>> >> >> >>> > Local. >> >>> > >> >>> >> >> >>> >> Also, does it happen to all nodes or just a node or queue? >> >>> >> >> >>> > >> >>> > Happened on 2 different nodes. Not all jobs caused this. >> >>> > >> >>> >> >> >>> >> Rayson >> >>> >> >> >>> >> >> >>> >> >> >>> >> > >> >>> >> > Any clues? Is the path perhaps hard coded into sge_shepherd >> >>> >> > for >> >>> >> > this >> >>> >> > file? >> >>> >> > >> >>> >> > Thanks. >> >>> >> > -- >> >>> >> > -MichaelC >> >>> >> > >> >>> >> > _______________________________________________ >> >>> >> > users mailing list >> >>> >> > [email protected] >> >>> >> > https://gridengine.org/mailman/listinfo/users >> >>> >> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > -- >> >>> > -MichaelC >> >> >> >> >> >> >> >> >> >> -- >> >> -MichaelC >> > >> > >> > >> > >> > -- >> > -MichaelC > > > > > -- > -MichaelC _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
