On Thu, Apr 05, 2018 at 09:46:23AM +0200, Paul Paul wrote: > Hello, > > We're using SGE 8.1.9 and randomly, we have jobs that finish with success > (our jobs logs confirm this) but the master is not notified. > On the compute, all the folders related to such a job are still here, > correctly filled: > > trace file: > ... > 04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300 > 04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" > "/gridware/sge/gridname/spool/server/job_scripts/1376090") > 04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) > 04/04/2018 21:50:23 [300:38327]: job exited with exit status 0 > 04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328 > 04/04/2018 21:50:23 [300:38327]: job exited not due to signal > 04/04/2018 21:50:23 [300:38327]: job exited with status 0 > 04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328 > 04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9 > 04/04/2018 21:50:23 [300:38327]: writing usage file to "usage" > 04/04/2018 21:50:23 [300:38327]: no epilog script to start > > exit_status: > 0 > > error: > (empty) > > but the process no longer appears in the 'ps' output. > > On the master, doing a 'qstat -j 1376090' works and so, to get rid of such a > job, we are performing 'qdel -f 1376090'. > > This happens 3 or 4 times a day (we submit more than 100k jobs per day), on > different exec hosts. > > Do you know what could be the cause of this behavior? Is there anything in the messages log?
Alternatively this might just be networks being less than 100% reliable. Possibly tweaking gdi_timeout and gdi_retries might help. William
signature.asc
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
