William, Thanks for your reply.
In the 'messages' file of the exec host, there is nothing (the last message was 2 weeks ago). In the 'messages' file of the master, there are the usual lines: 04/05/2018 06:42:58|worker|master_host|W|user forced the deletion of job 1376090 04/05/2018 06:43:20|worker|master_host|E|execd@exec_host reports running job (1376090.1/master) in queue "queue@exec_host" that was not supposed to be there - killing 04/05/2018 06:43:59|worker|master_host|E|execd@exec_host reports running job (1376090.1/master) in queue "queue@exec_host" that was not supposed to be there - killing About 'gdi_timeout' and 'gdi_retries', we will try to modify them to check if things are better. We already noticed issue when submitting jobs with 'qsub' (when the NFS is really loaded), like: "Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error)." so it might help for this too. Paul. > Sent: Thursday, April 05, 2018 at 8:20 AM > From: "William Hay" <[email protected]> > To: "Paul Paul" <[email protected]> > Cc: [email protected] > Subject: Re: [gridengine users] Job finishes correctly but master is not > notified > > On Thu, Apr 05, 2018 at 09:46:23AM +0200, Paul Paul wrote: > > Hello, > > > > We're using SGE 8.1.9 and randomly, we have jobs that finish with success > > (our jobs logs confirm this) but the master is not notified. > > On the compute, all the folders related to such a job are still here, > > correctly filled: > > > > trace file: > > ... > > 04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300 > > 04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" > > "/gridware/sge/gridname/spool/server/job_scripts/1376090") > > 04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; > > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) > > 04/04/2018 21:50:23 [300:38327]: job exited with exit status 0 > > 04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328 > > 04/04/2018 21:50:23 [300:38327]: job exited not due to signal > > 04/04/2018 21:50:23 [300:38327]: job exited with status 0 > > 04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328 > > 04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9 > > 04/04/2018 21:50:23 [300:38327]: writing usage file to "usage" > > 04/04/2018 21:50:23 [300:38327]: no epilog script to start > > > > exit_status: > > 0 > > > > error: > > (empty) > > > > but the process no longer appears in the 'ps' output. > > > > On the master, doing a 'qstat -j 1376090' works and so, to get rid of such > > a job, we are performing 'qdel -f 1376090'. > > > > This happens 3 or 4 times a day (we submit more than 100k jobs per day), on > > different exec hosts. > > > > Do you know what could be the cause of this behavior? > Is there anything in the messages log? > > Alternatively this might just be networks being less than 100% reliable. > Possibly tweaking gdi_timeout and gdi_retries > might help. > > William > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
