William,

Thanks for your reply.

In the 'messages' file of the exec host, there is nothing (the last message was 
2 weeks ago).
In the 'messages' file of the master, there are the usual lines:
04/05/2018 06:42:58|worker|master_host|W|user forced the deletion of job 1376090
04/05/2018 06:43:20|worker|master_host|E|execd@exec_host reports running job 
(1376090.1/master) in queue "queue@exec_host" that was not supposed to be there 
- killing
04/05/2018 06:43:59|worker|master_host|E|execd@exec_host reports running job 
(1376090.1/master) in queue "queue@exec_host" that was not supposed to be there 
- killing

About 'gdi_timeout' and 'gdi_retries', we will try to modify them to check if 
things are better.
We already noticed issue when submitting jobs with 'qsub' (when the NFS is 
really loaded), like:
"Unable to run job: failed receiving gdi request response for mid=1 (got 
syncron message receive timeout error)."
so it might help for this too.

Paul.

> Sent: Thursday, April 05, 2018 at 8:20 AM
> From: "William Hay" <[email protected]>
> To: "Paul Paul" <[email protected]>
> Cc: [email protected]
> Subject: Re: [gridengine users] Job finishes correctly but master is not 
> notified
>
> On Thu, Apr 05, 2018 at 09:46:23AM +0200, Paul Paul wrote:
> > Hello,
> > 
> > We're using SGE 8.1.9 and randomly, we have jobs that finish with success 
> > (our jobs logs confirm this) but the master is not notified.
> > On the compute, all the folders related to such a job are still here, 
> > correctly filled:
> > 
> > trace file:
> > ...
> > 04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300
> > 04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" 
> > "/gridware/sge/gridname/spool/server/job_scripts/1376090")
> > 04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; 
> > WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> > 04/04/2018 21:50:23 [300:38327]: job exited with exit status 0
> > 04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328
> > 04/04/2018 21:50:23 [300:38327]: job exited not due to signal
> > 04/04/2018 21:50:23 [300:38327]: job exited with status 0
> > 04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328
> > 04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9
> > 04/04/2018 21:50:23 [300:38327]: writing usage file to "usage"
> > 04/04/2018 21:50:23 [300:38327]: no epilog script to start
> > 
> > exit_status:
> > 0
> > 
> > error:
> > (empty)
> > 
> > but the process no longer appears in the 'ps' output.
> > 
> > On the master, doing a 'qstat -j 1376090' works and so, to get rid of such 
> > a job, we are performing 'qdel -f 1376090'.
> > 
> > This happens 3 or 4 times a day (we submit more than 100k jobs per day), on 
> > different exec hosts.
> > 
> > Do you know what could be the cause of this behavior?
> Is there anything in the messages log?
> 
> Alternatively this might just be networks being less than 100% reliable.  
> Possibly tweaking gdi_timeout and gdi_retries 
> might help.
> 
> William
> 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to