On Thu, Apr 05, 2018 at 09:46:23AM +0200, Paul Paul wrote:
> Hello,
> 
> We're using SGE 8.1.9 and randomly, we have jobs that finish with success 
> (our jobs logs confirm this) but the master is not notified.
> On the compute, all the folders related to such a job are still here, 
> correctly filled:
> 
> trace file:
> ...
> 04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300
> 04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" 
> "/gridware/sge/gridname/spool/server/job_scripts/1376090")
> 04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 04/04/2018 21:50:23 [300:38327]: job exited with exit status 0
> 04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328
> 04/04/2018 21:50:23 [300:38327]: job exited not due to signal
> 04/04/2018 21:50:23 [300:38327]: job exited with status 0
> 04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328
> 04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9
> 04/04/2018 21:50:23 [300:38327]: writing usage file to "usage"
> 04/04/2018 21:50:23 [300:38327]: no epilog script to start
> 
> exit_status:
> 0
> 
> error:
> (empty)
> 
> but the process no longer appears in the 'ps' output.
> 
> On the master, doing a 'qstat -j 1376090' works and so, to get rid of such a 
> job, we are performing 'qdel -f 1376090'.
> 
> This happens 3 or 4 times a day (we submit more than 100k jobs per day), on 
> different exec hosts.
> 
> Do you know what could be the cause of this behavior?
Is there anything in the messages log?

Alternatively this might just be networks being less than 100% reliable.  
Possibly tweaking gdi_timeout and gdi_retries 
might help.

William

Attachment: signature.asc
Description: PGP signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to