Hello,

We're using SGE 8.1.9 and randomly, we have jobs that finish with success (our 
jobs logs confirm this) but the master is not notified.
On the compute, all the folders related to such a job are still here, correctly 
filled:

trace file:
...
04/04/2018 21:50:13 [300:38328]: now running with uid=300, euid=300
04/04/2018 21:50:13 [300:38328]: execvlp(/bin/ksh, "-ksh" 
"/gridware/sge/gridname/spool/server/job_scripts/1376090")
04/04/2018 21:50:23 [300:38327]: wait3 returned 38328 (status: 0; WIFSIGNALED: 
0,  WIFEXITED: 1, WEXITSTATUS: 0)
04/04/2018 21:50:23 [300:38327]: job exited with exit status 0
04/04/2018 21:50:23 [300:38327]: reaped "job" with pid 38328
04/04/2018 21:50:23 [300:38327]: job exited not due to signal
04/04/2018 21:50:23 [300:38327]: job exited with status 0
04/04/2018 21:50:23 [300:38327]: now sending signal KILL to pid -38328
04/04/2018 21:50:23 [300:38327]: pdc_kill_addgrpid: 20075 9
04/04/2018 21:50:23 [300:38327]: writing usage file to "usage"
04/04/2018 21:50:23 [300:38327]: no epilog script to start

exit_status:
0

error:
(empty)

but the process no longer appears in the 'ps' output.

On the master, doing a 'qstat -j 1376090' works and so, to get rid of such a 
job, we are performing 'qdel -f 1376090'.

This happens 3 or 4 times a day (we submit more than 100k jobs per day), on 
different exec hosts.

Do you know what could be the cause of this behavior?

Thanks,

Paul.
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to