Hello, We're using SGE 8.1.9 on linux servers (debian 9) and sometimes, one job out of several thousands is killed. We debugged and even recompiled the sge_sheperd binary (to call "waitpid" instead of "wait3", the libexplain to get more informations and to run a 'ps axf' command to check the running processes when a kill happens) to try to understand, but the root cause remains unclear.
Here is the content of the 'trace' file of a killed job: 04/18/2018 09:10:38 [300:23222]: execvlp(/bin/ksh, "-ksh" "/gridware/sge/gridname/spool/server_name/job_scripts/4056720") 04/18/2018 09:11:08 [300:23212]: waitpid returned -1 04/18/2018 09:11:08 [300:23212]: waitpid(pid = -1, status = 0x7FFEFF80093C, options = 0) failed, Interrupted system call (4, EINTR) because the process was interrupted by a signal before the waitpid was complete but in the 'ps' that appears just after in the 'trace' file: 04/18/2018 09:11:09 [300:23212]: OUTPUT: 23212 ? S 0:00 \_ sge_shepherd-4056720 -bg 04/18/2018 09:11:09 [300:23212]: OUTPUT: 23222 ? Ss 0:00 | \_ -ksh /gridware/sge/gridname/spool/server_name/job_scripts/4056720 04/18/2018 09:11:09 [300:23212]: OUTPUT: 23233 ? S 0:00 | | \_ perl ...... 04/18/2018 09:11:09 [300:23212]: OUTPUT: 23427 ? R 0:23 | | \_ perl ...... So the job was running fine before being killed. This is the remaining 'trace' content: 04/18/2018 09:11:09 [300:23212]: forward_signal_to_job(): mapping signal 20 TSTP 04/18/2018 09:11:09 [300:23212]: mapped signal TSTP to signal KILL 04/18/2018 09:11:09 [300:23212]: queued signal KILL 04/18/2018 09:11:09 [300:23212]: kill(-23222, KILL) 04/18/2018 09:11:09 [300:23212]: now sending signal KILL to pid -23222 04/18/2018 09:11:09 [300:23212]: pdc_kill_addgrpid: 20029 9 04/18/2018 09:11:09 [0:23212]: killing pid 23222/4 04/18/2018 09:11:09 [0:23212]: killing pid 23427/4 04/18/2018 09:11:09 [300:23212]: waitpid(pid = -1, status = 0x7FFEFF80093C, options = 0): success 04/18/2018 09:11:09 [300:23212]: waitpid returned 23222 (status: 9; WIFSIGNALED: 1, WIFEXITED: 0, WEXITSTATUS: 0) 04/18/2018 09:11:09 [300:23212]: job exited with exit status 0 04/18/2018 09:11:09 [300:23212]: reaped "job" with pid 23222 04/18/2018 09:11:09 [300:23212]: job exited due to signal 04/18/2018 09:11:09 [300:23212]: job signaled: 9 04/18/2018 09:11:09 [300:23212]: ignored signal KILL to pid -23222 04/18/2018 09:11:09 [300:23212]: writing usage file to "usage" 04/18/2018 09:11:09 [300:23212]: no epilog script to start So it is clearly the sge_shepherd that forwarded the signal and killed the 'ksh' script and its children. The loglevel is set to 'log_info', but there is nothing in the 'messages' file of the exec node. On the master, when we launched a 'qdel -f 4056720' to get rid of the job, the following messages appeared: 04/18/2018 12:15:12|worker|master|W|user forced the deletion of job 4056720 04/18/2018 12:15:13|worker|master|E|execd server_name reports running state for job (4056720.1/master) in queue "queue@server_name" while job is in state 65536 As far as we can say, killing a job happens when the tag of a message processed by the 'sge_execd_process_messages' function is set to 'TAG_SIGJOB'. But in this case, it means that the master has sent such a message, but there are no reasons for it to do so (no 'qdel' or other commands - but qstat - are run). Or maybe there are other cases? The grid (master, shadow and exec hosts) was completely restarted few days ago. Thanks for reading and if you have any clue, thanks for sharing. Regards, Paul. _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users