Hello,

We're using SGE 8.1.9 on linux servers (debian 9) and sometimes, one job out of 
several thousands is killed.
We debugged and even recompiled the sge_sheperd binary (to call "waitpid" 
instead of "wait3", the libexplain to get more informations and to run a 'ps 
axf' command to check the running processes when a kill happens) to try to 
understand, but the root cause remains unclear.

Here is the content of the 'trace' file of a killed job:

04/18/2018 09:10:38 [300:23222]: execvlp(/bin/ksh, "-ksh" 
"/gridware/sge/gridname/spool/server_name/job_scripts/4056720")
04/18/2018 09:11:08 [300:23212]: waitpid returned -1
04/18/2018 09:11:08 [300:23212]: waitpid(pid = -1, status = 0x7FFEFF80093C, 
options = 0) failed, Interrupted system call (4, EINTR) because the process was 
interrupted by a signal before the waitpid was complete

but in the 'ps' that appears just after in the 'trace' file:

04/18/2018 09:11:09 [300:23212]: OUTPUT: 23212 ?        S      0:00  \_ 
sge_shepherd-4056720 -bg
04/18/2018 09:11:09 [300:23212]: OUTPUT: 23222 ?        Ss     0:00  |   \_ 
-ksh /gridware/sge/gridname/spool/server_name/job_scripts/4056720
04/18/2018 09:11:09 [300:23212]: OUTPUT: 23233 ?        S      0:00  |   |   \_ 
perl ......
04/18/2018 09:11:09 [300:23212]: OUTPUT: 23427 ?        R      0:23  |   |      
 \_ perl ......

So the job was running fine before being killed. This is the remaining 'trace' 
content:

04/18/2018 09:11:09 [300:23212]: forward_signal_to_job(): mapping signal 20 TSTP
04/18/2018 09:11:09 [300:23212]: mapped signal TSTP to signal KILL
04/18/2018 09:11:09 [300:23212]: queued signal KILL
04/18/2018 09:11:09 [300:23212]: kill(-23222, KILL)
04/18/2018 09:11:09 [300:23212]: now sending signal KILL to pid -23222
04/18/2018 09:11:09 [300:23212]: pdc_kill_addgrpid: 20029 9
04/18/2018 09:11:09 [0:23212]: killing pid 23222/4
04/18/2018 09:11:09 [0:23212]: killing pid 23427/4
04/18/2018 09:11:09 [300:23212]: waitpid(pid = -1, status = 0x7FFEFF80093C, 
options = 0): success
04/18/2018 09:11:09 [300:23212]: waitpid returned 23222 (status: 9; 
WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
04/18/2018 09:11:09 [300:23212]: job exited with exit status 0
04/18/2018 09:11:09 [300:23212]: reaped "job" with pid 23222
04/18/2018 09:11:09 [300:23212]: job exited due to signal
04/18/2018 09:11:09 [300:23212]: job signaled: 9
04/18/2018 09:11:09 [300:23212]: ignored signal KILL to pid -23222
04/18/2018 09:11:09 [300:23212]: writing usage file to "usage"
04/18/2018 09:11:09 [300:23212]: no epilog script to start

So it is clearly the sge_shepherd that forwarded the signal and killed the 
'ksh' script and its children.

The loglevel is set to 'log_info', but there is nothing in the 'messages' file 
of the exec node. On the master, when we launched a 'qdel -f 4056720' to get 
rid of the job, the following messages appeared:

04/18/2018 12:15:12|worker|master|W|user forced the deletion of job 4056720
04/18/2018 12:15:13|worker|master|E|execd server_name reports running state for 
job (4056720.1/master) in queue "queue@server_name" while job is in state 65536 

As far as we can say, killing a job happens when the tag of a message processed 
by the 'sge_execd_process_messages' function is set to 'TAG_SIGJOB'. But in 
this case, it means that the master has sent such a message, but there are no 
reasons for it to do so (no 'qdel' or other commands - but qstat - are run). Or 
maybe there are other cases?

The grid (master, shadow and exec hosts) was completely restarted few days ago.

Thanks for reading and if you have any clue, thanks for sharing.

Regards,

Paul.
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to