Hello,
 
I'm using the latest SGE version (8.1.9) and submit roughly 8k jobs per hour. All of them work well but few times a week (less than once a day), there are jobs that 'disappear'.
The only relevant information I get is the "trace" file that always mentions "wait3 returned -1" after the  "execvlp" line telling which script was used.
In the job log itself, there is no useful information: the output is stopped, without any error message.
 
Here is the end of the "trace" file of a job with such a problem:
 
03/06/2018 04:20:01 [300:27675]: execvlp(/bin/ksh, "-ksh" "/gridware/sge/name_of_the_grid/spool/compute_that_launches_the_job/job_scripts/5451271")
03/06/2018 04:20:39 [300:27642]: wait3 returned -1
03/06/2018 04:20:39 [300:27642]: forward_signal_to_job(): mapping signal 20 TSTP
03/06/2018 04:20:39 [300:27642]: mapped signal TSTP to signal KILL
03/06/2018 04:20:39 [300:27642]: queued signal KILL
03/06/2018 04:20:39 [300:27642]: kill(-27675, KILL)
03/06/2018 04:20:39 [300:27642]: now sending signal KILL to pid -27675
03/06/2018 04:20:39 [300:27642]: pdc_kill_addgrpid: 20084 9
03/06/2018 04:20:39 [0:27642]: killing pid 27675/4
03/06/2018 04:20:39 [0:27642]: killing pid 27849/4
03/06/2018 04:20:39 [300:27642]: wait3 returned 27675 (status: 9; WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
03/06/2018 04:20:39 [300:27642]: job exited with exit status 0
03/06/2018 04:20:39 [300:27642]: reaped "job" with pid 27675
03/06/2018 04:20:39 [300:27642]: job exited due to signal
03/06/2018 04:20:39 [300:27642]: job signaled: 9
03/06/2018 04:20:39 [300:27642]: ignored signal KILL to pid -27675
03/06/2018 04:20:39 [300:27642]: writing usage file to "usage"
03/06/2018 04:20:39 [300:27642]: no epilog script to start
 
It is strange because the PID 27642 seems to say that it received the signal 20 (SIGTSTP according to the log).
 
Have you already experienced such behavior?
 
Thanks in advance for any help.
 
Regards,
 
Paul.
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to