Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies. I ran strace on the nodes sge_execd and it's not of much help. It always end with +++ killed by SIGKILL +++ But I cannot tell what killed it. Dmesg has nothing
of segfault nor memory issues. The sge_qmaster
on the head node is never affected and it runs just fine. The
issue is on the client's sge_execd and 80%
of nodes are not affected, only some 20%
of the nodes. Here are some sge settings: qmaster_params MONITOR_TIME=0:1:00
LOG_Monitor_Message=0 max_aj_instances 2000 I also tried playing with vm settings to: /sbin/sysctl vm.overcommit_ratio=100 But it has not been of much help - sge_execd keeps dying. Any help on how I can track down what is causing the node client sge_execd to die? Joseph |
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users