Greetings.
I am running SGE 8.1.9 on a cluster
with some 10k cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
I ran strace on the nodes sge_execd and it's
not of much help. It always end
with
+++ killed by SIGKILL +++
But I cannot tell what killed it.
Dmesg has nothing of segfault nor memory
issues. The sge_qmaster on
the head node is never affected and it runs just
fine. The issue is on the
client's sge_execd and 80%
of nodes are not affected,
only some 20% of the nodes.
Here are some sge settings:
qmaster_params
MONITOR_TIME=0:1:00 LOG_Monitor_Message=0
execd_params
ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
H_MAXPROC=infinity,S_LOCKS=infinity, \
H_LOCKS=infinity,
USE_SMAPS=yes,ENABLE_BINDING=TRUE
max_aj_instances 2000
max_aj_tasks 0
max_u_jobs 900000
max_jobs 900000
max_advance_reservations 300
I also tried playing
with vm settings to:
/sbin/sysctl vm.overcommit_ratio=100
/sbin/sysctl vm.overcommit_memory=2
But it has not been of much help - sge_execd
keeps dying.
Any help on how I can track down what is
causing the node client sge_execd to die?
Joseph
_______________________________________________