Greetings.
I am running SGE 8.1.9 on a cluster with
some 10k cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
I ran strace on the nodes sge_execd and it's not of
much help. It always end with
+++ killed by SIGKILL +++
But I cannot tell what killed it. Dmesg has
nothing of segfault nor memory issues. The sge_qmaster
on the head node is never affected and it runs just
fine. The issue is on the client's
sge_execd and 80% of nodes
are not affected, only some 20% of the
nodes.
Here are some sge settings:
qmaster_params MONITOR_TIME=0:1:00
LOG_Monitor_Message=0
execd_params
ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
H_MAXPROC=infinity,S_LOCKS=infinity, \
H_LOCKS=infinity,
USE_SMAPS=yes,ENABLE_BINDING=TRUE
max_aj_instances 2000
max_aj_tasks 0
max_u_jobs 900000
max_jobs 900000
max_advance_reservations 300
I also tried playing with
vm settings to:
/sbin/sysctl vm.overcommit_ratio=100
/sbin/sysctl vm.overcommit_memory=2
But it has not been of much help - sge_execd keeps
dying.
Any help on how I can track down what is causing the
node client sge_execd to die?
Joseph
_______________________________________________