[gridengine users] sge_execd dies

Joseph Farran Thu, 08 Nov 2018 19:34:51 -0800

Greetings.

I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.

I am seeing job failures on nodeswhere the node's sge_execd unexpectedly dies.

I ran strace on the nodes sge_execd and it's not of much help. It alwaysend with

+++ killed by SIGKILL +++

But I cannottell what killed it. Dmesg has nothing of segfault nor memory issues. The sge_qmaster on the head node is never affected and it runs just fine. The issue ison the client's sge_execd and 80% ofnodes are not affected, only some 20% of the nodes.

Here are some sge settings:

qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ H_MAXPROC=infinity,S_LOCKS=infinity, \ H_LOCKS=infinity, USE_SMAPS=yes,ENABLE_BINDING=TRUE

max_aj_instances 2000max_aj_tasks 0max_u_jobs 900000max_jobs 900000max_advance_reservations 300

I also tried playing withvm settings to:

/sbin/sysctl vm.overcommit_ratio=100 /sbin/sysctl vm.overcommit_memory=2

But it has not been of much help - sge_execd keeps dying.

Any help on how I can track down what is causing the node client sge_execd to die?

Joseph

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] sge_execd dies

Reply via email to