Do qconf -sconf | grep gid_range and check whether any of your users have group id's in that range. That can lead to things being killed. Dan
On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran <jfar...@uci.edu> wrote: > Greetings. > > I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. > > I am seeing job failures on nodes where the node's sge_execd unexpectedly > dies. > > I ran strace on the nodes sge_execd and it's not of much help. It > always end with > > +++ killed by SIGKILL +++ > > But I cannot tell what killed it. Dmesg has nothing of segfault nor > memory issues. The sge_qmaster on the head node is never affected and it > runs just fine. The issue is on the client's sge_execd and 80% of nodes > are not affected, only some 20% of the nodes. > > Here are some sge settings: > > qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 > execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ > H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ > S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ > H_MAXPROC=infinity,S_LOCKS=infinity, \ > H_LOCKS=infinity, > USE_SMAPS=yes,ENABLE_BINDING=TRUE > > max_aj_instances 2000 > max_aj_tasks 0 > max_u_jobs 900000 > max_jobs 900000 > max_advance_reservations 300 > > I also tried playing with vm settings to: > > /sbin/sysctl vm.overcommit_ratio=100 > /sbin/sysctl vm.overcommit_memory=2 > > But it has not been of much help - sge_execd keeps dying. > > Any help on how I can track down what is causing the node client sge_execd > to die? > > Joseph > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users