OK, well there's your problem. You need to increase the start of gid_range to a value larger than your largest possible 'real' userid: for instance, 10000. The name is a little confusing. It needs to be a range that's disjoint from the range of possible userids.
On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran <jfar...@uci.edu> wrote: > Hi Dan. > > Thank you for the suggestion. Here is what I have: > > # qconf -sconf | grep gid_range > gid_range 200-700000 > > The highest gid is 3135. > Best, > Joseph > > On 11/8/2018 8:58 PM, Daniel Povey wrote: > > Do > qconf -sconf | grep gid_range > and check whether any of your users have group id's in that range. That > can lead to things being killed. > Dan > > > On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran <jfar...@uci.edu> wrote: > >> Greetings. >> >> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. >> >> I am seeing job failures on nodes where the node's sge_execd >> unexpectedly dies. >> >> I ran strace on the nodes sge_execd and it's not of much help. It >> always end with >> >> +++ killed by SIGKILL +++ >> >> But I cannot tell what killed it. Dmesg has nothing of segfault nor >> memory issues. The sge_qmaster on the head node is never affected and >> it runs just fine. The issue is on the client's sge_execd and 80% of nodes >> are not affected, only some 20% of the nodes. >> >> Here are some sge settings: >> >> qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 >> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ >> H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, >> \ >> S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ >> H_MAXPROC=infinity,S_LOCKS=infinity, \ >> H_LOCKS=infinity, >> USE_SMAPS=yes,ENABLE_BINDING=TRUE >> >> max_aj_instances 2000 >> max_aj_tasks 0 >> max_u_jobs 900000 >> max_jobs 900000 >> max_advance_reservations 300 >> >> I also tried playing with vm settings to: >> >> /sbin/sysctl vm.overcommit_ratio=100 >> /sbin/sysctl vm.overcommit_memory=2 >> >> But it has not been of much help - sge_execd keeps dying. >> >> Any help on how I can track down what is causing the node client >> sge_execd to die? >> >> Joseph >> _______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users >> > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users