Re: [gridengine users] sge_execd dies

Joseph Farran Fri, 09 Nov 2018 20:59:20 -0800

Yeap - that's exactly what is was. Not a single sge_execd crash since the change.

Thank you! I owe you a box of beer!

Joseph

On 11/8/2018 9:17 PM, Daniel Povey wrote:

OK, well there's your problem. You need to increase the start of gid_range to a value larger than your largest possible 'real' userid: for instance, 10000.
The name is a little confusing. It needs to be a range that's disjoint from the range of possible userids.

On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran <jfar...@uci.edu> wrote:

Hi Dan.

Thank you for the suggestion.   Here is what I have:

# qconf -sconf | grep gid_range gid_range                    200-700000

The highest gid is 3135.
Best,Joseph

On 11/8/2018 8:58 PM, Daniel Povey wrote:

Do
qconf -sconf | grep gid_range

and check whether any of your users have group id's in that range. That can lead to things being killed.
Dan

On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran <jfar...@uci.edu> wrote:

Greetings.

I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.

I am seeing job failures on nodeswhere the node's sge_execd unexpectedly dies.

I ran strace on the nodes sge_execd and it's not of much help.   It always end with

    +++ killed by SIGKILL +++

But I cannottell what killed it. Dmesg has nothing of segfault nor memory issues. The sge_qmaster on the head node is never affected and it runs just fine. The issue is on the client's sge_execd and80% ofnodes are not affected, only some 20% of the nodes.

Here are some sge settings:

qmaster_params               MONITOR_TIME=0:1:00 LOG_Monitor_Message=0execd_params                 ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \                             H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \                             S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \                             H_MAXPROC=infinity,S_LOCKS=infinity, \                             H_LOCKS=infinity, USE_SMAPS=yes,ENABLE_BINDING=TRUE

max_aj_instances             2000max_aj_tasks                 0max_u_jobs                   900000max_jobs                     900000max_advance_reservations     300

I also tried playing with vm settings to:

    /sbin/sysctl vm.overcommit_ratio=100    /sbin/sysctl vm.overcommit_memory=2

But it has not been of much help - sge_execd keeps dying.

Any help on how I can track down what is causing the node client sge_execd to die?

Joseph

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_execd dies

Reply via email to