Re: [gridengine users] sge_execd dies
Yeap - that's exactly what is was. Not a single sge_execd crash since the change. Thank you! I owe you a box of beer! Joseph On 11/8/2018 9:17 PM, Daniel Povey wrote: OK, well there's your problem. You need to increase the start of gid_range to a value larger than your largest possible 'real' userid: for instance, 1. The name is a little confusing. It needs to be a range that's disjoint from the range of possible userids. On Fri, Nov 9, 2018 at 12:12 AM Joseph Farranwrote: Hi Dan. Thank you for the suggestion. Here is what I have: # qconf -sconf | grep gid_range gid_range 200-70 The highest gid is 3135. Best, Joseph On 11/8/2018 8:58 PM, Daniel Povey wrote: Do qconf -sconf | grep gid_range and check whether any of your users have group id's in that range. That can lead to things being killed. Dan On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran wrote: Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on nodes where the node's sge_execd unexpectedly dies. I ran strace on the nodes sge_execd and it's not of much help. It always end with +++ killed by SIGKILL +++ But I cannot tell what killed it. Dmesg has nothing of segfault nor memory issues. The sge_qmaster on the head node is never affected and it runs just fine. The issue is on the client's sge_execd and 80% of nodes are not affected, only some 20% of the nodes. Here are some sge settings: qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ H_MAXPROC=infinity,S_LOCKS=infinity, \ H_LOCKS=infinity, USE_SMAPS=yes,ENABLE_BINDING=TRUE max_aj_instances 2000 max_aj_tasks 0 max_u_jobs 90 max_jobs 90 max_advance_reservations 300 I also tried playing with vm settings to: /sbin/sysctl vm.overcommit_ratio=100 /sbin/sysctl vm.overcommit_memory=2 But it has not been of much help - sge_execd keeps dying. Any help on how I can track down what is causing the node client sge_execd to die? Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies
Oh for goodness sake. Are you saying that the gid_range in sge is a range of gid I DO-NOT use on the cluster? Holly macro - that is very misleading. Thank you! Giving it a try. Best, Joseph On 11/8/2018 9:17 PM, Daniel Povey wrote: OK, well there's your problem. You need to increase the start of gid_range to a value larger than your largest possible 'real' userid: for instance, 1. The name is a little confusing. It needs to be a range that's disjoint from the range of possible userids. On Fri, Nov 9, 2018 at 12:12 AM Joseph Farranwrote: Hi Dan. Thank you for the suggestion. Here is what I have: # qconf -sconf | grep gid_range gid_range 200-70 The highest gid is 3135. Best, Joseph On 11/8/2018 8:58 PM, Daniel Povey wrote: Do qconf -sconf | grep gid_range and check whether any of your users have group id's in that range. That can lead to things being killed. Dan On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran wrote: Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on nodes where the node's sge_execd unexpectedly dies. I ran strace on the nodes sge_execd and it's not of much help. It always end with +++ killed by SIGKILL +++ But I cannot tell what killed it. Dmesg has nothing of segfault nor memory issues. The sge_qmaster on the head node is never affected and it runs just fine. The issue is on the client's sge_execd and 80% of nodes are not affected, only some 20% of the nodes. Here are some sge settings: qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ H_MAXPROC=infinity,S_LOCKS=infinity, \ H_LOCKS=infinity, USE_SMAPS=yes,ENABLE_BINDING=TRUE max_aj_instances 2000 max_aj_tasks 0 max_u_jobs 90 max_jobs 90 max_advance_reservations 300 I also tried playing with vm settings to: /sbin/sysctl vm.overcommit_ratio=100 /sbin/sysctl vm.overcommit_memory=2 But it has not been of much help - sge_execd keeps dying. Any help on how I can track down what is causing the node client sge_execd to die? Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies
OK, well there's your problem. You need to increase the start of gid_range to a value larger than your largest possible 'real' userid: for instance, 1. The name is a little confusing. It needs to be a range that's disjoint from the range of possible userids. On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran wrote: > Hi Dan. > > Thank you for the suggestion. Here is what I have: > > # qconf -sconf | grep gid_range > gid_range200-70 > > The highest gid is 3135. > Best, > Joseph > > On 11/8/2018 8:58 PM, Daniel Povey wrote: > > Do > qconf -sconf | grep gid_range > and check whether any of your users have group id's in that range. That > can lead to things being killed. > Dan > > > On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran wrote: > >> Greetings. >> >> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. >> >> I am seeing job failures on nodes where the node's sge_execd >> unexpectedly dies. >> >> I ran strace on the nodes sge_execd and it's not of much help. It >> always end with >> >> +++ killed by SIGKILL +++ >> >> But I cannot tell what killed it. Dmesg has nothing of segfault nor >> memory issues. The sge_qmaster on the head node is never affected and >> it runs just fine. The issue is on the client's sge_execd and 80% of nodes >> are not affected, only some 20% of the nodes. >> >> Here are some sge settings: >> >> qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 >> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ >> H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, >> \ >> S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ >> H_MAXPROC=infinity,S_LOCKS=infinity, \ >> H_LOCKS=infinity, >> USE_SMAPS=yes,ENABLE_BINDING=TRUE >> >> max_aj_instances 2000 >> max_aj_tasks 0 >> max_u_jobs 90 >> max_jobs 90 >> max_advance_reservations 300 >> >> I also tried playing with vm settings to: >> >> /sbin/sysctl vm.overcommit_ratio=100 >> /sbin/sysctl vm.overcommit_memory=2 >> >> But it has not been of much help - sge_execd keeps dying. >> >> Any help on how I can track down what is causing the node client >> sge_execd to die? >> >> Joseph >> ___ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users >> > > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies
Hi Dan. Thank you for the suggestion. Here is what I have: # qconf -sconf | grep gid_range gid_range 200-70 The highest gid is 3135. Best, Joseph On 11/8/2018 8:58 PM, Daniel Povey wrote: Do qconf -sconf | grep gid_range and check whether any of your users have group id's in that range. That can lead to things being killed. Dan On Thu, Nov 8, 2018 at 10:33 PM Joseph Farranwrote: Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on nodes where the node's sge_execd unexpectedly dies. I ran strace on the nodes sge_execd and it's not of much help. It always end with +++ killed by SIGKILL +++ But I cannot tell what killed it. Dmesg has nothing of segfault nor memory issues. The sge_qmaster on the head node is never affected and it runs just fine. The issue is on the client's sge_execd and 80% of nodes are not affected, only some 20% of the nodes. Here are some sge settings: qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ H_MAXPROC=infinity,S_LOCKS=infinity, \ H_LOCKS=infinity, USE_SMAPS=yes,ENABLE_BINDING=TRUE max_aj_instances 2000 max_aj_tasks 0 max_u_jobs 90 max_jobs 90 max_advance_reservations 300 I also tried playing with vm settings to: /sbin/sysctl vm.overcommit_ratio=100 /sbin/sysctl vm.overcommit_memory=2 But it has not been of much help - sge_execd keeps dying. Any help on how I can track down what is causing the node client sge_execd to die? Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies
Do qconf -sconf | grep gid_range and check whether any of your users have group id's in that range. That can lead to things being killed. Dan On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran wrote: > Greetings. > > I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. > > I am seeing job failures on nodes where the node's sge_execd unexpectedly > dies. > > I ran strace on the nodes sge_execd and it's not of much help. It > always end with > > +++ killed by SIGKILL +++ > > But I cannot tell what killed it. Dmesg has nothing of segfault nor > memory issues. The sge_qmaster on the head node is never affected and it > runs just fine. The issue is on the client's sge_execd and 80% of nodes > are not affected, only some 20% of the nodes. > > Here are some sge settings: > > qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 > execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ > H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ > S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ > H_MAXPROC=infinity,S_LOCKS=infinity, \ > H_LOCKS=infinity, > USE_SMAPS=yes,ENABLE_BINDING=TRUE > > max_aj_instances 2000 > max_aj_tasks 0 > max_u_jobs 90 > max_jobs 90 > max_advance_reservations 300 > > I also tried playing with vm settings to: > > /sbin/sysctl vm.overcommit_ratio=100 > /sbin/sysctl vm.overcommit_memory=2 > > But it has not been of much help - sge_execd keeps dying. > > Any help on how I can track down what is causing the node client sge_execd > to die? > > Joseph > ___ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
[gridengine users] sge_execd dies
Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on nodes where the node's sge_execd unexpectedly dies. I ran strace on the nodes sge_execd and it's not of much help. It always end with +++ killed by SIGKILL +++ But I cannot tell what killed it. Dmesg has nothing of segfault nor memory issues. The sge_qmaster on the head node is never affected and it runs just fine. The issue is on the client's sge_execd and 80% of nodes are not affected, only some 20% of the nodes. Here are some sge settings: qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0 execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \ H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \ S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \ H_MAXPROC=infinity,S_LOCKS=infinity, \ H_LOCKS=infinity, USE_SMAPS=yes,ENABLE_BINDING=TRUE max_aj_instances 2000 max_aj_tasks 0 max_u_jobs 90 max_jobs 90 max_advance_reservations 300 I also tried playing with vm settings to: /sbin/sysctl vm.overcommit_ratio=100 /sbin/sysctl vm.overcommit_memory=2 But it has not been of much help - sge_execd keeps dying. Any help on how I can track down what is causing the node client sge_execd to die? Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies silently with 0 exit status
Reuti re...@staff.uni-marburg.de writes: Please have a look at your /tmp. The starting execd will write the cause of not being able to start in a file therein. For what it's worth, that depends on the version. sge-8.0.0e+ writes to syslog, as you'd expect a daemon to. (The previous behaviour was also insecure.) Depending on how far it gets on starting, there may also be something in the messages file. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
[gridengine users] sge_execd dies silently with 0 exit status
I'm having a heck of a time figuring out why. On rhel6, /etc/init.d/sgeexecd.myclustername script is run at startup, or via sudo after startup. sudo /etc/init.d/sgeexecd.myclustername start It just says OK and no other output, yet the daemon isn't running. I added the -x option to '#!/bin/sh -x so I can debug it … I see it gets up to the exec 1 /dev/null 21 which effectively eliminates any further debug output… So I comment out that line and run again. Now I can see it launches sge_execd, and the exit status is 0, so the touch on the following line does indeed create the lock file. The qping loop immediately after that in the script … exits with 0 status, on the first try. And still, there is no process running at the end of that script. I modify the startup script to perform the qping 5 times unconditionally. I see the first time, it has exit value 0, and all subsequent times, it has exit value 1. This means it is indeed running for a very short period of time, but then it dies in less than a second. Any ideas what the problem is? This is a machine that we recently reinstalled the OS, and we're reinstalling sgeexecd by the same process it was previously installed. ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies silently with 0 exit status
On Sep 12, 2013, at 12:12 PM, Reuti re...@staff.uni-marburg.de wrote: Hi, Please have a look at your /tmp. The starting execd will write the cause of not being able to start in a file therein. Nailed it. Thank you. can't create directory /var/spool/sge Pretty self explanatory now. ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_execd dies silently with 0 exit status
Hi, Am 12.09.2013 um 17:50 schrieb Edward Ned Harvey: I'm having a heck of a time figuring out why. On rhel6, /etc/init.d/sgeexecd.myclustername script is run at startup, or via sudo after startup. sudo /etc/init.d/sgeexecd.myclustername start It just says OK and no other output, yet the daemon isn't running. I added the -x option to '#!/bin/sh -x so I can debug it … I see it gets up to the exec 1 /dev/null 21 which effectively eliminates any further debug output… So I comment out that line and run again. Now I can see it launches sge_execd, and the exit status is 0, so the touch on the following line does indeed create the lock file. The qping loop immediately after that in the script … exits with 0 status, on the first try. And still, there is no process running at the end of that script. I modify the startup script to perform the qping 5 times unconditionally. I see the first time, it has exit value 0, and all subsequent times, it has exit value 1. This means it is indeed running for a very short period of time, but then it dies in less than a second. Any ideas what the problem is? Please have a look at your /tmp. The starting execd will write the cause of not being able to start in a file therein. -- Reuti This is a machine that we recently reinstalled the OS, and we're reinstalling sgeexecd by the same process it was previously installed. ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users