Re: [gridengine users] sge_execd dies

2018-11-09 Thread Joseph Farran

  
  
Yeap - that's exactly
what is was.   Not a single sge_execd crash since the change.
Thank you!   I owe you
a box of beer!
Joseph
  
On 11/8/2018 9:17 PM, Daniel Povey
  wrote:


  
  OK, well there's your problem.  You need to
increase the start of gid_range to a value larger than your
largest possible 'real' userid: for instance, 1.
The name is a little confusing.  It needs to be a range
  that's disjoint from the range of possible userids.


  
  
  
On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran
  
  wrote:


  
Hi Dan.
Thank you for the suggestion.   Here is what I have:
# qconf -sconf | grep gid_range
gid_range    200-70
  
The highest gid is 3135.
  
Best,
Joseph

On
  11/8/2018 8:58 PM, Daniel Povey wrote:


  
Do
  qconf -sconf | grep gid_range
  
  and check whether any of your users have group
id's in that range.  That can lead to things being
killed.
Dan
  
  

  
  
  
On Thu, Nov 8, 2018 at 10:33 PM Joseph
  Farran 
  wrote:


  
Greetings.
I am running SGE 8.1.9 on a cluster
with some 10k cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
  
I ran strace on the nodes sge_execd and it's
not of much help.   It always end
with
    +++ killed by SIGKILL +++
  
But I cannot tell what killed it. 
Dmesg has nothing of segfault nor memory
issues.  The sge_qmaster on
the head node is never affected and it runs just
fine.  The issue is on the
client's sge_execd and 80%
of nodes are not affected,
only some 20% of the nodes.
  
Here are some sge settings:
qmaster_params  
MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
  execd_params
ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
  
H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
  
S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
  
H_MAXPROC=infinity,S_LOCKS=infinity, \
  
H_LOCKS=infinity,
USE_SMAPS=yes,ENABLE_BINDING=TRUE
  
max_aj_instances 2000
  max_aj_tasks 0
  max_u_jobs   90
  max_jobs 90
  max_advance_reservations 300
  
I also tried playing
with vm settings to:
    /sbin/sysctl vm.overcommit_ratio=100
      /sbin/sysctl vm.overcommit_memory=2
  
But it has not been of much help - sge_execd
keeps dying.
  
Any help on how I can track down what is
causing the node client sge_execd to die?
Joseph
  
  
  ___
  users mailing list
  users@gridengine.org
  https://gridengine.org/mailman/listinfo/users

  


  

  

  

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies

2018-11-08 Thread Joseph Farran

  
  
Oh for goodness sake.    
  
Are you saying that the gid_range in sge is a range of gid I
DO-NOT use on the cluster?
Holly macro - that is very misleading.
Thank you!   Giving it a try.
Best,
Joseph

  
On 11/8/2018 9:17 PM, Daniel Povey
  wrote:


  
  OK, well there's your problem.  You need to
increase the start of gid_range to a value larger than your
largest possible 'real' userid: for instance, 1.
The name is a little confusing.  It needs to be a range
  that's disjoint from the range of possible userids.


  
  
  
On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran
  
  wrote:


  
Hi Dan.
Thank you for the suggestion.   Here is what I have:
# qconf -sconf | grep gid_range
gid_range    200-70
  
The highest gid is 3135.
  
Best,
Joseph

On
  11/8/2018 8:58 PM, Daniel Povey wrote:


  
Do
  qconf -sconf | grep gid_range
  
  and check whether any of your users have group
id's in that range.  That can lead to things being
killed.
Dan
  
  

  
  
  
On Thu, Nov 8, 2018 at 10:33 PM Joseph
  Farran 
  wrote:


  
Greetings.
I am running SGE 8.1.9 on a cluster
with some 10k cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
  
I ran strace on the nodes sge_execd and it's
not of much help.   It always end
with
    +++ killed by SIGKILL +++
  
But I cannot tell what killed it. 
Dmesg has nothing of segfault nor memory
issues.  The sge_qmaster on
the head node is never affected and it runs just
fine.  The issue is on the
client's sge_execd and 80%
of nodes are not affected,
only some 20% of the nodes.
  
Here are some sge settings:
qmaster_params  
MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
  execd_params
ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
  
H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
  
S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
  
H_MAXPROC=infinity,S_LOCKS=infinity, \
  
H_LOCKS=infinity,
USE_SMAPS=yes,ENABLE_BINDING=TRUE
  
max_aj_instances 2000
  max_aj_tasks 0
  max_u_jobs   90
  max_jobs 90
  max_advance_reservations 300
  
I also tried playing
with vm settings to:
    /sbin/sysctl vm.overcommit_ratio=100
      /sbin/sysctl vm.overcommit_memory=2
  
But it has not been of much help - sge_execd
keeps dying.
  
Any help on how I can track down what is
causing the node client sge_execd to die?
Joseph
  
  
  ___
  users mailing list
  users@gridengine.org
  https://gridengine.org/mailman/listinfo/users

  


  

  

  

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies

2018-11-08 Thread Daniel Povey
OK, well there's your problem.  You need to increase the start of gid_range
to a value larger than your largest possible 'real' userid: for instance,
1.
The name is a little confusing.  It needs to be a range that's disjoint
from the range of possible userids.


On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran  wrote:

> Hi Dan.
>
> Thank you for the suggestion.   Here is what I have:
>
> # qconf -sconf | grep gid_range
> gid_range200-70
>
> The highest gid is 3135.
> Best,
> Joseph
>
> On 11/8/2018 8:58 PM, Daniel Povey wrote:
>
> Do
> qconf -sconf | grep gid_range
> and check whether any of your users have group id's in that range.  That
> can lead to things being killed.
> Dan
>
>
> On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran  wrote:
>
>> Greetings.
>>
>> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>>
>> I am seeing job failures on nodes where the node's sge_execd
>> unexpectedly dies.
>>
>> I ran strace on the nodes sge_execd and it's not of much help.   It
>> always end with
>>
>> +++ killed by SIGKILL +++
>>
>> But I cannot tell what killed it.  Dmesg has nothing of segfault nor
>> memory issues.  The sge_qmaster on the head node is never affected and
>> it runs just fine.  The issue is on the client's sge_execd and 80% of nodes
>> are not affected, only some 20% of the nodes.
>>
>> Here are some sge settings:
>>
>> qmaster_params   MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
>> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>>  H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity,
>> \
>>  S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>>  H_MAXPROC=infinity,S_LOCKS=infinity, \
>>  H_LOCKS=infinity,
>> USE_SMAPS=yes,ENABLE_BINDING=TRUE
>>
>> max_aj_instances 2000
>> max_aj_tasks 0
>> max_u_jobs   90
>> max_jobs 90
>> max_advance_reservations 300
>>
>> I also tried playing with vm settings to:
>>
>> /sbin/sysctl vm.overcommit_ratio=100
>> /sbin/sysctl vm.overcommit_memory=2
>>
>> But it has not been of much help - sge_execd keeps dying.
>>
>> Any help on how I can track down what is causing the node client
>> sge_execd to die?
>>
>> Joseph
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies

2018-11-08 Thread Joseph Farran

  
  
Hi Dan.
Thank you for the suggestion.   Here is what I have:
# qconf -sconf | grep gid_range
gid_range    200-70
  
The highest gid is 3135.
  
Best,
Joseph

On 11/8/2018 8:58 PM, Daniel Povey
  wrote:


  
  
Do
  qconf -sconf | grep gid_range
  
  and check whether any of your users have group id's in
that range.  That can lead to things being killed.
Dan
  
  

  
  
  
On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran
  
  wrote:


  
Greetings.
I am running SGE 8.1.9 on a cluster with
some 10k cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
  
I ran strace on the nodes sge_execd and it's not of
much help.   It always end with
    +++ killed by SIGKILL +++
  
But I cannot tell what killed it.  Dmesg has
nothing of segfault nor memory issues.  The sge_qmaster
on the head node is never affected and it runs just
fine.  The issue is on the client's
sge_execd and 80% of nodes
are not affected, only some 20% of the
nodes.
  
Here are some sge settings:
qmaster_params   MONITOR_TIME=0:1:00 
LOG_Monitor_Message=0
  execd_params
ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
  
H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
  
S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
  
H_MAXPROC=infinity,S_LOCKS=infinity, \
   H_LOCKS=infinity,
USE_SMAPS=yes,ENABLE_BINDING=TRUE
  
max_aj_instances 2000
  max_aj_tasks 0
  max_u_jobs   90
  max_jobs 90
  max_advance_reservations 300
  
I also tried playing with
vm settings to:
    /sbin/sysctl vm.overcommit_ratio=100
      /sbin/sysctl vm.overcommit_memory=2
  
But it has not been of much help - sge_execd keeps
dying.
  
Any help on how I can track down what is causing the
node client sge_execd to die?
Joseph
  
  
  ___
  users mailing list
  users@gridengine.org
  https://gridengine.org/mailman/listinfo/users

  


  

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies

2018-11-08 Thread Daniel Povey
Do
qconf -sconf | grep gid_range
and check whether any of your users have group id's in that range.  That
can lead to things being killed.
Dan


On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran  wrote:

> Greetings.
>
> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>
> I am seeing job failures on nodes where the node's sge_execd unexpectedly
> dies.
>
> I ran strace on the nodes sge_execd and it's not of much help.   It
> always end with
>
> +++ killed by SIGKILL +++
>
> But I cannot tell what killed it.  Dmesg has nothing of segfault nor
> memory issues.  The sge_qmaster on the head node is never affected and it
> runs just fine.  The issue is on the client's sge_execd and 80% of nodes
> are not affected, only some 20% of the nodes.
>
> Here are some sge settings:
>
> qmaster_params   MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>  H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
>  S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>  H_MAXPROC=infinity,S_LOCKS=infinity, \
>  H_LOCKS=infinity,
> USE_SMAPS=yes,ENABLE_BINDING=TRUE
>
> max_aj_instances 2000
> max_aj_tasks 0
> max_u_jobs   90
> max_jobs 90
> max_advance_reservations 300
>
> I also tried playing with vm settings to:
>
> /sbin/sysctl vm.overcommit_ratio=100
> /sbin/sysctl vm.overcommit_memory=2
>
> But it has not been of much help - sge_execd keeps dying.
>
> Any help on how I can track down what is causing the node client sge_execd
> to die?
>
> Joseph
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] sge_execd dies

2018-11-08 Thread Joseph Farran

  
  
Greetings.
I am running SGE 8.1.9 on a cluster with some 10k
cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
  
I ran strace on the nodes sge_execd and it's not of much help.  
It always end with
    +++ killed by SIGKILL +++
  
But I cannot tell what killed it.  Dmesg has nothing
of segfault nor memory issues.  The sge_qmaster
on the head node is never affected and it runs just fine.  The
issue is on the client's sge_execd and 80%
of nodes are not affected, only some 20%
of the nodes.
  
Here are some sge settings:
qmaster_params   MONITOR_TIME=0:1:00 
LOG_Monitor_Message=0
  execd_params
ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
  
H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
   S_MEMORYLOCKED=infinity
S_MAXPROC=infinity, \
  
H_MAXPROC=infinity,S_LOCKS=infinity, \
   H_LOCKS=infinity,
USE_SMAPS=yes,ENABLE_BINDING=TRUE
  
max_aj_instances 2000
  max_aj_tasks 0
  max_u_jobs   90
  max_jobs 90
  max_advance_reservations 300
  
I also tried playing with
vm settings to:
    /sbin/sysctl vm.overcommit_ratio=100
      /sbin/sysctl vm.overcommit_memory=2
  
But it has not been of much help - sge_execd keeps dying.
  
Any help on how I can track down what is causing the node
client sge_execd to die?
Joseph
  
  

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies silently with 0 exit status

2013-09-16 Thread Dave Love
Reuti re...@staff.uni-marburg.de writes:

 Please have a look at your /tmp. The starting execd will write the cause of 
 not being able to start in a file therein.

For what it's worth, that depends on the version.  sge-8.0.0e+ writes to
syslog, as you'd expect a daemon to.  (The previous behaviour was also
insecure.)  Depending on how far it gets on starting, there may also be
something in the messages file.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] sge_execd dies silently with 0 exit status

2013-09-12 Thread Edward Ned Harvey
I'm having a heck of a time figuring out why.

On rhel6, /etc/init.d/sgeexecd.myclustername script is run at startup, or via 
sudo after startup.
sudo /etc/init.d/sgeexecd.myclustername start

It just says OK and no other output, yet the daemon isn't running.

I added the -x option to '#!/bin/sh -x so I can debug it … 
I see it gets up to the exec 1 /dev/null 21 which effectively eliminates 
any further debug output…
So I comment out that line and run again.
Now I can see it launches sge_execd, and the exit status is 0, so the touch 
on the following line does indeed create the lock file.
The qping loop immediately after that in the script … exits with 0 status, on 
the first try.

And still, there is no process running at the end of that script.

I modify the startup script to perform the qping 5 times unconditionally.  I 
see the first time, it has exit value 0, and all subsequent times, it has exit 
value 1. This means it is indeed running for a very short period of time, but 
then it dies in less than a second.

Any ideas what the problem is?

This is a machine that we recently reinstalled the OS, and we're reinstalling 
sgeexecd by the same process it was previously installed.
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies silently with 0 exit status

2013-09-12 Thread Edward Ned Harvey

On Sep 12, 2013, at 12:12 PM, Reuti re...@staff.uni-marburg.de wrote:

 Hi,
 
 Please have a look at your /tmp. The starting execd will write the cause of 
 not being able to start in a file therein.
 

Nailed it.  Thank you.  
can't create directory /var/spool/sge
Pretty self explanatory now.


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies silently with 0 exit status

2013-09-12 Thread Reuti
Hi,

Am 12.09.2013 um 17:50 schrieb Edward Ned Harvey:

 I'm having a heck of a time figuring out why.
 
 On rhel6, /etc/init.d/sgeexecd.myclustername script is run at startup, or via 
 sudo after startup.
 sudo /etc/init.d/sgeexecd.myclustername start
 
 It just says OK and no other output, yet the daemon isn't running.
 
 I added the -x option to '#!/bin/sh -x so I can debug it … 
 I see it gets up to the exec 1 /dev/null 21 which effectively eliminates 
 any further debug output…
 So I comment out that line and run again.
 Now I can see it launches sge_execd, and the exit status is 0, so the touch 
 on the following line does indeed create the lock file.
 The qping loop immediately after that in the script … exits with 0 status, 
 on the first try.
 
 And still, there is no process running at the end of that script.
 
 I modify the startup script to perform the qping 5 times unconditionally.  I 
 see the first time, it has exit value 0, and all subsequent times, it has 
 exit value 1. This means it is indeed running for a very short period of 
 time, but then it dies in less than a second.
 
 Any ideas what the problem is?

Please have a look at your /tmp. The starting execd will write the cause of not 
being able to start in a file therein.

-- Reuti


 This is a machine that we recently reinstalled the OS, and we're reinstalling 
 sgeexecd by the same process it was previously installed.
 ___
 users mailing list
 users@gridengine.org
 https://gridengine.org/mailman/listinfo/users
 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users