On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote:
>    Hello to distinguished forum members,
> 
>     
> 
>    We use SoGE 8.1.8.
> 
>     
> 
>    We have noticed that our sge_qmaster process fails inconsistently and
>    jumps between shadow and master servers.
> 
>    Issue occurs every 2-5 days.
One possibility that occurs to me is that you might be suffering from a memory 
leak
that causes the oom_killer to target the qmaster.  

> 
>     
> 
>    We don't understand the root cause and the qmaster messages file does not
>    indicate any issue.

I would suggest increasing the loglevel and also checking to see if there is 
anything that immediately precedes the failure repeatedly (the qmaster starting 
up
again should be fairly obvious).

> 
>    What are the best practices debugging this issue and resolving the problem
>    without interrupting normal operation of sge_qmaster?

There is also running the qmaster with debugging turned up but that
could easily generate excessive an excessive volume of messages especially
if you don't know what you are looking for.

William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to