On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote: > Hello to distinguished forum members, > > > > We use SoGE 8.1.8. > > > > We have noticed that our sge_qmaster process fails inconsistently and > jumps between shadow and master servers. > > Issue occurs every 2-5 days. One possibility that occurs to me is that you might be suffering from a memory leak that causes the oom_killer to target the qmaster.
> > > > We don't understand the root cause and the qmaster messages file does not > indicate any issue. I would suggest increasing the loglevel and also checking to see if there is anything that immediately precedes the failure repeatedly (the qmaster starting up again should be fairly obvious). > > What are the best practices debugging this issue and resolving the problem > without interrupting normal operation of sge_qmaster? There is also running the qmaster with debugging turned up but that could easily generate excessive an excessive volume of messages especially if you don't know what you are looking for. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users