Normally restarting the qmaster (e.g. systemctl restart gridengine-qmaster) should be a very routine and harmless operation that should be invisible to users except for a temporary inaccessibility of `qstat`.
On Fri, Oct 18, 2019 at 8:35 AM WALLIS Michael <mike.wal...@ed.ac.uk> wrote: > Hi folks, > > Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs > at the start of the > month, and then started again at 1 as expected. About ten days later we > started the qmaster > a few times (it was segfaulting, originally we thought that a user was > using newer qstat > binaries to query an old qmaster) with JID nearing ~20k, only after each > of the restarts the JID > started at about 1100, not the number we were expecting. Because of this > there's duplicate JID > entries in accounting and it's causing a bit of a problem for people who > monitor for failed jobs. > > Because of the nature of the workload the currently-running JIDs are now > all over the place, > with some JIDs in the queue still in the 9,99n,nnn range and some in four > figures. If we need to > restart the qmaster again, will the jobseqnum file be overwritten with the > largest JID still in > the queue (as suggested in > http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)? > > Am aware that this is an old version of SGE and we're in the middle of > transitioning to a > much newer one, but this is a bit of an issue while we're still shifting > workloads over. > > Thanks, > Mike > -- > Mike Wallis x503305 > University of Edinburgh, Research Services, > Argyle House, 3 Lady Lawson Street, > Edinburgh, EH3 9DR > > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users