Normally restarting the qmaster (e.g. systemctl restart gridengine-qmaster)
should be a very routine and harmless operation that should be invisible to
users except for a temporary inaccessibility of `qstat`.

On Fri, Oct 18, 2019 at 8:35 AM WALLIS Michael <> wrote:

> Hi folks,
> Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs
> at the start of the
> month, and then started again at 1 as expected. About ten days later we
> started the qmaster
> a few times (it was segfaulting, originally we thought that a user was
> using newer qstat
> binaries to query an old qmaster) with JID nearing ~20k, only after each
> of the restarts the JID
> started at about 1100, not the number we were expecting. Because of this
> there's duplicate JID
> entries in accounting and it's causing a bit of a problem for people who
> monitor for failed jobs.
> Because of the nature of the workload the currently-running JIDs are now
> all over the place,
> with some JIDs in the queue still in the 9,99n,nnn range and some in four
> figures. If we need to
> restart the qmaster again, will the jobseqnum file be overwritten with the
> largest JID still in
> the queue (as suggested in
> Am aware that this is an old version of SGE and we're in the middle of
> transitioning to a
> much newer one, but this is a bit of an issue while we're still shifting
> workloads over.
> Thanks,
> Mike
> --
> Mike Wallis x503305
> University of Edinburgh, Research Services,
> Argyle House, 3 Lady Lawson Street,
> Edinburgh, EH3 9DR
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> _______________________________________________
> users mailing list
users mailing list

Reply via email to