Re: [gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened

2019-10-28 Thread bergman
In the message dated: Fri, 18 Oct 2019 15:34:02 -,
The pithy ruminations from WALLIS Michael on 
[[gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened] were:
=> Hi folks,
=> 
=> Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs 
at the start of the
=> month, and then started again at 1 as expected. About ten days later we 
started the qmaster
=> a few times (it was segfaulting, originally we thought that a user was using 
newer qstat
=> binaries to query an old qmaster) with JID nearing ~20k, only after each of 
the restarts the JID
=> started at about 1100, not the number we were expecting. Because of this 
there's duplicate JID
=> entries in accounting and it's causing a bit of a problem for people who 
monitor for failed jobs.

We've seen that too.

Restarting the queue master doesn't rotate the accounting file, so qacct output 
may be 'wrong', unless the query is restricted by a time range (ie.,
jobID 1000 may exist from 2017 and 2019).

Mark

=> 
=> Because of the nature of the workload the currently-running JIDs are now all 
over the place,
=> with some JIDs in the queue still in the 9,99n,nnn range and some in four 
figures. If we need to
=> restart the qmaster again, will the jobseqnum file be overwritten with the 
largest JID still in
=> the queue (as suggested in
=> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)?
=> 
=> Am aware that this is an old version of SGE and we're in the middle of 
transitioning to a
=> much newer one, but this is a bit of an issue while we're still shifting 
workloads over.
=> 
=> Thanks,
=> Mike
=> --
=> Mike Wallis x503305
=> University of Edinburgh, Research Services,
=> Argyle House, 3 Lady Lawson Street,
=> Edinburgh, EH3 9DR
=> 
=> 

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened

2019-10-18 Thread Daniel Povey
Normally restarting the qmaster (e.g. systemctl restart gridengine-qmaster)
should be a very routine and harmless operation that should be invisible to
users except for a temporary inaccessibility of `qstat`.

On Fri, Oct 18, 2019 at 8:35 AM WALLIS Michael  wrote:

> Hi folks,
>
> Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs
> at the start of the
> month, and then started again at 1 as expected. About ten days later we
> started the qmaster
> a few times (it was segfaulting, originally we thought that a user was
> using newer qstat
> binaries to query an old qmaster) with JID nearing ~20k, only after each
> of the restarts the JID
> started at about 1100, not the number we were expecting. Because of this
> there's duplicate JID
> entries in accounting and it's causing a bit of a problem for people who
> monitor for failed jobs.
>
> Because of the nature of the workload the currently-running JIDs are now
> all over the place,
> with some JIDs in the queue still in the 9,99n,nnn range and some in four
> figures. If we need to
> restart the qmaster again, will the jobseqnum file be overwritten with the
> largest JID still in
> the queue (as suggested in
> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)?
>
> Am aware that this is an old version of SGE and we're in the middle of
> transitioning to a
> much newer one, but this is a bit of an issue while we're still shifting
> workloads over.
>
> Thanks,
> Mike
> --
> Mike Wallis x503305
> University of Edinburgh, Research Services,
> Argyle House, 3 Lady Lawson Street,
> Edinburgh, EH3 9DR
>
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users