Hi folks,

Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs at 
the start of the
month, and then started again at 1 as expected. About ten days later we started 
the qmaster
a few times (it was segfaulting, originally we thought that a user was using 
newer qstat
binaries to query an old qmaster) with JID nearing ~20k, only after each of the 
restarts the JID
started at about 1100, not the number we were expecting. Because of this 
there's duplicate JID
entries in accounting and it's causing a bit of a problem for people who 
monitor for failed jobs.

Because of the nature of the workload the currently-running JIDs are now all 
over the place,
with some JIDs in the queue still in the 9,99n,nnn range and some in four 
figures. If we need to
restart the qmaster again, will the jobseqnum file be overwritten with the 
largest JID still in
the queue (as suggested in
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)?

Am aware that this is an old version of SGE and we're in the middle of 
transitioning to a
much newer one, but this is a bit of an issue while we're still shifting 
workloads over.

Thanks,
Mike
--
Mike Wallis x503305
University of Edinburgh, Research Services,
Argyle House, 3 Lady Lawson Street,
Edinburgh, EH3 9DR


The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to