We had another qmaster crash over the weekend, and the jobseqnum again got 
reset incorrectly to 9636419. I tried to delete the job with an ID of one less:

$ qdel 9636418

And this crashed the qmaster!

11/23/2015 11:58:55|worker|qmaster|W|It is impossible to move task 0 of job 
9636418 to the list of finished jobs
11/23/2015 11:58:55|worker|qmaster|C|Removing element from other list !!!

Upon restart, the jobseqnum appears to have been correctly taken from the 
jobseqnum file, so maybe this issue is resolved.

Has anyone seen this before?  Could this be caused by corruption of the sge_job 
DB?  Should I do a db_verify/db_dump/db_load on the DB to flush out any other 
issues?

Thanks,
Brad

-----Original Message-----
From: Reuti [mailto:[email protected]] 
Sent: Tuesday, November 17, 2015 5:09 AM
To: Dobbie, Brad <[email protected]>
Cc: [email protected]
Subject: Re: [gridengine users] Possible BDB corruption?

Hi,

> Am 17.11.2015 um 00:15 schrieb Dobbie, Brad <[email protected]>:
> 
> Our cluster is 6.2u5 and uses local BDB spooling.  We recently suffered a 
> qmaster crash and after reboot we noticed some strange behavior with the 
> jobseqnum.  When we restarted the qmaster, the JOBIDs skipped up to the high 
> 9's (996xxxx range).  
> 
> From the SGE source, it appears the qmaster tries to pick a new jobseqnum 
> based on the MAX of the jobseqnum file and guess_highest_job_number function.
> 
>  fp = fopen(SEQ_NUM_FILE, "r")
>  fscanf(fp, sge_u32, &job_nr)
>  guess_job_nr = guess_highest_job_number();
>  job_nr = MAX(job_nr, guess_job_nr);

I wasn't aware that also the BDB is checked. For a fresh installation it should 
be sufficient to fill the jobseqnum file with the proper value to start from.

-- Reuti


> It appears the qmaster guesses the highest job number from the 
> master_job_list, which I assume is stored in the spooling database.
> 
>  lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB));
> 
> When we restarted the qmaster, we attempted to keep the running jobs running. 
>  None of the previously running jobs were in the high 9's range, and the 
> jobseqnum file did not contain a value in that range.
> 
> I'm wondering how the qmaster selected this JOBID.  Could the BDB spooling 
> database be corrupted?  Is there a way to debug or cleanup the spooling 
> database?  I poked around a little with the db_dump utility but wasn't able 
> to draw any conclusions.  I saw some JATASKs in the high 9's range, but not 
> JOBs.
> 
> We are looking to migrate our qmaster to a new machine, so we'd like to be 
> able to control the jobseqnum upon startup to avoid potential accounting file 
> overlaps.
> 
> Thanks,
> Brad Dobbie
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to