A couple of days ago, we had a power outage and our 6.2U5 SGE qmaster would not start when the qmaster machine was rebooted. Running the qmaster in foreground, I got a core dump.
I suspected that the spooldb was corrupted (we use Berkeley DB), I re-created the spooldb/sge and spooldb/sge_job files using the following procedure: 1. db_dump spooldb/sge to a file. 2. Create a new grid to get empty sge and sge_job dbs. 3. Copy the empty sge and sge_job files into my old spooldb 4. db_load the new spooldb/sge from the earlier db_dump. We use Berkeley db spooling because we run a very large number of jobs (mostly very small jobs). With this process, the qmaster would start and my configuration was retained from before the crash. Now, I see occasional emails from the execd clients with the following: Job 4433950 caused action: none User = build Queue = (null)@(null) Start Time = <unknown> End Time = <unknown> failed before writing exit_status:shepherd exited with exit status 19: before writing exit_status As can be seen, the queue name is invalid. Any idea what might cause this? How to stop this? Simon _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
