/var/spool/gridengineI was able to fix it,
although I suspect that my fix may have been disruptive to
the jobs.
Firstly, I believe the problem was that gridengine
does not handle a deleted job that is on a host that has
been deleted, and it dies when it sees it. Presumably
the bug is in allowing it to be deleted in the first
place.
Anyway, my fix (after backing up the directory
/var/spool/gridengine) was to move the file
/var/spool/gridengine/spooldb/sge_job to a temporary
location, restart the qmaster, add the host back with
qconf -ah, stop the qmaster, restore the old database
/var/spool/gridengine/spooldb/sge_job, and restart the
qmaster.
Before doing that whole procedure, to stop the hosts
getting confused I stopped all the gridengine-exec
services. That probably wasn't optimal because clients
like qsub and qstat would still have been able to access
the queue in the interim, and it definitely would have
confused them and killed some processes. Unfortunately I
had to do this on short notice and wasn't sure how to use
iptables to close off those ports from outside the qmaster
while I did the maintenance-- that would have been a
better solution.
Also I encountered a hiccup that `systemctl stop
gridengine-qmaster` didn't actually work the second time,
the process was still running, with the old database, so I
had to manually kill it and retry.
Anyway this whole episode is making me think more
seriously about moving to Univa GridEngine. I've known
for a long time that the free version has a lot of bugs,
and I just don't have time to deal with this type of
thing.