Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Joseph Farran
Glad you were able to fix it Dan. I looked at Univa Grid Engine a while ago and it was super expensive.    I was able to ask lots of question to a potential candidate for a position we had who was using Univa GE.   His sentiments were that it was

Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
Sorry, what I wrote was confusing due to an errant paste. Edited below. On Sat, Nov 10, 2018 at 5:03 PM Daniel Povey wrote: > I was able to fix it, although I suspect that my fix may have been > disruptive to the jobs. > > Firstly, I believe the problem was that gridengine does not handle a >

Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
/var/spool/gridengineI was able to fix it, although I suspect that my fix may have been disruptive to the jobs. Firstly, I believe the problem was that gridengine does not handle a deleted job that is on a host that has been deleted, and it dies when it sees it. Presumably the bug is in

Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Marshall2, John (SSC/SPC)
Hi, I've never seen this but I would start with: 1) strace qmaster during restart to try to see at which point it is dying (e.g., loading a config file) 2) look for any reference to the name of the host you deleted in the spool area and do some cleanup 3) clean out the jobs spool area HTH, John

[gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
Has anyone found this error, and managed to fix it? I am in a very difficult situation. I deleted a host (qconf -de hostname) thinking that the machine no longer existed, but it did exist, and there was a job in 'dr' state there. After I attempted to force-delete that job (qdel -f job-id), the