Univa has different pricing depending who you are. I think I calculated based on what they published that it would be something like $5k per year for our cluster, but after talking to them they said it would be about a tenth of that for an academic institution. But thanks for the info-- if it's not that much better I may reconsider.
Dan On Sat, Nov 10, 2018 at 6:19 PM Joseph Farran <jfar...@uci.edu> wrote: > Glad you were able to fix it Dan. > > I looked at Univa Grid Engine a while ago and it was super expensive. > > I was able to ask lots of question to a potential candidate for a position > we had who was using Univa GE. His sentiments were that it was better > than the free version BUT not that much better and still plagued with > "weird" issues. > > Since we cannot afford Univa our department big-wig wants to us to move to > Slurm for out next cluster. Not sure how much better Slum it but it does > seem to have good support. > > Joseph > On 11/10/2018 2:03 PM, Daniel Povey wrote: > > /var/spool/gridengineI was able to fix it, although I suspect that my fix > may have been disruptive to the jobs. > > Firstly, I believe the problem was that gridengine does not handle a > deleted job that is on a host that has been deleted, and it dies when it > sees it. Presumably the bug is in allowing it to be deleted in the first > place. > > Anyway, my fix (after backing up the directory /var/spool/gridengine) was > to move the file /var/spool/gridengine/spooldb/sge_job to a temporary > location, restart the qmaster, add the host back with qconf -ah, stop the > qmaster, restore the old database /var/spool/gridengine/spooldb/sge_job, > and restart the qmaster. > > Before doing that whole procedure, to stop the hosts getting confused I > stopped all the gridengine-exec services. That probably wasn't optimal > because clients like qsub and qstat would still have been able to access > the queue in the interim, and it definitely would have confused them and > killed some processes. Unfortunately I had to do this on short notice and > wasn't sure how to use iptables to close off those ports from outside the > qmaster while I did the maintenance-- that would have been a better > solution. > > Also I encountered a hiccup that `systemctl stop gridengine-qmaster` > didn't actually work the second time, the process was still running, with > the old database, so I had to manually kill it and retry. > > Anyway this whole episode is making me think more seriously about moving > to Univa GridEngine. I've known for a long time that the free version has > a lot of bugs, and I just don't have time to deal with this type of thing. > > > On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) < > john.marsha...@canada.ca> wrote: > >> Hi, >> >> I've never seen this but I would start with: >> 1) strace qmaster during restart to try to see at which point it is dying >> (e.g., >> loading a config file) >> 2) look for any reference to the name of the host you deleted in the spool >> area and do some cleanup >> 3) clean out the jobs spool area >> >> HTH, >> John >> >> On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote: >> >> Has anyone found this error, and managed to fix it? >> I am in a very difficult situation. >> I deleted a host (qconf -de hostname) thinking that the machine no longer >> existed, but it did exist, and there was a job in 'dr' state there. >> After I attempted to force-delete that job (qdel -f job-id), the queue >> master died with out-of-memory, and now I can't restart qmaster. >> >> So now I don't know hw to fix it. Am I just completely lost now? >> >> Dan >> >> _______________________________________________ >> >> users mailing list >> >> users@gridengine.org >> >> https://gridengine.org/mailman/listinfo/users >> >> > _______________________________________________ > users mailing > listusers@gridengine.orghttps://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users