William Hay <[email protected]> writes: > My characterisation of Grid Engine as robust was in comparison to > Torque and Moab. Torque in particular seemed to be rather fragile and > the combination seemed to have issues scaling to the number of jobs we > needed (array jobs didn't appear to work properly and somewhere > slightly north of 50000 jobs in the queue the two of them timed out > when talking to each other).
Obviously I was wrong guessing we have similar requirements; we don't get anything like that number of jobs, given the separate Condor pool (sigh). I guess we'd expect people to construct array jobs if we did. The combination of known scalability and features is what makes it still worth working with until, perhaps, SLURM catches up or OAR proves itself. > we were able > to work around them bore an uncomfortable resemblance to known bugs in > that version and decided to switch back to 6.2u3 as that was the > preferred version of our cluster integrator. [I'm glad UCL have better luck. I'm not sure I should be grateful to our "integrator" for forcing me to learn so much so quickly...] > The main issue we > currently have with SGE is the time a scheduling cycle takes. We're > currently trying to tweak the configuration to minimise the work SGE > has to do while still implementing our policy. Perhaps you could summarize any useful results. I assume you've seen the hints (somewhere from Dan T?), at least. For what it's worth, at least two of the regressions causing daemon crashes I remember were due to changes for efficiency (memory?), and were fixed in post-6.2u5 changes from Sun. I guess a more recent version would be worth trying for scalability improvements. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
