[gridengine users] robustness/scalability

Dave Love Wed, 16 Nov 2011 15:45:29 -0800

William Hay <[email protected]> writes:

> My characterisation of Grid Engine as robust was in comparison to
> Torque and Moab.  Torque in particular seemed to be rather fragile and
> the combination seemed to have issues scaling to the number of jobs we
> needed (array jobs didn't appear to work properly and somewhere
> slightly north of 50000 jobs in the queue the two of them timed out
> when talking to each other).


Obviously I was wrong guessing we have similar requirements; we don't
get anything like that number of jobs, given the separate Condor pool
(sigh).  I guess we'd expect people to construct array jobs if we did.

The combination of known scalability and features is what makes it still
worth working with until, perhaps, SLURM catches up or OAR proves
itself.

> we were able
> to work around them bore an uncomfortable resemblance to known bugs in
> that version and decided to switch back to 6.2u3 as that was the
> preferred version of our cluster integrator.

[I'm glad UCL have better luck.  I'm not sure I should be grateful to
our "integrator" for forcing me to learn so much so quickly...]

> The main issue we
> currently have with SGE is the time a scheduling cycle takes.  We're
> currently trying to tweak the configuration to minimise the work SGE
> has to do while still implementing our policy.

Perhaps you could summarize any useful results.  I assume you've seen
the hints (somewhere from Dan T?), at least.

For what it's worth, at least two of the regressions causing daemon
crashes I remember were due to changes for efficiency (memory?), and
were fixed in post-6.2u5 changes from Sun.  I guess a more recent
version would be worth trying for scalability improvements.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] robustness/scalability

Reply via email to