On 16 November 2011 00:10, Dave Love <[email protected]> wrote: > William Hay <[email protected]> writes: > >> On 10 November 2011 03:46, Ron Chen <[email protected]> wrote: >> >>> >>> 4) Fritz was telling customers (including William Hay) that open source >>> Grid Engine is "buggy, unstable, hard to debug", and to use SGE in >>> production customers need to buy support from Univa. >> I should point out this was in the context of me baiting him with the >> assertion that we didn't need a support contract because his team had >> produced such a robust product. Also I believe his remarks were >> directed at the common Grid engine code base not specifically the open >> source variants. > > I'd say the code is relatively buggy and intractable by the standards of > (different sorts of) projects I'm used to. [I say that neutrally, and I > haven't worked on a more-or-less equivalent system, say SLURM, to > compare.] I don't know when most regressions in the 6.2 series > occurred, and they're not all easy to spot in change logs, but possibly > the version in use at UCL was in something of a sweet spot. I'd expect > our usage to be similar to UCL's as far as showing them up. As it > happens, I've recently been fighting a spooling regression (and cocked > up pushing the patch -- thanks Florian).
My characterisation of Grid Engine as robust was in comparison to Torque and Moab. Torque in particular seemed to be rather fragile and the combination seemed to have issues scaling to the number of jobs we needed (array jobs didn't appear to work properly and somewhere slightly north of 50000 jobs in the queue the two of them timed out when talking to each other). Possibly later versions would have resolved some of these issues but at the time Cluster Resources (Adaptive computing as it is now)wouldn't promise more than 50000 jobs and we'd seen fairly glaring bugs get past their regression testing. While Grid Engine isn't bug free it scales to our workload and on the few occasions when it has thrown a wobbly the log files were very helpful in identifying the source of the problem. We did try 6.2u5 but encountered some issues that although we were able to work around them bore an uncomfortable resemblance to known bugs in that version and decided to switch back to 6.2u3 as that was the preferred version of our cluster integrator. The main issue we currently have with SGE is the time a scheduling cycle takes. We're currently trying to tweak the configuration to minimise the work SGE has to do while still implementing our policy. William _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
