Am 16.11.2011 um 10:24 schrieb William Hay: > On 16 November 2011 00:10, Dave Love <[email protected]> wrote: >> William Hay <[email protected]> writes: >> >>> On 10 November 2011 03:46, Ron Chen <[email protected]> wrote: >>> >>>> >>>> 4) Fritz was telling customers (including William Hay) that open source >>>> Grid Engine is "buggy, unstable, hard to debug", and to use SGE in >>>> production customers need to buy support from Univa. >>> I should point out this was in the context of me baiting him with the >>> assertion that we didn't need a support contract because his team had >>> produced such a robust product. Also I believe his remarks were >>> directed at the common Grid engine code base not specifically the open >>> source variants. >> >> I'd say the code is relatively buggy and intractable by the standards of >> (different sorts of) projects I'm used to. [I say that neutrally, and I >> haven't worked on a more-or-less equivalent system, say SLURM, to >> compare.] I don't know when most regressions in the 6.2 series >> occurred, and they're not all easy to spot in change logs, but possibly >> the version in use at UCL was in something of a sweet spot. I'd expect >> our usage to be similar to UCL's as far as showing them up. As it >> happens, I've recently been fighting a spooling regression (and cocked >> up pushing the patch -- thanks Florian). > > My characterisation of Grid Engine as robust was in comparison to > Torque and Moab. Torque in particular seemed to be rather fragile and > the combination seemed to have issues scaling to the number of jobs we > needed (array jobs didn't appear to work properly and somewhere > slightly north of 50000 jobs in the queue the two of them timed out > when talking to each other). Possibly later versions would have > resolved some of these issues but at the time Cluster Resources > (Adaptive computing as it is now)wouldn't promise more than 50000 jobs > and we'd seen fairly glaring bugs get past their
While I on my own use SGE on all machines I set up, we have access to a cluster using Torque and I noticed something similar. Besides that we need a tight integration of parellel jobs using the Linda library (i.e. Gaussian), and as there is nothing like `qrsh -inherit` in Torque, so any set up in Torque would need some helping cron jobs to remove old processes which I wouldn't like. To remove old processes is one of the tasks of a queuing system IMO. -- Reuti > regression testing. While Grid Engine isn't bug free it scales to our > workload and on the few occasions when it has thrown a wobbly the log > files were very helpful in identifying the source of the problem. We > did try 6.2u5 but encountered some issues that although we were able > to work around them bore an uncomfortable resemblance to known bugs in > that version and decided to switch back to 6.2u3 as that was the > preferred version of our cluster integrator. The main issue we > currently have with SGE is the time a scheduling cycle takes. We're > currently trying to tweak the configuration to minimise the work SGE > has to do while still implementing our policy. > > William > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
