On 16 November 2011 09:38, Reuti <[email protected]> wrote: > Am 16.11.2011 um 10:24 schrieb William Hay: > >> On 16 November 2011 00:10, Dave Love <[email protected]> wrote: >>> William Hay <[email protected]> writes: >>> >>>> On 10 November 2011 03:46, Ron Chen <[email protected]> wrote: >>>> >>>>> >>>>> 4) Fritz was telling customers (including William Hay) that open source >>>>> Grid Engine is "buggy, unstable, hard to debug", and to use SGE in >>>>> production customers need to buy support from Univa. >>>> I should point out this was in the context of me baiting him with the >>>> assertion that we didn't need a support contract because his team had >>>> produced such a robust product. Also I believe his remarks were >>>> directed at the common Grid engine code base not specifically the open >>>> source variants. >>> >>> I'd say the code is relatively buggy and intractable by the standards of >>> (different sorts of) projects I'm used to. [I say that neutrally, and I >>> haven't worked on a more-or-less equivalent system, say SLURM, to >>> compare.] I don't know when most regressions in the 6.2 series >>> occurred, and they're not all easy to spot in change logs, but possibly >>> the version in use at UCL was in something of a sweet spot. I'd expect >>> our usage to be similar to UCL's as far as showing them up. As it >>> happens, I've recently been fighting a spooling regression (and cocked >>> up pushing the patch -- thanks Florian). >> >> My characterisation of Grid Engine as robust was in comparison to >> Torque and Moab. Torque in particular seemed to be rather fragile and >> the combination seemed to have issues scaling to the number of jobs we >> needed (array jobs didn't appear to work properly and somewhere >> slightly north of 50000 jobs in the queue the two of them timed out >> when talking to each other). Possibly later versions would have >> resolved some of these issues but at the time Cluster Resources >> (Adaptive computing as it is now)wouldn't promise more than 50000 jobs >> and we'd seen fairly glaring bugs get past their > > While I on my own use SGE on all machines I set up, we have access to a > cluster using Torque and I noticed something similar. Besides that we need a > tight integration of parellel jobs using the Linda library (i.e. Gaussian), > and as there is nothing like `qrsh -inherit` in Torque, so any set up in > Torque would need some helping cron jobs to remove old processes which I > wouldn't like. To remove old processes is one of the tasks of a queuing > system IMO. > > -- Reuti Is Linda/SGE tight integration documented somewhere? I googled and found some old messages detailing one but judging by the configs listed they seem to be for a rather retro version of SGE.
William > > >> regression testing. While Grid Engine isn't bug free it scales to our >> workload and on the few occasions when it has thrown a wobbly the log >> files were very helpful in identifying the source of the problem. We >> did try 6.2u5 but encountered some issues that although we were able >> to work around them bore an uncomfortable resemblance to known bugs in >> that version and decided to switch back to 6.2u3 as that was the >> preferred version of our cluster integrator. The main issue we >> currently have with SGE is the time a scheduling cycle takes. We're >> currently trying to tweak the configuration to minimise the work SGE >> has to do while still implementing our policy. >> >> William >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
