Am 16.11.2011 um 10:24 schrieb William Hay:

> On 16 November 2011 00:10, Dave Love <[email protected]> wrote:
>> William Hay <[email protected]> writes:
>> 
>>> On 10 November 2011 03:46, Ron Chen <[email protected]> wrote:
>>> 
>>>> 
>>>> 4) Fritz was telling customers (including William Hay) that open source 
>>>> Grid Engine is "buggy, unstable, hard to debug", and to use SGE in 
>>>> production customers need to buy support from Univa.
>>> I should point out this was in the context of me baiting him with the
>>> assertion that we didn't need a support contract because his team had
>>> produced such a robust product.  Also I believe his remarks were
>>> directed at the common Grid engine code base not specifically the open
>>> source variants.
>> 
>> I'd say the code is relatively buggy and intractable by the standards of
>> (different sorts of) projects I'm used to.  [I say that neutrally, and I
>> haven't worked on a more-or-less equivalent system, say SLURM, to
>> compare.]  I don't know when most regressions in the 6.2 series
>> occurred, and they're not all easy to spot in change logs, but possibly
>> the version in use at UCL was in something of a sweet spot.  I'd expect
>> our usage to be similar to UCL's as far as showing them up.  As it
>> happens, I've recently been fighting a spooling regression (and cocked
>> up pushing the patch -- thanks Florian).
> 
> My characterisation of Grid Engine as robust was in comparison to
> Torque and Moab.  Torque in particular seemed to be rather fragile and
> the combination seemed to have issues scaling to the number of jobs we
> needed (array jobs didn't appear to work properly and somewhere
> slightly north of 50000 jobs in the queue the two of them timed out
> when talking to each other).  Possibly later versions would have
> resolved some of these issues but at the time Cluster Resources
> (Adaptive computing as it is now)wouldn't promise more than 50000 jobs
> and we'd seen fairly glaring bugs get past their

While I on my own use SGE on all machines I set up, we have access to a cluster 
using Torque and I noticed something similar. Besides that we need a tight 
integration of parellel jobs using the Linda library (i.e. Gaussian), and as 
there is nothing like `qrsh -inherit` in Torque, so any set up in Torque would 
need some helping cron jobs to remove old processes which I wouldn't like. To 
remove old processes is one of the tasks of a queuing system IMO.

-- Reuti


> regression testing.  While Grid Engine isn't bug free it scales to our
> workload and on the few occasions when it has thrown a wobbly the log
> files were very helpful in identifying the source of the problem.   We
> did try 6.2u5 but encountered some issues that although we were able
> to work around them bore an uncomfortable resemblance to known bugs in
> that version and decided to switch back to 6.2u3 as that was the
> preferred version of our cluster integrator.  The main issue we
> currently have with SGE is the time a scheduling cycle takes.  We're
> currently trying to tweak the configuration to minimise the work SGE
> has to do while still implementing our policy.
> 
> William
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to