On 16 November 2011 09:38, Reuti <[email protected]> wrote:
> Am 16.11.2011 um 10:24 schrieb William Hay:
>
>> On 16 November 2011 00:10, Dave Love <[email protected]> wrote:
>>> William Hay <[email protected]> writes:
>>>
>>>> On 10 November 2011 03:46, Ron Chen <[email protected]> wrote:
>>>>
>>>>>
>>>>> 4) Fritz was telling customers (including William Hay) that open source 
>>>>> Grid Engine is "buggy, unstable, hard to debug", and to use SGE in 
>>>>> production customers need to buy support from Univa.
>>>> I should point out this was in the context of me baiting him with the
>>>> assertion that we didn't need a support contract because his team had
>>>> produced such a robust product.  Also I believe his remarks were
>>>> directed at the common Grid engine code base not specifically the open
>>>> source variants.
>>>
>>> I'd say the code is relatively buggy and intractable by the standards of
>>> (different sorts of) projects I'm used to.  [I say that neutrally, and I
>>> haven't worked on a more-or-less equivalent system, say SLURM, to
>>> compare.]  I don't know when most regressions in the 6.2 series
>>> occurred, and they're not all easy to spot in change logs, but possibly
>>> the version in use at UCL was in something of a sweet spot.  I'd expect
>>> our usage to be similar to UCL's as far as showing them up.  As it
>>> happens, I've recently been fighting a spooling regression (and cocked
>>> up pushing the patch -- thanks Florian).
>>
>> My characterisation of Grid Engine as robust was in comparison to
>> Torque and Moab.  Torque in particular seemed to be rather fragile and
>> the combination seemed to have issues scaling to the number of jobs we
>> needed (array jobs didn't appear to work properly and somewhere
>> slightly north of 50000 jobs in the queue the two of them timed out
>> when talking to each other).  Possibly later versions would have
>> resolved some of these issues but at the time Cluster Resources
>> (Adaptive computing as it is now)wouldn't promise more than 50000 jobs
>> and we'd seen fairly glaring bugs get past their
>
> While I on my own use SGE on all machines I set up, we have access to a 
> cluster using Torque and I noticed something similar. Besides that we need a 
> tight integration of parellel jobs using the Linda library (i.e. Gaussian), 
> and as there is nothing like `qrsh -inherit` in Torque, so any set up in 
> Torque would need some helping cron jobs to remove old processes which I 
> wouldn't like. To remove old processes is one of the tasks of a queuing 
> system IMO.
>
> -- Reuti
Is Linda/SGE tight integration documented somewhere?  I googled and
found some old messages detailing one but judging by the configs
listed they seem to be for a rather retro version of SGE.

William
>
>
>> regression testing.  While Grid Engine isn't bug free it scales to our
>> workload and on the few occasions when it has thrown a wobbly the log
>> files were very helpful in identifying the source of the problem.   We
>> did try 6.2u5 but encountered some issues that although we were able
>> to work around them bore an uncomfortable resemblance to known bugs in
>> that version and decided to switch back to 6.2u3 as that was the
>> preferred version of our cluster integrator.  The main issue we
>> currently have with SGE is the time a scheduling cycle takes.  We're
>> currently trying to tweak the configuration to minimise the work SGE
>> has to do while still implementing our policy.
>>
>> William
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>
>
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to