On Tue, Nov 6, 2012 at 5:47 AM, Paul Simpson <p...@realisestudio.com> wrote:
> The child process does not get killed which leaves an over
> stressed machine which leads to knock on errors.

Are these parallel (MPI?) jobs?


> From reading this list, we
> are not alone in suffering from this. Can anyone shred light on this in
> either positive or negative ways? Ie, can/should this work ie is this a
> known bug/'feature'

The common way to fix this is to enable "ENABLE_ADDGRP_KILL".

For more recent kernels, we use cgroups to track job membership, and
there is no way any process can escape in a cgroup (since the cgroup
data structures are managed by the Linux kernel):

http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html

Rayson



> - we are currently using 6.2u5 but this is rather old. Would anyone
> recommend upgrading and it so why?
>
> Again, many thanks for your collective help.
>
> Regards,
>
> Paul
>
> On 5 Nov 2012 13:25, "Paul Simpson" <p...@realisestudio.com> wrote:
>>
>> many thanks all - we're wading through this now.  what a great community!
>> :)
>>
>>
>>
>> On 5 November 2012 13:18, MacMullan, Hugh <hugh...@wharton.upenn.edu>
>> wrote:
>>>
>>> Version control: definitely THE way to go Tina! (Adding to my task list).
>>> :)
>>>
>>> On Nov 5, 2012, at 8:02 AM, "Tina Friedrich"
>>> <tina.friedr...@diamond.ac.uk> wrote:
>>>
>>> > Hi Paul,
>>> >
>>> > don't know about everything, but e.g. for complexes - have a look in
>>> > the spool directory, there's a 'centry' subdirectory
>>> > ("$SGE_ROOT/$SGE_CELL/spool/qmaster/centry" for me). That has a ASCII file
>>> > for every complex with all the configuration for it.
>>> >
>>> > There's likewise a subdirectory 'pe' with the PE configuration,
>>> > hostgroups, ...
>>> >
>>> > Tina
>>> >
>>> > PS: ...I do all my configuration from files that I keep in subversion
>>> > (especially queue config, complex config). I find it makes this sort of
>>> > thing lots easier ;)
>>> >
>>> > On 05/11/12 12:07, Paul Simpson wrote:
>>> >> hi grid gurus,
>>> >>
>>> >> i've had a bad w/end where the disk which stored the db filled up. the
>>> >> grid came down and i couldn't fix the db using db_recover -c - which
>>> >> meant no grid engine (6.2u5).
>>> >>
>>> >> we need to get the system back up asap (like yesterday). so, we've
>>> >> installed a fresh version which is coming up. however, we've got a
>>> >> load
>>> >> of complex's, host groups, share-trees, parallel envs, etc. etc. that
>>> >> i
>>> >> can't seem to recover from the old system.
>>> >>
>>> >> i've looked through all the old dirs - but can't find any text files.
>>> >> can anyone suggest how this config information could possibly be
>>> >> recovered? typically, this has happened a day before a huge deadline -
>>> >> so time is not on our side.
>>> >>
>>> >> -paul
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> users mailing list
>>> >> users@gridengine.org
>>> >> https://gridengine.org/mailman/listinfo/users
>>> >
>>> >
>>> > --
>>> > Tina Friedrich, Computer Systems Administrator, Diamond Light Source
>>> > Ltd
>>> > Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
>>> >
>>> > --
>>> > This e-mail and any attachments may contain confidential, copyright and
>>> > or privileged material, and are for the use of the intended addressee 
>>> > only.
>>> > If you are not the intended addressee or an authorised recipient of the
>>> > addressee please notify us of receipt by returning the e-mail and do not
>>> > use, copy, retain, distribute or disclose the information in or attached 
>>> > to
>>> > the e-mail.
>>> > Any opinions expressed within this e-mail are those of the individual
>>> > and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd.
>>> > cannot guarantee that this e-mail or any attachments are free from viruses
>>> > and we cannot accept liability for any damage which you may sustain as a
>>> > result of software viruses which may be transmitted in or with the 
>>> > message.
>>> > Diamond Light Source Limited (company no. 4375679). Registered in
>>> > England and Wales with its registered office at Diamond House, Harwell
>>> > Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United 
>>> > Kingdom
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users@gridengine.org
>>> > https://gridengine.org/mailman/listinfo/users
>>
>>
>
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to