On Tue, Nov 6, 2012 at 5:47 AM, Paul Simpson <p...@realisestudio.com> wrote: > The child process does not get killed which leaves an over > stressed machine which leads to knock on errors.
Are these parallel (MPI?) jobs? > From reading this list, we > are not alone in suffering from this. Can anyone shred light on this in > either positive or negative ways? Ie, can/should this work ie is this a > known bug/'feature' The common way to fix this is to enable "ENABLE_ADDGRP_KILL". For more recent kernels, we use cgroups to track job membership, and there is no way any process can escape in a cgroup (since the cgroup data structures are managed by the Linux kernel): http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html Rayson > - we are currently using 6.2u5 but this is rather old. Would anyone > recommend upgrading and it so why? > > Again, many thanks for your collective help. > > Regards, > > Paul > > On 5 Nov 2012 13:25, "Paul Simpson" <p...@realisestudio.com> wrote: >> >> many thanks all - we're wading through this now. what a great community! >> :) >> >> >> >> On 5 November 2012 13:18, MacMullan, Hugh <hugh...@wharton.upenn.edu> >> wrote: >>> >>> Version control: definitely THE way to go Tina! (Adding to my task list). >>> :) >>> >>> On Nov 5, 2012, at 8:02 AM, "Tina Friedrich" >>> <tina.friedr...@diamond.ac.uk> wrote: >>> >>> > Hi Paul, >>> > >>> > don't know about everything, but e.g. for complexes - have a look in >>> > the spool directory, there's a 'centry' subdirectory >>> > ("$SGE_ROOT/$SGE_CELL/spool/qmaster/centry" for me). That has a ASCII file >>> > for every complex with all the configuration for it. >>> > >>> > There's likewise a subdirectory 'pe' with the PE configuration, >>> > hostgroups, ... >>> > >>> > Tina >>> > >>> > PS: ...I do all my configuration from files that I keep in subversion >>> > (especially queue config, complex config). I find it makes this sort of >>> > thing lots easier ;) >>> > >>> > On 05/11/12 12:07, Paul Simpson wrote: >>> >> hi grid gurus, >>> >> >>> >> i've had a bad w/end where the disk which stored the db filled up. the >>> >> grid came down and i couldn't fix the db using db_recover -c - which >>> >> meant no grid engine (6.2u5). >>> >> >>> >> we need to get the system back up asap (like yesterday). so, we've >>> >> installed a fresh version which is coming up. however, we've got a >>> >> load >>> >> of complex's, host groups, share-trees, parallel envs, etc. etc. that >>> >> i >>> >> can't seem to recover from the old system. >>> >> >>> >> i've looked through all the old dirs - but can't find any text files. >>> >> can anyone suggest how this config information could possibly be >>> >> recovered? typically, this has happened a day before a huge deadline - >>> >> so time is not on our side. >>> >> >>> >> -paul >>> >> >>> >> >>> >> >>> >> _______________________________________________ >>> >> users mailing list >>> >> users@gridengine.org >>> >> https://gridengine.org/mailman/listinfo/users >>> > >>> > >>> > -- >>> > Tina Friedrich, Computer Systems Administrator, Diamond Light Source >>> > Ltd >>> > Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 >>> > >>> > -- >>> > This e-mail and any attachments may contain confidential, copyright and >>> > or privileged material, and are for the use of the intended addressee >>> > only. >>> > If you are not the intended addressee or an authorised recipient of the >>> > addressee please notify us of receipt by returning the e-mail and do not >>> > use, copy, retain, distribute or disclose the information in or attached >>> > to >>> > the e-mail. >>> > Any opinions expressed within this e-mail are those of the individual >>> > and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. >>> > cannot guarantee that this e-mail or any attachments are free from viruses >>> > and we cannot accept liability for any damage which you may sustain as a >>> > result of software viruses which may be transmitted in or with the >>> > message. >>> > Diamond Light Source Limited (company no. 4375679). Registered in >>> > England and Wales with its registered office at Diamond House, Harwell >>> > Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United >>> > Kingdom >>> > >>> > >>> > >>> > _______________________________________________ >>> > users mailing list >>> > users@gridengine.org >>> > https://gridengine.org/mailman/listinfo/users >> >> > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users