Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Reuti Sun, 05 Nov 2017 15:55:11 -0800

Am 06.11.2017 um 00:42 schrieb Derrick Lin:

> Hi Reuti,
> 
> Before I make the change, I want to check this is the one I look at:
> 
> gid_range                    20000-20100
> 
> gid_range
> 
>        The 
> gid_range
>  is a comma-separated list of range expressions of the
>        form 
> m-n, where m and n are integer numbers greater than 99, and m
>  is
>        an abbreviation for 
> m-m.  These numbers are used in sge_execd(8)
>  to
>        identify processes belonging to the same job.
> 
>        Each 
> sge_execd(8)
>  may use a separate set of group ids for this purpose.
>        All numbers in the group id range have to be unused supplementary group
>        ids on the system, where the 
> sge_execd(8)
>  is started.
> 
>        Changing 
> gid_range
>  will take immediate effect.  There is no default for
>        
> gid_range. The administrator will have to assign a value for gid_range
> 
>        during installation of Grid Engine.
> 
>        The global configuration entry for this value may be overwritten by the
>        execution host local configuration.
> 
> 
> It is true that the problematic hosts all seem to be busy with other jobs. 
> Also array jobs are very popular run on these hosts, and it is common to have 
> more than 100+ of sub processes on each host. 
> 
> Is it safe to set it to something like 20000-20500?


Yes, unless your users' groups fall into this range.

-- Reuti


> 
> Cheers,
> Derrick
> 
> On Mon, Nov 6, 2017 at 9:57 AM, Reuti <[email protected]> wrote:
> Hi,
> 
> Am 02.11.2017 um 11:39 schrieb Derrick Lin:
> 
> > Hi Reuti,
> >
> > One of the users indicates -S was used in his job:
> >
> > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S 
> > /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G 
> > ./cheat_script_0.sge
> >
> > I have setup my own test just do a simple dd in a local disk
> >
> > #!/bin/bash
> > #
> > #$ -j y
> > #$ -cwd
> > #$ -N bigtmpfile
> > #$ -l h_vmem=32G
> > #
> >
> > echo "$HOST $tmp_requested $TMPDIR"
> >
> > dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200
> >
> > Our SGE has h_vmem=8gb as default for any job which does not have h_vmem 
> > specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I 
> > put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed 
> > successfully. But I found some thing interesting to the maxvmem value from 
> > qacct -j result such as:
> >
> > ru_nvcsw     46651
> > ru_nivcsw    1355
> > cpu          146.611
> > mem          87.885
> > io           199.501
> > iow          0.000
> > maxvmem      736.727M
> > arid         undefined
> >
> > The maxvmem value for those 10 jobs are:
> >
> > 1 x 9.920G
> > 1 x 5.540G
> > 8 x 736.727M
> 
> Is anything else running on the nodes, which has by accident the same 
> additional group ID (the range you defined in `qconf -mconf`? This additional 
> group ID is used to allow SGE to keep track of each job's resource 
> consumptions. Somehow I remember an issue where former additional group IDs 
> were reused(?) although they are still in use.
> 
> Can you please try to extend the range for the additional group ID and check 
> whether the problem persists. Or OTOH shrink the range and check whether it 
> happens more often.
> 
> -- Reuti
> 
> 
> >
> > So this explains my test can fail if default h_vmem=8gb is used. I have to 
> > confess that I don't have a full understanding on maxvmem inside SGE. Why 
> > 10 jobs of the same command, few of them have much higher maxvmem value?
> >
> > Regards,
> > Derrick
> >
> > On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]> wrote:
> > Hi,
> >
> > > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>:
> > >
> > > Dear all,
> > >
> > > Recently, I have users reported some of their jobs failed silently. I 
> > > picked one up and check, found:
> > >
> > > 11/02/2017 05:30:18|  main|delta-5-3|W|job 610608 exceeds job hard limit 
> > > "h_vmem" of queue "[email protected]" (8942456832.00000 > 
> > > limit:8589934592.00000) - sending SIGKILL
> > >
> > > [root@alpha00 rocks_ansible]# qacct -j 610608
> > > ==============================================================
> > > qname        short.q
> > > hostname     xxxxxx.local
> > > group        g_xxxxxxx
> > > owner        glsai
> > > project      NONE
> > > department   xxxxxxx
> > > jobname      .name.out
> > > jobnumber    610608
> > > taskid       undefined
> > > account      sge
> > > priority     0
> > > qsub_time    Thu Nov  2 05:30:15 2017
> > > start_time   Thu Nov  2 05:30:17 2017
> > > end_time     Thu Nov  2 05:30:18 2017
> > > granted_pe   NONE
> > > slots        1
> > > failed       100 : assumedly after job
> > > exit_status  137
> > > ru_wallclock 1
> > > ru_utime     0.007
> > > ru_stime     0.006
> > > ru_maxrss    1388
> > > ru_ixrss     0
> > > ru_ismrss    0
> > > ru_idrss     0
> > > ru_isrss     0
> > > ru_minflt    640
> > > ru_majflt    0
> > > ru_nswap     0
> > > ru_inblock   0
> > > ru_oublock   16
> > > ru_msgsnd    0
> > > ru_msgrcv    0
> > > ru_nsignals  0
> > > ru_nvcsw     15
> > > ru_nivcsw    3
> > > cpu          0.013
> > > mem          0.000
> > > io           0.000
> > > iow          0.000
> > > maxvmem      8.328G
> > > arid         undefined
> > >
> > > So of course, it is killed due to over the h_vmem limited (exist status 
> > > 137, 137=128+9). Few things in my mind:
> > >
> > > 1) the same jobs have been running fine for long time, it started failing 
> > > two weeks ago (nothing has changed since I was on holiday)
> > >
> > > 2) the job almost failed instantly (like after 1 second). The job seems 
> > > to fail on the very first command which is "cd" to a directory and print 
> > > an output. There is not way a "cd" command can consume 8GB memory right?
> >
> > Depends on the command interpreter. Maybe it's a huge bash version. Bash is 
> > addressed in the #! line of the script and any #$ lines for SGE have proper 
> > format? Or do you use the -S option to SGE?
> >
> > -- Reuti
> >
> >
> > > 3) the same job will likely run successfully after re-submitting. So 
> > > currently our users just keep re-submitting the failed jobs until they 
> > > run successfully.
> > >
> > > 4) this happens on multiple execution hosts and multiple queues. So it 
> > > seems not to be host and queue specific.
> > >
> > > I am wondering if there is possible to be caused by the qmaster?
> > >
> > > Regards,
> > > Derrick
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Reply via email to