If you don't use need the scheduler info, you can turn off "schedd_job_info".
http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html

SGE 6.2 sets this parameter to false by default, BTW (see page 29):
http://www.oracle.com/technetwork/oem/host-server-mgmt/twp-gridengine-beginner-167116.pdf

Last year, a client's SGE 6.0 cluster was having the same issue with
memory, and I turned off schedd_job_info. The cluster runs tens of
thousands of jobs per day, and the qmaster & schedd were using no more
than 200 MB after the change.

Rayson



On Tue, Oct 25, 2011 at 1:12 PM, SLIM H.A. <[email protected]> wrote:
>
> Dear GridEngine users
>
> After using GridEngine 6.1u6 for more than a year a problem has cropped
> up suddenly with the scheduler. The scheduler uses rapidly all the
> available memory in the system and can ultimately crash the server.
> Stopping qmaster, waiting until top shows a normal memory usage and
> restarting it, immediately all memory is claimed by sge_schedd. I have
> tried setting the params  profile=1 setting with qconf -msconf to
> monitor the scheduler message file, the output after restarting qmaster
> is below. I cannot see anything relevant but maybe someone else has a
> better insight.
>
> Does anyone know another way to investigate this "memory leak"?
>
> We are still stuck with 6.1u6 because 6.2u5 has a bug when compiling on
> SUSE.
>
> Many thanks
>
> Henk
>
> 10/24/2011 14:48:38|schedd|ham4in|E|callback function for event "106590.
> EVENT DEL JOB 239618.1" failed
> 10/24/2011 15:02:51|schedd|ham4in|E|could not find job "239622" in
> master list
> 10/24/2011 15:02:51|schedd|ham4in|E|callback function for event "113617.
> EVENT DEL JOB 239622.1" failed
> 10/24/2011 15:04:32|schedd|ham4in|E|could not find job "239624" in
> master list
> 10/24/2011 15:04:32|schedd|ham4in|E|callback function for event "114422.
> EVENT DEL JOB 239624.1" failed
> 10/25/2011 15:00:37|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 16:14:11|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 16:26:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 17:13:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 17:24:54|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 17:34:16|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 17:34:33|schedd|ham4in|E|could not find job "240019" in
> master list
> 10/25/2011 17:34:33|schedd|ham4in|E|callback function for event "80.
> EVENT DEL JOB 240019.1" failed
> 10/25/2011 17:41:47|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: static urgency took 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init:
> 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init:
> 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.010, calc: 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: normalizing job tickets took
> 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create active job orders:
> 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job-order calculation took
> 0.020 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job sorting took 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job dispatching took 0.010 s
> (0 fast, 0 comp, 4 pe, 0 res)
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create pending job orders:
> 0.000 s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: scheduled in 0.090 (u 0.050 +
> s 0.020 = 0.070): 0 sequential, 4 parallel, 134 orders, 232 H, 0 Q, 1425
> QA, 49 J(qw), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 13 J(x), 132 J(all), 47
> C, 20 ACL, 4 PE, 14 U, 7 D, 18 PRJ, 1 ST, 0 CKPT, 0 RU, 1425 gMes, 7
> jMes, 79/1 pre-send, -80/-90/-242 pe-alg
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: send orders and cleanup took:
> 0.310 (u 0.010,s 0.000) s
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF: schedd run took: 0.440 s
> (init: 0.000 s, copy: 0.030 s, run:0.400, free: 0.010 s, jobs: 132,
> categories: 26/26)
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): profiling summary:
>
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): other          : wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): packing        : wc
> =      0.000s, utime =      0.010s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): eventclient    : wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): mirror         : wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): gdi            : wc
> =      0.310s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): ht-resize      : wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler      : wc
> =      0.060s, utime =      0.040s, stime =      0.010s, utilization =
> 83%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): pending ticket : wc
> =      0.010s, utime =      0.000s, stime =      0.010s, utilization =
> 100%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job sorting    : wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job dispatching: wc
> =      0.010s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): send orders    : wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler event: wc
> =      0.000s, utime =      0.000s, stime =      0.000s, utilization =
> 0%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): copy lists     : wc
> =      0.040s, utime =      0.040s, stime =      0.000s, utilization =
> 100%
> 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): total          : wc
> =      0.440s, utime =      0.100s, stime =      0.020s, utilization =
> 27%
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: sge_mirror processed 1370
> events in 0.010 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: static urgency took 0.000 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init:
> 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init:
> 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: normalizing job tickets took
> 0.000 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: create active job orders:
> 0.000 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job-order calculation took
> 0.000 s
> 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job sorting took 0.000 s
> 10/25/2011 17:56:12|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: static urgency took 0.000 s
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init:
> 0.000 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init:
> 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: normalizing job tickets took
> 0.000 s
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: create active job orders:
> 0.000 s
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job-order calculation took
> 0.030 s
> 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job sorting took 0.000 s
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to