If you don't use need the scheduler info, you can turn off "schedd_job_info". http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
SGE 6.2 sets this parameter to false by default, BTW (see page 29): http://www.oracle.com/technetwork/oem/host-server-mgmt/twp-gridengine-beginner-167116.pdf Last year, a client's SGE 6.0 cluster was having the same issue with memory, and I turned off schedd_job_info. The cluster runs tens of thousands of jobs per day, and the qmaster & schedd were using no more than 200 MB after the change. Rayson On Tue, Oct 25, 2011 at 1:12 PM, SLIM H.A. <[email protected]> wrote: > > Dear GridEngine users > > After using GridEngine 6.1u6 for more than a year a problem has cropped > up suddenly with the scheduler. The scheduler uses rapidly all the > available memory in the system and can ultimately crash the server. > Stopping qmaster, waiting until top shows a normal memory usage and > restarting it, immediately all memory is claimed by sge_schedd. I have > tried setting the params profile=1 setting with qconf -msconf to > monitor the scheduler message file, the output after restarting qmaster > is below. I cannot see anything relevant but maybe someone else has a > better insight. > > Does anyone know another way to investigate this "memory leak"? > > We are still stuck with 6.1u6 because 6.2u5 has a bug when compiling on > SUSE. > > Many thanks > > Henk > > 10/24/2011 14:48:38|schedd|ham4in|E|callback function for event "106590. > EVENT DEL JOB 239618.1" failed > 10/24/2011 15:02:51|schedd|ham4in|E|could not find job "239622" in > master list > 10/24/2011 15:02:51|schedd|ham4in|E|callback function for event "113617. > EVENT DEL JOB 239622.1" failed > 10/24/2011 15:04:32|schedd|ham4in|E|could not find job "239624" in > master list > 10/24/2011 15:04:32|schedd|ham4in|E|callback function for event "114422. > EVENT DEL JOB 239624.1" failed > 10/25/2011 15:00:37|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 16:14:11|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 16:26:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 17:13:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 17:24:54|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 17:34:16|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 17:34:33|schedd|ham4in|E|could not find job "240019" in > master list > 10/25/2011 17:34:33|schedd|ham4in|E|callback function for event "80. > EVENT DEL JOB 240019.1" failed > 10/25/2011 17:41:47|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: static urgency took 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init: > 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init: > 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.010, calc: 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: normalizing job tickets took > 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create active job orders: > 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job-order calculation took > 0.020 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job sorting took 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job dispatching took 0.010 s > (0 fast, 0 comp, 4 pe, 0 res) > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create pending job orders: > 0.000 s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: scheduled in 0.090 (u 0.050 + > s 0.020 = 0.070): 0 sequential, 4 parallel, 134 orders, 232 H, 0 Q, 1425 > QA, 49 J(qw), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 13 J(x), 132 J(all), 47 > C, 20 ACL, 4 PE, 14 U, 7 D, 18 PRJ, 1 ST, 0 CKPT, 0 RU, 1425 gMes, 7 > jMes, 79/1 pre-send, -80/-90/-242 pe-alg > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: send orders and cleanup took: > 0.310 (u 0.010,s 0.000) s > 10/25/2011 17:41:49|schedd|ham4in|P|PROF: schedd run took: 0.440 s > (init: 0.000 s, copy: 0.030 s, run:0.400, free: 0.010 s, jobs: 132, > categories: 26/26) > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): profiling summary: > > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): other : wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): packing : wc > = 0.000s, utime = 0.010s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): eventclient : wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): mirror : wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): gdi : wc > = 0.310s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): ht-resize : wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler : wc > = 0.060s, utime = 0.040s, stime = 0.010s, utilization = > 83% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): pending ticket : wc > = 0.010s, utime = 0.000s, stime = 0.010s, utilization = > 100% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job sorting : wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job dispatching: wc > = 0.010s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): send orders : wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler event: wc > = 0.000s, utime = 0.000s, stime = 0.000s, utilization = > 0% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): copy lists : wc > = 0.040s, utime = 0.040s, stime = 0.000s, utilization = > 100% > 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): total : wc > = 0.440s, utime = 0.100s, stime = 0.020s, utilization = > 27% > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: sge_mirror processed 1370 > events in 0.010 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: static urgency took 0.000 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init: > 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init: > 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: normalizing job tickets took > 0.000 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: create active job orders: > 0.000 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job-order calculation took > 0.000 s > 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job sorting took 0.000 s > 10/25/2011 17:56:12|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: static urgency took 0.000 s > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init: > 0.000 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init: > 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: normalizing job tickets took > 0.000 s > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: create active job orders: > 0.000 s > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job-order calculation took > 0.030 s > 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job sorting took 0.000 s > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
