Dear GridEngine users

After using GridEngine 6.1u6 for more than a year a problem has cropped
up suddenly with the scheduler. The scheduler uses rapidly all the
available memory in the system and can ultimately crash the server.
Stopping qmaster, waiting until top shows a normal memory usage and
restarting it, immediately all memory is claimed by sge_schedd. I have
tried setting the params  profile=1 setting with qconf -msconf to
monitor the scheduler message file, the output after restarting qmaster
is below. I cannot see anything relevant but maybe someone else has a
better insight.

Does anyone know another way to investigate this "memory leak"?

We are still stuck with 6.1u6 because 6.2u5 has a bug when compiling on
SUSE. 

Many thanks

Henk

10/24/2011 14:48:38|schedd|ham4in|E|callback function for event "106590.
EVENT DEL JOB 239618.1" failed
10/24/2011 15:02:51|schedd|ham4in|E|could not find job "239622" in
master list
10/24/2011 15:02:51|schedd|ham4in|E|callback function for event "113617.
EVENT DEL JOB 239622.1" failed
10/24/2011 15:04:32|schedd|ham4in|E|could not find job "239624" in
master list
10/24/2011 15:04:32|schedd|ham4in|E|callback function for event "114422.
EVENT DEL JOB 239624.1" failed
10/25/2011 15:00:37|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 16:14:11|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 16:26:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:13:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:24:54|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:34:16|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:34:33|schedd|ham4in|E|could not find job "240019" in
master list
10/25/2011 17:34:33|schedd|ham4in|E|callback function for event "80.
EVENT DEL JOB 240019.1" failed
10/25/2011 17:41:47|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:41:49|schedd|ham4in|P|PROF: static urgency took 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.010, calc: 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: normalizing job tickets took
0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: create active job orders:
0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job-order calculation took
0.020 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job sorting took 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job dispatching took 0.010 s
(0 fast, 0 comp, 4 pe, 0 res)
10/25/2011 17:41:49|schedd|ham4in|P|PROF: create pending job orders:
0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: scheduled in 0.090 (u 0.050 +
s 0.020 = 0.070): 0 sequential, 4 parallel, 134 orders, 232 H, 0 Q, 1425
QA, 49 J(qw), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 13 J(x), 132 J(all), 47
C, 20 ACL, 4 PE, 14 U, 7 D, 18 PRJ, 1 ST, 0 CKPT, 0 RU, 1425 gMes, 7
jMes, 79/1 pre-send, -80/-90/-242 pe-alg
10/25/2011 17:41:49|schedd|ham4in|P|PROF: send orders and cleanup took:
0.310 (u 0.010,s 0.000) s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: schedd run took: 0.440 s
(init: 0.000 s, copy: 0.030 s, run:0.400, free: 0.010 s, jobs: 132,
categories: 26/26)
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): profiling summary:

10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): other          : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): packing        : wc
=      0.000s, utime =      0.010s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): eventclient    : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): mirror         : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): gdi            : wc
=      0.310s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): ht-resize      : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler      : wc
=      0.060s, utime =      0.040s, stime =      0.010s, utilization =
83%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): pending ticket : wc
=      0.010s, utime =      0.000s, stime =      0.010s, utilization =
100%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job sorting    : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job dispatching: wc
=      0.010s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): send orders    : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler event: wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): copy lists     : wc
=      0.040s, utime =      0.040s, stime =      0.000s, utilization =
100%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): total          : wc
=      0.440s, utime =      0.100s, stime =      0.020s, utilization =
27%
10/25/2011 17:42:04|schedd|ham4in|P|PROF: sge_mirror processed 1370
events in 0.010 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: static urgency took 0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: normalizing job tickets took
0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: create active job orders:
0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job-order calculation took
0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job sorting took 0.000 s
10/25/2011 17:56:12|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:56:14|schedd|ham4in|P|PROF: static urgency took 0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: normalizing job tickets took
0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: create active job orders:
0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job-order calculation took
0.030 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job sorting took 0.000 s

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to