Dear GridEngine users After using GridEngine 6.1u6 for more than a year a problem has cropped up suddenly with the scheduler. The scheduler uses rapidly all the available memory in the system and can ultimately crash the server. Stopping qmaster, waiting until top shows a normal memory usage and restarting it, immediately all memory is claimed by sge_schedd. I have tried setting the params profile=1 setting with qconf -msconf to monitor the scheduler message file, the output after restarting qmaster is below. I cannot see anything relevant but maybe someone else has a better insight.
Does anyone know another way to investigate this "memory leak"? We are still stuck with 6.1u6 because 6.2u5 has a bug when compiling on SUSE. Many thanks Henk 10/24/2011 14:48:38|schedd|ham4in|E|callback function for event "106590. EVENT DEL JOB 239618.1" failed 10/24/2011 15:02:51|schedd|ham4in|E|could not find job "239622" in master list 10/24/2011 15:02:51|schedd|ham4in|E|callback function for event "113617. EVENT DEL JOB 239622.1" failed 10/24/2011 15:04:32|schedd|ham4in|E|could not find job "239624" in master list 10/24/2011 15:04:32|schedd|ham4in|E|callback function for event "114422. EVENT DEL JOB 239624.1" failed 10/25/2011 15:00:37|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 16:14:11|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 16:26:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:13:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:24:54|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:34:16|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:34:33|schedd|ham4in|E|could not find job "240019" in master list 10/25/2011 17:34:33|schedd|ham4in|E|callback function for event "80. EVENT DEL JOB 240019.1" failed 10/25/2011 17:41:47|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:41:49|schedd|ham4in|P|PROF: static urgency took 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.010, calc: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: normalizing job tickets took 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create active job orders: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job-order calculation took 0.020 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job sorting took 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job dispatching took 0.010 s (0 fast, 0 comp, 4 pe, 0 res) 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create pending job orders: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: scheduled in 0.090 (u 0.050 + s 0.020 = 0.070): 0 sequential, 4 parallel, 134 orders, 232 H, 0 Q, 1425 QA, 49 J(qw), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 13 J(x), 132 J(all), 47 C, 20 ACL, 4 PE, 14 U, 7 D, 18 PRJ, 1 ST, 0 CKPT, 0 RU, 1425 gMes, 7 jMes, 79/1 pre-send, -80/-90/-242 pe-alg 10/25/2011 17:41:49|schedd|ham4in|P|PROF: send orders and cleanup took: 0.310 (u 0.010,s 0.000) s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: schedd run took: 0.440 s (init: 0.000 s, copy: 0.030 s, run:0.400, free: 0.010 s, jobs: 132, categories: 26/26) 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): profiling summary: 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): other : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): packing : wc = 0.000s, utime = 0.010s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): eventclient : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): mirror : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): gdi : wc = 0.310s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): ht-resize : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler : wc = 0.060s, utime = 0.040s, stime = 0.010s, utilization = 83% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): pending ticket : wc = 0.010s, utime = 0.000s, stime = 0.010s, utilization = 100% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job sorting : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job dispatching: wc = 0.010s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): send orders : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler event: wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): copy lists : wc = 0.040s, utime = 0.040s, stime = 0.000s, utilization = 100% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): total : wc = 0.440s, utime = 0.100s, stime = 0.020s, utilization = 27% 10/25/2011 17:42:04|schedd|ham4in|P|PROF: sge_mirror processed 1370 events in 0.010 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: static urgency took 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: normalizing job tickets took 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: create active job orders: 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job-order calculation took 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job sorting took 0.000 s 10/25/2011 17:56:12|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:56:14|schedd|ham4in|P|PROF: static urgency took 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: normalizing job tickets took 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: create active job orders: 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job-order calculation took 0.030 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job sorting took 0.000 s _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
