If turning off schedd_job_info has no effect, you can see if it is a corrupted job causing the problem. You can use the techniques suggested by Joshua, or if you like, send SIGSTOP to schedd and just let qmaster run (since they are seperate processes in SGE 6.1) by itself, and then use qstat to see if there is a big job array or anything different or wrong.
If everything looks normal, and if you are using classic spooling, then you can dig into seperate jobs to see why something is wrong: (Asume $SGE_CELL == default) 0) first, make a backup of the spooling directory (default/spool/qmaster) 1) then, remove all the jobs from the original spooling directory 2) start qmaster+schedd, if schedd is already using lots of memory == configuration issue or problem with glibc or other system issues, so not an SGE issue (look elsewhere). 3) else, copy half of the jobs from the backup to the original jobs directory. 4) start qmaster+schedd, if schedd is using lots of memory == problems with one of the jobs, so you keep splitting (use binary search) and repeat this step. 5) else, it should be the other half of the jobs, load that half, and repeat step 3 & 4. I believe Berkeley DB can use similar techniques, but you will need to use db_dump & db_load to load portions of jobs into qmaster. (Or, you can have a backup of the "sge_job" DB (in default/spool/spooldb/) and use qdel to remove jobs to elimainate good jobs in order to find the bad ones. AGAIN, have a backup of the spool directory when you are debugging and/or fooling around with the job information. Unless, you don't care about the submitted jobs, or your *career*! -Ron ----- Original Message ----- From: SLIM H.A. <[email protected]> To: [email protected] Cc: Sent: Tuesday, October 25, 2011 1:12 PM Subject: [gridengine users] sge_schedd exhausts all memory Dear GridEngine users After using GridEngine 6.1u6 for more than a year a problem has cropped up suddenly with the scheduler. The scheduler uses rapidly all the available memory in the system and can ultimately crash the server. Stopping qmaster, waiting until top shows a normal memory usage and restarting it, immediately all memory is claimed by sge_schedd. I have tried setting the params profile=1 setting with qconf -msconf to monitor the scheduler message file, the output after restarting qmaster is below. I cannot see anything relevant but maybe someone else has a better insight. Does anyone know another way to investigate this "memory leak"? We are still stuck with 6.1u6 because 6.2u5 has a bug when compiling on SUSE. Many thanks Henk 10/24/2011 14:48:38|schedd|ham4in|E|callback function for event "106590. EVENT DEL JOB 239618.1" failed 10/24/2011 15:02:51|schedd|ham4in|E|could not find job "239622" in master list 10/24/2011 15:02:51|schedd|ham4in|E|callback function for event "113617. EVENT DEL JOB 239622.1" failed 10/24/2011 15:04:32|schedd|ham4in|E|could not find job "239624" in master list 10/24/2011 15:04:32|schedd|ham4in|E|callback function for event "114422. EVENT DEL JOB 239624.1" failed 10/25/2011 15:00:37|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 16:14:11|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 16:26:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:13:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:24:54|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:34:16|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:34:33|schedd|ham4in|E|could not find job "240019" in master list 10/25/2011 17:34:33|schedd|ham4in|E|callback function for event "80. EVENT DEL JOB 240019.1" failed 10/25/2011 17:41:47|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:41:49|schedd|ham4in|P|PROF: static urgency took 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.010, calc: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: normalizing job tickets took 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create active job orders: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job-order calculation took 0.020 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job sorting took 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: job dispatching took 0.010 s (0 fast, 0 comp, 4 pe, 0 res) 10/25/2011 17:41:49|schedd|ham4in|P|PROF: create pending job orders: 0.000 s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: scheduled in 0.090 (u 0.050 + s 0.020 = 0.070): 0 sequential, 4 parallel, 134 orders, 232 H, 0 Q, 1425 QA, 49 J(qw), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 13 J(x), 132 J(all), 47 C, 20 ACL, 4 PE, 14 U, 7 D, 18 PRJ, 1 ST, 0 CKPT, 0 RU, 1425 gMes, 7 jMes, 79/1 pre-send, -80/-90/-242 pe-alg 10/25/2011 17:41:49|schedd|ham4in|P|PROF: send orders and cleanup took: 0.310 (u 0.010,s 0.000) s 10/25/2011 17:41:49|schedd|ham4in|P|PROF: schedd run took: 0.440 s (init: 0.000 s, copy: 0.030 s, run:0.400, free: 0.010 s, jobs: 132, categories: 26/26) 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): profiling summary: 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): other : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): packing : wc = 0.000s, utime = 0.010s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): eventclient : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): mirror : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): gdi : wc = 0.310s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): ht-resize : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler : wc = 0.060s, utime = 0.040s, stime = 0.010s, utilization = 83% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): pending ticket : wc = 0.010s, utime = 0.000s, stime = 0.010s, utilization = 100% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job sorting : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job dispatching: wc = 0.010s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): send orders : wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler event: wc = 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): copy lists : wc = 0.040s, utime = 0.040s, stime = 0.000s, utilization = 100% 10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): total : wc = 0.440s, utime = 0.100s, stime = 0.020s, utilization = 27% 10/25/2011 17:42:04|schedd|ham4in|P|PROF: sge_mirror processed 1370 events in 0.010 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: static urgency took 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: normalizing job tickets took 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: create active job orders: 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job-order calculation took 0.000 s 10/25/2011 17:42:04|schedd|ham4in|P|PROF: job sorting took 0.000 s 10/25/2011 17:56:12|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64) 10/25/2011 17:56:14|schedd|ham4in|P|PROF: static urgency took 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: normalizing job tickets took 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: create active job orders: 0.000 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job-order calculation took 0.030 s 10/25/2011 17:56:14|schedd|ham4in|P|PROF: job sorting took 0.000 s _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
