Re: [gridengine users] sge_schedd exhausts all memory

Ron Chen Tue, 25 Oct 2011 11:21:41 -0700

If turning off schedd_job_info has no effect, you can see if it is a corrupted 
job causing the problem. You can use the techniques suggested by Joshua, or if 
you like, send SIGSTOP to schedd and just let qmaster run (since they are 
seperate processes in SGE 6.1) by itself, and then use qstat to see if there is 
a big job array or anything different or wrong.


If everything looks normal, and if you are using classic spooling, then you can 
dig into seperate jobs to see why something is wrong:

(Asume $SGE_CELL == default)


0) first, make a backup of the spooling directory (default/spool/qmaster)

1) then, remove all the jobs from the original spooling directory
2) start qmaster+schedd, if schedd is already using lots of memory == 
configuration issue or problem with glibc or other system issues, so not an SGE 
issue (look elsewhere).

3) else, copy half of the jobs from the backup to the original jobs directory.
4) start qmaster+schedd, if schedd is using lots of memory == problems with one 
of the jobs, so you keep splitting (use binary search) and repeat this step.

5) else, it should be the other half of the jobs, load that half, and repeat 
step 3 & 4.


I believe Berkeley DB can use similar techniques, but you will need to use 
db_dump & db_load to load portions of jobs into qmaster. (Or, you can have a 
backup of the "sge_job" DB (in default/spool/spooldb/) and use qdel to remove 
jobs to elimainate good jobs in order to find the bad ones.

AGAIN, have a backup of the spool directory when you are debugging and/or 
fooling around with the job information. Unless, you don't care about the 
submitted jobs, or your *career*!


 -Ron




----- Original Message -----
From: SLIM H.A. <[email protected]>
To: [email protected]
Cc: 
Sent: Tuesday, October 25, 2011 1:12 PM
Subject: [gridengine users] sge_schedd exhausts all memory


Dear GridEngine users

After using GridEngine 6.1u6 for more than a year a problem has cropped
up suddenly with the scheduler. The scheduler uses rapidly all the
available memory in the system and can ultimately crash the server.
Stopping qmaster, waiting until top shows a normal memory usage and
restarting it, immediately all memory is claimed by sge_schedd. I have
tried setting the params  profile=1 setting with qconf -msconf to
monitor the scheduler message file, the output after restarting qmaster
is below. I cannot see anything relevant but maybe someone else has a
better insight.

Does anyone know another way to investigate this "memory leak"?

We are still stuck with 6.1u6 because 6.2u5 has a bug when compiling on
SUSE. 

Many thanks

Henk

10/24/2011 14:48:38|schedd|ham4in|E|callback function for event "106590.
EVENT DEL JOB 239618.1" failed
10/24/2011 15:02:51|schedd|ham4in|E|could not find job "239622" in
master list
10/24/2011 15:02:51|schedd|ham4in|E|callback function for event "113617.
EVENT DEL JOB 239622.1" failed
10/24/2011 15:04:32|schedd|ham4in|E|could not find job "239624" in
master list
10/24/2011 15:04:32|schedd|ham4in|E|callback function for event "114422.
EVENT DEL JOB 239624.1" failed
10/25/2011 15:00:37|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 16:14:11|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 16:26:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:13:31|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:24:54|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:34:16|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:34:33|schedd|ham4in|E|could not find job "240019" in
master list
10/25/2011 17:34:33|schedd|ham4in|E|callback function for event "80.
EVENT DEL JOB 240019.1" failed
10/25/2011 17:41:47|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:41:49|schedd|ham4in|P|PROF: static urgency took 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.010, calc: 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: normalizing job tickets took
0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: create active job orders:
0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job-order calculation took
0.020 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job sorting took 0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: job dispatching took 0.010 s
(0 fast, 0 comp, 4 pe, 0 res)
10/25/2011 17:41:49|schedd|ham4in|P|PROF: create pending job orders:
0.000 s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: scheduled in 0.090 (u 0.050 +
s 0.020 = 0.070): 0 sequential, 4 parallel, 134 orders, 232 H, 0 Q, 1425
QA, 49 J(qw), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 13 J(x), 132 J(all), 47
C, 20 ACL, 4 PE, 14 U, 7 D, 18 PRJ, 1 ST, 0 CKPT, 0 RU, 1425 gMes, 7
jMes, 79/1 pre-send, -80/-90/-242 pe-alg
10/25/2011 17:41:49|schedd|ham4in|P|PROF: send orders and cleanup took:
0.310 (u 0.010,s 0.000) s
10/25/2011 17:41:49|schedd|ham4in|P|PROF: schedd run took: 0.440 s
(init: 0.000 s, copy: 0.030 s, run:0.400, free: 0.010 s, jobs: 132,
categories: 26/26)
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): profiling summary:

10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): other          : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): packing        : wc
=      0.000s, utime =      0.010s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): eventclient    : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): mirror         : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): gdi            : wc
=      0.310s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): ht-resize      : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler      : wc
=      0.060s, utime =      0.040s, stime =      0.010s, utilization =
83%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): pending ticket : wc
=      0.010s, utime =      0.000s, stime =      0.010s, utilization =
100%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job sorting    : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): job dispatching: wc
=      0.010s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): send orders    : wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): scheduler event: wc
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =
0%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): copy lists     : wc
=      0.040s, utime =      0.040s, stime =      0.000s, utilization =
100%
10/25/2011 17:41:49|schedd|ham4in|P|PROF(-37096352): total          : wc
=      0.440s, utime =      0.100s, stime =      0.020s, utilization =
27%
10/25/2011 17:42:04|schedd|ham4in|P|PROF: sge_mirror processed 1370
events in 0.010 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: static urgency took 0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: normalizing job tickets took
0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: create active job orders:
0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job-order calculation took
0.000 s
10/25/2011 17:42:04|schedd|ham4in|P|PROF: job sorting took 0.000 s
10/25/2011 17:56:12|schedd|ham4in|I|starting up GE 6.1u6 (lx24-amd64)
10/25/2011 17:56:14|schedd|ham4in|P|PROF: static urgency took 0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: normalizing job tickets took
0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: create active job orders:
0.000 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job-order calculation took
0.030 s
10/25/2011 17:56:14|schedd|ham4in|P|PROF: job sorting took 0.000 s

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_schedd exhausts all memory

Reply via email to