Hi All.
We are running SGE 8.1.8 under CentOS 6.6.
All was running just fine and all of a suddend sge_master goes to 100% and SGE
becomes *VERY* slow and sluggish, specially qrsh. All works, just very VERY
slowly.
We have around 100 nodes, some 400 jobs waiting to run, most are job arrays. The number of jobs is normal / low for our cluster and all has been great until now. When we run qrsh, we usually get
a node in 20 seconds or less. Now it can take 10-15 minutes for qrsh to respond. Looking at all logs there is nothing obviously wrong.
I ran qping from a compute node and get the following data:
[root@compute-7-7 ~]# qping -info hpc-s $SGE_QMASTER_PORT qmaster 1
05/22/2015 10:54:31:
SIRM version: 0.1
SIRM message id: 1
start time: 05/22/2015 10:50:47 (1432317047)
run time [s]: 224
messages in read buffer: 0
messages in write buffer: 0
no. of connected clients: 103
status: 1
info: MAIN: R (224.43) | signaler000: R (223.48) | event_master000: R (0.70) | timer000: R (3.70) | worker000: R (3.23) | worker001: R (1.21) | listener000: R (1.21) | listener001:
R (4.24) | scheduler000: W (132.70) | WARNING
malloc: arena(51183616) |ordblks(5826) | smblks(27120) |
hblksr(1) | hblhkd(8908800) usmblks(0) | fsmblks(1207424) | uordblks(39004960)
| fordblks(12178656) | keepcost(68192)
Monitor:
05/22/2015 10:50:47 | MAIN: no monitoring data available
05/22/2015 10:50:48 | signaler000: no monitoring data available
05/22/2015 10:53:48 | event_master000: runs: 1.00r/s (clients: 1.00 mod: 0.00/s ack: 0.00/s blocked: 0.00 busy: 0.47 | events: 0.00/s added: 0.00/s skipt: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle:
100.00% wait: 0.00% time: 60.00s
05/22/2015 10:53:48 | timer000: runs: 0.45r/s (pending: 12.00 executed: 0.45/s)
out: 0.00m/s APT: 0.0001s/m idle: 99.99% wait: 0.00% time: 60.00s
05/22/2015 10:53:52 | worker000: runs: 0.27r/s (EXECD (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.00,g:2.71,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.27m/s APT: 0.0025s/m idle:
99.93% wait: 0.00% time: 62.63s
05/22/2015 10:53:54 | worker001: runs: 0.27r/s (EXECD (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s GDI (a:0.00,g:2.71,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.27m/s APT: 0.0027s/m idle:
99.93% wait: 0.00% time: 62.63s
05/22/2015 10:53:54 | listener000: runs: 0.31r/s (in (g:0.20 a:0.00 e:0.00
r:0.00)/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% wait: 0.00% time: 60.62s
05/22/2015 10:53:49 | listener001: runs: 0.40r/s (in (g:0.39 a:0.00 e:0.00
r:0.00)/s) out: 0.00m/s APT: 0.0001s/m idle: 100.00% wait: 0.00% time: 59.60s
05/22/2015 10:50:48 | scheduler000: no monitoring data available
However, I am not sure how to read it. Nothing stands out that I can see.
Any help greatly appreciated.
Thanks,
Joseph
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users