Very temping but the user would have my head :-)

On my previous post I think I may have found the issue.   I turned down max_reservation from 100 to 10 and sge_qmaster seems to be taking a breather and not staying at 100% cpu utilization. Time will tell.

Thank you all,

Joseph


On 1/26/2019 11:26 AM, Daniel Povey wrote:
It may depend on specific features of those large job arrays.  You could try 
deleting them and see if the problem disappears.

On Sat, Jan 26, 2019 at 2:23 PM Joseph Farran <jfar...@uci.edu 
<mailto:jfar...@uci.edu>> wrote:

    Hi Daniel.

    Yes I do have large job-arrays around 7k tasks BUT I have had larger job 
arrays of 500k without seeing this kind of slowdown.

    Joseph


    On 1/26/2019 10:16 AM, Daniel Povey wrote:
    > Check if there are any huge jobs in the queue. Sometimes very large task 
ranges, or large numbers of jobs, can make it slow.
    >
    > On Sat, Jan 26, 2019 at 7:05 AM Reuti <re...@staff.uni-marburg.de 
<mailto:re...@staff.uni-marburg.de> <mailto:re...@staff.uni-marburg.de 
<mailto:re...@staff.uni-marburg.de>>> wrote:
    >
    >     Hi,
    >
    >     > Am 26.01.2019 um 10:20 schrieb Joseph Farran <jfar...@uci.edu 
<mailto:jfar...@uci.edu> <mailto:jfar...@uci.edu <mailto:jfar...@uci.edu>>>:
    >     >
    >     > Hi.
    >     > Our Grid Engine is running very sluggish all of a sudden. 
Sqe_qmaster stays at 100% all the time where is used to be 100% for a few seconds 
every 30 seconds or so.
    >     > I ran the qping command but not sure how to read it.  Any helpful 
insight much appreciated
    >
    >     Did you try to stop and start the qmaster?
    >
    >     -- Reuti
    >
    >
    >     > qping -i 5 -info hpc-s 6444 qmaster 1
    >     > 01/26/2019 01:12:18:
    >     > SIRM version:             0.1
    >     > SIRM message id:          1
    >     > start time:               01/26/2019 01:10:13 (1548493813)
    >     > run time [s]:             125
    >     > messages in read buffer:  0
    >     > messages in write buffer: 0
    >     > no. of connected clients: 296
    >     > status:                   0
    >     > info:                     MAIN: R (125.20) | signaler000: R 
(123.69) | event_master000: R (0.14) | timer000: R (4.52) | worker000: R (0.14) | 
worker001: R (3.44) | worker002: R (7.33) |
    >     worker003: R (3.43) | worker004: R (3.08) | worker005: R (1.42) | OK
    >     > malloc:                   arena(34410496) |ordblks(9370) | 
smblks(164269) | hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(7726000) | 
uordblks(24248176) | fordblks(10162320) | keepcost(119856)
    >     > Monitor:
    >     > 01/26/2019 01:10:13 | MAIN: no monitoring data available
    >     > 01/26/2019 01:10:14 | signaler000: no monitoring data available
    >     > 01/26/2019 01:12:14 | event_master000: runs: 4.82r/s (clients: 1.00 
mod: 0.02/s ack: 0.02/s blocked: 0.00 busy: 0.81 | events: 5.52/s added: 5.47/s 
skipt: 0.05/s) out: 0.00m/s APT: 0.0002s/m
    >     idle: 99.89% wait: 0.00% time: 60.00s
    >     > 01/26/2019 01:12:14 | timer000: runs: 0.47r/s (pending: 12.00 
executed: 0.45/s) out: 0.00m/s APT: 0.0002s/m idle: 99.99% wait: 0.00% time: 60.00s
    >     > 01/26/2019 01:11:19 | worker000: runs: 0.68r/s (EXECD 
(l:0.32,j:0.28,c:0.32,p:0.00,a:0.00)/s GDI 
(a:0.25,g:1.08,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.82m/s APT:
    0.0036s/m
    >     idle: 99.75% wait: 0.00% time: 64.96s
    >     > 01/26/2019 01:12:15 | worker001: runs: 0.81r/s (EXECD 
(l:0.02,j:0.02,c:0.02,p:0.00,a:0.00)/s GDI 
(a:0.00,g:1.92,m:0.08,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.81m/s APT:
    0.0008s/m
    >     idle: 99.93% wait: 0.00% time: 59.27s
    >     > 01/26/2019 01:11:16 | worker002: runs: 0.73r/s (EXECD 
(l:0.28,j:0.23,c:0.26,p:0.00,a:0.00)/s GDI 
(a:0.34,g:1.13,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.71m/s APT:
    0.0030s/m
    >     idle: 99.78% wait: 0.17% time: 61.75s
    >     > 01/26/2019 01:12:15 | worker003: runs: 0.75r/s (EXECD 
(l:0.03,j:0.02,c:0.03,p:0.00,a:0.00)/s GDI 
(a:0.02,g:1.23,m:0.07,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.73m/s APT:
    0.0008s/m
    >     idle: 99.94% wait: 0.02% time: 60.40s
    >     > 01/26/2019 01:11:26 | worker004: runs: 0.68r/s (EXECD 
(l:0.23,j:0.21,c:0.23,p:0.00,a:0.00)/s GDI 
(a:0.27,g:1.69,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.65m/s APT:
    0.0012s/m
    >     idle: 99.92% wait: 0.00% time: 71.11s
    >     > 01/26/2019 01:11:31 | worker005: runs: 0.56r/s (EXECD 
(l:0.25,j:0.24,c:0.25,p:0.00,a:0.00)/s GDI 
(a:0.20,g:1.05,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 0.55m/s APT:
    0.0011s/m
    >     idle: 99.94% wait: 0.00% time: 76.48s
    >     >
    >     > Joseph
    >     >
    >     >
    >     > _______________________________________________
    >     > users mailing list
    >     > users@gridengine.org <mailto:users@gridengine.org> 
<mailto:users@gridengine.org <mailto:users@gridengine.org>>
    >     > https://gridengine.org/mailman/listinfo/users
    >     >
    >
    >
    >     _______________________________________________
    >     users mailing list
    > users@gridengine.org <mailto:users@gridengine.org> <mailto:users@gridengine.org 
<mailto:users@gridengine.org>>
    > https://gridengine.org/mailman/listinfo/users
    >

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to