Re: [gridengine users] cluster utilization

RDlab Thu, 25 Feb 2016 00:54:13 -0800

Hello,

I would suggest that you take a look at S-GAE. It gathers data from qactt and 
display information using eye-candy graphics for user, queue and whole cluster. 
It shows the process memory usage, averages, queue wait time….


By the way, it is free software under GNU license and we are really happy with 
it :)

http://rdlab.cs.upc.edu/s-gae


Best regards,

Gabriel

-- 
RDlab (Campus Nord - UPC)  --  http://rdlab.cs.upc.edu
C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
08034 Barcelona

Telf:   +34 93 413 78 20

> El 24 feb 2016, a las 21:22, [email protected] escribió:
> 
> Is anyone monitoring cluster utilization with a higher-level view than
> simply job (qacct) stastics and CPU-seconds used/available?
> 
> I'm running SoGE 8.1.6  on a cluster with ~70 nodes, ~1400 cores and
> 200~350K jobs/month and I'm seeking ways to understand the utilization &
> resource constraints in our cluster overall.
> 
> The 'jobstats' script is fine for giving feedback to users, looking
> things like avg/high/low job runtime, wait time, etc., but it doesn't
> give good information about overall cluster utilization.
> 
> 
> I'd like to see these kind of metrics on cluster use:
> 
>       histogram of CPU utilization, ie:
>               Utilization     Time
>               100%            5%
>                90%            20%
> 
>       histogram of overall memory use, ie:
>               Utilization     Time
>               100%            0%
>                90%            60%
> 
>       correlation between jobs waiting (CPUs idle) and available memory, as
>       in:
>               Jan 1   14:00 - 20:00
>                       avg 4GB free/node
>                       avg 50% CPU-slots used
>                       avg 12GB RAM request for jobs in 'qw'
>                               memory is constraint, cluster is fully
>                               utilized but CPUs are idle
> 
>               Jan 8   08:00 - 14:00
>                       avg 32GB free/node
>                       avg 98% CPU-slots used
>                       avg 2GB RAM request for jobs in 'qw'
>                               CPU is constraint, cluster is fully
>                               utilized but memory is unused
> 
> 
>       number of jobs queued/waiting (excluding 'hold' jobs)
> 
>       number of CPUs requested vs [CPU time/wallclock time]
>               (useful for detecting if users are requesting multiple
>               cores in the 'threaded' PE but running single-threaded
>               jobs)
> 
>       amount of memory used per job as a function of request, ie:
>               requested       used avg
>               =========       ========
>               4GB             2.1GB
>               12GB            9GB
>               20GB            17GB
> 
>       average duration job spends in 'qw' state
> 
>       duration of queue time as a function of number of CPUs requested, ie
>               1CPU    1hr avg in 'qw'
>               2CPU    2hr avg in 'qw'
>               4CPU    12hr avg in 'qw'
> 
>       duration of queue time as a function of amount of RAM requested
>               4GB     1hr avg in 'qw'
>               12GB    2hr avg in 'qw'
>               20GB    12hr avg in 'qw'
> 
> I think that the only way to get this information would be to run 'qstats'
> periodically, capture & process that data....any better suggestions or
> scripts that anyone can share?
> 
> Thanks,
> 
> Mark
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] cluster utilization

Reply via email to