Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

Jesse Becker Tue, 03 Mar 2015 06:10:16 -0800

Thanks, some comments/questions inline.

On Tue, Mar 03, 2015 at 07:16:16AM -0500, Chris Dagdigian wrote:

- It's a good basic reporting tool for monthly metrics.


Is that the smallest resolution it supports?  XDMoD can drill down to
daily, which we find very useful.

- Job count shown as a percentage of success/failed jobs (job success% is a great top-line metric)


That's actually a really nice metric.  I don't know if XDMoD supports
that out of the box.  That said, the charts it makes are nice, and
there's a custom reporting system.

- Cluster exec time (bar graph showing longest / shortest / avg job info)
- Slots per job graph (great way to show that only 1% of jobs useMPI or threaded PE hack)
- Top ten users by memory consumption
- Top ten users by raw job count
- Top ten users by absolute exec time


XDMoD has similar stuff.

Generic observations:
- It's not super fast at ingest; it does a qacct on every job in theaccounting file, parses the data and loads into db; I usually let itcook overnight on ingest


Seriously? a full "qacct -j <jobid>" on each job?  That's got to be
slow.  *MoD at least groks the raw accounting logs.

- It can be tuned for ingest with various memory, mysql and ramdiskmethods


Handy.

- It's not fast at viewing - tons of temporary mysql tables are madein $TMP just to show the front cluster view page


We run xdmod on a small VM, works just fine.  To be fair, the mysql
server that stores the data is a relatively large physical box.

- It can take 10 minutes just to render the HTML main page afterwe've loaded metrics for the month; lots of action in /tmp withtemporary mysql files


Yeah, nothing like that here, even for the worst case graphs.

- By default it will reject jobs for which the username does notexist on localhost - this is crappy for situations where I takesomeone's accounting file and run it through my own S-GAE serverrunning on AWS cloud or elsewhere. I had to make scripts that parsethe accounting file for usernames, generate a uniq list and then makefake dummy accounts on the local system. This is problematic if youdon't pay attention to the logs


XDMoD deals with that issue as well.  One thing that you cannot do, yet,
is cleanly map all of XDMoD's organizational hierarchies directly into
SGE's.  For example, we would really like to map SGE's
Division/Project/User into XDMoD, but it's not perfect.  Projects are
most important to us, but to get them to show up in the charts, we have
to map them PIs.

- Errors in the logs about being unable to ingest or create summaryviews may make you think at first about SQL or database problems but99% of the time it means that the system ran /tmp to 100% full andjust bombed out trying to execute a procedure


Sometimes we get funky logs, but given that we push somewhere around 1.5
million jobs a week, losing a few is not a big deal.

- There are certain things that can ONLY be done in the web interfacethat kill me when I set up or repeatedly setup and rebuild a metricsystem. You can't configure the known queues or other parameters via ascript or a config file. Each time you install or reinstall you needto step through the web page. There are multiple point and clickevents require to register each cluster queue which is painful on bigsystems where I may be destroying and rebuilding the S-GAE systemmultiple times. It's a human interaction / UI hassle basically


A lot of the group/queue/user stuff is auto-generated from the logs, so
that's a good thing.

There are some "admin" type things that UI-only, but I do most of the
setup via puppet pushing RPMs and json files around.  It's not 100%
automated, but about as close as I care to make it.

Authentication on the webpages is a bit odd.  Not basic HTTP Auth,
doesn't tie in Kerberos, and there are some odd hooks in place that
assume you're part of the U. Buffalo system.  However, it works.

Tuning:
- S-GAE needs huge /tmp space and may fail subtly unless you arecareful about watching the logs- For a cluster that does between 1-2million jobs a month we need a100GB /tmp partition to run metrics


XDMoD does all of this in the database.  It "shreds" the raw log files,
and stuffs them into the DB directly.  Those records are then ingested
from one DB to another DB for aggregation and storage.  (There are 6
different databases for XDMoD, which is a bit odd.  The schemas are
fairly sane though.)

When running on the Amazon cloud doing a 1-off analysis on accountingfile from a client I've found that I could make things go far farfaster by:
- Running on a spot node with lots of memory
- Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
- Relocating the mysql database data/table files into /ramdisk
- Applying some of the mysql tuning advice from google to themysql.conf file
- Keeping the accounting file in /ramdisk/ path


Probably all useful for XDMoD as well, if it fits your environment.  I
pull the accounting logs off a moderately powerful, moderately
overloaded netapp.  For running a nightly ingest, it's ceratinyl fast
enough.

--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

Reply via email to