I'll give some impressions of S-GAE since I have it installed in a lot of places ...

- It's a good basic reporting tool for monthly metrics.
- I don't use all of the features; mainly the full cluster "view"
- In the full cluster view there are 4-6 PNG graphics that I just generate and copy/embed into a written document

The basic metrics that I like are:

- Job count shown as a percentage of success/failed jobs (job success % is a great top-line metric)
 - Cluster exec time (bar graph showing longest / shortest / avg job info)
- Slots per job graph (great way to show that only 1% of jobs use MPI or threaded PE hack)
 - Top ten users by memory consumption
 - Top ten users by raw job count
 - Top ten users by absolute exec time

Generic observations:

- It's not super fast at ingest; it does a qacct on every job in the accounting file, parses the data and loads into db; I usually let it cook overnight on ingest

- It can be tuned for ingest with various memory, mysql and ramdisk methods

- It's not fast at viewing - tons of temporary mysql tables are made in $TMP just to show the front cluster view page

- It can take 10 minutes just to render the HTML main page after we've loaded metrics for the month; lots of action in /tmp with temporary mysql files

- By default it will reject jobs for which the username does not exist on localhost - this is crappy for situations where I take someone's accounting file and run it through my own S-GAE server running on AWS cloud or elsewhere. I had to make scripts that parse the accounting file for usernames, generate a uniq list and then make fake dummy accounts on the local system. This is problematic if you don't pay attention to the logs

- Errors in the logs about being unable to ingest or create summary views may make you think at first about SQL or database problems but 99% of the time it means that the system ran /tmp to 100% full and just bombed out trying to execute a procedure

- There are certain things that can ONLY be done in the web interface that kill me when I set up or repeatedly setup and rebuild a metric system. You can't configure the known queues or other parameters via a script or a config file. Each time you install or reinstall you need to step through the web page. There are multiple point and click events require to register each cluster queue which is painful on big systems where I may be destroying and rebuilding the S-GAE system multiple times. It's a human interaction / UI hassle basically


Tuning:

- S-GAE needs huge /tmp space and may fail subtly unless you are careful about watching the logs - For a cluster that does between 1-2million jobs a month we need a 100GB /tmp partition to run metrics


For fixed installs that run metrics monthly I just configure the server to use a big /tmp partition and decide if I can get away with turning on the in-memory accounting file handling on a given system.

When running on the Amazon cloud doing a 1-off analysis on accounting file from a client I've found that I could make things go far far faster by:

 - Running on a spot node with lots of memory
 - Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
 - Relocating the mysql database data/table files into /ramdisk
- Applying some of the mysql tuning advice from google to the mysql.conf file
 - Keeping the accounting file in /ramdisk/ path




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to