I'll give some impressions of S-GAE since I have it installed in a lot
of places ...
- It's a good basic reporting tool for monthly metrics.
- I don't use all of the features; mainly the full cluster "view"
- In the full cluster view there are 4-6 PNG graphics that I just
generate and copy/embed into a written document
The basic metrics that I like are:
- Job count shown as a percentage of success/failed jobs (job success
% is a great top-line metric)
- Cluster exec time (bar graph showing longest / shortest / avg job info)
- Slots per job graph (great way to show that only 1% of jobs use MPI
or threaded PE hack)
- Top ten users by memory consumption
- Top ten users by raw job count
- Top ten users by absolute exec time
Generic observations:
- It's not super fast at ingest; it does a qacct on every job in the
accounting file, parses the data and loads into db; I usually let it
cook overnight on ingest
- It can be tuned for ingest with various memory, mysql and ramdisk
methods
- It's not fast at viewing - tons of temporary mysql tables are made
in $TMP just to show the front cluster view page
- It can take 10 minutes just to render the HTML main page after we've
loaded metrics for the month; lots of action in /tmp with temporary
mysql files
- By default it will reject jobs for which the username does not exist
on localhost - this is crappy for situations where I take someone's
accounting file and run it through my own S-GAE server running on AWS
cloud or elsewhere. I had to make scripts that parse the accounting file
for usernames, generate a uniq list and then make fake dummy accounts on
the local system. This is problematic if you don't pay attention to the logs
- Errors in the logs about being unable to ingest or create summary
views may make you think at first about SQL or database problems but 99%
of the time it means that the system ran /tmp to 100% full and just
bombed out trying to execute a procedure
- There are certain things that can ONLY be done in the web interface
that kill me when I set up or repeatedly setup and rebuild a metric
system. You can't configure the known queues or other parameters via a
script or a config file. Each time you install or reinstall you need to
step through the web page. There are multiple point and click events
require to register each cluster queue which is painful on big systems
where I may be destroying and rebuilding the S-GAE system multiple
times. It's a human interaction / UI hassle basically
Tuning:
- S-GAE needs huge /tmp space and may fail subtly unless you are
careful about watching the logs
- For a cluster that does between 1-2million jobs a month we need a
100GB /tmp partition to run metrics
For fixed installs that run metrics monthly I just configure the server
to use a big /tmp partition and decide if I can get away with turning on
the in-memory accounting file handling on a given system.
When running on the Amazon cloud doing a 1-off analysis on accounting
file from a client I've found that I could make things go far far faster by:
- Running on a spot node with lots of memory
- Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
- Relocating the mysql database data/table files into /ramdisk
- Applying some of the mysql tuning advice from google to the
mysql.conf file
- Keeping the accounting file in /ramdisk/ path
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users