Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

Chris Dagdigian Tue, 03 Mar 2015 04:19:46 -0800

I'll give some impressions of S-GAE since I have it installed in a lotof places ...


- It's a good basic reporting tool for monthly metrics.
- I don't use all of the features; mainly the full cluster "view"

- In the full cluster view there are 4-6 PNG graphics that I justgenerate and copy/embed into a written document


The basic metrics that I like are:

- Job count shown as a percentage of success/failed jobs (job success% is a great top-line metric)

 - Cluster exec time (bar graph showing longest / shortest / avg job info)

- Slots per job graph (great way to show that only 1% of jobs use MPIor threaded PE hack)

 - Top ten users by memory consumption
 - Top ten users by raw job count
 - Top ten users by absolute exec time

Generic observations:

- It's not super fast at ingest; it does a qacct on every job in theaccounting file, parses the data and loads into db; I usually let itcook overnight on ingest

- It can be tuned for ingest with various memory, mysql and ramdiskmethods

- It's not fast at viewing - tons of temporary mysql tables are madein $TMP just to show the front cluster view page

- It can take 10 minutes just to render the HTML main page after we'veloaded metrics for the month; lots of action in /tmp with temporarymysql files

- By default it will reject jobs for which the username does not existon localhost - this is crappy for situations where I take someone'saccounting file and run it through my own S-GAE server running on AWScloud or elsewhere. I had to make scripts that parse the accounting filefor usernames, generate a uniq list and then make fake dummy accounts onthe local system. This is problematic if you don't pay attention to the logs

- Errors in the logs about being unable to ingest or create summaryviews may make you think at first about SQL or database problems but 99%of the time it means that the system ran /tmp to 100% full and justbombed out trying to execute a procedure

- There are certain things that can ONLY be done in the web interfacethat kill me when I set up or repeatedly setup and rebuild a metricsystem. You can't configure the known queues or other parameters via ascript or a config file. Each time you install or reinstall you need tostep through the web page. There are multiple point and click eventsrequire to register each cluster queue which is painful on big systemswhere I may be destroying and rebuilding the S-GAE system multipletimes. It's a human interaction / UI hassle basically



Tuning:

- S-GAE needs huge /tmp space and may fail subtly unless you arecareful about watching the logs- For a cluster that does between 1-2million jobs a month we need a100GB /tmp partition to run metrics

For fixed installs that run metrics monthly I just configure the serverto use a big /tmp partition and decide if I can get away with turning onthe in-memory accounting file handling on a given system.

When running on the Amazon cloud doing a 1-off analysis on accountingfile from a client I've found that I could make things go far far faster by:


 - Running on a spot node with lots of memory
 - Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
 - Relocating the mysql database data/table files into /ramdisk

- Applying some of the mysql tuning advice from google to themysql.conf file

 - Keeping the accounting file in /ramdisk/ path




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

Reply via email to