Thanks, some comments/questions inline.
On Tue, Mar 03, 2015 at 07:16:16AM -0500, Chris Dagdigian wrote:
- It's a good basic reporting tool for monthly metrics.
Is that the smallest resolution it supports? XDMoD can drill down to
daily, which we find very useful.
- Job count shown as a percentage of success/failed jobs (job success
% is a great top-line metric)
That's actually a really nice metric. I don't know if XDMoD supports
that out of the box. That said, the charts it makes are nice, and
there's a custom reporting system.
- Cluster exec time (bar graph showing longest / shortest / avg job info)
- Slots per job graph (great way to show that only 1% of jobs use
MPI or threaded PE hack)
- Top ten users by memory consumption
- Top ten users by raw job count
- Top ten users by absolute exec time
XDMoD has similar stuff.
Generic observations:
- It's not super fast at ingest; it does a qacct on every job in the
accounting file, parses the data and loads into db; I usually let it
cook overnight on ingest
Seriously? a full "qacct -j <jobid>" on each job? That's got to be
slow. *MoD at least groks the raw accounting logs.
- It can be tuned for ingest with various memory, mysql and ramdisk
methods
Handy.
- It's not fast at viewing - tons of temporary mysql tables are made
in $TMP just to show the front cluster view page
We run xdmod on a small VM, works just fine. To be fair, the mysql
server that stores the data is a relatively large physical box.
- It can take 10 minutes just to render the HTML main page after
we've loaded metrics for the month; lots of action in /tmp with
temporary mysql files
Yeah, nothing like that here, even for the worst case graphs.
- By default it will reject jobs for which the username does not
exist on localhost - this is crappy for situations where I take
someone's accounting file and run it through my own S-GAE server
running on AWS cloud or elsewhere. I had to make scripts that parse
the accounting file for usernames, generate a uniq list and then make
fake dummy accounts on the local system. This is problematic if you
don't pay attention to the logs
XDMoD deals with that issue as well. One thing that you cannot do, yet,
is cleanly map all of XDMoD's organizational hierarchies directly into
SGE's. For example, we would really like to map SGE's
Division/Project/User into XDMoD, but it's not perfect. Projects are
most important to us, but to get them to show up in the charts, we have
to map them PIs.
- Errors in the logs about being unable to ingest or create summary
views may make you think at first about SQL or database problems but
99% of the time it means that the system ran /tmp to 100% full and
just bombed out trying to execute a procedure
Sometimes we get funky logs, but given that we push somewhere around 1.5
million jobs a week, losing a few is not a big deal.
- There are certain things that can ONLY be done in the web interface
that kill me when I set up or repeatedly setup and rebuild a metric
system. You can't configure the known queues or other parameters via a
script or a config file. Each time you install or reinstall you need
to step through the web page. There are multiple point and click
events require to register each cluster queue which is painful on big
systems where I may be destroying and rebuilding the S-GAE system
multiple times. It's a human interaction / UI hassle basically
A lot of the group/queue/user stuff is auto-generated from the logs, so
that's a good thing.
There are some "admin" type things that UI-only, but I do most of the
setup via puppet pushing RPMs and json files around. It's not 100%
automated, but about as close as I care to make it.
Authentication on the webpages is a bit odd. Not basic HTTP Auth,
doesn't tie in Kerberos, and there are some odd hooks in place that
assume you're part of the U. Buffalo system. However, it works.
Tuning:
- S-GAE needs huge /tmp space and may fail subtly unless you are
careful about watching the logs
- For a cluster that does between 1-2million jobs a month we need a
100GB /tmp partition to run metrics
XDMoD does all of this in the database. It "shreds" the raw log files,
and stuffs them into the DB directly. Those records are then ingested
from one DB to another DB for aggregation and storage. (There are 6
different databases for XDMoD, which is a bit odd. The schemas are
fairly sane though.)
When running on the Amazon cloud doing a 1-off analysis on accounting
file from a client I've found that I could make things go far far
faster by:
- Running on a spot node with lots of memory
- Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
- Relocating the mysql database data/table files into /ramdisk
- Applying some of the mysql tuning advice from google to the
mysql.conf file
- Keeping the accounting file in /ramdisk/ path
Probably all useful for XDMoD as well, if it fits your environment. I
pull the accounting logs off a moderately powerful, moderately
overloaded netapp. For running a nightly ingest, it's ceratinyl fast
enough.
--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users