On Tue, Sep 11, 2012 at 06:01:43PM -0400, Brodie, Kent wrote:
Like I hinted at, I am starting to figure out what I can use to intelligently tell me 
what's going on inside the switch.   It's the classic problem of, "I'm the only unix 
sysadmin here and I really don't have time!".   I appreciate the tips about the 
switch monitoring, I, think I'll start there.

I completely understand, and you probably have the sympathy of the
majority of the denizens of the mailing list.

(I'm already monitoring the NODES closely.   Have not seen any real issue or 
resource shortage or errors or...   ).

The per-node monitoring is really useful, as you can use it to
high-level in/out sorts of metrics.  Make sure you are collecting
network traffic metrics (both packets/sec and bandwidth/sec), and also
make sure that you get per-interface breakouts, if possible.  i.e. you
don't want backup traffic going out of your ethernet interfaces to
interfere with IB traffic metrics.

If you don't already know about 'collectl', take a look.  It works on
it's own, and with Ganglia as well.

For the switches, if you don't already have something in place, take a
look at either Cacti or MRTG (with the "routers2.cgi" front end).  Both
will slurp up SNMP data from your switch and make useful charts out of
the data.

The followup question I have is, my qmaster host is the NFS server for the SGE 
management.     The DATA the jobs require are all on our nfs appliance (Isilon, 
in our case).     Would you recommend I re-do my cluster to store ALL sge 
goodies on the nfs appliance?   I didn't think grid engine beat up the NFS 
qmaster too much but then again.......

We use Isilons as well (and love them).  We actually run our SGE files
off an ancient, and (now) stupidly underpowered and overworked Netapp
3050.  We don't have any performance issues with SGE, and the entire thing
is shared across our cluster (~50 compute nodes, ~1000 cores).  While we
don't push massive numbers of jobs through the system, it isn't trivial
either--although many of our larger jobs are moderatly sized array jobs.
Our large Illumina pipeline jobs run 32-way on a single compute node.


--
Jesse Becker
NHGRI Linux support (Digicon Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to