On Tue, Sep 11, 2012 at 06:01:43PM -0400, Brodie, Kent wrote:
Like I hinted at, I am starting to figure out what I can use to intelligently tell me what's going on inside the switch. It's the classic problem of, "I'm the only unix sysadmin here and I really don't have time!". I appreciate the tips about the switch monitoring, I, think I'll start there.
I completely understand, and you probably have the sympathy of the majority of the denizens of the mailing list.
(I'm already monitoring the NODES closely. Have not seen any real issue or resource shortage or errors or... ).
The per-node monitoring is really useful, as you can use it to high-level in/out sorts of metrics. Make sure you are collecting network traffic metrics (both packets/sec and bandwidth/sec), and also make sure that you get per-interface breakouts, if possible. i.e. you don't want backup traffic going out of your ethernet interfaces to interfere with IB traffic metrics. If you don't already know about 'collectl', take a look. It works on it's own, and with Ganglia as well. For the switches, if you don't already have something in place, take a look at either Cacti or MRTG (with the "routers2.cgi" front end). Both will slurp up SNMP data from your switch and make useful charts out of the data.
The followup question I have is, my qmaster host is the NFS server for the SGE management. The DATA the jobs require are all on our nfs appliance (Isilon, in our case). Would you recommend I re-do my cluster to store ALL sge goodies on the nfs appliance? I didn't think grid engine beat up the NFS qmaster too much but then again.......
We use Isilons as well (and love them). We actually run our SGE files off an ancient, and (now) stupidly underpowered and overworked Netapp 3050. We don't have any performance issues with SGE, and the entire thing is shared across our cluster (~50 compute nodes, ~1000 cores). While we don't push massive numbers of jobs through the system, it isn't trivial either--although many of our larger jobs are moderatly sized array jobs. Our large Illumina pipeline jobs run 32-way on a single compute node. -- Jesse Becker NHGRI Linux support (Digicon Contractor) _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
