On Tue, Sep 11, 2012 at 05:30:14PM -0400, Brodie, Kent wrote:
Hi all.   OK, so in the time I disappeared for days to check on why I was 
getting very frustrating “commlib” errrors, I *think* I finally have an answer- 
sort of.   But behind this is a critical question I’m seeking an opinion on.

In my case, the software (CASAVA from Illumina, Inc) runs under SGE.  Swell.   
So far, so good.   But, after several hours, frequent “commlib” errors with 
really NO other errors or info.  Just a connection reset.

We run much the same thing, plus random other jobs thrown in for good
measure.

Well, when I finally had our researcher “scale things back” a bit, the job 
actually ran to completion.    <-- light bulb went off here.

The difference?    Number of simultaneous jobs PER HOST.   At least, that’s my 
working theory.

It's easy to make a single node fall over with these jobs.  We pretty closely 
control
what jobs run where to avoid it.

So, we have a dedicated 10-gig network behind our cluster, but I think I’m 
running into some potential network limitations.   I am working on ways to 
confirm that.

We have a mixed 1 and 10G network on our cluster.  It works fine.  It
also worked fine (albeit more slowly) with only 1G networking.

*Here’s my question*:   In our case, we have nodes that have 8, 12 and even 24 
cores.     A much different world than not too long ago when everything was 
single or dual or (max) quad core.   And while I have no doubt the boxes can 
handle the cpu load of many jobs, I think I’m hitting network limitations and 
stuff is getting dropped.    Can anyone here speak to opinions, experiences, 
etc- when it comes to “max simultaneous jobs per executions host” as relates to 
networking?

I’d love to hear any insight on this.

It sounds like you need to start watching metrics on your cluster.  I
suggest Ganglia for the compute nodes, and then either Cacti or MRTG for
looking at the switch itself (or, if you can do sflow packects, recent
versions of Ganglia can do that too...).

This will tell you if there really are packets getting dropped or not.

The other usual suggestion of checking the various kernel counters on
the nodes and NFS system as well.

When we have had nodes blow up, it tends to be for one of two main
reasons:
        1) Memory overuse, and the node blows up because the OOM killer
misfired.

        2) The load becomes so high that we get into an NFS deadlock of some
sort.  This usually happens when the NFS *server* (another Linux box)
in question is hammered, and needs a break (and reboot).  We've never
seen this with our appliance systems (NetApp, Isilon, Sun).


Thanks all, --Kent


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


--
Jesse Becker
NHGRI Linux support (Digicon Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to