On Tue, Sep 11, 2012 at 05:30:14PM -0400, Brodie, Kent wrote:
Hi all. OK, so in the time I disappeared for days to check on why I was
getting very frustrating “commlib” errrors, I *think* I finally have an answer-
sort of. But behind this is a critical question I’m seeking an opinion on.
In my case, the software (CASAVA from Illumina, Inc) runs under SGE. Swell.
So far, so good. But, after several hours, frequent “commlib” errors with
really NO other errors or info. Just a connection reset.
We run much the same thing, plus random other jobs thrown in for good
measure.
Well, when I finally had our researcher “scale things back” a bit, the job
actually ran to completion. <-- light bulb went off here.
The difference? Number of simultaneous jobs PER HOST. At least, that’s my
working theory.
It's easy to make a single node fall over with these jobs. We pretty closely
control
what jobs run where to avoid it.
So, we have a dedicated 10-gig network behind our cluster, but I think I’m
running into some potential network limitations. I am working on ways to
confirm that.
We have a mixed 1 and 10G network on our cluster. It works fine. It
also worked fine (albeit more slowly) with only 1G networking.
*Here’s my question*: In our case, we have nodes that have 8, 12 and even 24
cores. A much different world than not too long ago when everything was
single or dual or (max) quad core. And while I have no doubt the boxes can
handle the cpu load of many jobs, I think I’m hitting network limitations and
stuff is getting dropped. Can anyone here speak to opinions, experiences,
etc- when it comes to “max simultaneous jobs per executions host” as relates to
networking?
I’d love to hear any insight on this.
It sounds like you need to start watching metrics on your cluster. I
suggest Ganglia for the compute nodes, and then either Cacti or MRTG for
looking at the switch itself (or, if you can do sflow packects, recent
versions of Ganglia can do that too...).
This will tell you if there really are packets getting dropped or not.
The other usual suggestion of checking the various kernel counters on
the nodes and NFS system as well.
When we have had nodes blow up, it tends to be for one of two main
reasons:
1) Memory overuse, and the node blows up because the OOM killer
misfired.
2) The load becomes so high that we get into an NFS deadlock of some
sort. This usually happens when the NFS *server* (another Linux box)
in question is hammered, and needs a break (and reboot). We've never
seen this with our appliance systems (NetApp, Isilon, Sun).
Thanks all, --Kent
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Jesse Becker
NHGRI Linux support (Digicon Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users