Hi all.   OK, so in the time I disappeared for days to check on why I was 
getting very frustrating "commlib" errrors, I *think* I finally have an answer- 
sort of.   But behind this is a critical question I'm seeking an opinion on.

In my case, the software (CASAVA from Illumina, Inc) runs under SGE.  Swell.   
So far, so good.   But, after several hours, frequent "commlib" errors with 
really NO other errors or info.  Just a connection reset.

Well, when I finally had our researcher "scale things back" a bit, the job 
actually ran to completion.    <-- light bulb went off here.

The difference?    Number of simultaneous jobs PER HOST.   At least, that's my 
working theory.

So, we have a dedicated 10-gig network behind our cluster, but I think I'm 
running into some potential network limitations.   I am working on ways to 
confirm that.

*Here's my question*:   In our case, we have nodes that have 8, 12 and even 24 
cores.     A much different world than not too long ago when everything was 
single or dual or (max) quad core.   And while I have no doubt the boxes can 
handle the cpu load of many jobs, I think I'm hitting network limitations and 
stuff is getting dropped.    Can anyone here speak to opinions, experiences, 
etc- when it comes to "max simultaneous jobs per executions host" as relates to 
networking?

I'd love to hear any insight on this.

Thanks all, --Kent

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to