Re: [gridengine users] Intermittent commlib errors with MPI jobs

Reuti Wed, 07 Nov 2012 23:48:19 -0800

Hi,

Am 08.11.2012 um 05:11 schrieb Brendan Moloney:


> Hello,
> 
> I have MPICH2 tightly

Which version? It should work out-of-the-box with SGE.


> integrated with OGS 2011.11.  Everything is working great in general.  I have 
> noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each 
> using two cores) that I will get intermittent commlib errors like:
> commlib error: got select error (Broken pipe)
> executing task of job 138060 failed: failed sending task to 
> [email protected]: can't find connection

This sounds like a network problem unrelated to SGE. Do you use a private 
network inside the cluster or can you outline the network configuration - do 
you have a dedicated switch for the cluster?


> Sometimes I get "Connection reset by peer"

Which startup of slave tasks do you use, i.e.:

$ qconf -sconf
...
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

It sound like an SSH problem with your mentioned output above and your settings 
could be different.


> instead of "Broken pipe". I have the allocation rule set to round robin, so 
> the second process is always spawned on a remote host.

For small jobs I would configure it to run on only one machine - unless they 
create large scratch files.

-- Reuti


> The cluster is small, just four servers (72 cores) on gigabit ethernet. The 
> master spool is on NFS while the local spool is on a local drive. 
> 
> Any advice on how to debug this would be greatly appreciated.
> 
> Thanks!
> Brendan
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Intermittent commlib errors with MPI jobs

Reply via email to