Hi, Am 08.11.2012 um 05:11 schrieb Brendan Moloney:
> Hello, > > I have MPICH2 tightly Which version? It should work out-of-the-box with SGE. > integrated with OGS 2011.11. Everything is working great in general. I have > noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each > using two cores) that I will get intermittent commlib errors like: > commlib error: got select error (Broken pipe) > executing task of job 138060 failed: failed sending task to > [email protected]: can't find connection This sounds like a network problem unrelated to SGE. Do you use a private network inside the cluster or can you outline the network configuration - do you have a dedicated switch for the cluster? > Sometimes I get "Connection reset by peer" Which startup of slave tasks do you use, i.e.: $ qconf -sconf ... qlogin_command builtin qlogin_daemon builtin rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin It sound like an SSH problem with your mentioned output above and your settings could be different. > instead of "Broken pipe". I have the allocation rule set to round robin, so > the second process is always spawned on a remote host. For small jobs I would configure it to run on only one machine - unless they create large scratch files. -- Reuti > The cluster is small, just four servers (72 cores) on gigabit ethernet. The > master spool is on NFS while the local spool is on a local drive. > > Any advice on how to debug this would be greatly appreciated. > > Thanks! > Brendan > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
