Hello, I have MPICH2 tightly integrated with OGS 2011.11. Everything is working great in general. I have noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each using two cores) that I will get intermittent commlib errors like:
commlib error: got select error (Broken pipe) executing task of job 138060 failed: failed sending task to [email protected]: can't find connection Sometimes I get "Connection reset by peer" instead of "Broken pipe". I have the allocation rule set to round robin, so the second process is always spawned on a remote host. The cluster is small, just four servers (72 cores) on gigabit ethernet. The master spool is on NFS while the local spool is on a local drive. Any advice on how to debug this would be greatly appreciated. Thanks! Brendan _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
