Hi Ralph, if what you say is true I don't understand why if I run a job in grid01 and grid03 it runs properly. They are on different network like grid03 and grid04. But if I run the same job in grid03 and grid04 it fails.
If it is a network problem like you say I don't think that is about reachable because I can trace the network traffic and see that grid03 and grid04 communicates when I run the job. Alex On Mar 26, 2009, at 10:59 AM, Alessandro Surace wrote: > Hi Ralph, > what do you mean to create/define a directly interface? > > The 3 hosts are network connected and ssh pub key enabled. Every > hosts can see the other but they are not all on the same direct > connected network . More in detail: > grid01 and grid04 are in the same network > grid03 is on different network. This is the problem. If grid03 is on a different network, then there is no way that an MPI process on that node can directly communicate with one on grid04 or grid01. Grid03 must have a common network interface with each of the machines, though it can be different for each one. In other words, grid03 and grid01 -must- have at least one network in common. And grid04 and grid03 must also share at least one network, though it can be different from the one that grid03 and grid01 share. Does that help clarify? Ralph