Hi Ralph,
if what you say is true I don't understand why if I run a job in grid01 and
grid03 it runs properly. They are on different network like grid03 and
grid04.
But if I run the same job in grid03 and grid04 it fails.

If it is a network problem like you say I don't think that is  about
reachable because I can trace the network traffic and see that grid03 and
grid04 communicates when I run the job.

Alex

On Mar 26, 2009, at 10:59 AM, Alessandro Surace wrote:

> Hi Ralph,
> what do you mean to create/define a directly interface?
>
> The 3 hosts are network connected and ssh pub key enabled. Every
> hosts can see the other but they are not all on the same direct
> connected network . More in detail:
> grid01 and grid04 are in the same network
> grid03 is on different network.

This is the problem. If grid03 is on a different network, then there
is no way that an MPI process on that node can directly communicate
with one on grid04 or grid01. Grid03 must have a common network
interface with each of the machines, though it can be different for
each one.

In other words, grid03 and grid01 -must- have at least one network in
common. And grid04 and grid03 must also share at least one network,
though it can be different from the one that grid03 and grid01 share.

Does that help clarify?

Ralph

Reply via email to