Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Ethan Deneault

Prentice Bisbal wrote:

Ashley Pittman wrote:

This smacks of a firewall issue, I thought you'd said you weren't using one but 
now I read back your emails I can't see anywhere where you say that.  Are you 
running a flrewall or any iptables rules on any of the nodes?  It looks to me 
like you may have some setup from on the worker nodes.

Ashley.



I agree with Ashley. To make sure it's not an IP tables or SELinux
problem on one of the nodes, run these two commands on all teh nodes and
then try again:

service iptables stop
setenforce 0




This fix worked. Delving in deeper, it turns out that there was a typo in the iptables file for the 
nodes: they were accepting all traffic on eth1 instead of eth0. Only the master has an eth1 port. 
When I checked the tables earlier, I didn't notice the discrepancy.


Thank you all so much!

Cheers,
Ethan



--
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555


Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Prentice Bisbal
Ashley Pittman wrote:
> This smacks of a firewall issue, I thought you'd said you weren't using one 
> but now I read back your emails I can't see anywhere where you say that.  Are 
> you running a flrewall or any iptables rules on any of the nodes?  It looks 
> to me like you may have some setup from on the worker nodes.
> 
> Ashley.
> 

I agree with Ashley. To make sure it's not an IP tables or SELinux
problem on one of the nodes, run these two commands on all teh nodes and
then try again:

service iptables stop
setenforce 0


-- 
Prentice


Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Ashley Pittman

This smacks of a firewall issue, I thought you'd said you weren't using one but 
now I read back your emails I can't see anywhere where you say that.  Are you 
running a flrewall or any iptables rules on any of the nodes?  It looks to me 
like you may have some setup from on the worker nodes.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Ethan Deneault

Rolf vandeVaart wrote:

Ethan:

Can you run just "hostname" successfully?  In other words, a non-MPI 
program.
If that does not work, then we know the problem is in the runtime.  If  
it does works, then
there is something with the way the MPI library is setting up its 
connections.


Interesting. I did not try this.

From the master:
$ mpirun -debug-daemons -host merope,asterope -np 2 hostname
asterope
merope

$ mpirun -host merope,asterope,electra -np 3 hostname
asterope
merope

(hangs)

$ mpirun -host electra,asterope,merope -np 3 hostname
asterope
electra

(hangs)

I cannot get 3 nodes to work together. Each node does work if in a pair of two. I can get three 
-processes- to work, if I include the master:


$ mpirun -host pleiades,electra,asterope -np 3 hostname
pleiades
electra
asterope

But 4 processes does not:

$ mpirun -host pleiades,electra,asterope,merope -np 4 hostname
pleiades
electra
asterope

(hangs)


Is there more than one interface on the nodes?


Each node only has eth0, and a static DHCP address.

Is there something in the way that I have the nodes set up? They boot via PXE from an image on the 
master, so they should all have the same basic filesystem.


Cheers,
Ethan









Rolf

On 09/21/10 14:41, Ethan Deneault wrote:

Prentice Bisbal wrote:



I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)


Yes. I am able to log in remotely to all nodes from the master, and to 
each node from each node without a password. Each node mounts the same 
/home directory from the master, so they have the same copy of all the 
ssh and rsh keys.



This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination 
of 3 other nodes hangs during communication. This includes not using 
--machinefile and using -host; i.e.


$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node   1 : Hello world
 node   0 : Hello world
 node   2 : Hello world


2. Run the mpirun command from a different host. I'd try running it from
several different hosts.


The mpirun command does not seem to work when launched from one of the 
nodes. As an example:


Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)


I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.


The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555


Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Rolf vandeVaart

Ethan:

Can you run just "hostname" successfully?  In other words, a non-MPI 
program.
If that does not work, then we know the problem is in the runtime.  If  
it does works, then
there is something with the way the MPI library is setting up its 
connections.


Is there more than one interface on the nodes?

Rolf

On 09/21/10 14:41, Ethan Deneault wrote:

Prentice Bisbal wrote:



I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)


Yes. I am able to log in remotely to all nodes from the master, and to 
each node from each node without a password. Each node mounts the same 
/home directory from the master, so they have the same copy of all the 
ssh and rsh keys.



This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination 
of 3 other nodes hangs during communication. This includes not using 
--machinefile and using -host; i.e.


$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node   1 : Hello world
 node   0 : Hello world
 node   2 : Hello world


2. Run the mpirun command from a different host. I'd try running it from
several different hosts.


The mpirun command does not seem to work when launched from one of the 
nodes. As an example:


Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)


I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.


The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan





Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Ethan Deneault

Prentice Bisbal wrote:



I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)


Yes. I am able to log in remotely to all nodes from the master, and to each node from each node 
without a password. Each node mounts the same /home directory from the master, so they have the same 
copy of all the ssh and rsh keys.



This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination of 3 other nodes hangs during 
communication. This includes not using --machinefile and using -host; i.e.


$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node   1 : Hello world
 node   0 : Hello world
 node   2 : Hello world


2. Run the mpirun command from a different host. I'd try running it from
several different hosts.


The mpirun command does not seem to work when launched from one of the nodes. 
As an example:

Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)


I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.


The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan

--
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555


Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Gus Correa


Prentice Bisbal wrote:

Ethan Deneault wrote:

All,

I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
/usr/lib/openmpi/1.4-gcc/ directory. I know this is typically
/opt/openmpi, but Red Hat does things differently. I have my PATH and
LD_LIBRARY_PATH set correctly; because the test program does compile and
run.

The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is
a AMD x86_64 machine which serves the diskless node images and /home as
an NFS mount. I compile all of my programs as 32-bit.

My code is a simple hello world:
$ more test.f
  program test

  include 'mpif.h'
  integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

  call MPI_INIT(ierror)
  call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
  print*, 'node', rank, ': Hello world'
  call MPI_FINALIZE(ierror)
  end

If I run this program with:

$ mpirun --machinefile testfile ./test.out
 node   0 : Hello world
 node   2 : Hello world
 node   1 : Hello world

This is the expected output. Here, testfile contains the master node:
'pleiades', and two slave nodes: 'taygeta' and 'm43'

If I add another machine to testfile, say 'asterope', it hangs until I
ctrl-c it. I have tried every machine, and as long as I do not include
more than 3 hosts, the program will not hang.

I have run the debug-daemons flag with it as well, and I don't see what
is wrong specifically.



I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

3. Change your machinefile to include 4 completely different hosts.

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.



Hi Ethan

What your program prints is process number, not the host name.
To make sure all nodes are responding, you can try this:

http://www.open-mpi.org/faq/?category=running#mpirun-host

For the hostfile/machinefile structure,
including the number of slots/cores/processors, see "man mpiexec".

The OpenMPI FAQ have answers for many of these initial setup questions.
Worth taking a look.

I hope it helps,
Gus Correa



Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Prentice Bisbal
Ethan Deneault wrote:
> All,
> 
> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically
> /opt/openmpi, but Red Hat does things differently. I have my PATH and
> LD_LIBRARY_PATH set correctly; because the test program does compile and
> run.
> 
> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is
> a AMD x86_64 machine which serves the diskless node images and /home as
> an NFS mount. I compile all of my programs as 32-bit.
> 
> My code is a simple hello world:
> $ more test.f
>   program test
> 
>   include 'mpif.h'
>   integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
> 
>   call MPI_INIT(ierror)
>   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>   print*, 'node', rank, ': Hello world'
>   call MPI_FINALIZE(ierror)
>   end
> 
> If I run this program with:
> 
> $ mpirun --machinefile testfile ./test.out
>  node   0 : Hello world
>  node   2 : Hello world
>  node   1 : Hello world
> 
> This is the expected output. Here, testfile contains the master node:
> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
> 
> If I add another machine to testfile, say 'asterope', it hangs until I
> ctrl-c it. I have tried every machine, and as long as I do not include
> more than 3 hosts, the program will not hang.
> 
> I have run the debug-daemons flag with it as well, and I don't see what
> is wrong specifically.
> 

I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

3. Change your machinefile to include 4 completely different hosts.

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.

-- 
Prentice


Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread ETHAN DENEAULT
David, 

I did try that after I sent the original mail, but the -np 4 flag doesn't fix 
the problem, the program still hangs. I've also double checked the iptables for 
the image and for the master node, and all ports are set to accept. 

Cheers, 
Ethan

--
Dr. Ethan Deneault
Assistant Professor of Physics
SC 234
University of Tampa
Tampa, FL 33606



-Original Message-
From: users-boun...@open-mpi.org on behalf of David Zhang
Sent: Mon 9/20/2010 9:58 PM
To: Open MPI Users
Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or 
more nodes.
 
I don't know if this will help, but try
mpirun --machinefile testfile -np 4 ./test.out
for running 4 processes

On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault  wrote:

> All,
>
> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi,
> but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set
> correctly; because the test program does compile and run.
>
> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a
> AMD x86_64 machine which serves the diskless node images and /home as an NFS
> mount. I compile all of my programs as 32-bit.
>
> My code is a simple hello world:
> $ more test.f
>  program test
>
>  include 'mpif.h'
>  integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>
>  call MPI_INIT(ierror)
>  call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>  print*, 'node', rank, ': Hello world'
>  call MPI_FINALIZE(ierror)
>  end
>
> If I run this program with:
>
> $ mpirun --machinefile testfile ./test.out
>  node   0 : Hello world
>  node   2 : Hello world
>  node   1 : Hello world
>
> This is the expected output. Here, testfile contains the master node:
> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
>
> If I add another machine to testfile, say 'asterope', it hangs until I
> ctrl-c it. I have tried every machine, and as long as I do not include more
> than 3 hosts, the program will not hang.
>
> I have run the debug-daemons flag with it as well, and I don't see what is
> wrong specifically.
>
> Working output: pleiades (master) and 2 nodes.
>
> $ mpirun --debug-daemons --machinefile testfile ./test.out
> Daemon was launched on m43 - beginning to initialize
> Daemon was launched on taygeta - beginning to initialize
> Daemon [[46344,0],2] checking in as pid 2140 on host m43
> Daemon [[46344,0],2] not using static ports
> [m43:02140] [[46344,0],2] orted: up and running - waiting for commands!
> [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200
> [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs
> [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200
> [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200
> [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs
> Daemon [[46344,0],1] checking in as pid 2317 on host taygeta
> Daemon [[46344,0],1] not using static ports
> [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands!
> [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200
> [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs
> [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local
> proc [[46344,1],0]
> [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc
> [[46344,1],2]
> [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local
> proc [[46344,1],1]
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: 

Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-20 Thread David Zhang
I don't know if this will help, but try
mpirun --machinefile testfile -np 4 ./test.out
for running 4 processes

On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault  wrote:

> All,
>
> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi,
> but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set
> correctly; because the test program does compile and run.
>
> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a
> AMD x86_64 machine which serves the diskless node images and /home as an NFS
> mount. I compile all of my programs as 32-bit.
>
> My code is a simple hello world:
> $ more test.f
>  program test
>
>  include 'mpif.h'
>  integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>
>  call MPI_INIT(ierror)
>  call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>  print*, 'node', rank, ': Hello world'
>  call MPI_FINALIZE(ierror)
>  end
>
> If I run this program with:
>
> $ mpirun --machinefile testfile ./test.out
>  node   0 : Hello world
>  node   2 : Hello world
>  node   1 : Hello world
>
> This is the expected output. Here, testfile contains the master node:
> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
>
> If I add another machine to testfile, say 'asterope', it hangs until I
> ctrl-c it. I have tried every machine, and as long as I do not include more
> than 3 hosts, the program will not hang.
>
> I have run the debug-daemons flag with it as well, and I don't see what is
> wrong specifically.
>
> Working output: pleiades (master) and 2 nodes.
>
> $ mpirun --debug-daemons --machinefile testfile ./test.out
> Daemon was launched on m43 - beginning to initialize
> Daemon was launched on taygeta - beginning to initialize
> Daemon [[46344,0],2] checking in as pid 2140 on host m43
> Daemon [[46344,0],2] not using static ports
> [m43:02140] [[46344,0],2] orted: up and running - waiting for commands!
> [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200
> [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs
> [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200
> [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200
> [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs
> Daemon [[46344,0],1] checking in as pid 2317 on host taygeta
> Daemon [[46344,0],1] not using static ports
> [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands!
> [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200
> [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs
> [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local
> proc [[46344,1],0]
> [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc
> [[46344,1],2]
> [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local
> proc [[46344,1],1]
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
>  node   0 : Hello world
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
>  node   2 : Hello world
>  node   1 : Hello world
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] 

[OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-20 Thread Ethan Deneault

All,

I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the /usr/lib/openmpi/1.4-gcc/ 
directory. I know this is typically /opt/openmpi, but Red Hat does things differently. I have my 
PATH and LD_LIBRARY_PATH set correctly; because the test program does compile and run.


The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a AMD x86_64 machine which 
serves the diskless node images and /home as an NFS mount. I compile all of my programs as 32-bit.


My code is a simple hello world:
$ more test.f
  program test

  include 'mpif.h'
  integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

  call MPI_INIT(ierror)
  call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
  print*, 'node', rank, ': Hello world'
  call MPI_FINALIZE(ierror)
  end

If I run this program with:

$ mpirun --machinefile testfile ./test.out
 node   0 : Hello world
 node   2 : Hello world
 node   1 : Hello world

This is the expected output. Here, testfile contains the master node: 'pleiades', and two slave 
nodes: 'taygeta' and 'm43'


If I add another machine to testfile, say 'asterope', it hangs until I ctrl-c it. I have tried every 
machine, and as long as I do not include more than 3 hosts, the program will not hang.


I have run the debug-daemons flag with it as well, and I don't see what is 
wrong specifically.

Working output: pleiades (master) and 2 nodes.

$ mpirun --debug-daemons --machinefile testfile ./test.out
Daemon was launched on m43 - beginning to initialize
Daemon was launched on taygeta - beginning to initialize
Daemon [[46344,0],2] checking in as pid 2140 on host m43
Daemon [[46344,0],2] not using static ports
[m43:02140] [[46344,0],2] orted: up and running - waiting for commands!
[pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200
[pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200
[pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200
[pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs
[m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200
[m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200
[m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200
[m43:02140] [[46344,0],2] orted_cmd: received add_local_procs
Daemon [[46344,0],1] checking in as pid 2317 on host taygeta
Daemon [[46344,0],1] not using static ports
[taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands!
[taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200
[taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200
[taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200
[taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs
[pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local proc 
[[46344,1],0]
[m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc 
[[46344,1],2]
[taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local proc 
[[46344,1],1]
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
[taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
[m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
[taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
[m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
 node   0 : Hello world
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
 node   2 : Hello world
 node   1 : Hello world
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
[taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
[m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
[m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
[pleiades:19178] [[46344,0],0] orted_recv: received sync from local proc