All,

I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi, but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set correctly; because the test program does compile and run.

The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a AMD x86_64 machine which serves the diskless node images and /home as an NFS mount. I compile all of my programs as 32-bit.

My code is a simple hello world:
$ more test.f
      program test

      include 'mpif.h'
      integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

      call MPI_INIT(ierror)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
      print*, 'node', rank, ': Hello world'
      call MPI_FINALIZE(ierror)
      end

If I run this program with:

$ mpirun --machinefile testfile ./test.out
 node           0 : Hello world
 node           2 : Hello world
 node           1 : Hello world

This is the expected output. Here, testfile contains the master node: 'pleiades', and two slave nodes: 'taygeta' and 'm43'

If I add another machine to testfile, say 'asterope', it hangs until I ctrl-c it. I have tried every machine, and as long as I do not include more than 3 hosts, the program will not hang.

I have run the debug-daemons flag with it as well, and I don't see what is 
wrong specifically.

Working output: pleiades (master) and 2 nodes.

$ mpirun --debug-daemons --machinefile testfile ./test.out
Daemon was launched on m43 - beginning to initialize
Daemon was launched on taygeta - beginning to initialize
Daemon [[46344,0],2] checking in as pid 2140 on host m43
Daemon [[46344,0],2] not using static ports
[m43:02140] [[46344,0],2] orted: up and running - waiting for commands!
[pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200
[pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200
[pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200
[pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs
[m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200
[m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200
[m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200
[m43:02140] [[46344,0],2] orted_cmd: received add_local_procs
Daemon [[46344,0],1] checking in as pid 2317 on host taygeta
Daemon [[46344,0],1] not using static ports
[taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands!
[taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200
[taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200
[taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200
[taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs
[pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local proc 
[[46344,1],0]
[m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc 
[[46344,1],2]
[taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local proc 
[[46344,1],1]
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
[taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
[m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
[taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
[m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
 node           0 : Hello world
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
 node           2 : Hello world
 node           1 : Hello world
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
[taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
[m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
[m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
[pleiades:19178] [[46344,0],0] orted_recv: received sync from local proc 
[[46344,1],0]
[m43:02140] [[46344,0],2] orted_recv: received sync from local proc 
[[46344,1],2]
[taygeta:02317] [[46344,0],1] orted_recv: received sync from local proc 
[[46344,1],1]
[pleiades:19178] [[46344,0],0] orted_cmd: received waitpid_fired cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received iof_complete cmd
[m43:02140] [[46344,0],2] orted_cmd: received waitpid_fired cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received waitpid_fired cmd
[m43:02140] [[46344,0],2] orted_cmd: received iof_complete cmd
[taygeta:02317] [[46344,0],1] orted_cmd: received iof_complete cmd
[pleiades:19178] [[46344,0],0] orted_cmd: received exit
[taygeta:02317] [[46344,0],1] orted_cmd: received exit
[taygeta:02317] [[46344,0],1] orted: finalizing
[m43:02140] [[46344,0],2] orted_cmd: received exit
[m43:02140] [[46344,0],2] orted: finalizing

Not working output: pleiades (master) and 3 nodes:

$ mpirun --debug-daemons --machinefile testfile ./test.out
Daemon was launched on m43 - beginning to initialize
Daemon was launched on taygeta - beginning to initialize
Daemon was launched on asterope - beginning to initialize
Daemon [[46357,0],2] checking in as pid 2181 on host m43
Daemon [[46357,0],2] not using static ports
[m43:02181] [[46357,0],2] orted: up and running - waiting for commands!
Daemon [[46357,0],1] checking in as pid 2358 on host taygeta
Daemon [[46357,0],1] not using static ports
[taygeta:02358] [[46357,0],1] orted: up and running - waiting for commands!
[pleiades:19191] [[46357,0],0] node[0].name pleiades daemon 0 arch ffca0200
[pleiades:19191] [[46357,0],0] node[1].name taygeta daemon 1 arch ffca0200
[pleiades:19191] [[46357,0],0] node[2].name m43 daemon 2 arch ffca0200
[pleiades:19191] [[46357,0],0] node[3].name asterope daemon 3 arch ffca0200
[pleiades:19191] [[46357,0],0] orted_cmd: received add_local_procs
[taygeta:02358] [[46357,0],1] node[0].name pleiades daemon 0 arch ffca0200
[taygeta:02358] [[46357,0],1] node[1].name taygeta daemon 1 arch ffca0200
[m43:02181] [[46357,0],2] node[0].name pleiades daemon 0 arch ffca0200
[taygeta:02358] [[46357,0],1] node[2].name m43 daemon 2 arch ffca0200
[m43:02181] [[46357,0],2] node[1].name taygeta daemon 1 arch ffca0200
[m43:02181] [[46357,0],2] node[2].name m43 daemon 2 arch ffca0200
[m43:02181] [[46357,0],2] node[3].name asterope daemon 3 arch ffca0200
[m43:02181] [[46357,0],2] orted_cmd: received add_local_procs
[taygeta:02358] [[46357,0],1] node[3].name asterope daemon 3 arch ffca0200
[taygeta:02358] [[46357,0],1] orted_cmd: received add_local_procs
Daemon [[46357,0],3] checking in as pid 1965 on host asterope
Daemon [[46357,0],3] not using static ports
[asterope:01965] [[46357,0],3] orted: up and running - waiting for commands!
[pleiades:19191] [[46357,0],0] orted_recv: received sync+nidmap from local proc 
[[46357,1],0]
[m43:02181] [[46357,0],2] orted_recv: received sync+nidmap from local proc 
[[46357,1],2]
[pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd
[m43:02181] [[46357,0],2] orted_cmd: received collective data cmd
[pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd

------------------
The output hangs here.

After I kill the process, I get the following output:
------------------

Killed by signal 2.
Killed by signal 2.
--------------------------------------------------------------------------
A daemon (pid 19194) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

Killed by signal 2.
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
[pleiades:19191] [[46357,0],0] orted_cmd: received waitpid_fired cmd
[pleiades:19191] [[46357,0],0] orted_cmd: received iof_complete cmd
[pleiades:19191] [[46357,0],0] orted_cmd: received exit
mpirun: clean termination accomplished

I know that LD_LIBRARY_PATH is -not- to blame. /home/<user> is exported to each machine from the master, and each machine uses the same image (and thus the same paths). If there was a problem with the path, it would not run.

Any insight would be appreciated.

Thank you,
Ethan



--
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555

Reply via email to