I don't know if this will help, but try mpirun --machinefile testfile -np 4 ./test.out for running 4 processes
On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault <edenea...@ut.edu> wrote: > All, > > I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the > /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi, > but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set > correctly; because the test program does compile and run. > > The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a > AMD x86_64 machine which serves the diskless node images and /home as an NFS > mount. I compile all of my programs as 32-bit. > > My code is a simple hello world: > $ more test.f > program test > > include 'mpif.h' > integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) > > call MPI_INIT(ierror) > call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) > print*, 'node', rank, ': Hello world' > call MPI_FINALIZE(ierror) > end > > If I run this program with: > > $ mpirun --machinefile testfile ./test.out > node 0 : Hello world > node 2 : Hello world > node 1 : Hello world > > This is the expected output. Here, testfile contains the master node: > 'pleiades', and two slave nodes: 'taygeta' and 'm43' > > If I add another machine to testfile, say 'asterope', it hangs until I > ctrl-c it. I have tried every machine, and as long as I do not include more > than 3 hosts, the program will not hang. > > I have run the debug-daemons flag with it as well, and I don't see what is > wrong specifically. > > Working output: pleiades (master) and 2 nodes. > > $ mpirun --debug-daemons --machinefile testfile ./test.out > Daemon was launched on m43 - beginning to initialize > Daemon was launched on taygeta - beginning to initialize > Daemon [[46344,0],2] checking in as pid 2140 on host m43 > Daemon [[46344,0],2] not using static ports > [m43:02140] [[46344,0],2] orted: up and running - waiting for commands! > [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200 > [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200 > [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200 > [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs > [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200 > [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200 > [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200 > [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs > Daemon [[46344,0],1] checking in as pid 2317 on host taygeta > Daemon [[46344,0],1] not using static ports > [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands! > [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200 > [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200 > [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200 > [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs > [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local > proc [[46344,1],0] > [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc > [[46344,1],2] > [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local > proc [[46344,1],1] > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs > [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs > [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs > node 0 : Hello world > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > node 2 : Hello world > node 1 : Hello world > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs > [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd > [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs > [pleiades:19178] [[46344,0],0] orted_recv: received sync from local proc > [[46344,1],0] > [m43:02140] [[46344,0],2] orted_recv: received sync from local proc > [[46344,1],2] > [taygeta:02317] [[46344,0],1] orted_recv: received sync from local proc > [[46344,1],1] > [pleiades:19178] [[46344,0],0] orted_cmd: received waitpid_fired cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received iof_complete cmd > [m43:02140] [[46344,0],2] orted_cmd: received waitpid_fired cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received waitpid_fired cmd > [m43:02140] [[46344,0],2] orted_cmd: received iof_complete cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received iof_complete cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received exit > [taygeta:02317] [[46344,0],1] orted_cmd: received exit > [taygeta:02317] [[46344,0],1] orted: finalizing > [m43:02140] [[46344,0],2] orted_cmd: received exit > [m43:02140] [[46344,0],2] orted: finalizing > > Not working output: pleiades (master) and 3 nodes: > > $ mpirun --debug-daemons --machinefile testfile ./test.out > Daemon was launched on m43 - beginning to initialize > Daemon was launched on taygeta - beginning to initialize > Daemon was launched on asterope - beginning to initialize > Daemon [[46357,0],2] checking in as pid 2181 on host m43 > Daemon [[46357,0],2] not using static ports > [m43:02181] [[46357,0],2] orted: up and running - waiting for commands! > Daemon [[46357,0],1] checking in as pid 2358 on host taygeta > Daemon [[46357,0],1] not using static ports > [taygeta:02358] [[46357,0],1] orted: up and running - waiting for commands! > [pleiades:19191] [[46357,0],0] node[0].name pleiades daemon 0 arch ffca0200 > [pleiades:19191] [[46357,0],0] node[1].name taygeta daemon 1 arch ffca0200 > [pleiades:19191] [[46357,0],0] node[2].name m43 daemon 2 arch ffca0200 > [pleiades:19191] [[46357,0],0] node[3].name asterope daemon 3 arch ffca0200 > [pleiades:19191] [[46357,0],0] orted_cmd: received add_local_procs > [taygeta:02358] [[46357,0],1] node[0].name pleiades daemon 0 arch ffca0200 > [taygeta:02358] [[46357,0],1] node[1].name taygeta daemon 1 arch ffca0200 > [m43:02181] [[46357,0],2] node[0].name pleiades daemon 0 arch ffca0200 > [taygeta:02358] [[46357,0],1] node[2].name m43 daemon 2 arch ffca0200 > [m43:02181] [[46357,0],2] node[1].name taygeta daemon 1 arch ffca0200 > [m43:02181] [[46357,0],2] node[2].name m43 daemon 2 arch ffca0200 > [m43:02181] [[46357,0],2] node[3].name asterope daemon 3 arch ffca0200 > [m43:02181] [[46357,0],2] orted_cmd: received add_local_procs > [taygeta:02358] [[46357,0],1] node[3].name asterope daemon 3 arch ffca0200 > [taygeta:02358] [[46357,0],1] orted_cmd: received add_local_procs > Daemon [[46357,0],3] checking in as pid 1965 on host asterope > Daemon [[46357,0],3] not using static ports > [asterope:01965] [[46357,0],3] orted: up and running - waiting for > commands! > [pleiades:19191] [[46357,0],0] orted_recv: received sync+nidmap from local > proc [[46357,1],0] > [m43:02181] [[46357,0],2] orted_recv: received sync+nidmap from local proc > [[46357,1],2] > [pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd > [m43:02181] [[46357,0],2] orted_cmd: received collective data cmd > [pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd > > ------------------ > The output hangs here. > > After I kill the process, I get the following output: > ------------------ > > Killed by signal 2. > Killed by signal 2. > -------------------------------------------------------------------------- > A daemon (pid 19194) died unexpectedly with status 255 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > mpirun: abort is already in progress...hit ctrl-c again to forcibly > terminate > > Killed by signal 2. > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > [pleiades:19191] [[46357,0],0] orted_cmd: received waitpid_fired cmd > [pleiades:19191] [[46357,0],0] orted_cmd: received iof_complete cmd > [pleiades:19191] [[46357,0],0] orted_cmd: received exit > mpirun: clean termination accomplished > > I know that LD_LIBRARY_PATH is -not- to blame. /home/<user> is exported to > each machine from the master, and each machine uses the same image (and thus > the same paths). If there was a problem with the path, it would not run. > > Any insight would be appreciated. > > Thank you, > Ethan > > > > -- > Dr. Ethan Deneault > Assistant Professor of Physics > SC-234 > University of Tampa > Tampa, FL 33615 > Office: (813) 257-3555 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego