This worked for me although I am not sure how extensive our 32/64 interoperability support is. I tested on Solaris using the TCP interconnect and a 1.2.5 version of Open MPI. Also, we configure with the --enable-heterogeneous flag which may make a difference here. Also this did not work for me over the sm btl.

By the way, can you run a simple /bin/hostname across the two nodes?


burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o simple.32 burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o simple.64 burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
[burl-ct-v20z-4]I am #0/6 before the barrier
[burl-ct-v20z-5]I am #3/6 before the barrier
[burl-ct-v20z-5]I am #4/6 before the barrier
[burl-ct-v20z-4]I am #1/6 before the barrier
[burl-ct-v20z-4]I am #2/6 before the barrier
[burl-ct-v20z-5]I am #5/6 before the barrier
[burl-ct-v20z-5]I am #3/6 after the barrier
[burl-ct-v20z-4]I am #1/6 after the barrier
[burl-ct-v20z-5]I am #5/6 after the barrier
[burl-ct-v20z-5]I am #4/6 after the barrier
[burl-ct-v20z-4]I am #2/6 after the barrier
[burl-ct-v20z-4]I am #0/6 after the barrier
burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open MPI) 1.2.5r16572

Report bugs to http://www.open-mpi.org/community/help/
 burl-ct-v20z-4 65 =>


jody wrote:
i narrowed it down:
The majority of processes get stuck in MPI_Barrier.
My Test application looks like this:

#include <stdio.h>
#include <unistd.h>
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
    int iResult = 0;
    int iRank1;
    int iNum1;

    char sName[256];
    gethostname(sName, 255);

    MPI_Init(&iArgC, &apArgV);

    MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
    MPI_Comm_size(MPI_COMM_WORLD, &iNum1);

    printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
    MPI_Barrier(MPI_COMM_WORLD);
    printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);

    MPI_Finalize();

    return iResult;
}


If i make this call:
mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
./run_gdb.sh ./MPITest64

(run_gdb.sh is a script which starts gdb in a xterm for each process)
Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize,
all other processes get stuck in PMPI_Barrier,
Process 1 (on aim-plankton) displays the message
   
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Process 2 on (aim-plankton) displays the same message twice.

Any ideas?

  Thanks Jody

On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote:
Hi
 Using a more realistic application than a simple "Hello, world"
 even the --host version doesn't work correctly
 Called this way

 mpirun -np 3 --host aim-plankton ./QHGLauncher
 --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt

 the application starts but seems to hang after a while.

 Running the application in gdb:

 mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
 --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
 -o bruzlopf -n 12
 --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim

 i can see that the processes on aim-fanta4 have indeed gotten stuck
 after a few initial outputs,
 and the processes on aim-plankton all have a messsage:

 
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
 connect() failed with errno=113

 If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
 as expected.

 BTW: i'm, using open MPI 1.2.2

 Thanks
  Jody


On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote:
 > HI
 >  In my network i have some 32 bit machines and some 64 bit machines.
 >  With --host i successfully call my application:
 >   mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
 >  -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
 >  (MPITest64 has the same code as MPITest, but was compiled on the 64 bit 
machine)
 >
 >  But when i use hostfiles:
 >   mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
 >  -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
 >  all 6 processes are started on the 64 bit machine aim-fanta4.
 >
 >  hosts32:
 >    aim-plankton slots=3
 >  hosts64
 >   aim-fanta4 slots
 >
 >  Is this a bug or a feature?  ;)
 >
 >  Jody
 >

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to