Hi, I am facing problems running OpenMPI-1.0.1 on a heterogeneous cluster.
I have a Linux machine and a SunOS machine in this cluster. linux$ uname -a Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004 i686 i686 i386 GNU/Linux OpenMPI-1.0.1 is installed uisng ./configure --prefix=... make all install sunos$ uname -a SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10 OpenMPI-1.0.1 is installed uisng ./configure --prefix=... make all install I use ssh. Both nodes are accessible without prompts for password. I use the following simple application: ------------------------------------------------------------------------ #include <mpi.h> int main(int argc, char** argv) { int rc, me; char pname[MPI_MAX_PROCESSOR_NAME]; int plen; MPI_Init( &argc, &argv ); rc = MPI_Comm_rank( MPI_COMM_WORLD, &me ); if (rc != MPI_SUCCESS) { return rc; } MPI_Get_processor_name( pname, &plen ); printf("%s:Hello world from %d\n", pname, me); MPI_Finalize(); return 0; } ------------------------------------------------------------------------ It is compiled as follows: linux$ mpicc -o mpiinit_linux mpiinit.c sunos$ mpicc -o mpiinit_sunos mpiinit.c My hosts file is hosts.txt --------- pg1cluster01 slots=2 csultra01 slots=1 My app file is mpiinit_appfile --------------- -np 2 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_linux -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos $ mpirun --hostfile hosts.txt --app mpiinit_appfile ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos: fatal: relocation error: file /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0: symbol nanosleep: referenced symbol not found ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos: fatal: relocation error: file /home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0: symbol nanosleep: referenced symbol not found I have fixed this by compiling with "-lrt" option to the linker. sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt However when I run this again, I get the error: $ mpirun --hostfile hosts.txt --app mpiinit_appfile [pg1cluster01:19858] ERROR: A daemon on node csultra01 failed to start as expected. [pg1cluster01:19858] ERROR: There may be more information available from [pg1cluster01:19858] ERROR: the remote shell (see above). [pg1cluster01:19858] ERROR: The daemon exited unexpectedly with status 255. 2 processes killed (possibly by Open MPI) Sometimes I get the error. $ mpirun --hostfile hosts.txt --app mpiinit_appfile [csultra01:06256] mca_common_sm_mmap_init: ftruncate failed with errno=28 [csultra01:06256] mca_mpool_sm_init: unable to create shared memory mapping -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned value -2 instead of OMPI_SUCCESS -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) Please let me know the resolution of this problem. Please let me know if you need more details. Regards, Ravi.