Hi,

I am facing problems running OpenMPI-1.0.1 on a heterogeneous cluster.

I have a Linux machine and a SunOS machine in this cluster.

linux$ uname -a
Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004
i686 i686 i386 GNU/Linux

OpenMPI-1.0.1 is installed uisng 

./configure --prefix=...
make all install

sunos$ uname -a
SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10

OpenMPI-1.0.1 is installed uisng 

./configure --prefix=...
make all install


I use ssh. Both nodes are accessible without prompts for password.

I use the following simple application:

------------------------------------------------------------------------
#include <mpi.h>

int main(int argc, char** argv)
{
    int rc, me;
    char pname[MPI_MAX_PROCESSOR_NAME];
    int plen;

    MPI_Init(
       &argc,
       &argv
    );

    rc = MPI_Comm_rank(
            MPI_COMM_WORLD,
            &me
    );

    if (rc != MPI_SUCCESS)
    {
       return rc;
    }

    MPI_Get_processor_name(
       pname,
       &plen
    );

    printf("%s:Hello world from %d\n", pname, me);

    MPI_Finalize();

    return 0;
}
------------------------------------------------------------------------

It is compiled as follows:

linux$ mpicc -o mpiinit_linux mpiinit.c
sunos$ mpicc -o mpiinit_sunos mpiinit.c

My hosts file is

hosts.txt
---------
pg1cluster01 slots=2
csultra01 slots=1

My app file is

mpiinit_appfile
---------------
-np 2 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_linux
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found

I have fixed this by compiling with "-lrt" option to the linker.

sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt

However when I run this again, I get the error:

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[pg1cluster01:19858] ERROR: A daemon on node csultra01 failed to start
as expected.
[pg1cluster01:19858] ERROR: There may be more information available from
[pg1cluster01:19858] ERROR: the remote shell (see above).
[pg1cluster01:19858] ERROR: The daemon exited unexpectedly with status 255.
2 processes killed (possibly by Open MPI)

Sometimes I get the error.

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[csultra01:06256] mca_common_sm_mmap_init: ftruncate failed with errno=28
[csultra01:06256] mca_mpool_sm_init: unable to create shared memory mapping
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned value -2 instead of OMPI_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

Please let me know the resolution of this problem. Please let me know if
you need more details.

Regards,
Ravi.

Reply via email to