Re: [OMPI users] Segmentation fault / Address not mapped (1) with 2-node job on Rocks 5.2

2010-06-22 Thread Ralph Castain
Sorry for the problem - the issue is a bug in the handling of the pernode 
option in 1.4.2. This has been fixed and awaits release in 1.4.3.


On Jun 21, 2010, at 5:27 PM, Riccardo Murri wrote:

> Hello,
> 
> I'm using OpenMPI 1.4.2 on a Rocks 5.2 cluster.  I compiled it on my
> own to have a thread-enabled MPI (the OMPI coming with Rocks 5.2
> apparently only supports MPI_THREAD_SINGLE), and installed into ~/sw.
> 
> To test the newly installed library I compiled a simple "hello world"
> that comes with Rocks::
> 
>  [murri@idgc3grid01 hello_mpi.d]$ cat hello_mpi.c
>  #include 
>  #include 
> 
>  #include 
> 
>  int main(int argc, char **argv) {
>int myrank;
>struct utsname unam;
> 
>MPI_Init(&argc, &argv);
> 
>uname(&unam);
>MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>printf("Hello from rank %d on host %s\n", myrank, unam.nodename);
> 
>MPI_Finalize();
>  }
> 
> The program runs fine as long as it only uses ranks on localhost::
> 
>  [murri@idgc3grid01 hello_mpi.d]$ mpirun --host localhost -np 2 hello_mpi
>  Hello from rank 1 on host idgc3grid01.uzh.ch
>  Hello from rank 0 on host idgc3grid01.uzh.ch
> 
> However, as soon as I try to run on more than one host, I get a
> segfault::
> 
>  [murri@idgc3grid01 hello_mpi.d]$ mpirun --host
> idgc3grid01,compute-0-11 --pernode hello_mpi
>  [idgc3grid01:13006] *** Process received signal ***
>  [idgc3grid01:13006] Signal: Segmentation fault (11)
>  [idgc3grid01:13006] Signal code: Address not mapped (1)
>  [idgc3grid01:13006] Failing at address: 0x50
>  [idgc3grid01:13006] [ 0] /lib64/libpthread.so.0 [0x359420e4c0]
>  [idgc3grid01:13006] [ 1]
> /home/oci/murri/sw/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb)
> [0x2b352d00265b]
>  [idgc3grid01:13006] [ 2]
> /home/oci/murri/sw/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x676)
> [0x2b352d00e0e6]
>  [idgc3grid01:13006] [ 3]
> /home/oci/murri/sw/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xb8)
> [0x2b352d015358]
>  [idgc3grid01:13006] [ 4]
> /home/oci/murri/sw/lib/openmpi/mca_plm_rsh.so [0x2b352dcb9a80]
>  [idgc3grid01:13006] [ 5] mpirun [0x40345a]
>  [idgc3grid01:13006] [ 6] mpirun [0x402af3]
>  [idgc3grid01:13006] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x359361d974]
>  [idgc3grid01:13006] [ 8] mpirun [0x402a29]
>  [idgc3grid01:13006] *** End of error message ***
>  Segmentation fault
> 
> I've already tried the suggestions posted to similar messages on the
> list: "ldd" reports that the executable is linked with the libraries
> in my home, not the system-wide OMPI::
> 
>  [murri@idgc3grid01 hello_mpi.d]$ ldd hello_mpi
>  libmpi.so.0 => /home/oci/murri/sw/lib/libmpi.so.0 
> (0x2ad2bd6f2000)
>  libopen-rte.so.0 => /home/oci/murri/sw/lib/libopen-rte.so.0
> (0x2ad2bd997000)
>  libopen-pal.so.0 => /home/oci/murri/sw/lib/libopen-pal.so.0
> (0x2ad2bdbe3000)
>  libdl.so.2 => /lib64/libdl.so.2 (0x003593e0)
>  libnsl.so.1 => /lib64/libnsl.so.1 (0x003596a0)
>  libutil.so.1 => /lib64/libutil.so.1 (0x0035a100)
>  libm.so.6 => /lib64/libm.so.6 (0x003593a0)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x00359420)
>  libc.so.6 => /lib64/libc.so.6 (0x00359360)
>  /lib64/ld-linux-x86-64.so.2 (0x00359320)
> 
> I've also checked with "strace" that the "mpi.h" file used during
> compile is the one in ~/sw/include and that all ".so" files being
> loaded from OMPI are the ones in ~/sw/lib.  I can ssh without password
> to the target compute node. The "mpirun" and "mpicc" are the correct ones:
> 
>  [murri@idgc3grid01 hello_mpi.d]$ which mpirun
>  ~/sw/bin/mpirun
> 
>  [murri@idgc3grid01 hello_mpi.d]$ which mpicc
>  ~/sw/bin/mpicc
> 
> 
> I'm pretty stuck now; can anybody give me a hint?
> 
> Thanks a lot for any help!
> 
> Best regards,
> Riccardo
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Segmentation fault / Address not mapped (1) with 2-node job on Rocks 5.2

2010-06-21 Thread Riccardo Murri
Hello,

I'm using OpenMPI 1.4.2 on a Rocks 5.2 cluster.  I compiled it on my
own to have a thread-enabled MPI (the OMPI coming with Rocks 5.2
apparently only supports MPI_THREAD_SINGLE), and installed into ~/sw.

To test the newly installed library I compiled a simple "hello world"
that comes with Rocks::

  [murri@idgc3grid01 hello_mpi.d]$ cat hello_mpi.c
  #include 
  #include 

  #include 

  int main(int argc, char **argv) {
int myrank;
struct utsname unam;

MPI_Init(&argc, &argv);

uname(&unam);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("Hello from rank %d on host %s\n", myrank, unam.nodename);

MPI_Finalize();
  }

The program runs fine as long as it only uses ranks on localhost::

  [murri@idgc3grid01 hello_mpi.d]$ mpirun --host localhost -np 2 hello_mpi
  Hello from rank 1 on host idgc3grid01.uzh.ch
  Hello from rank 0 on host idgc3grid01.uzh.ch

However, as soon as I try to run on more than one host, I get a
segfault::

  [murri@idgc3grid01 hello_mpi.d]$ mpirun --host
idgc3grid01,compute-0-11 --pernode hello_mpi
  [idgc3grid01:13006] *** Process received signal ***
  [idgc3grid01:13006] Signal: Segmentation fault (11)
  [idgc3grid01:13006] Signal code: Address not mapped (1)
  [idgc3grid01:13006] Failing at address: 0x50
  [idgc3grid01:13006] [ 0] /lib64/libpthread.so.0 [0x359420e4c0]
  [idgc3grid01:13006] [ 1]
/home/oci/murri/sw/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb)
[0x2b352d00265b]
  [idgc3grid01:13006] [ 2]
/home/oci/murri/sw/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x676)
[0x2b352d00e0e6]
  [idgc3grid01:13006] [ 3]
/home/oci/murri/sw/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xb8)
[0x2b352d015358]
  [idgc3grid01:13006] [ 4]
/home/oci/murri/sw/lib/openmpi/mca_plm_rsh.so [0x2b352dcb9a80]
  [idgc3grid01:13006] [ 5] mpirun [0x40345a]
  [idgc3grid01:13006] [ 6] mpirun [0x402af3]
  [idgc3grid01:13006] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x359361d974]
  [idgc3grid01:13006] [ 8] mpirun [0x402a29]
  [idgc3grid01:13006] *** End of error message ***
  Segmentation fault

I've already tried the suggestions posted to similar messages on the
list: "ldd" reports that the executable is linked with the libraries
in my home, not the system-wide OMPI::

  [murri@idgc3grid01 hello_mpi.d]$ ldd hello_mpi
  libmpi.so.0 => /home/oci/murri/sw/lib/libmpi.so.0 (0x2ad2bd6f2000)
  libopen-rte.so.0 => /home/oci/murri/sw/lib/libopen-rte.so.0
(0x2ad2bd997000)
  libopen-pal.so.0 => /home/oci/murri/sw/lib/libopen-pal.so.0
(0x2ad2bdbe3000)
  libdl.so.2 => /lib64/libdl.so.2 (0x003593e0)
  libnsl.so.1 => /lib64/libnsl.so.1 (0x003596a0)
  libutil.so.1 => /lib64/libutil.so.1 (0x0035a100)
  libm.so.6 => /lib64/libm.so.6 (0x003593a0)
  libpthread.so.0 => /lib64/libpthread.so.0 (0x00359420)
  libc.so.6 => /lib64/libc.so.6 (0x00359360)
  /lib64/ld-linux-x86-64.so.2 (0x00359320)

I've also checked with "strace" that the "mpi.h" file used during
compile is the one in ~/sw/include and that all ".so" files being
loaded from OMPI are the ones in ~/sw/lib.  I can ssh without password
to the target compute node. The "mpirun" and "mpicc" are the correct ones:

  [murri@idgc3grid01 hello_mpi.d]$ which mpirun
  ~/sw/bin/mpirun

  [murri@idgc3grid01 hello_mpi.d]$ which mpicc
  ~/sw/bin/mpicc


I'm pretty stuck now; can anybody give me a hint?

Thanks a lot for any help!

Best regards,
Riccardo