Hello,

I am running Open MPI 1.10.1 on CentOS 6.7

running mpi_hello_world fails on remote node:

mpirun --hostfile hostfile.txt -np 8 hello.mpi | sort

[1,1]<stderr>:[fupone4][[1657,1],1][btl_tcp_proc.c:132:mca_btl_tcp_proc_create]
ompi_modex_recv: failed with return value=-48

running on just the controller node works:

[mpirun --hostfile hostfile.txt --tag-output -np 2 hello.mpi
[1,0]<stdout>:Hello world from processor scogrid01, rank 0 out of 2
processors
[1,1]<stdout>:Hello world from processor scogrid01, rank 1 out of 2
processors

any suggestions on debugging would be appreciated.
what are the return codes from ompi_modex_recv?

additional error log below:

[1,1]<stderr>:[fupone4][[1657,1],1][btl_tcp_proc.c:132:mca_btl_tcp_proc_create]
ompi_modex_recv: failed with return value=-48
[1,1]<stderr>:[fupone4:31860] *** Process received signal ***
[1,1]<stderr>:[fupone4:31860] Signal: Segmentation fault (11)
[1,1]<stderr>:[fupone4:31860] Signal code: Address not mapped (1)
[1,1]<stderr>:[fupone4:31860] Failing at address: 0x8f
[1,1]<stderr>:[fupone4:31860] [ 0]
[1,1]<stderr>:/lib64/libpthread.so.0[0x35b580eca0]
[1,1]<stderr>:[fupone4:31860] [ 1]
/home/brian/openmpi-1.10.1/build/lib/openmpi/mca_bml_r2.so[0x2afe783c57b1]
[1,1]<stderr>:[fupone4:31860] [ 2]
/home/brian/openmpi-1.10.1/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xce)[0x2afe78e69aee]
[1,1]<stderr>:[fupone4:31860] [ 3]
[1,1]<stderr>:/home/brian/openmpi-1.10.1/build/lib/libmpi.so.12(ompi_mpi_init+0x7c4)[0x2afe74b13504]
[1,1]<stderr>:[fupone4:31860] [ 4]
[1,1]<stderr>:/home/brian/openmpi-1.10.1/build/lib/libmpi.so.12(MPI_Init+0x189)[0x2afe74b31099]
[1,1]<stderr>:[fupone4:31860] [ 5] hello.mpi[0x4007cf]
[1,1]<stderr>:[fupone4:31860] [ 6]
[1,1]<stderr>:/lib64/libc.so.6(__libc_start_main+0xf4)[0x35b481d9c4]
[1,1]<stderr>:[fupone4:31860] [ 7] hello.mpi[0x4006f9]
[1,1]<stderr>:[fupone4:31860] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node fupone4 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------



Brian

Reply via email to