Hello, I am running Open MPI 1.10.1 on CentOS 6.7
running mpi_hello_world fails on remote node: mpirun --hostfile hostfile.txt -np 8 hello.mpi | sort [1,1]<stderr>:[fupone4][[1657,1],1][btl_tcp_proc.c:132:mca_btl_tcp_proc_create] ompi_modex_recv: failed with return value=-48 running on just the controller node works: [mpirun --hostfile hostfile.txt --tag-output -np 2 hello.mpi [1,0]<stdout>:Hello world from processor scogrid01, rank 0 out of 2 processors [1,1]<stdout>:Hello world from processor scogrid01, rank 1 out of 2 processors any suggestions on debugging would be appreciated. what are the return codes from ompi_modex_recv? additional error log below: [1,1]<stderr>:[fupone4][[1657,1],1][btl_tcp_proc.c:132:mca_btl_tcp_proc_create] ompi_modex_recv: failed with return value=-48 [1,1]<stderr>:[fupone4:31860] *** Process received signal *** [1,1]<stderr>:[fupone4:31860] Signal: Segmentation fault (11) [1,1]<stderr>:[fupone4:31860] Signal code: Address not mapped (1) [1,1]<stderr>:[fupone4:31860] Failing at address: 0x8f [1,1]<stderr>:[fupone4:31860] [ 0] [1,1]<stderr>:/lib64/libpthread.so.0[0x35b580eca0] [1,1]<stderr>:[fupone4:31860] [ 1] /home/brian/openmpi-1.10.1/build/lib/openmpi/mca_bml_r2.so[0x2afe783c57b1] [1,1]<stderr>:[fupone4:31860] [ 2] /home/brian/openmpi-1.10.1/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xce)[0x2afe78e69aee] [1,1]<stderr>:[fupone4:31860] [ 3] [1,1]<stderr>:/home/brian/openmpi-1.10.1/build/lib/libmpi.so.12(ompi_mpi_init+0x7c4)[0x2afe74b13504] [1,1]<stderr>:[fupone4:31860] [ 4] [1,1]<stderr>:/home/brian/openmpi-1.10.1/build/lib/libmpi.so.12(MPI_Init+0x189)[0x2afe74b31099] [1,1]<stderr>:[fupone4:31860] [ 5] hello.mpi[0x4007cf] [1,1]<stderr>:[fupone4:31860] [ 6] [1,1]<stderr>:/lib64/libc.so.6(__libc_start_main+0xf4)[0x35b481d9c4] [1,1]<stderr>:[fupone4:31860] [ 7] hello.mpi[0x4006f9] [1,1]<stderr>:[fupone4:31860] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node fupone4 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Brian
