Are you running the same OS version and Open MPI version between the head node and regular nodes?
On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: > Hey all, > I’ve been racking my brains over this for several days and was hoping anyone > could enlighten me. I’ll describe only the relevant parts of the > network/computer systems. There is one head node and a multitude of regular > nodes. The regular nodes are all identical to each other. If I run an mpi > program from one of the regular nodes to any other regular nodes, everything > works. If I include the head node in the hosts file, I get segfaults which > I’ll paste below along with sample code. The machines are all networked via > infiniband and Ethernet. The issue only arises when mpi communication occurs. > By this I mean, MPi_Init might succeed but the segfault always occurs on > MPI_Barrier or MPI_send/recv. I found a work around by disabling the openib > btl and enforcing that communications go over infiniband(if I don’t force > infiniband, it’ll go over Ethernet). This command works when the head node is > included in the hosts file: > mpirun --hostfile hostfile --mca btl ^openib --mca btl_tcp_if_include ib0 > -np 2 ./b.out > > Sample Code: > #include "mpi.h" > #include <stdio.h> > int main(int argc, char *argv[]) > { > int rank, nprocs; > char* name[20]; > int maxlen = 20; > MPI_Init(&argc,&argv); > MPI_Comm_size(MPI_COMM_WORLD,&nprocs); > MPI_Comm_rank(MPI_COMM_WORLD,&rank); > MPI_Barrier(MPI_COMM_WORLD); > gethostname(name,maxlen); > printf("Hello, world. I am %d of %d and host %s \n", rank, nprocs,name); > fflush(stdout); > MPI_Finalize(); > return 0; > > } > > Segfault: > [pastec:19917] *** Process received signal *** > [pastec:19917] Signal: Segmentation fault (11) > [pastec:19917] Signal code: Address not mapped (1) > [pastec:19917] Failing at address: 0x8 > [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0] > [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) [0x7eff6430b6aa] > [pastec:19917] [ 2] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) [0x7eff66a163c9] > [pastec:19917] [ 3] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) [0x7eff66a21b70] > [pastec:19917] [ 4] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) [0x7eff66a21c89] > [pastec:19917] [ 5] > /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) [0x7eff66a1703d] > [pastec:19917] [ 6] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) > [0x7eff676670e6] > [pastec:19917] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) > [0x7eff6765b273] > [pastec:19917] [ 8] /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) > [0x7eff65539b2f] > [pastec:19917] [ 9] /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) > [0x7eff655425cf] > [pastec:19917] [10] /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) > [0x3a54c4c94e] > [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42] > [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) [0x34a841ee5d] > [pastec:19917] [13] ./b.out() [0x400919] > [pastec:19917] *** End of error message *** > [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] > mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 19917 on node > pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/