Hmm.  It's not immediately clear to me what's going wrong here.

I hate to ask, but could you install a debugging version of Open MPI and 
capture a proper stack trace of the segv?

Also, could you try the 1.4.4 rc and see if that magically fixes the problem? 
(I'm about to post a new 1.4.4 rc later this morning, but either the current 
one or the one from later today would be a good datapoint)


On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:

> Yep, Fedora Core 14 and OpenMPI 1.4.3
> 
> On 9/24/11 7:02 AM, Jeff Squyres wrote:
>> Are you running the same OS version and Open MPI version between the head 
>> node and regular nodes?
>> 
>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
>> 
>>> Hey all,
>>> I’ve been racking my brains over this for several days and was hoping 
>>> anyone could enlighten me. I’ll describe only the relevant parts of the 
>>> network/computer systems. There is one head node and a multitude of regular 
>>> nodes. The regular nodes are all identical to each other. If I run an mpi 
>>> program from one of the regular nodes to any other regular nodes, 
>>> everything works. If I include the head node in the hosts file, I get 
>>> segfaults which I’ll paste below along with sample code. The machines are 
>>> all networked via infiniband and Ethernet. The issue only arises when mpi 
>>> communication occurs. By this I mean, MPi_Init might succeed but the 
>>> segfault always occurs on MPI_Barrier or MPI_send/recv. I found a work 
>>> around by disabling the openib btl and enforcing that communications go 
>>> over infiniband(if I don’t force infiniband, it’ll go over Ethernet). This 
>>> command works when the head node is included in the hosts file:
>>> mpirun --hostfile hostfile --mca btl ^openib --mca btl_tcp_if_include ib0  
>>> -np 2 ./b.out
>>> 
>>> Sample Code:
>>> #include "mpi.h"
>>> #include<stdio.h>
>>> int main(int argc, char *argv[])
>>> {
>>>    int rank, nprocs;
>>>     char* name[20];
>>>     int maxlen = 20;
>>>     MPI_Init(&argc,&argv);
>>>     MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
>>>     MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>     gethostname(name,maxlen);
>>>     printf("Hello, world.  I am %d of %d and host %s \n", rank, 
>>> nprocs,name);
>>>     fflush(stdout);
>>>     MPI_Finalize();
>>>     return 0;
>>> 
>>> }
>>> 
>>> Segfault:
>>> [pastec:19917] *** Process received signal ***
>>> [pastec:19917] Signal: Segmentation fault (11)
>>> [pastec:19917] Signal code: Address not mapped (1)
>>> [pastec:19917] Failing at address: 0x8
>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) [0x7eff6430b6aa]
>>> [pastec:19917] [ 2] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) [0x7eff66a163c9]
>>> [pastec:19917] [ 3] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) [0x7eff66a21b70]
>>> [pastec:19917] [ 4] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) [0x7eff66a21c89]
>>> [pastec:19917] [ 5] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) [0x7eff66a1703d]
>>> [pastec:19917] [ 6] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) 
>>> [0x7eff676670e6]
>>> [pastec:19917] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) 
>>> [0x7eff6765b273]
>>> [pastec:19917] [ 8] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) [0x7eff65539b2f]
>>> [pastec:19917] [ 9] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) [0x7eff655425cf]
>>> [pastec:19917] [10] /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) 
>>> [0x3a54c4c94e]
>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) [0x34a841ee5d]
>>> [pastec:19917] [13] ./b.out() [0x400919]
>>> [pastec:19917] *** End of error message ***
>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] 
>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 19917 on node 
>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to