Hmmm...I'm not sure how that is going to run with only one proc (I don't know 
if the program is protected against that scenario). If you run with -np 2 -mca 
btl openib,sm,self, is it happy?


On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. <fisch...@westinghouse.com> wrote:

> Here’s the command I’m invoking and the terminal output.  (Some of this 
> information doesn’t appear to be captured in the backtrace.)
>  
> [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
> ring_c: 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>  udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
> ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> [binf316:04549] *** Process received signal ***
> [binf316:04549] Signal: Aborted (6)
> [binf316:04549] Signal code:  (-6)
> [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
> [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
> [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
> [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
> [binf316:04549] [ 4] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
> [binf316:04549] [ 5] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
> [binf316:04549] [ 6] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
> [binf316:04549] [ 7] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
> [binf316:04549] [ 8] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
> [binf316:04549] [ 9] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
> [binf316:04549] [10] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
> [binf316:04549] [11] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
> [binf316:04549] [12] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
> [binf316:04549] [13] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
> [binf316:04549] [14] 
> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
> [binf316:04549] [15] ring_c[0x40096f]
> [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
> [binf316:04549] [17] ring_c[0x400889]
> [binf316:04549] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 4549 on node xxxx316 exited on 
> signal 6 (Aborted).
> --------------------------------------------------------------------------
>  
> From: Fischer, Greg A. 
> Sent: Thursday, June 05, 2014 5:10 PM
> To: us...@open-mpi.org
> Cc: Fischer, Greg A.
> Subject: openib segfaults with Torque
>  
> OpenMPI Users,
>  
> After encountering difficulty with the Intel compilers (see the “intermittent 
> segfaults with openib on ring_c.c” thread), I installed GCC-4.8.3 and 
> recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib 
> BTL in a typical BASH environment. Everything appeared to work fine, so I 
> went on my merry way compiling the rest of my dependencies.
>  
> After getting my dependencies and applications compiled, I began observing 
> segfaults when submitting the applications through Torque. I recompiled 
> OpenMPI with debug options, ran “ring_c” over the openib BTL in an 
> interactive Torque session (“qsub –I”), and got the backtrace below. All 
> other system settings described in the previous thread are the same. Any 
> thoughts on how to resolve this issue?
>  
> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0  0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
> #1  0x00007f7f5920c0c5 in abort () from /lib64/libc.so.6
> #2  0x00007f7f59203a10 in __assert_fail () from /lib64/libc.so.6
> #3  0x00007f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
> #4  0x00007f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) 
> at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
> #5  0x00007f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
> (btl=0x716680) at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6  0x00007f7f54885817 in btl_openib_component_init 
> (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
> enable_mpi_threads=false)
>     at 
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
> #7  0x00007f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false) at 
> ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
> #8  0x00007f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, 
> enable_progress_threads=false, enable_mpi_threads=false) at 
> ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
> #9  0x00007f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, 
> enable_mpi_threads=false) at 
> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
> #10 0x00007f7f539ed739 in mca_pml_ob1_component_init 
> (priority=0x7fff906aa630, enable_progress_threads=false, 
> enable_mpi_threads=false)
>     at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #11 0x00007f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false) at 
> ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
> #12 0x00007f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, 
> requested=0, provided=0x7fff906aa7d8) at 
> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
> #13 0x00007f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, 
> argv=0x7fff906aa820) at pinit.c:84
> #14 0x000000000040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19
>  
> Greg
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to