Hello George Adding --mca pml ob1 does make the program run. I just wanted to make sure that was the expected behaviour (as opposed to a bug in mpirun).
Thanks Durga 1% of the executables have 99% of CPU privilege! Userspace code! Unite!! Occupy the kernel!!! On Sun, Apr 24, 2016 at 9:43 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > Add --mca pml ob1 to your mpirun command. > > George > > > On Sunday, April 24, 2016, dpchoudh . <dpcho...@gmail.com> wrote: > >> Hello Gilles >> >> Thank you for finding the bug; it was not there in the original code; I >> added it while trying to 'simplify' the code. >> >> With the bug fixed, the code now runs in the last scenario. But it still >> hangs with the following command line (even after updating to latest git >> tree, rebuilding and reinstalling): >> >> mpirun -np 2 -mca btl self,sm ./mpi_hello_master_slave >> >> and the stack is still as before: >> >> (gdb) bt >> #0 0x00007f4e4bd60117 in sched_yield () from /lib64/libc.so.6 >> #1 0x00007f4e4ba3d875 in amsh_ep_connreq_wrap () from >> /lib64/libpsm_infinipath.so.1 >> #2 0x00007f4e4ba3e254 in amsh_ep_connect () from >> /lib64/libpsm_infinipath.so.1 >> #3 0x00007f4e4ba470df in psm_ep_connect () from >> /lib64/libpsm_infinipath.so.1 >> #4 0x00007f4e4c4c8975 in ompi_mtl_psm_add_procs (mtl=0x7f4e4c846500 >> <ompi_mtl_psm>, nprocs=2, procs=0x23bb420) >> at mtl_psm.c:312 >> #5 0x00007f4e4c52ef6b in mca_pml_cm_add_procs (procs=0x23bb420, >> nprocs=2) at pml_cm.c:134 >> #6 0x00007f4e4c2e7d0f in ompi_mpi_init (argc=1, argv=0x7fffe930f9b8, >> requested=0, provided=0x7fffe930f78c) >> at runtime/ompi_mpi_init.c:770 >> #7 0x00007f4e4c324aff in PMPI_Init (argc=0x7fffe930f7bc, >> argv=0x7fffe930f7b0) at pinit.c:66 >> #8 0x000000000040101f in main (argc=1, argv=0x7fffe930f9b8) at >> mpi_hello_master_slave.c:94 >> >> As you can see, OMPI is trying the PSM link to communicate, even though >> the link is down and it is not mentioned in the arguments to mpirun. (There >> are not even multiple nodes mentioned in the arguments.) >> >> Is this the expected behaviour or is it a bug? >> >> Thanks >> Durga >> >> 1% of the executables have 99% of CPU privilege! >> Userspace code! Unite!! Occupy the kernel!!! >> >> On Sun, Apr 24, 2016 at 8:12 PM, Gilles Gouaillardet <gil...@rist.or.jp> >> wrote: >> >>> two comments : >>> >>> - the program is incorrect : slave() should MPI_Recv(..., MPI_ANY_TAG, >>> ...) >>> >>> - current master uses pmix114, and your traces mention pmix120 >>> so your master is out of sync, or pmix120 is an old module that was >>> not manually removed. >>> fwiw, once in a while, i >>> rm -rf /.../ompi_install_dir/lib/openmpi >>> to get rid of the removed modules >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 4/25/2016 7:34 AM, dpchoudh . wrote: >>> >>> Hello all >>> >>> Attached is a simple MPI program (a modified version of a similar >>> program that was posted by another user). This program, when run on a >>> single node machine, hangs most of the time, as follows: (in all cases, OS >>> was CentOS 7) >>> >>> Scenario 1: OMPI v 1.10, single socket quad core machine, with Chelsio >>> T3 card, link down, and GigE, link up >>> >>> mpirun -np 2 <progname> >>> Backtrace of the two spawned processes as follows: >>> >>> (gdb) bt >>> #0 0x00007f6471647aba in mca_btl_vader_component_progress () at >>> btl_vader_component.c:708 >>> #1 0x00007f6475c6722a in opal_progress () at runtime/opal_progress.c:187 >>> #2 0x00007f64767b7685 in opal_condition_wait (c=<optimized out>, >>> m=<optimized out>) >>> at ../opal/threads/condition.h:78 >>> #3 ompi_request_default_wait_all (count=2, requests=0x7ffd1d921530, >>> statuses=0x7ffd1d921540) >>> at request/req_wait.c:281 >>> #4 0x00007f64709dd591 in ompi_coll_tuned_sendrecv_zero (stag=-16, >>> rtag=-16, >>> comm=<optimized out>, source=1, dest=1) at coll_tuned_barrier.c:78 >>> #5 ompi_coll_tuned_barrier_intra_two_procs (comm=0x6022c0 >>> <ompi_mpi_comm_world>, >>> module=<optimized out>) at coll_tuned_barrier.c:324 >>> #6 0x00007f64767c92e6 in PMPI_Barrier (comm=0x6022c0 >>> <ompi_mpi_comm_world>) at pbarrier.c:70 >>> #7 0x00000000004010bd in main (argc=1, argv=0x7ffd1d9217d8) at >>> mpi_hello_master_slave.c:115 >>> (gdb) >>> >>> >>> (gdb) bt >>> #0 mca_pml_ob1_progress () at pml_ob1_progress.c:45 >>> #1 0x00007feeae7dc22a in opal_progress () at runtime/opal_progress.c:187 >>> #2 0x00007feea9e125c5 in opal_condition_wait (c=<optimized out>, >>> m=<optimized out>) >>> at ../../../../opal/threads/condition.h:78 >>> #3 ompi_request_wait_completion (req=0xe55200) at >>> ../../../../ompi/request/request.h:381 >>> #4 mca_pml_ob1_recv (addr=<optimized out>, count=255, >>> datatype=<optimized out>, >>> src=<optimized out>, tag=<optimized out>, comm=<optimized out>, >>> status=0x7fff4a618000) >>> at pml_ob1_irecv.c:118 >>> #5 0x00007feeaf35068f in PMPI_Recv (buf=0x7fff4a618020, count=255, >>> type=0x6020c0 <ompi_mpi_char>, source=<optimized out>, >>> tag=<optimized out>, >>> comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fff4a618000) at >>> precv.c:78 >>> #6 0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67 >>> #7 0x00000000004010b3 in main (argc=1, argv=0x7fff4a6184d8) at >>> mpi_hello_master_slave.c:113 >>> (gdb) >>> >>> >>> Scenario 2: >>> Dual socket Hexcore machine with Qlogic IB, Chelsio iWARP and Fibre >>> Channel, all link down, GigE, link up, OpenMPI compiled from master branch, >>> crashes as follows: >>> >>> [durga@smallMPI Desktop]$ mpirun -np 2 ./mpi_hello_master_slave >>> >>> mpi_hello_master_slave:39570 terminated with signal 11 at PC=20 >>> SP=7ffd438c00b8. Backtrace: >>> >>> mpi_hello_master_slave:39571 terminated with signal 11 at PC=20 >>> SP=7ffee5903e08. Backtrace: >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 0 on node smallMPI exited on >>> signal 11 (Segmentation fault). >>> >>> -------------------------------------------------------------------------- >>> >>> Scenario 3: >>> Exactly same as scenario 2, but with command line more explicit as >>> follows: >>> >>> [durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm >>> ./mpi_hello_master_slave >>> This hangs with the following backtrace: >>> >>> (gdb) bt >>> #0 0x00007ff6639f049d in nanosleep () from /lib64/libc.so.6 >>> #1 0x00007ff663a210d4 in usleep () from /lib64/libc.so.6 >>> #2 0x00007ff662f72796 in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0, >>> nprocs=0, info=0x0, ninfo=0) >>> at src/client/pmix_client_fence.c:100 >>> #3 0x00007ff662f4f0bc in pmix120_fence (procs=0x0, collect_data=0) at >>> pmix120_client.c:255 >>> #4 0x00007ff663f941af in ompi_mpi_init (argc=1, argv=0x7ffc18c9afd8, >>> requested=0, provided=0x7ffc18c9adac) >>> at runtime/ompi_mpi_init.c:813 >>> #5 0x00007ff663fc9c33 in PMPI_Init (argc=0x7ffc18c9addc, >>> argv=0x7ffc18c9add0) at pinit.c:66 >>> #6 0x000000000040101f in main (argc=1, argv=0x7ffc18c9afd8) at >>> mpi_hello_master_slave.c:94 >>> (gdb) q >>> >>> (gdb) bt >>> #0 0x00007f5af7646117 in sched_yield () from /lib64/libc.so.6 >>> #1 0x00007f5af7323875 in amsh_ep_connreq_wrap () from >>> /lib64/libpsm_infinipath.so.1 >>> #2 0x00007f5af7324254 in amsh_ep_connect () from >>> /lib64/libpsm_infinipath.so.1 >>> #3 0x00007f5af732d0df in psm_ep_connect () from >>> /lib64/libpsm_infinipath.so.1 >>> #4 0x00007f5af7d94a69 in ompi_mtl_psm_add_procs (mtl=0x7f5af80f8500 >>> <ompi_mtl_psm>, nprocs=2, procs=0xf53e60) >>> at mtl_psm.c:312 >>> #5 0x00007f5af7df3630 in mca_pml_cm_add_procs (procs=0xf53e60, >>> nprocs=2) at pml_cm.c:134 >>> #6 0x00007f5af7bcc0d1 in ompi_mpi_init (argc=1, argv=0x7ffc485a2f98, >>> requested=0, provided=0x7ffc485a2d6c) >>> at runtime/ompi_mpi_init.c:777 >>> #7 0x00007f5af7c01c33 in PMPI_Init (argc=0x7ffc485a2d9c, >>> argv=0x7ffc485a2d90) at pinit.c:66 >>> #8 0x000000000040101f in main (argc=1, argv=0x7ffc485a2f98) at >>> mpi_hello_master_slave.c:94 >>> >>> This seems to suggest that it is trying PSM to connect even when the >>> link was down and it was not mentioned in the command line. Is this >>> behavior expected? >>> >>> >>> Scenario 4: >>> Exactly same as scenario 3, but with even more explicit command line: >>> >>> [durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm -mca pml ob1 >>> ./mpi_hello_master_slave >>> >>> This hangs towards the end, after printing the output (as opposed to >>> scenario 3 where it hangs at the connection setup stage, without printing >>> anything.) >>> >>> Process 0 of 2 running on host smallMPI >>> >>> >>> Now 1 slave tasks are sending greetings. >>> >>> Process 1 of 2 running on host smallMPI >>> Greetings from task 1: >>> message type: 3 >>> msg length: 141 characters >>> message: >>> hostname: smallMPI >>> operating system: Linux >>> release: 3.10.0-327.13.1.el7.x86_64 >>> processor: x86_64 >>> >>> >>> Backtraces of the two processes are as follows: >>> >>> (gdb) bt >>> #0 opal_timer_base_get_usec_clock_gettime () at >>> timer_linux_component.c:180 >>> #1 0x00007f10f46e50e4 in opal_progress () at runtime/opal_progress.c:161 >>> #2 0x00007f10f58a9d8b in opal_condition_wait (c=0x7f10f5df3c40 >>> <ompi_request_cond>, >>> m=0x7f10f5df3bc0 <ompi_request_lock>) at >>> ../opal/threads/condition.h:76 >>> #3 0x00007f10f58aa31b in ompi_request_default_wait_all (count=2, >>> requests=0x7ffe7edd5a80, >>> statuses=0x7ffe7edd5a50) at request/req_wait.c:287 >>> #4 0x00007f10f596f225 in ompi_coll_base_sendrecv_zero (dest=1, >>> stag=-16, source=1, rtag=-16, >>> comm=0x6022c0 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:63 >>> #5 0x00007f10f596f92a in ompi_coll_base_barrier_intra_two_procs >>> (comm=0x6022c0 <ompi_mpi_comm_world>, >>> module=0xd5a7f0) at base/coll_base_barrier.c:308 >>> #6 0x00007f10f599ffec in ompi_coll_tuned_barrier_intra_dec_fixed >>> (comm=0x6022c0 <ompi_mpi_comm_world>, >>> module=0xd5a7f0) at coll_tuned_decision_fixed.c:196 >>> #7 0x00007f10f58c86fd in PMPI_Barrier (comm=0x6022c0 >>> <ompi_mpi_comm_world>) at pbarrier.c:63 >>> #8 0x00000000004010bd in main (argc=1, argv=0x7ffe7edd5d48) at >>> mpi_hello_master_slave.c:115 >>> >>> >>> (gdb) bt >>> #0 0x00007fffe9d6a988 in clock_gettime () >>> #1 0x00007f704bf64edd in clock_gettime () from /lib64/libc.so.6 >>> #2 0x00007f704b4deea5 in opal_timer_base_get_usec_clock_gettime () at >>> timer_linux_component.c:183 >>> #3 0x00007f704b2f50e4 in opal_progress () at runtime/opal_progress.c:161 >>> #4 0x00007f704c6cc39c in opal_condition_wait (c=0x7f704ca03c40 >>> <ompi_request_cond>, >>> m=0x7f704ca03bc0 <ompi_request_lock>) at >>> ../../../../opal/threads/condition.h:76 >>> #5 0x00007f704c6cc560 in ompi_request_wait_completion (req=0x165e580) >>> at ../../../../ompi/request/request.h:383 >>> #6 0x00007f704c6cd724 in mca_pml_ob1_recv (addr=0x7fffe9cafa10, >>> count=255, datatype=0x6020c0 <ompi_mpi_char>, >>> src=0, tag=1, comm=0x6022c0 <ompi_mpi_comm_world>, >>> status=0x7fffe9caf9f0) at pml_ob1_irecv.c:123 >>> #7 0x00007f704c4ff434 in PMPI_Recv (buf=0x7fffe9cafa10, count=255, >>> type=0x6020c0 <ompi_mpi_char>, source=0, >>> tag=1, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fffe9caf9f0) >>> at precv.c:79 >>> #8 0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67 >>> #9 0x00000000004010b3 in main (argc=1, argv=0x7fffe9cafec8) at >>> mpi_hello_master_slave.c:113 >>> (gdb) q >>> >>> I am going to try the tarball shortly, but hopefully someone can get >>> some insight out of this much information. BTW, the code was compiled with >>> the following flags: >>> >>> -Wall -Wextra -g3 -O0 >>> >>> Let me rehash that NO network communication was involved in any of these >>> experiments; they were all single node shared memory (sm btl) jobs. >>> >>> Thanks >>> Durga >>> >>> >>> >>> 1% of the executables have 99% of CPU privilege! >>> Userspace code! Unite!! Occupy the kernel!!! >>> >>> >>> _______________________________________________ >>> users mailing listus...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/04/29018.php >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/04/29019.php >>> >> >> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/29021.php >