Hello George

Adding --mca pml ob1 does make the program run. I just wanted to make sure
that was the expected behaviour (as opposed to a bug in mpirun).

Thanks
Durga

1% of the executables have 99% of CPU privilege!
Userspace code! Unite!! Occupy the kernel!!!

On Sun, Apr 24, 2016 at 9:43 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Add --mca pml ob1 to your mpirun command.
>
> George
>
>
> On Sunday, April 24, 2016, dpchoudh . <dpcho...@gmail.com> wrote:
>
>> Hello Gilles
>>
>> Thank you for finding the bug; it was not there in the original code; I
>> added it while trying to 'simplify' the code.
>>
>> With the bug fixed, the code now runs in the last scenario. But it still
>> hangs with the following command line (even after updating to latest git
>> tree, rebuilding and reinstalling):
>>
>> mpirun -np 2 -mca btl self,sm ./mpi_hello_master_slave
>>
>> and the stack is still as before:
>>
>> (gdb) bt
>> #0  0x00007f4e4bd60117 in sched_yield () from /lib64/libc.so.6
>> #1  0x00007f4e4ba3d875 in amsh_ep_connreq_wrap () from
>> /lib64/libpsm_infinipath.so.1
>> #2  0x00007f4e4ba3e254 in amsh_ep_connect () from
>> /lib64/libpsm_infinipath.so.1
>> #3  0x00007f4e4ba470df in psm_ep_connect () from
>> /lib64/libpsm_infinipath.so.1
>> #4  0x00007f4e4c4c8975 in ompi_mtl_psm_add_procs (mtl=0x7f4e4c846500
>> <ompi_mtl_psm>, nprocs=2, procs=0x23bb420)
>>     at mtl_psm.c:312
>> #5  0x00007f4e4c52ef6b in mca_pml_cm_add_procs (procs=0x23bb420,
>> nprocs=2) at pml_cm.c:134
>> #6  0x00007f4e4c2e7d0f in ompi_mpi_init (argc=1, argv=0x7fffe930f9b8,
>> requested=0, provided=0x7fffe930f78c)
>>     at runtime/ompi_mpi_init.c:770
>> #7  0x00007f4e4c324aff in PMPI_Init (argc=0x7fffe930f7bc,
>> argv=0x7fffe930f7b0) at pinit.c:66
>> #8  0x000000000040101f in main (argc=1, argv=0x7fffe930f9b8) at
>> mpi_hello_master_slave.c:94
>>
>> As you can see, OMPI is trying the PSM link to communicate, even though
>> the link is down and it is not mentioned in the arguments to mpirun. (There
>> are not even multiple nodes mentioned in the arguments.)
>>
>> Is this the expected behaviour or is it a bug?
>>
>> Thanks
>> Durga
>>
>> 1% of the executables have 99% of CPU privilege!
>> Userspace code! Unite!! Occupy the kernel!!!
>>
>> On Sun, Apr 24, 2016 at 8:12 PM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>>
>>> two comments :
>>>
>>> - the program is incorrect : slave() should MPI_Recv(..., MPI_ANY_TAG,
>>> ...)
>>>
>>> - current master uses pmix114, and your traces mention pmix120
>>>   so your master is out of sync, or pmix120 is an old module that was
>>> not manually removed.
>>>   fwiw, once in a while, i
>>>   rm -rf /.../ompi_install_dir/lib/openmpi
>>>   to get rid of the removed modules
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On 4/25/2016 7:34 AM, dpchoudh . wrote:
>>>
>>> Hello all
>>>
>>> Attached is a simple MPI program (a modified version of a similar
>>> program that was posted by another user). This program, when run on a
>>> single node machine, hangs most of the time, as follows: (in all cases, OS
>>> was CentOS 7)
>>>
>>> Scenario 1: OMPI v 1.10, single socket quad core machine, with Chelsio
>>> T3 card, link down, and GigE, link up
>>>
>>> mpirun -np 2 <progname>
>>> Backtrace of the two spawned processes as follows:
>>>
>>> (gdb) bt
>>> #0  0x00007f6471647aba in mca_btl_vader_component_progress () at
>>> btl_vader_component.c:708
>>> #1  0x00007f6475c6722a in opal_progress () at runtime/opal_progress.c:187
>>> #2  0x00007f64767b7685 in opal_condition_wait (c=<optimized out>,
>>> m=<optimized out>)
>>>     at ../opal/threads/condition.h:78
>>> #3  ompi_request_default_wait_all (count=2, requests=0x7ffd1d921530,
>>> statuses=0x7ffd1d921540)
>>>     at request/req_wait.c:281
>>> #4  0x00007f64709dd591 in ompi_coll_tuned_sendrecv_zero (stag=-16,
>>> rtag=-16,
>>>     comm=<optimized out>, source=1, dest=1) at coll_tuned_barrier.c:78
>>> #5  ompi_coll_tuned_barrier_intra_two_procs (comm=0x6022c0
>>> <ompi_mpi_comm_world>,
>>>     module=<optimized out>) at coll_tuned_barrier.c:324
>>> #6  0x00007f64767c92e6 in PMPI_Barrier (comm=0x6022c0
>>> <ompi_mpi_comm_world>) at pbarrier.c:70
>>> #7  0x00000000004010bd in main (argc=1, argv=0x7ffd1d9217d8) at
>>> mpi_hello_master_slave.c:115
>>> (gdb)
>>>
>>>
>>> (gdb) bt
>>> #0  mca_pml_ob1_progress () at pml_ob1_progress.c:45
>>> #1  0x00007feeae7dc22a in opal_progress () at runtime/opal_progress.c:187
>>> #2  0x00007feea9e125c5 in opal_condition_wait (c=<optimized out>,
>>> m=<optimized out>)
>>>     at ../../../../opal/threads/condition.h:78
>>> #3  ompi_request_wait_completion (req=0xe55200) at
>>> ../../../../ompi/request/request.h:381
>>> #4  mca_pml_ob1_recv (addr=<optimized out>, count=255,
>>> datatype=<optimized out>,
>>>     src=<optimized out>, tag=<optimized out>, comm=<optimized out>,
>>> status=0x7fff4a618000)
>>>     at pml_ob1_irecv.c:118
>>> #5  0x00007feeaf35068f in PMPI_Recv (buf=0x7fff4a618020, count=255,
>>>     type=0x6020c0 <ompi_mpi_char>, source=<optimized out>,
>>> tag=<optimized out>,
>>>     comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fff4a618000) at
>>> precv.c:78
>>> #6  0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67
>>> #7  0x00000000004010b3 in main (argc=1, argv=0x7fff4a6184d8) at
>>> mpi_hello_master_slave.c:113
>>> (gdb)
>>>
>>>
>>> Scenario 2:
>>> Dual socket Hexcore machine with Qlogic IB, Chelsio iWARP and Fibre
>>> Channel, all link down, GigE, link up, OpenMPI compiled from master branch,
>>> crashes as follows:
>>>
>>> [durga@smallMPI Desktop]$ mpirun -np 2 ./mpi_hello_master_slave
>>>
>>> mpi_hello_master_slave:39570 terminated with signal 11 at PC=20
>>> SP=7ffd438c00b8.  Backtrace:
>>>
>>> mpi_hello_master_slave:39571 terminated with signal 11 at PC=20
>>> SP=7ffee5903e08.  Backtrace:
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 0 on node smallMPI exited on
>>> signal 11 (Segmentation fault).
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Scenario 3:
>>> Exactly same as scenario 2, but with command line more explicit as
>>> follows:
>>>
>>> [durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm
>>> ./mpi_hello_master_slave
>>> This hangs with the following backtrace:
>>>
>>> (gdb) bt
>>> #0  0x00007ff6639f049d in nanosleep () from /lib64/libc.so.6
>>> #1  0x00007ff663a210d4 in usleep () from /lib64/libc.so.6
>>> #2  0x00007ff662f72796 in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0,
>>> nprocs=0, info=0x0, ninfo=0)
>>>     at src/client/pmix_client_fence.c:100
>>> #3  0x00007ff662f4f0bc in pmix120_fence (procs=0x0, collect_data=0) at
>>> pmix120_client.c:255
>>> #4  0x00007ff663f941af in ompi_mpi_init (argc=1, argv=0x7ffc18c9afd8,
>>> requested=0, provided=0x7ffc18c9adac)
>>>     at runtime/ompi_mpi_init.c:813
>>> #5  0x00007ff663fc9c33 in PMPI_Init (argc=0x7ffc18c9addc,
>>> argv=0x7ffc18c9add0) at pinit.c:66
>>> #6  0x000000000040101f in main (argc=1, argv=0x7ffc18c9afd8) at
>>> mpi_hello_master_slave.c:94
>>> (gdb) q
>>>
>>> (gdb) bt
>>> #0  0x00007f5af7646117 in sched_yield () from /lib64/libc.so.6
>>> #1  0x00007f5af7323875 in amsh_ep_connreq_wrap () from
>>> /lib64/libpsm_infinipath.so.1
>>> #2  0x00007f5af7324254 in amsh_ep_connect () from
>>> /lib64/libpsm_infinipath.so.1
>>> #3  0x00007f5af732d0df in psm_ep_connect () from
>>> /lib64/libpsm_infinipath.so.1
>>> #4  0x00007f5af7d94a69 in ompi_mtl_psm_add_procs (mtl=0x7f5af80f8500
>>> <ompi_mtl_psm>, nprocs=2, procs=0xf53e60)
>>>     at mtl_psm.c:312
>>> #5  0x00007f5af7df3630 in mca_pml_cm_add_procs (procs=0xf53e60,
>>> nprocs=2) at pml_cm.c:134
>>> #6  0x00007f5af7bcc0d1 in ompi_mpi_init (argc=1, argv=0x7ffc485a2f98,
>>> requested=0, provided=0x7ffc485a2d6c)
>>>     at runtime/ompi_mpi_init.c:777
>>> #7  0x00007f5af7c01c33 in PMPI_Init (argc=0x7ffc485a2d9c,
>>> argv=0x7ffc485a2d90) at pinit.c:66
>>> #8  0x000000000040101f in main (argc=1, argv=0x7ffc485a2f98) at
>>> mpi_hello_master_slave.c:94
>>>
>>> This seems to suggest that it is trying PSM to connect even when the
>>> link was down and it was not mentioned in the command line. Is this
>>> behavior expected?
>>>
>>>
>>> Scenario 4:
>>> Exactly same as scenario 3, but with even more explicit command line:
>>>
>>> [durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm -mca pml ob1
>>> ./mpi_hello_master_slave
>>>
>>> This hangs towards the end, after printing the output (as opposed to
>>> scenario 3 where it hangs at the connection setup stage, without printing
>>> anything.)
>>>
>>> Process 0 of 2 running on host smallMPI
>>>
>>>
>>> Now 1 slave tasks are sending greetings.
>>>
>>> Process 1 of 2 running on host smallMPI
>>> Greetings from task 1:
>>>   message type:        3
>>>   msg length:          141 characters
>>>   message:
>>>     hostname:          smallMPI
>>>     operating system:  Linux
>>>     release:           3.10.0-327.13.1.el7.x86_64
>>>     processor:         x86_64
>>>
>>>
>>> Backtraces of the two processes are as follows:
>>>
>>> (gdb) bt
>>> #0  opal_timer_base_get_usec_clock_gettime () at
>>> timer_linux_component.c:180
>>> #1  0x00007f10f46e50e4 in opal_progress () at runtime/opal_progress.c:161
>>> #2  0x00007f10f58a9d8b in opal_condition_wait (c=0x7f10f5df3c40
>>> <ompi_request_cond>,
>>>     m=0x7f10f5df3bc0 <ompi_request_lock>) at
>>> ../opal/threads/condition.h:76
>>> #3  0x00007f10f58aa31b in ompi_request_default_wait_all (count=2,
>>> requests=0x7ffe7edd5a80,
>>>     statuses=0x7ffe7edd5a50) at request/req_wait.c:287
>>> #4  0x00007f10f596f225 in ompi_coll_base_sendrecv_zero (dest=1,
>>> stag=-16, source=1, rtag=-16,
>>>     comm=0x6022c0 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:63
>>> #5  0x00007f10f596f92a in ompi_coll_base_barrier_intra_two_procs
>>> (comm=0x6022c0 <ompi_mpi_comm_world>,
>>>     module=0xd5a7f0) at base/coll_base_barrier.c:308
>>> #6  0x00007f10f599ffec in ompi_coll_tuned_barrier_intra_dec_fixed
>>> (comm=0x6022c0 <ompi_mpi_comm_world>,
>>>     module=0xd5a7f0) at coll_tuned_decision_fixed.c:196
>>> #7  0x00007f10f58c86fd in PMPI_Barrier (comm=0x6022c0
>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>> #8  0x00000000004010bd in main (argc=1, argv=0x7ffe7edd5d48) at
>>> mpi_hello_master_slave.c:115
>>>
>>>
>>> (gdb) bt
>>> #0  0x00007fffe9d6a988 in clock_gettime ()
>>> #1  0x00007f704bf64edd in clock_gettime () from /lib64/libc.so.6
>>> #2  0x00007f704b4deea5 in opal_timer_base_get_usec_clock_gettime () at
>>> timer_linux_component.c:183
>>> #3  0x00007f704b2f50e4 in opal_progress () at runtime/opal_progress.c:161
>>> #4  0x00007f704c6cc39c in opal_condition_wait (c=0x7f704ca03c40
>>> <ompi_request_cond>,
>>>     m=0x7f704ca03bc0 <ompi_request_lock>) at
>>> ../../../../opal/threads/condition.h:76
>>> #5  0x00007f704c6cc560 in ompi_request_wait_completion (req=0x165e580)
>>> at ../../../../ompi/request/request.h:383
>>> #6  0x00007f704c6cd724 in mca_pml_ob1_recv (addr=0x7fffe9cafa10,
>>> count=255, datatype=0x6020c0 <ompi_mpi_char>,
>>>     src=0, tag=1, comm=0x6022c0 <ompi_mpi_comm_world>,
>>> status=0x7fffe9caf9f0) at pml_ob1_irecv.c:123
>>> #7  0x00007f704c4ff434 in PMPI_Recv (buf=0x7fffe9cafa10, count=255,
>>> type=0x6020c0 <ompi_mpi_char>, source=0,
>>>     tag=1, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fffe9caf9f0)
>>> at precv.c:79
>>> #8  0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67
>>> #9  0x00000000004010b3 in main (argc=1, argv=0x7fffe9cafec8) at
>>> mpi_hello_master_slave.c:113
>>> (gdb) q
>>>
>>> I am going to try the tarball shortly, but hopefully someone can get
>>> some insight out of this much information. BTW, the code was compiled with
>>> the following flags:
>>>
>>> -Wall -Wextra -g3 -O0
>>>
>>> Let me rehash that NO network communication was involved in any of these
>>> experiments; they were all single node shared memory (sm btl) jobs.
>>>
>>> Thanks
>>> Durga
>>>
>>>
>>>
>>> 1% of the executables have 99% of CPU privilege!
>>> Userspace code! Unite!! Occupy the kernel!!!
>>>
>>>
>>> _______________________________________________
>>> users mailing listus...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/04/29018.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/04/29019.php
>>>
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/29021.php
>

Reply via email to