Re: [OMPI users] OMPI users] Possible bug in MPI_Barrier() ?

dpchoudh . Mon, 18 Apr 2016 15:54:02 -0400 (EDT)

Dear developers

Thank you all for jumping in to help. Here is what I have found so far:


1. Running Netpipe (NPmpi) between the two nodes (in either order) was
successful, but following this test, my original code still hung.
2. Following Gilles's advice, I then added an MPI_Barrier() at the end of
the code, just before MPI_Finalize(), and, to my surprise, the code ran to
completion!
3. Then, I took out the barrier, leaving the code the way it was before,
and it still ran to completion!
4. I tried several variations of call sequence, and all of them ran
successfully.

I can't explain why the runtime behavior seems to depend on the phase of
the moon, but, although I cannot prove it, I have a gut feeling there is a
bug somewhere in the development branch. I never run into this issue when
running the release branch. (I sometimes work as MPI application developer,
when I use the release branch, and sometime as MPI developer, when I use
the master branch).

Thank you all, again.

Durga

1% of the executables have 99% of CPU privilege!
Userspace code! Unite!! Occupy the kernel!!!

On Mon, Apr 18, 2016 at 8:04 AM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Durga,
>
> Can you run a simple netpipe over TCP using any of the two interfaces you
> mentioned?
>
> George
> On Apr 18, 2016 11:08 AM, "Gilles Gouaillardet" <
> gilles.gouaillar...@gmail.com> wrote:
>
>> An other test is to swap the hostnames.
>> If the single barrier test fails, this can hint to a firewall.
>>
>> Cheers,
>>
>> Gilles
>>
>> Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>> sudo make uninstall
>> will not remove modules that are no more built
>> sudo rm -rf /usr/local/lib/openmpi
>> is safe thought
>>
>> i confirm i did not see any issue on a system with two networks
>>
>> Cheers,
>>
>> Gilles
>>
>> On 4/18/2016 2:53 PM, dpchoudh . wrote:
>>
>> Hello Gilles
>>
>> I did a
>> sudo make uninstall
>> followed by a
>> sudo make install
>> on both nodes. But that did not make a difference. I will try your
>> tarball build suggestion a bit later.
>>
>> What I find a bit strange is that only I seem to be getting into this
>> issue. What could I be doing wrong? Or am I discovering an obscure bug?
>>
>> Thanks
>> Durga
>>
>> 1% of the executables have 99% of CPU privilege!
>> Userspace code! Unite!! Occupy the kernel!!!
>>
>> On Mon, Apr 18, 2016 at 1:21 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>>
>>> so you might want to
>>> rm -rf /usr/local/lib/openmpi
>>> and run
>>> make install
>>> again, just to make sure old stuff does not get in the way
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On 4/18/2016 2:12 PM, dpchoudh . wrote:
>>>
>>> Hello Gilles
>>>
>>> Thank you very much for your feedback. You are right that my original
>>> stack trace was on code that was several weeks behind, but updating it just
>>> now did not seem to make a difference: I am copying the stack from the
>>> latest code below:
>>>
>>> On the master node:
>>>
>>> (gdb) bt
>>> #0  0x00007fc0524cbb7d in poll () from /lib64/libc.so.6
>>> #1  0x00007fc051e53116 in poll_dispatch (base=0x1aabbe0,
>>> tv=0x7fff29fcb240) at poll.c:165
>>> #2  0x00007fc051e4adb0 in opal_libevent2022_event_base_loop
>>> (base=0x1aabbe0, flags=2) at event.c:1630
>>> #3  0x00007fc051de9a00 in opal_progress () at runtime/opal_progress.c:171
>>> #4  0x00007fc04ce46b0b in opal_condition_wait (c=0x7fc052d3cde0
>>> <ompi_request_cond>,
>>>     m=0x7fc052d3cd60 <ompi_request_lock>) at
>>> ../../../../opal/threads/condition.h:76
>>> #5  0x00007fc04ce46cec in ompi_request_wait_completion (req=0x1b7b580)
>>>     at ../../../../ompi/request/request.h:383
>>> #6  0x00007fc04ce48d4f in mca_pml_ob1_send (buf=0x7fff29fcb480, count=4,
>>>     datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
>>> sendmode=MCA_PML_BASE_SEND_STANDARD,
>>>     comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259
>>> #7  0x00007fc052a62d73 in PMPI_Send (buf=0x7fff29fcb480, count=4,
>>> type=0x601080 <ompi_mpi_char>, dest=1,
>>>     tag=1, comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78
>>> #8  0x0000000000400afa in main (argc=1, argv=0x7fff29fcb5e8) at
>>> mpitest.c:19
>>> (gdb)
>>>
>>> And on the non-master node
>>>
>>> (gdb) bt
>>> #0  0x00007fad2c32148d in nanosleep () from /lib64/libc.so.6
>>> #1  0x00007fad2c352014 in usleep () from /lib64/libc.so.6
>>> #2  0x00007fad296412de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0,
>>> nprocs=0, info=0x0, ninfo=0)
>>>     at src/client/pmix_client_fence.c:100
>>> #3  0x00007fad2960e1a6 in pmix120_fence (procs=0x0, collect_data=0) at
>>> pmix120_client.c:258
>>> #4  0x00007fad2c89b2da in ompi_mpi_finalize () at
>>> runtime/ompi_mpi_finalize.c:242
>>> #5  0x00007fad2c8c5849 in PMPI_Finalize () at pfinalize.c:47
>>> #6  0x0000000000400958 in main (argc=1, argv=0x7fff163879c8) at
>>> mpitest.c:30
>>> (gdb)
>>>
>>> And my configuration was done as follows:
>>>
>>>  $ ./configure --enable-debug --enable-debug-symbols
>>>
>>> I double checked to ensure that there is not an older installation of
>>> OpenMPI that is getting mixed up with the master branch.
>>> sudo yum list installed | grep -i mpi
>>> shows nothing on both nodes, and pmap -p <pid> shows that all the
>>> libraries are coming from /usr/local/lib, which seems to be correct. I
>>> am also quite sure about the firewall issue (that there is none). I will
>>> try out your suggestion on installing from a tarball and see how it goes.
>>>
>>> Thanks
>>> Durga
>>>
>>> 1% of the executables have 99% of CPU privilege!
>>> Userspace code! Unite!! Occupy the kernel!!!
>>>
>>> On Mon, Apr 18, 2016 at 12:47 AM, Gilles Gouaillardet <
>>> <gil...@rist.or.jp>gil...@rist.or.jp> wrote:
>>>
>>>> here is your stack trace
>>>>
>>>> #6  0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0, count=4,
>>>>     datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
>>>>     sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280
>>>> <ompi_mpi_comm_world>)
>>>>
>>>> at line 251
>>>>
>>>>
>>>> that would be line 259 in current master, and this file was updated 21
>>>> days ago
>>>> and that suggests your master is not quite up to date.
>>>>
>>>> even if the message is sent eagerly, the ob1 pml does use an internal
>>>> request it will wait for.
>>>>
>>>> btw, did you configure with --enable-mpi-thread-multiple ?
>>>> did you configure with --enable-mpirun-prefix-by-default ?
>>>> did you configure with --disable-dlopen ?
>>>>
>>>> at first, i d recommend you download a tarball from
>>>> https://www.open-mpi.org/nightly/master,
>>>> configure && make && make install
>>>> using a new install dir, and check if the issue is still here or not.
>>>>
>>>> there could be some side effects if some old modules were not removed
>>>> and/or if you are
>>>> not using the modules you expect.
>>>> /* when it hangs, you can pmap <pid> and check the path of the openmpi
>>>> libraries are the one you expect */
>>>>
>>>> what if you do not send/recv but invoke MPI_Barrier multiple times ?
>>>> what if you send/recv a one byte message instead ?
>>>> did you double check there is no firewall running on your nodes ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 4/18/2016 1:06 PM, dpchoudh . wrote:
>>>>
>>>> Thank you for your suggestion, Ralph. But it did not make any
>>>> difference.
>>>>
>>>> Let me say that my code is about a week stale. I just did a git pull
>>>> and am building it right now. The build takes quite a bit of time, so I
>>>> avoid doing that unless there is a reason. But what I am trying out is the
>>>> most basic functionality, so I'd think a week or so of lag would not make a
>>>> difference.
>>>>
>>>> Does the stack trace suggest something to you? It seems that the send
>>>> hangs; but a 4 byte send should be sent eagerly.
>>>>
>>>> Best regards
>>>> 'Durga
>>>>
>>>> 1% of the executables have 99% of CPU privilege!
>>>> Userspace code! Unite!! Occupy the kernel!!!
>>>>
>>>> On Sun, Apr 17, 2016 at 11:55 PM, Ralph Castain < <r...@open-mpi.org>
>>>> r...@open-mpi.org> wrote:
>>>>
>>>>> Try adding -mca oob_tcp_if_include eno1 to your cmd line and see if
>>>>> that makes a difference
>>>>>
>>>>> On Apr 17, 2016, at 8:43 PM, dpchoudh . < <dpcho...@gmail.com>
>>>>> dpcho...@gmail.com> wrote:
>>>>>
>>>>> Hello Gilles and all
>>>>>
>>>>> I am sorry to be bugging the developers, but this issue seems to be
>>>>> nagging me, and I am surprised it does not seem to affect anybody else. 
>>>>> But
>>>>> then again, I am using the master branch, and most users are probably 
>>>>> using
>>>>> a released version.
>>>>>
>>>>> This time I am using a totally different cluster. This has NO verbs
>>>>> capable interface; just 2 Ethernet (1 of which has no IP address and hence
>>>>> is unusable) plus 1 proprietary interface that currently supports only IP
>>>>> traffic. The two IP interfaces (Ethernet and proprietary) are on different
>>>>> IP subnets.
>>>>>
>>>>> My test program is as follows:
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <string.h>
>>>>> #include "mpi.h"
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>> char host[128];
>>>>> int n;
>>>>> MPI_Init(&argc, &argv);
>>>>> MPI_Get_processor_name(host, &n);
>>>>> printf("Hello from %s\n", host);
>>>>> MPI_Comm_size(MPI_COMM_WORLD, &n);
>>>>> printf("The world has %d nodes\n", n);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &n);
>>>>> printf("My rank is %d\n",n);
>>>>> //#if 0
>>>>> if (n == 0)
>>>>> {
>>>>> strcpy(host, "ha!");
>>>>> MPI_Send(host, strlen(host) + 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD);
>>>>> printf("sent %s\n", host);
>>>>> }
>>>>> else
>>>>> {
>>>>> //int len = strlen(host) + 1;
>>>>> bzero(host, 128);
>>>>> MPI_Recv(host,  4, MPI_CHAR, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>>>> printf("Received %s from rank 0\n", host);
>>>>> }
>>>>> //#endif
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> This program, when run between two nodes, hangs. The command was:
>>>>> [durga@b-1 ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
>>>>> -mca pml ob1 -mca btl_tcp_if_include eno1 ./mpitest
>>>>>
>>>>> And the hang is with the following output: (eno1 is one of the gigEth
>>>>> interfaces, that takes OOB traffic as well)
>>>>>
>>>>> Hello from b-1
>>>>> The world has 2 nodes
>>>>> My rank is 0
>>>>> Hello from b-2
>>>>> The world has 2 nodes
>>>>> My rank is 1
>>>>>
>>>>> Note that if I uncomment the #if 0 - #endif (i.e. comment out the
>>>>> MPI_Send()/MPI_Recv() part, the program runs to completion. Also note that
>>>>> the printfs following MPI_Send()/MPI_Recv() do not show up on console.
>>>>>
>>>>> Upon attaching gdb, the stack trace from the master node is as follows:
>>>>>
>>>>> Missing separate debuginfos, use: debuginfo-install
>>>>> glibc-2.17-78.el7.x86_64 libpciaccess-0.13.4-2.el7.x86_64
>>>>> (gdb) bt
>>>>> #0  0x00007f72a533eb7d in poll () from /lib64/libc.so.6
>>>>> #1  0x00007f72a4cb7146 in poll_dispatch (base=0xee33d0,
>>>>> tv=0x7fff81057b70)
>>>>>     at poll.c:165
>>>>> #2  0x00007f72a4caede0 in opal_libevent2022_event_base_loop
>>>>> (base=0xee33d0,
>>>>>     flags=2) at event.c:1630
>>>>> #3  0x00007f72a4c4e692 in opal_progress () at
>>>>> runtime/opal_progress.c:171
>>>>> #4  0x00007f72a0d07ac1 in opal_condition_wait (
>>>>>     c=0x7f72a5bb1e00 <ompi_request_cond>, m=0x7f72a5bb1d80
>>>>> <ompi_request_lock>)
>>>>>     at ../../../../opal/threads/condition.h:76
>>>>> #5  0x00007f72a0d07ca2 in ompi_request_wait_completion (req=0x113eb80)
>>>>>     at ../../../../ompi/request/request.h:383
>>>>> #6  0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0,
>>>>> count=4,
>>>>>     datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
>>>>>     sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280
>>>>> <ompi_mpi_comm_world>)
>>>>>     at pml_ob1_isend.c:251
>>>>> #7  0x00007f72a58d6be3 in PMPI_Send (buf=0x7fff81057db0, count=4,
>>>>>     type=0x601080 <ompi_mpi_char>, dest=1, tag=1,
>>>>>     comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78
>>>>> #8  0x0000000000400afa in main (argc=1, argv=0x7fff81057f18) at
>>>>> mpitest.c:19
>>>>> (gdb)
>>>>>
>>>>> And the backtrace on the non-master node is:
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00007ff3b377e48d in nanosleep () from /lib64/libc.so.6
>>>>> #1  0x00007ff3b37af014 in usleep () from /lib64/libc.so.6
>>>>> #2  0x00007ff3b0c922de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0,
>>>>> nprocs=0,
>>>>>     info=0x0, ninfo=0) at src/client/pmix_client_fence.c:100
>>>>> #3  0x00007ff3b0c5f1a6 in pmix120_fence (procs=0x0, collect_data=0)
>>>>>     at pmix120_client.c:258
>>>>> #4  0x00007ff3b3cf8f4b in ompi_mpi_finalize ()
>>>>>     at runtime/ompi_mpi_finalize.c:242
>>>>> #5  0x00007ff3b3d23295 in PMPI_Finalize () at pfinalize.c:47
>>>>> #6  0x0000000000400958 in main (argc=1, argv=0x7fff785e8788) at
>>>>> mpitest.c:30
>>>>> (gdb)
>>>>>
>>>>> The hostfile is as follows:
>>>>>
>>>>> [durga@b-1 ~]$ cat hostfile
>>>>> 10.4.70.10 slots=1
>>>>> 10.4.70.11 slots=1
>>>>> #10.4.70.12 slots=1
>>>>>
>>>>> And the ifconfig output from the master node is as follows (the other
>>>>> node is similar; all the IP interfaces are in their respective subnets) :
>>>>>
>>>>> [durga@b-1 ~]$ ifconfig
>>>>> eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>>>>>         inet 10.4.70.10  netmask 255.255.255.0  broadcast 10.4.70.255
>>>>>         inet6 fe80::21e:c9ff:fefe:13df  prefixlen 64  scopeid
>>>>> 0x20<link>
>>>>>         ether 00:1e:c9:fe:13:df  txqueuelen 1000  (Ethernet)
>>>>>         RX packets 48215  bytes 27842846 (26.5 MiB)
>>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>>         TX packets 52746  bytes 7817568 (7.4 MiB)
>>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>         device interrupt 16
>>>>>
>>>>> eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
>>>>>         ether 00:1e:c9:fe:13:e0  txqueuelen 1000  (Ethernet)
>>>>>         RX packets 0  bytes 0 (0.0 B)
>>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>>         TX packets 0  bytes 0 (0.0 B)
>>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>         device interrupt 17
>>>>>
>>>>> lf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2016
>>>>>         inet 192.168.1.2  netmask 255.255.255.0  broadcast
>>>>> 192.168.1.255
>>>>>         inet6 fe80::3002:ff:fe33:3333  prefixlen 64  scopeid 0x20<link>
>>>>>         ether 32:02:00:33:33:33  txqueuelen 1000  (Ethernet)
>>>>>         RX packets 10  bytes 512 (512.0 B)
>>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>>         TX packets 22  bytes 1536 (1.5 KiB)
>>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>
>>>>> lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
>>>>>         inet 127.0.0.1  netmask 255.0.0.0
>>>>>         inet6 ::1  prefixlen 128  scopeid 0x10<host>
>>>>>         loop  txqueuelen 0  (Local Loopback)
>>>>>         RX packets 26  bytes 1378 (1.3 KiB)
>>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>>         TX packets 26  bytes 1378 (1.3 KiB)
>>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>
>>>>> Please help me with this. I am stuck with the TCP transport, which is
>>>>> the most basic of all transports.
>>>>>
>>>>> Thanks in advance
>>>>> Durga
>>>>>
>>>>>
>>>>> 1% of the executables have 99% of CPU privilege!
>>>>> Userspace code! Unite!! Occupy the kernel!!!
>>>>>
>>>>> On Tue, Apr 12, 2016 at 9:32 PM, Gilles Gouaillardet <
>>>>> <gil...@rist.or.jp>gil...@rist.or.jp> wrote:
>>>>>
>>>>>> This is quite unlikely, and fwiw, your test program works for me.
>>>>>>
>>>>>> i suggest you check your 3 TCP networks are usable, for example
>>>>>>
>>>>>> $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1
>>>>>> --mca btl_tcp_if_include xxx ./mpitest
>>>>>>
>>>>>> in which xxx is a [list of] interface name :
>>>>>> eth0
>>>>>> eth1
>>>>>> ib0
>>>>>> eth0,eth1
>>>>>> eth0,ib0
>>>>>> ...
>>>>>> eth0,eth1,ib0
>>>>>>
>>>>>> and see where problem start occuring.
>>>>>>
>>>>>> btw, are your 3 interfaces in 3 different subnet ? is routing
>>>>>> required between two interfaces of the same type ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On 4/13/2016 7:15 AM, dpchoudh . wrote:
>>>>>>
>>>>>> Hi all
>>>>>>
>>>>>> I have reported this issue before, but then had brushed it off as
>>>>>> something that was caused by my modifications to the source tree. It 
>>>>>> looks
>>>>>> like that is not the case.
>>>>>>
>>>>>> Just now, I did the following:
>>>>>>
>>>>>> 1. Cloned a fresh copy from master.
>>>>>> 2. Configured with the following flags, built and installed it in my
>>>>>> two-node "cluster".
>>>>>> --enable-debug --enable-debug-symbols --disable-dlopen
>>>>>> 3. Compiled the following program, mpitest.c with these flags: -g3
>>>>>> -Wall -Wextra
>>>>>> 4. Ran it like this:
>>>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl
>>>>>> self,tcp -mca pml ob1 ./mpitest
>>>>>>
>>>>>> With this, the code hangs at MPI_Barrier() on both nodes, after
>>>>>> generating the following output:
>>>>>>
>>>>>> Hello world from processor smallMPI, rank 0 out of 2 processors
>>>>>> Hello world from processor bigMPI, rank 1 out of 2 processors
>>>>>> smallMPI sent haha!
>>>>>> bigMPI received haha!
>>>>>> <Hangs until killed by ^C>
>>>>>> Attaching to the hung process at one node gives the following
>>>>>> backtrace:
>>>>>>
>>>>>> (gdb) bt
>>>>>> #0  0x00007f55b0f41c3d in poll () from /lib64/libc.so.6
>>>>>> #1  0x00007f55b03ccde6 in poll_dispatch (base=0x70e7b0,
>>>>>> tv=0x7ffd1bb551c0) at poll.c:165
>>>>>> #2  0x00007f55b03c4a90 in opal_libevent2022_event_base_loop
>>>>>> (base=0x70e7b0, flags=2) at event.c:1630
>>>>>> #3  0x00007f55b02f0144 in opal_progress () at
>>>>>> runtime/opal_progress.c:171
>>>>>> #4  0x00007f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40
>>>>>> <ompi_request_cond>, m=0x7f55b19febc0 <ompi_request_lock>) at
>>>>>> ../opal/threads/condition.h:76
>>>>>> #5  0x00007f55b14b531b in ompi_request_default_wait_all (count=2,
>>>>>> requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at 
>>>>>> request/req_wait.c:287
>>>>>> #6  0x00007f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1,
>>>>>> stag=-16, source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>>>>>>     at base/coll_base_barrier.c:63
>>>>>> #7  0x00007f55b157a92a in ompi_coll_base_barrier_intra_two_procs
>>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at
>>>>>> base/coll_base_barrier.c:308
>>>>>> #8  0x00007f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed
>>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at
>>>>>> coll_tuned_decision_fixed.c:196
>>>>>> #9  0x00007f55b14d36fd in PMPI_Barrier (comm=0x601280
>>>>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>>>>> #10 0x0000000000400b0b in main (argc=1, argv=0x7ffd1bb55658) at
>>>>>> mpitest.c:26
>>>>>> (gdb)
>>>>>>
>>>>>> Thinking that this might be a bug in tuned collectives, since that is
>>>>>> what the stack shows, I ran the program like this (basically adding the
>>>>>> ^tuned part)
>>>>>>
>>>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl
>>>>>> self,tcp -mca pml ob1 -mca coll ^tuned ./mpitest
>>>>>>
>>>>>> It still hangs, but now with a different stack trace:
>>>>>> (gdb) bt
>>>>>> #0  0x00007f910d38ac3d in poll () from /lib64/libc.so.6
>>>>>> #1  0x00007f910c815de6 in poll_dispatch (base=0x1a317b0,
>>>>>> tv=0x7fff43ee3610) at poll.c:165
>>>>>> #2  0x00007f910c80da90 in opal_libevent2022_event_base_loop
>>>>>> (base=0x1a317b0, flags=2) at event.c:1630
>>>>>> #3  0x00007f910c739144 in opal_progress () at
>>>>>> runtime/opal_progress.c:171
>>>>>> #4  0x00007f910db130f7 in opal_condition_wait (c=0x7f910de47c40
>>>>>> <ompi_request_cond>, m=0x7f910de47bc0 <ompi_request_lock>)
>>>>>>     at ../../../../opal/threads/condition.h:76
>>>>>> #5  0x00007f910db132d8 in ompi_request_wait_completion
>>>>>> (req=0x1b07680) at ../../../../ompi/request/request.h:383
>>>>>> #6  0x00007f910db1533b in mca_pml_ob1_send (buf=0x0, count=0,
>>>>>> datatype=0x7f910de1e340 <ompi_mpi_byte>, dst=1, tag=-16,
>>>>>> sendmode=MCA_PML_BASE_SEND_STANDARD,
>>>>>>     comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259
>>>>>> #7  0x00007f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear
>>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1b092c0) at
>>>>>> base/coll_base_barrier.c:368
>>>>>> #8  0x00007f910d91c6fd in PMPI_Barrier (comm=0x601280
>>>>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>>>>> #9  0x0000000000400b0b in main (argc=1, argv=0x7fff43ee3a58) at
>>>>>> mpitest.c:26
>>>>>> (gdb)
>>>>>>
>>>>>> The mpitest.c program is as follows:
>>>>>> #include <mpi.h>
>>>>>> #include <stdio.h>
>>>>>> #include <string.h>
>>>>>>
>>>>>> int main(int argc, char** argv)
>>>>>> {
>>>>>>     int world_size, world_rank, name_len;
>>>>>>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>>>>>>
>>>>>>     MPI_Init(&argc, &argv);
>>>>>>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>>>>>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>>>>>     MPI_Get_processor_name(hostname, &name_len);
>>>>>>     printf("Hello world from processor %s, rank %d out of %d
>>>>>> processors\n", hostname, world_rank, world_size);
>>>>>>     if (world_rank == 1)
>>>>>>     {
>>>>>>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
>>>>>> MPI_STATUS_IGNORE);
>>>>>>     printf("%s received %s\n", hostname, buf);
>>>>>>     }
>>>>>>     else
>>>>>>     {
>>>>>>     strcpy(buf, "haha!");
>>>>>>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>>>>>>     printf("%s sent %s\n", hostname, buf);
>>>>>>     }
>>>>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>>>>     MPI_Finalize();
>>>>>>     return 0;
>>>>>> }
>>>>>>
>>>>>> The hostfile is as follows:
>>>>>> 10.10.10.10 slots=1
>>>>>> 10.10.10.11 slots=1
>>>>>>
>>>>>> The two nodes are connected by three physical and 3 logical networks:
>>>>>> Physical: Gigabit Ethernet, 10G iWARP, 20G Infiniband
>>>>>> Logical: IP (all 3), PSM (Qlogic Infiniband), Verbs (iWARP and
>>>>>> Infiniband)
>>>>>>
>>>>>> Please note again that this is a fresh, brand new clone.
>>>>>>
>>>>>> Is this a bug (perhaps a side effect of --disable-dlopen) or
>>>>>> something I am doing wrong?
>>>>>>
>>>>>> Thanks
>>>>>> Durga
>>>>>>
>>>>>> We learn from history that we never learn from history.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing listus...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2016/04/28930.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> <us...@open-mpi.org>us...@open-mpi.org
>>>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> <http://www.open-mpi.org/community/lists/users/2016/04/28932.php>
>>>>>
>>>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/04/28951.php
>> ...
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28954.php
>

Re: [OMPI users] OMPI users] Possible bug in MPI_Barrier() ?

Reply via email to