Matus,

This has very likely been fixed by
https://github.com/open-mpi/ompi/pull/2259
Can you download the patch at
https://github.com/open-mpi/ompi/pull/2259.patch and apply it manually on
v1.10 ?

Cheers,

Gilles


On Monday, August 29, 2016, M. D. <matus.dobro...@gmail.com> wrote:

>
> Hi,
>
> I would like to ask - are there any new solutions or investigations in
> this problem?
>
> Cheers,
>
> Matus Dobrotka
>
> 2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>>:
>
>> my bad for the confusion,
>>
>> I misread you and miswrote my reply.
>>
>> I will investigate this again.
>>
>> strictly speaking, the clients can only start after the server first
>> write the port info to a file.
>> if you start the client right after the server start, they might use
>> incorrect/outdated info and cause all the test hang.
>>
>> I will start reproducing the hang
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Tuesday, July 19, 2016, M. D. <matus.dobro...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','matus.dobro...@gmail.com');>> wrote:
>>
>>> Yes I understand it, but I think, this is exactly that situation you are
>>> talking about. In my opinion, the test is doing exactly what you said -
>>> when a new player is willing to join, other players must invoke 
>>> MPI_Comm_accept().
>>> All *other* players must invoke MPI_Comm_accept(). Only the last client
>>> (in this case last player which wants to join) does not
>>> invoke MPI_Comm_accept(), because this client invokes only
>>> MPI_Comm_connect(). He is connecting to communicator, in which all other
>>> players are already involved and therefore this last client doesn't have to
>>> invoke MPI_Comm_accept().
>>>
>>> Am I still missing something in this my reflection?
>>>
>>> Matus
>>>
>>> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>>>
>>>> here is what the client is doing
>>>>
>>>>     printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
>>>> rank) ;
>>>>
>>>>     for (i = rank ; i < num_clients ; i++)
>>>>     {
>>>>       /* client performs a collective accept */
>>>>       CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
>>>> intracomm, &intercomm)) ;
>>>>
>>>>       printf("CLIENT: connected to server on port\n") ;
>>>>       [...]
>>>>
>>>>     }
>>>>
>>>> 2) has rank 1
>>>>
>>>> /* and 3) has rank 2) */
>>>>
>>>> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
>>>> called, hence my analysis of the crash/hang
>>>>
>>>>
>>>> I understand what you are trying to achieve, keep in mind
>>>> MPI_Comm_accept() is a collective call, so when a new player
>>>>
>>>> is willing to join, other players must invoke MPI_Comm_accept().
>>>>
>>>> and it is up to you to make sure that happens
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>> On 7/19/2016 5:48 PM, M. D. wrote:
>>>>
>>>>
>>>>
>>>> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>>>>
>>>>> MPI_Comm_accept must be called by all the tasks of the local
>>>>> communicator.
>>>>>
>>>> Yes, that's how I understand it. In the source code of the test, all
>>>> the tasks call  MPI_Comm_accept - server and also relevant clients.
>>>>
>>>>> so if you
>>>>>
>>>>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>>>>
>>>>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>>>>
>>>>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>>>>
>>>>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
>>>>> and an exited task (2)
>>>>>
>>>> This is not true in my opinion -  because of above mentioned fact that
>>>> MPI_Comm_accept is called by all the tasks of the local communicator.
>>>>
>>>>> /*
>>>>>
>>>>> strictly speaking, there is a race condition, if 2) has exited, then
>>>>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>>>>
>>>>> if 2) has not yet exited, then the test will hang because 2) does not
>>>>> invoke MPI_Comm_accept
>>>>>
>>>>> */
>>>>>
>>>> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>>>>
>>>>>
>>>>>
>>>>
>>>>> there are different ways of seeing things :
>>>>>
>>>>> 1) this is an incorrect usage of the test, the number of clients
>>>>> should be the same everywhere
>>>>>
>>>>> 2) task 2) should not exit (because it did not call
>>>>> MPI_Comm_disconnect()) and the test should hang when
>>>>>
>>>>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>>>>
>>>>>
>>>>> ad 1) I am sorry, but maybe I do not understand what you think - In my
>>>> previous post I wrote that the number of clients is the same in every
>>>> mpirun instance.
>>>> ad 2) it is the same as above
>>>>
>>>>> i do not know how you want to spawn your tasks.
>>>>>
>>>>> if 2) and 3) do not need to communicate with each other (they only
>>>>> communicate with 1)), then
>>>>>
>>>>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>>>>
>>>>> if 2 and 3) need to communicate with each other, it would be much
>>>>> easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>>>>
>>>>> so there is only one inter communicator with all the tasks.
>>>>>
>>>> My aim is that all the tasks need to communicate with each other. I am
>>>> implementing a distributed application - game with more players
>>>> communicating with each other via MPI. It should work as follows - First
>>>> player creates a game and waits for other players to connect to this game.
>>>> On different computers (in the same network) the other players can join
>>>> this game. When they are connected, they should be able to play this game
>>>> together.
>>>> I hope, it is clear what my idea is. If it is not, just ask me, please.
>>>>
>>>>>
>>>>> The current test program is growing incrementally the intercomm, which
>>>>> does require extra steps for synchronization.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> Gilles
>>>>>
>>>> Cheers,
>>>>
>>>> Matus
>>>>
>>>>> On 7/19/2016 4:37 PM, M. D. wrote:
>>>>>
>>>>> Hi,
>>>>> thank you for your interest in this topic.
>>>>>
>>>>> So, I normally run the test as follows:
>>>>> Firstly, I run "server" (second parameter is 1):
>>>>> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>>>>>
>>>>> Secondly, I run corresponding number of "clients" via following
>>>>> command:
>>>>> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>>>>>
>>>>> So, for example with 3 clients I do:
>>>>> mpirun -np 1 ./singleton_client_server 3 1
>>>>> mpirun -np 1 ./singleton_client_server 3 0
>>>>> mpirun -np 1 ./singleton_client_server 3 0
>>>>> mpirun -np 1 ./singleton_client_server 3 0
>>>>>
>>>>> It means you are right - there should be the same number of clients
>>>>> in each mpirun instance.
>>>>>
>>>>> The test does not involve MPI_Comm_disconnect(), but the problem in
>>>>> the test is in the earlier position, because some of clients (in the most
>>>>> cases actually the last client) cannot sometimes connect to the server and
>>>>> therefore all clients with server are hanging (waiting for the connections
>>>>> with the last client(s) ).
>>>>>
>>>>> So, the bahaviour of accept/connect method is a bit confusing for me.
>>>>> If I understand you, according to your post - the problem is not in
>>>>> the timeout value, isn't it?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Matus
>>>>>
>>>>> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>>>>>
>>>>>> How do you run the test ?
>>>>>>
>>>>>> you should have the same number of clients in each mpirun instance,
>>>>>> the following simple shell starts the test as i think it is supposed to
>>>>>>
>>>>>> note the test itself is arguable since MPI_Comm_disconnect() is never
>>>>>> invoked
>>>>>>
>>>>>> (and you will observe some related dpm_base_disconnect_init errors)
>>>>>>
>>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> clients=3
>>>>>>
>>>>>>     screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
>>>>>> $clients 1 2>&1 | tee /tmp/server.$clients"
>>>>>> for i in $(seq $clients); do
>>>>>>
>>>>>>     sleep 1
>>>>>>
>>>>>>     screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
>>>>>> $clients 0 2>&1 | tee /tmp/client.$clients.$i"
>>>>>> done
>>>>>>
>>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>>
>>>>>> this test fails with master.
>>>>>>
>>>>>> when the "server" (second parameter is 1), MPI_Comm_accept() fails
>>>>>> with a timeout.
>>>>>>
>>>>>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>>>>>>
>>>>>> OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
>>>>>>
>>>>>> but this is not the timeout that is triggered ...
>>>>>>
>>>>>> the eviction_cbfunc timeout function is invoked, and it has been set
>>>>>> when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
>>>>>>
>>>>>>
>>>>>> default timeout is 2 seconds, but in this case (user invokes
>>>>>> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
>>>>>> (hard coded value described above)
>>>>>>
>>>>>> sadly, if i set a higher timeout value (mpirun --mca
>>>>>> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return 
>>>>>> when
>>>>>> the client invokes MPI_Comm_connect()
>>>>>>
>>>>>>
>>>>>> could you please have a look at this ?
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On 7/15/2016 9:20 PM, M. D. wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a problem with basic client - server application. I tried to
>>>>>> run C program from this website
>>>>>> <https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
>>>>>> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/ma
>>>>>> ster/orte/test/mpi/singleton_client_server.c
>>>>>> I saw this program mentioned in many discussions in your website, so
>>>>>> I expected that it should work properly, but after more testing I found 
>>>>>> out
>>>>>> that there is probably an error somewhere in connect/accept method. I 
>>>>>> have
>>>>>> read many discussions and threads on your website, but I have not found
>>>>>> similar problem that I am facing. It seems that nobody had similar 
>>>>>> problem
>>>>>> like me. When I run this app with one server and more clients 
>>>>>> (3,4,5,6,...)
>>>>>> sometimes the app hangs. It hangs when second or next client wants to
>>>>>> connect to the server (it depends, sometimes third client hangs, 
>>>>>> sometimes
>>>>>> fourth, sometimes second, and so on).
>>>>>> So it means that app starts to hang where server waits for accept and
>>>>>> client waits for connect. And it is not possible to continue, because 
>>>>>> this
>>>>>> client cannot connect to the server. It is strange, because I observed 
>>>>>> this
>>>>>> behaviour only in some cases... Sometimes it works without any problems,
>>>>>> sometimes it does not work. The behaviour is unpredictable and not 
>>>>>> stable.
>>>>>>
>>>>>> I have installed openmpi 1.10.2 on my Fedora 19. I have the same
>>>>>> problem with Java alternative of this application. It hangs also
>>>>>> sometimes... I need this app in Java, but firstly it must work properly 
>>>>>> in
>>>>>> C implementation. Because of this strange behaviour I assume that there 
>>>>>> can
>>>>>> be an error maybe inside of openmpi implementation of connect/accept
>>>>>> methods. I tried it also with another version of openmpi - 1.8.1. 
>>>>>> However,
>>>>>> the problem did not disappear.
>>>>>>
>>>>>> Could you help me, what can cause the problem? Maybe I did not get
>>>>>> something about openmpi (or connect/server) and the problem is with 
>>>>>> me... I
>>>>>> will appreciate any your help, support, or interest about this topic.
>>>>>>
>>>>>> Best regards,
>>>>>> Matus Dobrotka
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing listus...@open-mpi.org
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29673.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29681.php
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing listus...@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29687.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-mpi.org/commun
>>>>> ity/lists/users/2016/07/29688.php
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing listus...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29689.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/commun
>>>> ity/lists/users/2016/07/29690.php
>>>>
>>>
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/commun
>> ity/lists/users/2016/07/29693.php
>>
>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to