Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke
MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client (in
this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().

Am I still missing something in this my reflection?

Matus

2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>:

> here is what the client is doing
>
>     printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
> rank) ;
>
>     for (i = rank ; i < num_clients ; i++)
>     {
>       /* client performs a collective accept */
>       CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
> &intercomm)) ;
>
>       printf("CLIENT: connected to server on port\n") ;
>       [...]
>
>     }
>
> 2) has rank 1
>
> /* and 3) has rank 2) */
>
> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
> called, hence my analysis of the crash/hang
>
>
> I understand what you are trying to achieve, keep in mind
> MPI_Comm_accept() is a collective call, so when a new player
>
> is willing to join, other players must invoke MPI_Comm_accept().
>
> and it is up to you to make sure that happens
>
>
> Cheers,
>
>
> Gilles
>
> On 7/19/2016 5:48 PM, M. D. wrote:
>
>
>
> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>
>> MPI_Comm_accept must be called by all the tasks of the local communicator.
>>
> Yes, that's how I understand it. In the source code of the test, all the
> tasks call  MPI_Comm_accept - server and also relevant clients.
>
>> so if you
>>
>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>
>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>
>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>
>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
>> an exited task (2)
>>
> This is not true in my opinion -  because of above mentioned fact that
> MPI_Comm_accept is called by all the tasks of the local communicator.
>
>> /*
>>
>> strictly speaking, there is a race condition, if 2) has exited, then
>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>
>> if 2) has not yet exited, then the test will hang because 2) does not
>> invoke MPI_Comm_accept
>>
>> */
>>
> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>
>>
>>
>
>> there are different ways of seeing things :
>>
>> 1) this is an incorrect usage of the test, the number of clients should
>> be the same everywhere
>>
>> 2) task 2) should not exit (because it did not call
>> MPI_Comm_disconnect()) and the test should hang when
>>
>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>
>>
>> ad 1) I am sorry, but maybe I do not understand what you think - In my
> previous post I wrote that the number of clients is the same in every
> mpirun instance.
> ad 2) it is the same as above
>
>> i do not know how you want to spawn your tasks.
>>
>> if 2) and 3) do not need to communicate with each other (they only
>> communicate with 1)), then
>>
>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>
>> if 2 and 3) need to communicate with each other, it would be much easier
>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>
>> so there is only one inter communicator with all the tasks.
>>
> My aim is that all the tasks need to communicate with each other. I am
> implementing a distributed application - game with more players
> communicating with each other via MPI. It should work as follows - First
> player creates a game and waits for other players to connect to this game.
> On different computers (in the same network) the other players can join
> this game. When they are connected, they should be able to play this game
> together.
> I hope, it is clear what my idea is. If it is not, just ask me, please.
>
>>
>> The current test program is growing incrementally the intercomm, which
>> does require extra steps for synchronization.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
> Cheers,
>
> Matus
>
>> On 7/19/2016 4:37 PM, M. D. wrote:
>>
>> Hi,
>> thank you for your interest in this topic.
>>
>> So, I normally run the test as follows:
>> Firstly, I run "server" (second parameter is 1):
>> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>>
>> Secondly, I run corresponding number of "clients" via following command:
>> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>>
>> So, for example with 3 clients I do:
>> mpirun -np 1 ./singleton_client_server 3 1
>> mpirun -np 1 ./singleton_client_server 3 0
>> mpirun -np 1 ./singleton_client_server 3 0
>> mpirun -np 1 ./singleton_client_server 3 0
>>
>> It means you are right - there should be the same number of clients in
>> each mpirun instance.
>>
>> The test does not involve MPI_Comm_disconnect(), but the problem in the
>> test is in the earlier position, because some of clients (in the most cases
>> actually the last client) cannot sometimes connect to the server and
>> therefore all clients with server are hanging (waiting for the connections
>> with the last client(s) ).
>>
>> So, the bahaviour of accept/connect method is a bit confusing for me.
>> If I understand you, according to your post - the problem is not in the
>> timeout value, isn't it?
>>
>> Cheers,
>>
>> Matus
>>
>> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet < <gil...@rist.or.jp>
>> gil...@rist.or.jp>:
>>
>>> How do you run the test ?
>>>
>>> you should have the same number of clients in each mpirun instance, the
>>> following simple shell starts the test as i think it is supposed to
>>>
>>> note the test itself is arguable since MPI_Comm_disconnect() is never
>>> invoked
>>>
>>> (and you will observe some related dpm_base_disconnect_init errors)
>>>
>>>
>>> #!/bin/sh
>>>
>>> clients=3
>>>
>>>     screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
>>> 1 2>&1 | tee /tmp/server.$clients"
>>> for i in $(seq $clients); do
>>>
>>>     sleep 1
>>>
>>>     screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
>>> 0 2>&1 | tee /tmp/client.$clients.$i"
>>> done
>>>
>>>
>>> Ralph,
>>>
>>>
>>> this test fails with master.
>>>
>>> when the "server" (second parameter is 1), MPI_Comm_accept() fails with
>>> a timeout.
>>>
>>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>>>
>>> OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
>>>
>>> but this is not the timeout that is triggered ...
>>>
>>> the eviction_cbfunc timeout function is invoked, and it has been set
>>> when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
>>>
>>>
>>> default timeout is 2 seconds, but in this case (user invokes
>>> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
>>> (hard coded value described above)
>>>
>>> sadly, if i set a higher timeout value (mpirun --mca
>>> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
>>> the client invokes MPI_Comm_connect()
>>>
>>>
>>> could you please have a look at this ?
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 7/15/2016 9:20 PM, M. D. wrote:
>>>
>>> Hello,
>>>
>>> I have a problem with basic client - server application. I tried to run
>>> C program from this website
>>> <https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
>>> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
>>> I saw this program mentioned in many discussions in your website, so I
>>> expected that it should work properly, but after more testing I found out
>>> that there is probably an error somewhere in connect/accept method. I have
>>> read many discussions and threads on your website, but I have not found
>>> similar problem that I am facing. It seems that nobody had similar problem
>>> like me. When I run this app with one server and more clients (3,4,5,6,...)
>>> sometimes the app hangs. It hangs when second or next client wants to
>>> connect to the server (it depends, sometimes third client hangs, sometimes
>>> fourth, sometimes second, and so on).
>>> So it means that app starts to hang where server waits for accept and
>>> client waits for connect. And it is not possible to continue, because this
>>> client cannot connect to the server. It is strange, because I observed this
>>> behaviour only in some cases... Sometimes it works without any problems,
>>> sometimes it does not work. The behaviour is unpredictable and not stable.
>>>
>>> I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
>>> with Java alternative of this application. It hangs also sometimes... I
>>> need this app in Java, but firstly it must work properly in C
>>> implementation. Because of this strange behaviour I assume that there can
>>> be an error maybe inside of openmpi implementation of connect/accept
>>> methods. I tried it also with another version of openmpi - 1.8.1. However,
>>> the problem did not disappear.
>>>
>>> Could you help me, what can cause the problem? Maybe I did not get
>>> something about openmpi (or connect/server) and the problem is with me... I
>>> will appreciate any your help, support, or interest about this topic.
>>>
>>> Best regards,
>>> Matus Dobrotka
>>>
>>>
>>> _______________________________________________
>>> users mailing listus...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/07/29673.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> <http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
>>> http://www.open-mpi.org/community/lists/users/2016/07/29681.php
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/07/29687.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/07/29688.php
>>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/07/29689.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29690.php
>

Reply via email to