Matus,
This has very likely been fixed by https://github.com/open-mpi/ompi/pull/2259 Can you download the patch at https://github.com/open-mpi/ompi/pull/2259.patch and apply it manually on v1.10 ? Cheers, Gilles On Monday, August 29, 2016, M. D. <matus.dobro...@gmail.com> wrote: > > Hi, > > I would like to ask - are there any new solutions or investigations in > this problem? > > Cheers, > > Matus Dobrotka > > 2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>>: > >> my bad for the confusion, >> >> I misread you and miswrote my reply. >> >> I will investigate this again. >> >> strictly speaking, the clients can only start after the server first >> write the port info to a file. >> if you start the client right after the server start, they might use >> incorrect/outdated info and cause all the test hang. >> >> I will start reproducing the hang >> >> Cheers, >> >> Gilles >> >> >> On Tuesday, July 19, 2016, M. D. <matus.dobro...@gmail.com >> <javascript:_e(%7B%7D,'cvml','matus.dobro...@gmail.com');>> wrote: >> >>> Yes I understand it, but I think, this is exactly that situation you are >>> talking about. In my opinion, the test is doing exactly what you said - >>> when a new player is willing to join, other players must invoke >>> MPI_Comm_accept(). >>> All *other* players must invoke MPI_Comm_accept(). Only the last client >>> (in this case last player which wants to join) does not >>> invoke MPI_Comm_accept(), because this client invokes only >>> MPI_Comm_connect(). He is connecting to communicator, in which all other >>> players are already involved and therefore this last client doesn't have to >>> invoke MPI_Comm_accept(). >>> >>> Am I still missing something in this my reflection? >>> >>> Matus >>> >>> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>: >>> >>>> here is what the client is doing >>>> >>>> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size, >>>> rank) ; >>>> >>>> for (i = rank ; i < num_clients ; i++) >>>> { >>>> /* client performs a collective accept */ >>>> CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, >>>> intracomm, &intercomm)) ; >>>> >>>> printf("CLIENT: connected to server on port\n") ; >>>> [...] >>>> >>>> } >>>> >>>> 2) has rank 1 >>>> >>>> /* and 3) has rank 2) */ >>>> >>>> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never >>>> called, hence my analysis of the crash/hang >>>> >>>> >>>> I understand what you are trying to achieve, keep in mind >>>> MPI_Comm_accept() is a collective call, so when a new player >>>> >>>> is willing to join, other players must invoke MPI_Comm_accept(). >>>> >>>> and it is up to you to make sure that happens >>>> >>>> >>>> Cheers, >>>> >>>> >>>> Gilles >>>> >>>> On 7/19/2016 5:48 PM, M. D. wrote: >>>> >>>> >>>> >>>> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>: >>>> >>>>> MPI_Comm_accept must be called by all the tasks of the local >>>>> communicator. >>>>> >>>> Yes, that's how I understand it. In the source code of the test, all >>>> the tasks call MPI_Comm_accept - server and also relevant clients. >>>> >>>>> so if you >>>>> >>>>> 1) mpirun -np 1 ./singleton_client_server 2 1 >>>>> >>>>> 2) mpirun -np 1 ./singleton_client_server 2 0 >>>>> >>>>> 3) mpirun -np 1 ./singleton_client_server 2 0 >>>>> >>>>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) >>>>> and an exited task (2) >>>>> >>>> This is not true in my opinion - because of above mentioned fact that >>>> MPI_Comm_accept is called by all the tasks of the local communicator. >>>> >>>>> /* >>>>> >>>>> strictly speaking, there is a race condition, if 2) has exited, then >>>>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined. >>>>> >>>>> if 2) has not yet exited, then the test will hang because 2) does not >>>>> invoke MPI_Comm_accept >>>>> >>>>> */ >>>>> >>>> Task 2) does not exit, because of blocking call of MPI_Comm_accept. >>>> >>>>> >>>>> >>>> >>>>> there are different ways of seeing things : >>>>> >>>>> 1) this is an incorrect usage of the test, the number of clients >>>>> should be the same everywhere >>>>> >>>>> 2) task 2) should not exit (because it did not call >>>>> MPI_Comm_disconnect()) and the test should hang when >>>>> >>>>> starting task 3) because task 2) does not call MPI_Comm_accept() >>>>> >>>>> >>>>> ad 1) I am sorry, but maybe I do not understand what you think - In my >>>> previous post I wrote that the number of clients is the same in every >>>> mpirun instance. >>>> ad 2) it is the same as above >>>> >>>>> i do not know how you want to spawn your tasks. >>>>> >>>>> if 2) and 3) do not need to communicate with each other (they only >>>>> communicate with 1)), then >>>>> >>>>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1) >>>>> >>>>> if 2 and 3) need to communicate with each other, it would be much >>>>> easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1), >>>>> >>>>> so there is only one inter communicator with all the tasks. >>>>> >>>> My aim is that all the tasks need to communicate with each other. I am >>>> implementing a distributed application - game with more players >>>> communicating with each other via MPI. It should work as follows - First >>>> player creates a game and waits for other players to connect to this game. >>>> On different computers (in the same network) the other players can join >>>> this game. When they are connected, they should be able to play this game >>>> together. >>>> I hope, it is clear what my idea is. If it is not, just ask me, please. >>>> >>>>> >>>>> The current test program is growing incrementally the intercomm, which >>>>> does require extra steps for synchronization. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> Gilles >>>>> >>>> Cheers, >>>> >>>> Matus >>>> >>>>> On 7/19/2016 4:37 PM, M. D. wrote: >>>>> >>>>> Hi, >>>>> thank you for your interest in this topic. >>>>> >>>>> So, I normally run the test as follows: >>>>> Firstly, I run "server" (second parameter is 1): >>>>> *mpirun -np 1 ./singleton_client_server number_of_clients 1* >>>>> >>>>> Secondly, I run corresponding number of "clients" via following >>>>> command: >>>>> *mpirun -np 1 ./singleton_client_server number_of_clients 0* >>>>> >>>>> So, for example with 3 clients I do: >>>>> mpirun -np 1 ./singleton_client_server 3 1 >>>>> mpirun -np 1 ./singleton_client_server 3 0 >>>>> mpirun -np 1 ./singleton_client_server 3 0 >>>>> mpirun -np 1 ./singleton_client_server 3 0 >>>>> >>>>> It means you are right - there should be the same number of clients >>>>> in each mpirun instance. >>>>> >>>>> The test does not involve MPI_Comm_disconnect(), but the problem in >>>>> the test is in the earlier position, because some of clients (in the most >>>>> cases actually the last client) cannot sometimes connect to the server and >>>>> therefore all clients with server are hanging (waiting for the connections >>>>> with the last client(s) ). >>>>> >>>>> So, the bahaviour of accept/connect method is a bit confusing for me. >>>>> If I understand you, according to your post - the problem is not in >>>>> the timeout value, isn't it? >>>>> >>>>> Cheers, >>>>> >>>>> Matus >>>>> >>>>> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>: >>>>> >>>>>> How do you run the test ? >>>>>> >>>>>> you should have the same number of clients in each mpirun instance, >>>>>> the following simple shell starts the test as i think it is supposed to >>>>>> >>>>>> note the test itself is arguable since MPI_Comm_disconnect() is never >>>>>> invoked >>>>>> >>>>>> (and you will observe some related dpm_base_disconnect_init errors) >>>>>> >>>>>> >>>>>> #!/bin/sh >>>>>> >>>>>> clients=3 >>>>>> >>>>>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server >>>>>> $clients 1 2>&1 | tee /tmp/server.$clients" >>>>>> for i in $(seq $clients); do >>>>>> >>>>>> sleep 1 >>>>>> >>>>>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server >>>>>> $clients 0 2>&1 | tee /tmp/client.$clients.$i" >>>>>> done >>>>>> >>>>>> >>>>>> Ralph, >>>>>> >>>>>> >>>>>> this test fails with master. >>>>>> >>>>>> when the "server" (second parameter is 1), MPI_Comm_accept() fails >>>>>> with a timeout. >>>>>> >>>>>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout >>>>>> >>>>>> OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60); >>>>>> >>>>>> but this is not the timeout that is triggered ... >>>>>> >>>>>> the eviction_cbfunc timeout function is invoked, and it has been set >>>>>> when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c >>>>>> >>>>>> >>>>>> default timeout is 2 seconds, but in this case (user invokes >>>>>> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds >>>>>> (hard coded value described above) >>>>>> >>>>>> sadly, if i set a higher timeout value (mpirun --mca >>>>>> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return >>>>>> when >>>>>> the client invokes MPI_Comm_connect() >>>>>> >>>>>> >>>>>> could you please have a look at this ? >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 7/15/2016 9:20 PM, M. D. wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> I have a problem with basic client - server application. I tried to >>>>>> run C program from this website >>>>>> <https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c> >>>>>> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/ma >>>>>> ster/orte/test/mpi/singleton_client_server.c >>>>>> I saw this program mentioned in many discussions in your website, so >>>>>> I expected that it should work properly, but after more testing I found >>>>>> out >>>>>> that there is probably an error somewhere in connect/accept method. I >>>>>> have >>>>>> read many discussions and threads on your website, but I have not found >>>>>> similar problem that I am facing. It seems that nobody had similar >>>>>> problem >>>>>> like me. When I run this app with one server and more clients >>>>>> (3,4,5,6,...) >>>>>> sometimes the app hangs. It hangs when second or next client wants to >>>>>> connect to the server (it depends, sometimes third client hangs, >>>>>> sometimes >>>>>> fourth, sometimes second, and so on). >>>>>> So it means that app starts to hang where server waits for accept and >>>>>> client waits for connect. And it is not possible to continue, because >>>>>> this >>>>>> client cannot connect to the server. It is strange, because I observed >>>>>> this >>>>>> behaviour only in some cases... Sometimes it works without any problems, >>>>>> sometimes it does not work. The behaviour is unpredictable and not >>>>>> stable. >>>>>> >>>>>> I have installed openmpi 1.10.2 on my Fedora 19. I have the same >>>>>> problem with Java alternative of this application. It hangs also >>>>>> sometimes... I need this app in Java, but firstly it must work properly >>>>>> in >>>>>> C implementation. Because of this strange behaviour I assume that there >>>>>> can >>>>>> be an error maybe inside of openmpi implementation of connect/accept >>>>>> methods. I tried it also with another version of openmpi - 1.8.1. >>>>>> However, >>>>>> the problem did not disappear. >>>>>> >>>>>> Could you help me, what can cause the problem? Maybe I did not get >>>>>> something about openmpi (or connect/server) and the problem is with >>>>>> me... I >>>>>> will appreciate any your help, support, or interest about this topic. >>>>>> >>>>>> Best regards, >>>>>> Matus Dobrotka >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing listus...@open-mpi.org >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29673.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29681.php> >>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29681.php >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing listus...@open-mpi.org >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2016/07/29687.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: http://www.open-mpi.org/commun >>>>> ity/lists/users/2016/07/29688.php >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing listus...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2016/07/29689.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: http://www.open-mpi.org/commun >>>> ity/lists/users/2016/07/29690.php >>>> >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: http://www.open-mpi.org/commun >> ity/lists/users/2016/07/29693.php >> > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users