here is what the client is doing

printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size, rank) ;

    for (i = rank ; i < num_clients ; i++)
    {
      /* client performs a collective accept */
CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm, &intercomm)) ;

      printf("CLIENT: connected to server on port\n") ;
      [...]

    }

2) has rank 1

/* and 3) has rank 2) */

so unless you run 2) with num_clients=2, MPI_Comm_accept() is never called, hence my analysis of the crash/hang


I understand what you are trying to achieve, keep in mind MPI_Comm_accept() is a collective call, so when a new player

is willing to join, other players must invoke MPI_Comm_accept().

and it is up to you to make sure that happens


Cheers,


Gilles


On 7/19/2016 5:48 PM, M. D. wrote:


2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>>:

    MPI_Comm_accept must be called by all the tasks of the local
    communicator.

Yes, that's how I understand it. In the source code of the test, all the tasks call MPI_Comm_accept - server and also relevant clients.

    so if you

    1) mpirun -np 1 ./singleton_client_server 2 1

    2) mpirun -np 1 ./singleton_client_server 2 0

    3) mpirun -np 1 ./singleton_client_server 2 0

    then 3) starts after 2) has exited, so on 1), intracomm is made of
    1) and an exited task (2)

This is not true in my opinion - because of above mentioned fact that MPI_Comm_accept is called by all the tasks of the local communicator.

    /*

    strictly speaking, there is a race condition, if 2) has exited,
    then MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.

    if 2) has not yet exited, then the test will hang because 2) does
    not invoke MPI_Comm_accept

    */

Task 2) does not exit, because of blocking call of MPI_Comm_accept.


    there are different ways of seeing things :

    1) this is an incorrect usage of the test, the number of clients
    should be the same everywhere

    2) task 2) should not exit (because it did not call
    MPI_Comm_disconnect()) and the test should hang when

    starting task 3) because task 2) does not call MPI_Comm_accept()


ad 1) I am sorry, but maybe I do not understand what you think - In my previous post I wrote that the number of clients is the same in every mpirun instance.
ad 2) it is the same as above

    i do not know how you want to spawn your tasks.

    if 2) and 3) do not need to communicate with each other (they only
    communicate with 1)), then

    you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

    if 2 and 3) need to communicate with each other, it would be much
    easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),

    so there is only one inter communicator with all the tasks.

My aim is that all the tasks need to communicate with each other. I am implementing a distributed application - game with more players communicating with each other via MPI. It should work as follows - First player creates a game and waits for other players to connect to this game. On different computers (in the same network) the other players can join this game. When they are connected, they should be able to play this game together.
I hope, it is clear what my idea is. If it is not, just ask me, please.


    The current test program is growing incrementally the intercomm,
    which does require extra steps for synchronization.


    Cheers,


    Gilles

Cheers,

Matus

    On 7/19/2016 4:37 PM, M. D. wrote:
    Hi,
    thank you for your interest in this topic.

    So, I normally run the test as follows:
    Firstly, I run "server" (second parameter is 1):
    *mpirun -np 1 ./singleton_client_server number_of_clients 1*
    *
    *
    Secondly, I run corresponding number of "clients" via following
    command:
    *mpirun -np 1 ./singleton_client_server number_of_clients 0*
    *
    *
    So, for example with 3 clients I do:
    mpirun -np 1 ./singleton_client_server 3 1
    mpirun -np 1 ./singleton_client_server 3 0
    mpirun -np 1 ./singleton_client_server 3 0
    mpirun -np 1 ./singleton_client_server 3 0

    It means you are right - there should be the same number of
    clients in each mpirun instance.

    The test does not involve MPI_Comm_disconnect(), but the problem
    in the test is in the earlier position, because some of clients
    (in the most cases actually the last client) cannot sometimes
    connect to the server and therefore all clients with server are
    hanging (waiting for the connections with the last client(s) ).

    So, the bahaviour of accept/connect method is a bit confusing for
    me.
    If I understand you, according to your post - the problem is not
    in the timeout value, isn't it?

    Cheers,

    Matus

    2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp
    <mailto:gil...@rist.or.jp>>:

        How do you run the test ?

        you should have the same number of clients in each mpirun
        instance, the following simple shell starts the test as i
        think it is supposed to

        note the test itself is arguable since MPI_Comm_disconnect()
        is never invoked

        (and you will observe some related dpm_base_disconnect_init
        errors)


        #!/bin/sh

        clients=3

            screen -d -m sh -c "mpirun -np 1
        ./singleton_client_server $clients 1 2>&1 | tee
        /tmp/server.$clients"
        for i in $(seq $clients); do

            sleep 1

            screen -d -m sh -c "mpirun -np 1
        ./singleton_client_server $clients 0 2>&1 | tee
        /tmp/client.$clients.$i"
        done


        Ralph,


        this test fails with master.

        when the "server" (second parameter is 1), MPI_Comm_accept()
        fails with a timeout.

        i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

        OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);

        but this is not the timeout that is triggered ...

        the eviction_cbfunc timeout function is invoked, and it has
        been set when opal_hotel_init() was invoked in
        orte/orted/pmix/pmix_server.c


        default timeout is 2 seconds, but in this case (user invokes
        MPI_Comm_accept), i guess the timeout should be infinite or
        60 seconds (hard coded value described above)

        sadly, if i set a higher timeout value (mpirun --mca
        orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does
        not return when the client invokes MPI_Comm_connect()


        could you please have a look at this ?


        Cheers,


        Gilles


        On 7/15/2016 9:20 PM, M. D. wrote:
        Hello,

        I have a problem with basic client - server application. I
        tried to run C program from this website
        
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
        I saw this program mentioned in many discussions in your
        website, so I expected that it should work properly, but
        after more testing I found out that there is probably an
        error somewhere in connect/accept method. I have read many
        discussions and threads on your website, but I have not
        found similar problem that I am facing. It seems that nobody
        had similar problem like me. When I run this app with one
        server and more clients (3,4,5,6,...) sometimes the app
        hangs. It hangs when second or next client wants to connect
        to the server (it depends, sometimes third client hangs,
        sometimes fourth, sometimes second, and so on).
        So it means that app starts to hang where server waits for
        accept and client waits for connect. And it is not possible
        to continue, because this client cannot connect to the
        server. It is strange, because I observed this behaviour
        only in some cases... Sometimes it works without any
        problems, sometimes it does not work. The behaviour is
        unpredictable and not stable.

        I have installed openmpi 1.10.2 on my Fedora 19. I have the
        same problem with Java alternative of this application. It
        hangs also sometimes... I need this app in Java, but firstly
        it must work properly in C implementation. Because of this
        strange behaviour I assume that there can be an error maybe
        inside of openmpi implementation of connect/accept methods.
        I tried it also with another version of openmpi - 1.8.1.
        However, the problem did not disappear.

        Could you help me, what can cause the problem? Maybe I did
        not get something about openmpi (or connect/server) and the
        problem is with me... I will appreciate any your help,
        support, or interest about this topic.

        Best regards,
        Matus Dobrotka


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/07/29673.php


        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
        Link to this post:
        http://www.open-mpi.org/community/lists/users/2016/07/29681.php




    _______________________________________________ users mailing
    list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription:
    https://www.open-mpi.org/mailman/listinfo.cgi/users

    Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/07/29687.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/07/29688.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/07/29689.php

Reply via email to