Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

Ralph Castain Thu, 22 Jul 2010 15:50:45 -0400

It was easier for me to just construct this module than to explain how to do so 
:-)


I will commit it this evening (couple of hours from now) as that is our 
standard practice. You'll need to use the developer's trunk, though, to use it.

Here are the envars you'll need to provide:

Each process needs to get the same following values:

* OMPI_MCA_ess=generic
* OMPI_MCA_orte_num_procs=<number of MPI procs>
* OMPI_MCA_orte_nodes=<a comma-separated list of nodenames where MPI procs 
reside>
* OMPI_MCA_orte_ppn=<number of procs/node>

Note that I have assumed this last value is a constant for simplicity. If that 
isn't the case, let me know - you could instead provide it as a comma-separated 
list of values with an entry for each node.

In addition, you need to provide the following value that will be unique to 
each process:

* OMPI_MCA_orte_rank=<MPI rank>

Finally, you have to provide a range of static TCP ports for use by the 
processes. Pick any range that you know will be available across all the nodes. 
You then need to ensure that each process sees the following envar:

* OMPI_MCA_oob_tcp_static_ports=6000-6010  <== obviously, replace this with 
your range

You will need a port range that is at least equal to the ppn for the job (each 
proc on a node will take one of the provided ports).

That should do it. I compute everything else I need from those values.

Does that work for you?
Ralph


On Jul 22, 2010, at 6:48 AM, Philippe wrote:

> On Wed, Jul 21, 2010 at 10:44 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> On Jul 21, 2010, at 7:44 AM, Philippe wrote:
>> 
>>> Ralph,
>>> 
>>> Sorry for the late reply -- I was away on vacation.
>> 
>> no problem at all!
>> 
>>> 
>>> regarding your earlier question about how many processes where
>>> involved when the memory was entirely allocated, it was only two, a
>>> sender and a receiver. I'm still trying to pinpoint what can be
>>> different between the standalone case and the "integrated" case. I
>>> will try to find out what part of the code is allocating memory in a
>>> loop.
>> 
>> hmmm....that sounds like a bug in your program. let me know what you find
>> 
>>> 
>>> 
>>> On Tue, Jul 20, 2010 at 12:51 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> Well, I finally managed to make this work without the required ompi-server 
>>>> rendezvous point. The fix is only in the devel trunk right now - I'll have 
>>>> to ask the release managers for 1.5 and 1.4 if they want it ported to 
>>>> those series.
>>>> 
>>> 
>>> great -- i'll give it a try
>>> 
>>>> On the notion of integrating OMPI to your launch environment: remember 
>>>> that we don't necessarily require that you use mpiexec for that purpose. 
>>>> If your launch environment provides just a little info in the environment 
>>>> of the launched procs, we can usually devise a method that allows the 
>>>> procs to perform an MPI_Init as a single job without all this work you are 
>>>> doing.
>>>> 
>>> 
>>> I'm working on creating operators using MPI for the IBM product
>>> "InfoSphere Streams". It has its own launching mechanism to start the
>>> processes. However I can pass some information to the processes that
>>> belong to the same job (Streams job -- which should neatly map to MPI
>>> job).
>>> 
>>>> Only difference is that your procs will all block in MPI_Init until they 
>>>> -all- have executed that function. If that isn't a problem, this would be 
>>>> a much more scalable and reliable method than doing it thru massive calls 
>>>> to MPI_Port_connect.
>>>> 
>>> 
>>> in the general case, that would be a problem, but for my prototype,
>>> this is acceptable.
>>> 
>>> In general, each process is composed of operators, some may be MPI
>>> related and some may not. But in my case, I know ahead of time which
>>> processes will be part of the MPI job, so I can easily deal with the
>>> fact that they would block on MPI_init (actually -- MPI_thread_init
>>> since its using a lot of threads).
>> 
>> We have talked in the past about creating a non-blocking MPI_Init as an 
>> extension to the standard. It would lock you to Open MPI, though...
>> 
>> Regardless, at some point you would have to know how many processes are 
>> going to be part of the job so you can know when MPI_Init is complete. I 
>> would think you would require that info for the singleton wireup anyway - 
>> yes? Otherwise, how would you know when to quit running connect-accept?
>> 
> 
> the short answer is yes... although, the longer answer is a bit more
> complicated. currently I do know the number of connect I need to do on
> a per-port basis. a job can contains an arbitrary number of MPI
> processes, each opening one or more ports. so i know the count port by
> ports but I dont need to worry about how many MPI processes there is
> globally. to make things a bit more complicated, each MPI operator can
> be "fused" with other operators to make a process. each fused operator
> may or may not require MPI. the bottom line is, to get the total
> number of processes to calculate rank&size, I need to reverse engineer
> the fusing that the compiler may do.
> 
> but that's ok, I'm willing to do that for our prototype :-)
> 
>>> 
>>> Is there a documentation or example I can use to see what information
>>> I can pass to the processes to enable that? Is it just environment
>>> variables?
>> 
>> No real documentation - a lack I should probably fill. At the moment, we 
>> don't have a "generic" module for standalone launch, but I can create one as 
>> it is pretty trivial. I would then need you to pass each process envars 
>> telling it the total number of processes in the MPI job, its rank within 
>> that job, and a file where some rendezvous process (can be rank=0) has 
>> provided that port string. Armed with that info, I can wireup the job.
>> 
>> Won't be as scalable as an mpirun-initiated startup, but will be much better 
>> than doing it from singletons.
> 
> that would be great. I can definitely pass environment variables to
> each process.
> 
>> 
>> Or if you prefer, we could setup an "infosphere" module that we could 
>> customize for this system. Main thing here would be to provide us with some 
>> kind of regex (or access to a file containing the info) that describes the 
>> map of rank to node so we can construct the wireup communication pattern.
>> 
> 
> i think for our prototype we are fine with the first method. I'd leave
> the cleaner implementation as a task for the product team ;-)
> 
> regarding the "generic" module, is that something you can put together
> quickly? can I help in any way?
> 
> Thanks!
> p
> 
>> Either way would work. The second is more scalable, but I don't know if you 
>> have (or can construct) the map info.
>> 
>>> 
>>> Many thanks!
>>> p.
>>> 
>>>> 
>>>> On Jul 18, 2010, at 4:09 PM, Philippe wrote:
>>>> 
>>>>> Ralph,
>>>>> 
>>>>> thanks for investigating.
>>>>> 
>>>>> I've applied the two patches you mentioned earlier and ran with the
>>>>> ompi server. Although i was able to runn our standalone test, when I
>>>>> integrated the changes to our code, the processes entered a crazy loop
>>>>> and allocated all the memory available when calling MPI_Port_Connect.
>>>>> I was not able to identify why it works standalone but not integrated
>>>>> with our code. If I found why, I'll let your know.
>>>>> 
>>>>> looking forward to your findings. We'll be happy to test any patches
>>>>> if you have some!
>>>>> 
>>>>> p.
>>>>> 
>>>>> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Okay, I can reproduce this problem. Frankly, I don't think this ever 
>>>>>> worked with OMPI, and I'm not sure how the choice of BTL makes a 
>>>>>> difference.
>>>>>> 
>>>>>> The program is crashing in the communicator definition, which involves a 
>>>>>> communication over our internal out-of-band messaging system. That 
>>>>>> system has zero connection to any BTL, so it should crash either way.
>>>>>> 
>>>>>> Regardless, I will play with this a little as time allows. Thanks for 
>>>>>> the reproducer!
>>>>>> 
>>>>>> 
>>>>>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm trying to run a test program which consists of a server creating a
>>>>>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>>>>>>> connect to the server.
>>>>>>> 
>>>>>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
>>>>>>> clients, I get the following error message:
>>>>>>> 
>>>>>>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
>>>>>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>>>>>> 
>>>>>>> This is only happening with the openib BTL. With tcp BTL it works
>>>>>>> perfectly fine (ofud also works as a matter of fact...). This has been
>>>>>>> tested on two completely different clusters, with identical results.
>>>>>>> In either cases, the IB frabic works normally.
>>>>>>> 
>>>>>>> Any help would be greatly appreciated! Several people in my team
>>>>>>> looked at the problem. Google and the mailing list archive did not
>>>>>>> provide any clue. I believe that from an MPI standpoint, my test
>>>>>>> program is valid (and it works with TCP, which make me feel better
>>>>>>> about the sequence of MPI calls)
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Philippe.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Background:
>>>>>>> 
>>>>>>> I intend to use openMPI to transport data inside a much larger
>>>>>>> application. Because of that, I cannot used mpiexec. Each process is
>>>>>>> started by our own "job management" and use a name server to find
>>>>>>> about each others. Once all the clients are connected, I would like
>>>>>>> the server to do MPI_Recv to get the data from all the client. I dont
>>>>>>> care about the order or which client are sending data, as long as I
>>>>>>> can receive it with on call. Do do that, the clients and the server
>>>>>>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
>>>>>>> so that at the end, all the clients and the server are inside the same
>>>>>>> intracomm.
>>>>>>> 
>>>>>>> Steps:
>>>>>>> 
>>>>>>> I have a sample program that show the issue. I tried to make it as
>>>>>>> short as possible. It needs to be executed on a shared file system
>>>>>>> like NFS because the server write the port info to a file that the
>>>>>>> client will read. To reproduce the issue, the following steps should
>>>>>>> be performed:
>>>>>>> 
>>>>>>> 0. compile the test with "mpicc -o ben12 ben12.c"
>>>>>>> 1. ssh to the machine that will be the server
>>>>>>> 2. run ./ben12 3 1
>>>>>>> 3. ssh to the machine that will be the client #1
>>>>>>> 4. run ./ben12 3 0
>>>>>>> 5. repeat step 3-4 for client #2 and #3
>>>>>>> 
>>>>>>> the server accept the connection from client #1 and merge it in a new
>>>>>>> intracomm. It then accept connection from client #2 and merge it. when
>>>>>>> the client #3 arrives, the server accept the connection, but that
>>>>>>> cause client #1 and #2 to die with the error above (see the complete
>>>>>>> trace in the tarball).
>>>>>>> 
>>>>>>> The exact steps are:
>>>>>>> 
>>>>>>>     - server open port
>>>>>>>     - server does accept
>>>>>>>     - client #1 does connect
>>>>>>>     - server and client #1 do merge
>>>>>>>     - server does accept
>>>>>>>     - client #2 does connect
>>>>>>>     - server, client #1 and client #2 do merge
>>>>>>>     - server does accept
>>>>>>>     - client #3 does connect
>>>>>>>     - server, client #1, client #2 and client #3 do merge
>>>>>>> 
>>>>>>> 
>>>>>>> My infiniband network works normally with other test programs or
>>>>>>> applications (MPI or others like Verbs).
>>>>>>> 
>>>>>>> Info about my setup:
>>>>>>> 
>>>>>>>    openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
>>>>>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>>>>>>>    config.log in the tarball
>>>>>>>    "ompi_info --all" in the tarball
>>>>>>>    OFED version = 1.3 installed from RHEL 5.3
>>>>>>>    Distro = RedHat Entreprise Linux 5.3
>>>>>>>    Kernel = 2.6.18-128.4.1.el5 x86_64
>>>>>>>    subnet manager = built-in SM from the cisco/topspin switch
>>>>>>>    output of ibv_devinfo included in the tarball (there are no "bad" 
>>>>>>> nodes)
>>>>>>>    "ulimit -l" says "unlimited"
>>>>>>> 
>>>>>>> The tarball contains:
>>>>>>> 
>>>>>>>   - ben12.c: my test program showing the behavior
>>>>>>>   - config.log / config.out / make.out / make-install.out /
>>>>>>> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt
>>>>>>>   - trace-tcp.txt: output of the server and each client when it works
>>>>>>> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf)
>>>>>>>   - trace-ib.txt: output of the server and each client when it fails
>>>>>>> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf)
>>>>>>> 
>>>>>>> I hope I provided enough info for somebody to reproduce the problem...
>>>>>>> <ompi-output.tar.bz2>_______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

Reply via email to