Re: [OMPI devel] SOS... help needed :(
Are both the IB HCA and the ethernet interfaces on the same physical bus? If they're not, the need for multiplexing them is diminished (but, of course, it depends on what you're trying to do -- if everything is using huge memory transfers, then your bottleneck will be RAM, not the bus that the NICs reside on). That being said, something we have not explored at all is the idea of multiplexing at the MPI layer. Perhaps something like "this is a low priority communicator; I want you to only use the 'tcp' BTL on it" and "this is a high priority communicator; I want you to only use the 'openib' BTL on it". I haven't thought at all about whether that is possible. It would probably take some mucking around in both the bml and the ob1 pml. Hmm. It may or may not be worth it, but I raise the possibility... On Apr 19, 2007, at 9:18 PM, po...@cc.gatech.edu wrote: Hi, Some of our clusters uses Gigabit Ethernet and Infiniband. So we are trying to multiplex them. Thanks and Regards Pooja On Thu, Apr 19, 2007 at 06:58:37PM -0400, po...@cc.gatech.edu wrote: I am Pooja working with chaitali on this project. The idea behind this is while running a parallelized code ,if a huge chunks of serial computation is encountered at that time underlying network infrastructure can be used for some other data transfer. This increases the network utilization. But this (non Mpi) data transfer should not keep Mpi calls blocking. So we need to give them priorities. Also we are trying to predict a behavior of the code (like if there are more MPi calls coming with short interval or if they are coming after large interval ) based on previous calls. As a result we can make this mechanism more efficient. Ok, so you have a Cluster with Infiniband a while the network traffic is low you want to utilize the Infiniband network for other data transfers with a lower priority? What does this have to do with TCP or are you using TCP over Infiniband? Regards Christian Leber -- http://rettetdieti.vde-uni-mannheim.de/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SOS... help needed :(
Hi, Some of our clusters uses Gigabit Ethernet and Infiniband. So we are trying to multiplex them. Thanks and Regards Pooja > On Thu, Apr 19, 2007 at 06:58:37PM -0400, po...@cc.gatech.edu wrote: > >> I am Pooja working with chaitali on this project. >> The idea behind this is while running a parallelized code ,if a huge >> chunks of serial computation is encountered at that time underlying >> network infrastructure can be used for some other data transfer. >> This increases the network utilization. >> But this (non Mpi) data transfer should not keep Mpi calls blocking. >> So we need to give them priorities. >> Also we are trying to predict a behavior of the code (like if there are >> more MPi calls coming with short interval or if they are coming after >> large interval ) based on previous calls. >> As a result we can make this mechanism more efficient. > > Ok, so you have a Cluster with Infiniband a while the network traffic is > low you want to utilize the Infiniband network for other data transfers > with a lower priority? > > What does this have to do with TCP or are you using TCP over Infiniband? > > Regards > Christian Leber > > -- > http://rettetdieti.vde-uni-mannheim.de/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] SOS... help needed :(
On Thu, Apr 19, 2007 at 06:58:37PM -0400, po...@cc.gatech.edu wrote: > I am Pooja working with chaitali on this project. > The idea behind this is while running a parallelized code ,if a huge > chunks of serial computation is encountered at that time underlying > network infrastructure can be used for some other data transfer. > This increases the network utilization. > But this (non Mpi) data transfer should not keep Mpi calls blocking. > So we need to give them priorities. > Also we are trying to predict a behavior of the code (like if there are > more MPi calls coming with short interval or if they are coming after > large interval ) based on previous calls. > As a result we can make this mechanism more efficient. Ok, so you have a Cluster with Infiniband a while the network traffic is low you want to utilize the Infiniband network for other data transfers with a lower priority? What does this have to do with TCP or are you using TCP over Infiniband? Regards Christian Leber -- http://rettetdieti.vde-uni-mannheim.de/
Re: [OMPI devel] SOS... help needed :(
On Apr 15, 2007, at 10:25 PM, chaitali dherange wrote: To make things simple, we are making this scheduling static to some extent... by static I mean.. we know that our clusters use Infiniband for MPI ( from our study of the openmpi source code this precisely uses the 'mca_btl_openib_send()' from the ompi/mca/btl/ openib/btl_openib.c file) ... so all the non MPI communication can be assumed to be TCP communication using the 'mca_btl_tcp_send()' from the ompi/mca/btl/tcp/btl_tcp.c file. To implement this we plan to implement the foll. simple algorithm: - before calling the 'mca_btl_openib_send()' lock0(X); - before calling the 'mca_btl_tcp_send()' lock1(X); Algo: 1. Allow Lock0(x) -> Lock0(x);.. meaning Lock0(x) is followed by Lock0(x). 2. Allow Lock1(x) -> Lock1(x); 3. Do not allow Lock0(x) -> Lock1(x); 4. If Lock1(x) -> Lock0(x) since MPI calls are to be higher priority over the non MPI ones.. in this case the non MPI communication should be paused and all the related data off course needs to be put into a queue(meaning the status of this should be saved in a queue). All other non MPI communications newer than this should also be added to this same queue. Now the MPI process trying to perform Lock0(x) should be allowed to complete and only when all the MPI communications are complete should the non MPI communication be allowed. Currently we are working on a simple scheduling algorithm without giving any priorities to the 'MPI_send' calls. However to implement the project fully, we have the following queries :( -Can we abort or pause the non-MPI/TCP communication in any way??? Not really; the BTL interface was not really designed for that. Indeed, even if you wrote your own socket code to use TCP sockets outside of MPI / BTL / etc., you don't have full control of exactly what is sent (or when). For example, if you write(fd, ...) and then decide you want to pause it, how would you do so? You can stop calling write(), but that's not enough. The kernel may have copied your buffer to a lower level and may be progressing the actual send behind the scenes. So you haven't *guaranteed* that only one network interface is utilizing the host's resources (RAM, kernel, memory busses, etc.) at one time. Indeed, the BTL interface is designed to acknowledge this asynchronicity -- it *assumes* that all network actions are non- blocking such that a "Send" action only *begins* the send; completion occurs later. So even if you use the TCP BTL to queue up a bunch of writes, if you then get an IB BTL send request, there isn't a good way to tell the TCP BTL "stop doing anything until I tell you otherwise" (i.e., don't process incoming reads and don't progress any further writes). :-\ -Given the assumption that the non-MPI communication is TCP, can we make use of the built in structures (i mean the buffer already used) in mca_btl_tcp_send() for the implementation of pt.4 in the above mentioned algorithm??? and more importantly how? Not really :-(. The BTLs, by design, are mutually unaware of each other. In fact, the BTLs are quite dumb (as intended). The design was to have the caller coordinate and perform any higher-level coordination and the BTLs are simple bit-movers between processes. Using the BTL's directly, the best you might be able to do is to stop queuing up new messages to a secondary BTL until you have completions from all pending traffic on a primary BTL. That might still be interesting, but it may not give you everything that you want -- especially since a) I'm guessing that your ultimate goal may be to multi-schedule multiple communication libraries across the *same* interconnect, and b) given the asynchronous nature of parallel computing, you might be able to do a half-decent job of *sending* scheduling, but you may not be able to predict the behavior of *receive* scheduling (e.g., how can you predict/schedule that a low priority receive would not be occurring at the same time on the same node as a high priority send?). Regards, Chaitali ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SOS... help needed :(
Hi!!! I am Pooja ,I am working with chaitali on this project. What we meant by Btl_Tcp is a call to btl_send that our program will give directly on the higher levels.In short we want to call BTL transport at the higher levels and so we have configure openmpi using all development header files.(So that we can call btl.h directly and use btl_tcp_send). As a result it will not be a MPI_call but a direct call from our own code. So we just want to know whether this can be done and if yes what we are thinking ahead is right and doable??? Thanks and Reagrds Pooja > On Sun, Apr 15, 2007 at 10:25:06PM -0400, chaitali dherange wrote: > >> Hi, > > Hi! > >> giving more priority to the MPI calls over the non MPI ones. > >> static I mean.. we know that our clusters use Infiniband for MPI ... >> so all the non MPI communication can be assumed to be TCP >> communication using the 'mca_btl_tcp_send()' from the >> ompi/mca/btl/tcp/btl_tcp.c file. > > I don't see why you call BTL/IB a MPI call, but BTL/TCP is non-MPI. > > The BTL components are used to provide MPI data transport. Depending on > your installed hardware, this transport can be done via IB, Myrinet or > at least TCP. Open MPI is even able to mix multiple transports and do > message striping. > > I suggest you read the comments in pml.h to make things clear. Don't get > confused, they still use the old terminology 'PTL' instead of 'BTL', but > just consider them to be equal. > > > -- > Cluster and Metacomputing Working Group > Friedrich-Schiller-Universität Jena, Germany > > private: http://adi.thur.de > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >
Re: [OMPI devel] SOS... help needed :(
On Sun, Apr 15, 2007 at 10:25:06PM -0400, chaitali dherange wrote: > Hi, Hi! > giving more priority to the MPI calls over the non MPI ones. > static I mean.. we know that our clusters use Infiniband for MPI ... > so all the non MPI communication can be assumed to be TCP > communication using the 'mca_btl_tcp_send()' from the > ompi/mca/btl/tcp/btl_tcp.c file. I don't see why you call BTL/IB a MPI call, but BTL/TCP is non-MPI. The BTL components are used to provide MPI data transport. Depending on your installed hardware, this transport can be done via IB, Myrinet or at least TCP. Open MPI is even able to mix multiple transports and do message striping. I suggest you read the comments in pml.h to make things clear. Don't get confused, they still use the old terminology 'PTL' instead of 'BTL', but just consider them to be equal. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de