Re: [OMPI devel] SOS... help needed :(

2007-04-19 Thread Jeff Squyres
Are both the IB HCA and the ethernet interfaces on the same physical  
bus?


If they're not, the need for multiplexing them is diminished (but, of  
course, it depends on what you're trying to do -- if everything is  
using huge memory transfers, then your bottleneck will be RAM, not  
the bus that the NICs reside on).


That being said, something we have not explored at all is the idea of  
multiplexing at the MPI layer.  Perhaps something like "this is a low  
priority communicator; I want you to only use the 'tcp' BTL on it"  
and "this is a high priority communicator; I want you to only use the  
'openib' BTL on it".


I haven't thought at all about whether that is possible.  It would  
probably take some mucking around in both the bml and the ob1 pml.   
Hmm.  It may or may not be worth it, but I raise the possibility...



On Apr 19, 2007, at 9:18 PM, po...@cc.gatech.edu wrote:


Hi,

Some of our clusters uses Gigabit Ethernet and Infiniband.
So we are trying to multiplex them.

Thanks and Regards
Pooja



On Thu, Apr 19, 2007 at 06:58:37PM -0400, po...@cc.gatech.edu wrote:


I am Pooja working with chaitali on this project.
The idea behind this is while running a parallelized code ,if a huge
chunks of serial computation is encountered at that time underlying
network infrastructure can be used for some other data transfer.
This increases the network utilization.
But this (non Mpi) data transfer should not keep Mpi calls blocking.
So we need to give them priorities.
Also we are trying to predict a behavior of the code (like if  
there are
more MPi calls coming with short interval or if they are coming  
after

large interval ) based on previous calls.
As a result we can make this mechanism more efficient.


Ok, so you have a Cluster with Infiniband a while the network  
traffic is
low you want to utilize the Infiniband network for other data  
transfers

with a lower priority?

What does this have to do with TCP or are you using TCP over  
Infiniband?


Regards
Christian Leber

--
http://rettetdieti.vde-uni-mannheim.de/

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SOS... help needed :(

2007-04-19 Thread pooja
Hi,

Some of our clusters uses Gigabit Ethernet and Infiniband.
So we are trying to multiplex them.

Thanks and Regards
Pooja


> On Thu, Apr 19, 2007 at 06:58:37PM -0400, po...@cc.gatech.edu wrote:
>
>> I am Pooja working with chaitali on this project.
>> The idea behind this is while running a parallelized code ,if a huge
>> chunks of serial computation is encountered at that time underlying
>> network infrastructure can be used for some other data transfer.
>> This increases the network utilization.
>> But this (non Mpi) data transfer should not keep Mpi calls blocking.
>> So we need to give them priorities.
>> Also we are trying to predict a behavior of the code (like if there are
>> more MPi calls coming with short interval or if they are coming after
>> large interval ) based on previous calls.
>> As a result we can make this mechanism more efficient.
>
> Ok, so you have a Cluster with Infiniband a while the network traffic is
> low you want to utilize the Infiniband network for other data transfers
> with a lower priority?
>
> What does this have to do with TCP or are you using TCP over Infiniband?
>
> Regards
> Christian Leber
>
> --
> http://rettetdieti.vde-uni-mannheim.de/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



Re: [OMPI devel] SOS... help needed :(

2007-04-19 Thread Christian Leber
On Thu, Apr 19, 2007 at 06:58:37PM -0400, po...@cc.gatech.edu wrote:

> I am Pooja working with chaitali on this project.
> The idea behind this is while running a parallelized code ,if a huge
> chunks of serial computation is encountered at that time underlying
> network infrastructure can be used for some other data transfer.
> This increases the network utilization.
> But this (non Mpi) data transfer should not keep Mpi calls blocking.
> So we need to give them priorities.
> Also we are trying to predict a behavior of the code (like if there are
> more MPi calls coming with short interval or if they are coming after
> large interval ) based on previous calls.
> As a result we can make this mechanism more efficient.

Ok, so you have a Cluster with Infiniband a while the network traffic is
low you want to utilize the Infiniband network for other data transfers
with a lower priority?

What does this have to do with TCP or are you using TCP over Infiniband?

Regards
Christian Leber

-- 
http://rettetdieti.vde-uni-mannheim.de/



Re: [OMPI devel] SOS... help needed :(

2007-04-16 Thread Jeff Squyres

On Apr 15, 2007, at 10:25 PM, chaitali dherange wrote:

To make things simple, we are making this scheduling static to some  
extent... by static I mean.. we know that our clusters use  
Infiniband for MPI ( from our study of the openmpi source code this  
precisely uses the 'mca_btl_openib_send()' from the ompi/mca/btl/ 
openib/btl_openib.c file) ... so all the non MPI communication can  
be assumed to be TCP communication using the 'mca_btl_tcp_send()'  
from the ompi/mca/btl/tcp/btl_tcp.c file.


To implement this we plan to implement the foll. simple algorithm:

- before calling the 'mca_btl_openib_send()' lock0(X);
- before calling the 'mca_btl_tcp_send()' lock1(X);

Algo:

1. Allow Lock0(x) -> Lock0(x);.. meaning Lock0(x) is followed by  
Lock0(x).

2. Allow Lock1(x) -> Lock1(x);
3. Do not allow Lock0(x) -> Lock1(x);
4. If Lock1(x) -> Lock0(x) since MPI calls are to be higher  
priority over the non MPI ones.. in this case the non MPI  
communication should be paused and all the related data off course  
needs to be put into a queue(meaning the status of this should be  
saved in a queue). All other non MPI communications newer than this  
should also be added to this same queue. Now the MPI process trying  
to perform Lock0(x) should be allowed to complete and only when all  
the MPI communications are complete should the non MPI  
communication be allowed.


Currently we are working on a simple scheduling algorithm without  
giving any priorities to the 'MPI_send' calls.


However to implement the project fully, we have the following  
queries :(

-Can we abort or pause the non-MPI/TCP communication in any way???


Not really; the BTL interface was not really designed for that.   
Indeed, even if you wrote your own socket code to use TCP sockets  
outside of MPI / BTL / etc., you don't have full control of exactly  
what is sent (or when).  For example, if you write(fd, ...) and then  
decide you want to pause it, how would you do so?  You can stop  
calling write(), but that's not enough.  The kernel may have copied  
your buffer to a lower level and may be progressing the actual send  
behind the scenes.  So you haven't *guaranteed* that only one network  
interface is utilizing the host's resources (RAM, kernel, memory  
busses, etc.) at one time.


Indeed, the BTL interface is designed to acknowledge this  
asynchronicity -- it *assumes* that all network actions are non- 
blocking such that a "Send" action only *begins* the send; completion  
occurs later.


So even if you use the TCP BTL to queue up a bunch of writes, if you  
then get an IB BTL send request, there isn't a good way to tell the  
TCP BTL "stop doing anything until I tell you otherwise" (i.e., don't  
process incoming reads and don't progress any further writes).  :-\



-Given the assumption that the non-MPI communication is TCP, can we
make use of the built in structures (i mean the buffer already  
used) in
mca_btl_tcp_send() for the implementation of pt.4  in the above  
mentioned

algorithm??? and more importantly how?


Not really :-(.  The BTLs, by design, are mutually unaware of each  
other.  In fact, the BTLs are quite dumb (as intended).  The design  
was to have the caller coordinate and perform any higher-level  
coordination and the BTLs are simple bit-movers between processes.


Using the BTL's directly, the best you might be able to do is to stop  
queuing up new messages to a secondary BTL until you have completions  
from all pending traffic on a primary BTL.  That might still be  
interesting, but it may not give you everything that you want --  
especially since a) I'm guessing that your ultimate goal may be to  
multi-schedule multiple communication libraries across the *same*  
interconnect, and b) given the asynchronous nature of parallel  
computing, you might be able to do a half-decent job of *sending*  
scheduling, but you may not be able to predict the behavior of  
*receive* scheduling (e.g., how can you predict/schedule that a low  
priority receive would not be occurring at the same time on the same  
node as a high priority send?).



Regards,
Chaitali
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SOS... help needed :(

2007-04-16 Thread pooja
Hi!!!


I am Pooja ,I am working with chaitali on this project.
What we meant by Btl_Tcp is a call to btl_send that our program will give
directly on the higher levels.In short we want to call BTL transport at
the higher levels and so we have configure openmpi using all development
header
files.(So that we can call btl.h directly and use btl_tcp_send).
As a result it will not be a MPI_call but a direct call from our own code.

So we just want to know whether this can be done and if yes what we are
thinking ahead is right and doable???


Thanks and Reagrds
Pooja







> On Sun, Apr 15, 2007 at 10:25:06PM -0400, chaitali dherange wrote:
>
>> Hi,
>
> Hi!
>
>> giving more priority to the MPI calls over the non MPI ones.
>
>> static I mean.. we know that our clusters use Infiniband for MPI ...
>> so all the non MPI communication can be assumed to be TCP
>> communication using the 'mca_btl_tcp_send()' from the
>> ompi/mca/btl/tcp/btl_tcp.c file.
>
> I don't see why you call BTL/IB a MPI call, but BTL/TCP is non-MPI.
>
> The BTL components are used to provide MPI data transport. Depending on
> your installed hardware, this transport can be done via IB, Myrinet or
> at least TCP. Open MPI is even able to mix multiple transports and do
> message striping.
>
> I suggest you read the comments in pml.h to make things clear. Don't get
> confused, they still use the old terminology 'PTL' instead of 'BTL', but
> just consider them to be equal.
>
>
> --
> Cluster and Metacomputing Working Group
> Friedrich-Schiller-Universität Jena, Germany
>
> private: http://adi.thur.de
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>



Re: [OMPI devel] SOS... help needed :(

2007-04-16 Thread Adrian Knoth
On Sun, Apr 15, 2007 at 10:25:06PM -0400, chaitali dherange wrote:

> Hi,

Hi!

> giving more priority to the MPI calls over the non MPI ones.

> static I mean.. we know that our clusters use Infiniband for MPI ...
> so all the non MPI communication can be assumed to be TCP
> communication using the 'mca_btl_tcp_send()' from the
> ompi/mca/btl/tcp/btl_tcp.c file.

I don't see why you call BTL/IB a MPI call, but BTL/TCP is non-MPI.

The BTL components are used to provide MPI data transport. Depending on
your installed hardware, this transport can be done via IB, Myrinet or
at least TCP. Open MPI is even able to mix multiple transports and do
message striping.

I suggest you read the comments in pml.h to make things clear. Don't get
confused, they still use the old terminology 'PTL' instead of 'BTL', but
just consider them to be equal.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de