Re: [OMPI devel] TCP BTL routability (was: ticket #972)

2008-07-29 Thread Adrian Knoth
On Tue, Jul 29, 2008 at 03:25:00PM -0400, Jeff Squyres wrote:

> For reference, the FAQ entry is here:
> 
> http://www.open-mpi.org/faq/?category=tcp#tcp-routability
> 
> It looks like we now *always* assume that two TCP peers are routable.   

As long as they share the same address family (IPv4 or IPv6).

> The code in question is in btl_tcp_proc.c with the loop starting at  
> line 413.

Yes. The FAQ is outdated, the new code is very different.

We now use graph theory, imagine a bipartite graph where each interface
is a vertex. (one peer on the left, the other on the right, no
connections inside each peer, only from left to right, hence a bipartite
graph).

Every edge in this graph is given a weight depending on its quality. The
quality is "defined" in btl_tcp_proc.h:

enum mca_btl_tcp_connection_quality { 
CQ_NO_CONNECTION,
CQ_PRIVATE_DIFFERENT_NETWORK,
CQ_PRIVATE_SAME_NETWORK,
CQ_PUBLIC_DIFFERENT_NETWORK,
CQ_PUBLIC_SAME_NETWORK
};

CQ_NO_CONNECTION (weight 0) is for different address families, so we
don't try to connect from IPv6 to IPv4 and vice versa. The more likely a
connection is going to be established, the higher the weight. So public
addresses on the same network (read: very close, probably sharing the
same link) are the best one can get, private addresses on different
networks have the lowest probability for a succeeding connection.

We then try to find a matching in the graph, this means, no two edges
may have a common endpoint on either side, thus avoiding
oversubscription.

In order to support striping, we look for the largest matching (read:
selecting as many edges (links) as possible).

In order to ensure connectivity, we then choose from the variety of
largest matchings the one with the highest summed weights. These edges
denote the addresses with the best probability for a succeeding
connection.

In terms of graph theory, this is called a maximum cardinality maximum
weight matching.


You can find the whole background story in Chapter 3:

   http://cluster.inf-ra.uni-jena.de/~adi/peiselt-thesis.pdf


We have also a brief IEEE paper on this:

   
http://www.ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=4476518=4476565=56=46


In other words: #972 is somewhat obsolete, the FAQ entry should surely
be removed/updated. I don't know to which extent, but if you want me to
write some lines, I could probably invent a not so scientific
description.


HTH

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


[OMPI devel] TCP BTL routability (was: ticket #972)

2008-07-29 Thread Jeff Squyres

On Jul 29, 2008, at 3:20 PM, Terry Dontje wrote:

So, we've pinged ticket #972 several times to see if the issue it  
covers has been fixed and have not really gotten a response in the  
last few months.  While talking with Jeff about a recent thread on  
the users list about this issue he has found the code in  
btl_tcp_proc.c that determines whether the tcp btl can be used has  
changed significantly between 1.2 and 1.3.  So a question is has  
these changes changed the rules of whether connections between two  
nodes should use the tcp btl as described on the FAQ?  If so we  
should update the FAQ.



For reference, the FAQ entry is here:

http://www.open-mpi.org/faq/?category=tcp#tcp-routability

It looks like we now *always* assume that two TCP peers are routable.   
The code in question is in btl_tcp_proc.c with the loop starting at  
line 413.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Jeff Squyres

On Jul 29, 2008, at 9:47 AM, Jeff Squyres wrote:

Ok.  FWIW, Pasha and I think that openib has supported "send-to- 
self" for a while (we don't know exactly when; but Pasha thinks it  
is very old code that we don't check for self in add_procs).  But it  
only broke recently.



More in the FWIW category -- we just checked, and OMPI v1.2 supported  
"--mca btl openib" (note the lack of ",self").  So the openib BTL has,  
indeed, supported send-to-self for quite a while.


This should help narrow where to start looking for the problem:  
changes within the last few weeks.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Jeff Squyres
Ok.  FWIW, Pasha and I think that openib has supported "send-to-self"  
for a while (we don't know exactly when; but Pasha thinks it is very  
old code that we don't check for self in add_procs).  But it only  
broke recently.



On Jul 29, 2008, at 9:31 AM, George Bosilca wrote:

I ran few tests and the only combination leading to a deadlock is  
openib and self. As openib is the only BTL supporting self  
communications (except self of course), I guess it interfere with  
self in some more or less strange ways. I didn't had the time to dig  
deeper yet to see what exactly happens there, I'll schedule this  
later today.


 george.

On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago  
(Pasha: do you remember?) because Mellanox HCAs are capable of  
send-to-self (process) and there were no code changes necessary to  
enable it.  So it allowed a slightly simpler command line.  This  
was quite a while ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Change in hostfile behavior

2008-07-29 Thread Ralph Castain
Lenny's point is true - except for the danger of setting that mca  
param and its possible impact on ORTE daemons+mpirun - see other note  
in that regard. However, it would only be useful if the same user was  
doing it.


I believe Tim was concerned about the case where two users are sharing  
nodes. There is no good solution for that case. Two mpiruns done by  
different users that share a node and who have no knowledge of the  
other's actions will cause collision.


We should probably warn about that in our FAQ or something since that  
is a fairly common use-case - only thing I can think of is to  
recommend people default to running without affinity and only set it  
when they -know- they have sole use of their nodes.



On Jul 29, 2008, at 12:17 AM, Lenny Verkhovsky wrote:

for two separate runs we can use slot_list parameter  
( opal_paffinity_base_slot_list ) to have paffinity


1: mpirun -mca opal_paffinity_base_slot_list "0-1"

2 :mpirun -mca opal_paffinity_base_slot_list "2-3"


On 7/28/08, Ralph Castain  wrote:
Actually, this is true today regardless of this change. If two  
separate mpirun invocations share a node and attempt to use  
paffinity, they will conflict with each other. The problem isn't  
caused by the hostfile sub-allocation. The problem is that the two  
mpiruns have no knowledge of each other's actions, and hence assign  
node ranks to each process independently.


Thus, we would have two procs that think they are node rank=0 and  
should therefore bind to the 0 processor, and so on up the line.


Obviously, if you run within one mpirun and have two app_contexts,  
the hostfile sub-allocation is fine - mpirun will track node rank  
across the app_contexts. It is only the use of multiple mpiruns that  
share nodes that causes the problem.


Several of us have discussed this problem and have a proposed  
solution for 1.4. Once we get past 1.3 (someday!), we'll bring it to  
the group.




On Jul 28, 2008, at 10:44 AM, Tim Mattox wrote:

My only concern is how will this interact with PLPA.
Say two Open MPI jobs each use "half" the cores (slots) on a
particular node...  how would they be able to bind themselves to
a disjoint set of cores?  I'm not asking you to solve this Ralph, I'm
just pointing it out so we can maybe warn users that if both jobs  
sharing
a node try to use processor affinity, we don't make that magically  
work well,

and that we would expect it to do quite poorly.

I could see disabling paffinity and/or warning if it was enabled for
one of these "fractional" nodes.

On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain  wrote:
Per an earlier telecon, I have modified the hostfile behavior  
slightly to

allow hostfiles to subdivide allocations.

Briefly: given an allocation, we allow users to specify --hostfile  
on a
per-app_context basis. In this mode, the hostfile info is used to  
filter the

nodes that will be used for that app_context. However, the prior
implementation only filtered the nodes themselves - i.e., it was a  
binary

filter that allowed you to include or exclude an entire node.

The change now allows you to include a specified #slots for a given  
node as

opposed to -all- slots from that node. You are limited to the #slots
included in the original allocation. I just realized that I hadn't  
output a

warning if you attempt to violate this condition - will do so shortly.
Rather than just abort if this happens, I set the allocation to that  
of the

original - please let me know if you would prefer it to abort.

If you have interest in this behavior, please check it out and let  
me know

if this meets needs.

Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Pavel Shamis (Pasha)

Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago (Pasha: do 
you remember?) because Mellanox HCAs are capable of send-to-self 
(process) and there were no code changes necessary to enable it.  So 
it allowed a slightly simpler command line.  This was quite a while 
ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.


Re: [OMPI devel] Change in hostfile behavior

2008-07-29 Thread Lenny Verkhovsky
for two separate runs we can use slot_list parameter (
opal_paffinity_base_slot_list ) to have paffinity

1: mpirun -mca opal_paffinity_base_slot_list "0-1"

2 :mpirun -mca opal_paffinity_base_slot_list "2-3"

On 7/28/08, Ralph Castain  wrote:
>
> Actually, this is true today regardless of this change. If two separate
> mpirun invocations share a node and attempt to use paffinity, they will
> conflict with each other. The problem isn't caused by the hostfile
> sub-allocation. The problem is that the two mpiruns have no knowledge of
> each other's actions, and hence assign node ranks to each process
> independently.
>
> Thus, we would have two procs that think they are node rank=0 and should
> therefore bind to the 0 processor, and so on up the line.
>
> Obviously, if you run within one mpirun and have two app_contexts, the
> hostfile sub-allocation is fine - mpirun will track node rank across the
> app_contexts. It is only the use of multiple mpiruns that share nodes that
> causes the problem.
>
> Several of us have discussed this problem and have a proposed solution for
> 1.4. Once we get past 1.3 (someday!), we'll bring it to the group.
>
>
> On Jul 28, 2008, at 10:44 AM, Tim Mattox wrote:
>
>  My only concern is how will this interact with PLPA.
>> Say two Open MPI jobs each use "half" the cores (slots) on a
>> particular node...  how would they be able to bind themselves to
>> a disjoint set of cores?  I'm not asking you to solve this Ralph, I'm
>> just pointing it out so we can maybe warn users that if both jobs sharing
>> a node try to use processor affinity, we don't make that magically work
>> well,
>> and that we would expect it to do quite poorly.
>>
>> I could see disabling paffinity and/or warning if it was enabled for
>> one of these "fractional" nodes.
>>
>> On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain  wrote:
>>
>>> Per an earlier telecon, I have modified the hostfile behavior slightly to
>>> allow hostfiles to subdivide allocations.
>>>
>>> Briefly: given an allocation, we allow users to specify --hostfile on a
>>> per-app_context basis. In this mode, the hostfile info is used to filter
>>> the
>>> nodes that will be used for that app_context. However, the prior
>>> implementation only filtered the nodes themselves - i.e., it was a binary
>>> filter that allowed you to include or exclude an entire node.
>>>
>>> The change now allows you to include a specified #slots for a given node
>>> as
>>> opposed to -all- slots from that node. You are limited to the #slots
>>> included in the original allocation. I just realized that I hadn't output
>>> a
>>> warning if you attempt to violate this condition - will do so shortly.
>>> Rather than just abort if this happens, I set the allocation to that of
>>> the
>>> original - please let me know if you would prefer it to abort.
>>>
>>> If you have interest in this behavior, please check it out and let me
>>> know
>>> if this meets needs.
>>>
>>> Ralph
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>> --
>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>> tmat...@gmail.com || timat...@open-mpi.org
>> I'm a bright... http://www.the-brights.net/
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>