Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-14 Thread Jeff Squyres (jsquyres)
On Nov 14, 2014, at 10:52 AM, Reuti  wrote:

> I appreciate your replies and will read them thoroughly. I think it's best to 
> continue with the discussion after SC14. I don't want to put any burden on 
> anyone when time is tight.

Cool; many thanks.  This is complicated stuff; we might not have it exactly 
right in the 1.8.x series.  Let's figure it out in December.

Thanks for your patience!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-14 Thread Reuti
Jeff, Gus, Gilles,

Am 14.11.2014 um 15:56 schrieb Jeff Squyres (jsquyres):

> I lurked on this thread for a while, but I have some thoughts on the many 
> issues that were discussed on this thread (sorry, I'm still pretty under 
> water trying to get ready for SC next week...).

I appreciate your replies and will read them thoroughly. I think it's best to 
continue with the discussion after SC14. I don't want to put any burden on 
anyone when time is tight.

-- Reuti


>  These points are in no particular order...
> 
> 0. Two fundamental points have been missed in this thread:
> 
>   - A hostname technically has nothing to do with the resolvable name of an 
> IP interface.  By convention, many people set the hostname to be the same as 
> some "primary" IP interface (for some definition of "primary", e.g., eth0).  
> But they are actually unrelated concepts.
> 
>   - Open MPI uses host specifications only to specify a remote server, *NOT* 
> an interface.  E.g., when you list names in a hostile or the --host CLI 
> option, those only specify the server -- not the interface(s).  This was an 
> intentional design choice because there tends to be confusion and different 
> schools of thought about the question "What's the [resolvable] name of that 
> remote server?"  Hence, OMPI will take any old name you throw at it to 
> identify that remote server, but then we have separate controls for 
> specifying which interface(s) to use to communicate with that server.
> 
> 1. Remember that there is at least one, and possibly two, uses of TCP 
> communications in Open MPI -- and they are used differently:
> 
>   - Command/control (sometimes referred to as "oob"): used for things like 
> mpirun control messages, shuttling IO from remote processes back to mpirun, 
> etc.  Generally, unless you have a mountain of stdout/stderr from your 
> launched processes, this isn't a huge amount of traffic.
> 
>   - MPI messages: kernel-based TCP is the fallback if you don't have some 
> kind of faster off-server network -- i.e., the TCP BTL.  Like all BTLs, the 
> TCP BTL carries all MPI traffic when it is used.  How much traffic is 
> sent/received depends on your application.
> 
> 2. For OOB, I believe that the current ORTE mechanism is that it will try all 
> available IP interfaces and use the *first* one that succeeds.  Meaning: 
> after some negotiation, only one IP interface will be used to communicate 
> with a given peer.
> 
> 3. The TCP BTL will examine all local IP interfaces and determine all that 
> can be used to reach each peer according to the algorithm described here: 
> http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3.  It will use 
> *all* IP interfaces to reach a given peer in order to maximize the available 
> bandwidth.
> 
> 4. The usNIC BTL uses UDP as its wire transport, and therefore has the same 
> reachability issues as both the TCP OOB and BTL.  However, we use a different 
> mechanism than the algorithm described in the above-cited FAQ item: we simply 
> query the Linux routing table.  This can cause ARP requests, but the kernel 
> caches them (e.g., for multiple MPI procs on the same server making the 
> same/similar requests), and for a properly-segmented L3 network, each MPI 
> process will effectively end up querying about its local gateway (vs. the 
> actual peer), and therefore the chances of having that ARP already cached are 
> quite high.
> 
> --> I want to make this clear: there's nothing magic our the 
> usNIC/check-the-routing-table approach.  It's actually a very standard 
> IP/datacenter method.  With a proper routing table, you can know fairly 
> quickly whether local IP interface X can reach remote IP interface Y.
> 
> 5. The original problem cited in this thread was about the TCP OOB, not the 
> TCP BTL.  It's important to keep straight that the OOB, with no guidance from 
> the user, was trying to probe the different IP interfaces and find one that 
> would reach a peer.  Using the check-the-routing-table approach cited in #4, 
> we might be able to make this better (that's what Ralph and I are going to 
> talk about in December / post-SC / post-US Thanksgiving holiday).
> 
> 6. As a sidenote to #5, the TCP OOB and TCP BTL determine reachability in 
> different ways.  Remember that the TCP BTL has the benefit of having all the 
> ORTE infrastructure up and running.  Meaning: MPI processes can exchange IP 
> interface information and then use that information to compute which peer IP 
> interfaces can be reached.  The TCP OOB doesn't have this benefit -- it's 
> being used to establish initial connectivity.  Hence, it probes each IP 
> interface to see if it can reach a given peer.
> 
> --> We apparently need to do that probe better (vs. blocking in a serial 
> fashion, and eventually timing out on "bad" interfaces and then trying the 
> next one). 
> 
> Having a bad route or gateway listed in a server's IP setup, however, will 
> make the process take an artificially 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-14 Thread Jeff Squyres (jsquyres)
I lurked on this thread for a while, but I have some thoughts on the many 
issues that were discussed on this thread (sorry, I'm still pretty under water 
trying to get ready for SC next week...).  These points are in no particular 
order...

0. Two fundamental points have been missed in this thread:

   - A hostname technically has nothing to do with the resolvable name of an IP 
interface.  By convention, many people set the hostname to be the same as some 
"primary" IP interface (for some definition of "primary", e.g., eth0).  But 
they are actually unrelated concepts.

   - Open MPI uses host specifications only to specify a remote server, *NOT* 
an interface.  E.g., when you list names in a hostile or the --host CLI option, 
those only specify the server -- not the interface(s).  This was an intentional 
design choice because there tends to be confusion and different schools of 
thought about the question "What's the [resolvable] name of that remote 
server?"  Hence, OMPI will take any old name you throw at it to identify that 
remote server, but then we have separate controls for specifying which 
interface(s) to use to communicate with that server.

1. Remember that there is at least one, and possibly two, uses of TCP 
communications in Open MPI -- and they are used differently:

   - Command/control (sometimes referred to as "oob"): used for things like 
mpirun control messages, shuttling IO from remote processes back to mpirun, 
etc.  Generally, unless you have a mountain of stdout/stderr from your launched 
processes, this isn't a huge amount of traffic.

   - MPI messages: kernel-based TCP is the fallback if you don't have some kind 
of faster off-server network -- i.e., the TCP BTL.  Like all BTLs, the TCP BTL 
carries all MPI traffic when it is used.  How much traffic is sent/received 
depends on your application.

2. For OOB, I believe that the current ORTE mechanism is that it will try all 
available IP interfaces and use the *first* one that succeeds.  Meaning: after 
some negotiation, only one IP interface will be used to communicate with a 
given peer.

3. The TCP BTL will examine all local IP interfaces and determine all that can 
be used to reach each peer according to the algorithm described here: 
http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3.  It will use 
*all* IP interfaces to reach a given peer in order to maximize the available 
bandwidth.

4. The usNIC BTL uses UDP as its wire transport, and therefore has the same 
reachability issues as both the TCP OOB and BTL.  However, we use a different 
mechanism than the algorithm described in the above-cited FAQ item: we simply 
query the Linux routing table.  This can cause ARP requests, but the kernel 
caches them (e.g., for multiple MPI procs on the same server making the 
same/similar requests), and for a properly-segmented L3 network, each MPI 
process will effectively end up querying about its local gateway (vs. the 
actual peer), and therefore the chances of having that ARP already cached are 
quite high.

--> I want to make this clear: there's nothing magic our the 
usNIC/check-the-routing-table approach.  It's actually a very standard 
IP/datacenter method.  With a proper routing table, you can know fairly quickly 
whether local IP interface X can reach remote IP interface Y.

5. The original problem cited in this thread was about the TCP OOB, not the TCP 
BTL.  It's important to keep straight that the OOB, with no guidance from the 
user, was trying to probe the different IP interfaces and find one that would 
reach a peer.  Using the check-the-routing-table approach cited in #4, we might 
be able to make this better (that's what Ralph and I are going to talk about in 
December / post-SC / post-US Thanksgiving holiday).

6. As a sidenote to #5, the TCP OOB and TCP BTL determine reachability in 
different ways.  Remember that the TCP BTL has the benefit of having all the 
ORTE infrastructure up and running.  Meaning: MPI processes can exchange IP 
interface information and then use that information to compute which peer IP 
interfaces can be reached.  The TCP OOB doesn't have this benefit -- it's being 
used to establish initial connectivity.  Hence, it probes each IP interface to 
see if it can reach a given peer.

--> We apparently need to do that probe better (vs. blocking in a serial 
fashion, and eventually timing out on "bad" interfaces and then trying the next 
one). 

Having a bad route or gateway listed in a server's IP setup, however, will make 
the process take an artificially long time.  This is a user error that Open MPI 
cannot compensate for.  If prior versions of OMPI tried interfaces in a 
different order that luckily worked nicely, cool.  But as Gilles mentioned, 
that was luck -- there was still a user config error that was the real 
underlying issue.

7. Someone asked: does it matter in which order you specify interfaces in 
btl_tcp_if_include?  No, it effectively does not.  Open MPI will use 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Gilles Gouaillardet
My 0.02 US$

first, the root cause of the problem was a default gateway was
configured on the node,
but this gateway was unreachable.
imho, this is incorrect system setting that can lead to unpredictable
results :
- openmpi 1.8.1 works (you are lucky, good for you)
- openmpi 1.8.3 fails (no luck this time, too bad)
so i believe it is incorrect to blame openmpi for this.

that being said, you raise some good points of how to improve user
friendliness for end users
that have limited skills and/or interest in OpenMPI and system
administration.

basically, i agree with Gus. HPC is complex, not every clusters are the same
and imho some minimal config/tuning might not be avoided to get OpenMPI
working,
or operating at full speed.


let me give a few examples :

you recommend OpenMPI uses only the interfaces that matches the
hostnames in the machinefile.
what if you submit from the head node ? should you use the interface
that matches the hostname ?
what if this interface is the public interface, there is a firewall
and/or compute nodes have no default gateway ?
that will simply not work ...
so mpirun needs to pass orted all its interfaces.
which one should be picked by orted ?
- the first one ? it might be the unreachable public interface ...
- the one on the same subnet ? what if none is on the same subnet ?
  on the cluster i am working, eth0 are in different subnets, ib0 is on
a single subnet
  and i do *not* want to use ib0. but on some other clusters, the
ethernet network is so cheap
  they *want* to use ib0.

on your cluster, you want to use eth0 for oob and mpi, and eth1 for NFS.
that is legitimate.
in my case, i want to use eth0 (gigE) for oob and eth2 (10gigE) for MPI.
that is legitimate too.

we both want OpenMPI works *and* with best performance out of the box.
it is a good thing to have high expectations, but they might not all be met.

i'd rather implement some pre-defined policies that rules how ethernet
interfaces should be picked up,
and add a FAQ that mentions : if it does not work (or does not work as
fast as expected) out of the box, you should
at first try an other policy.

then the next legitimate question will be "what is the default policy" ?
regardless the answer, it will be good for some and bad for others.


imho, posting a mail to the OMPI users mailing list was the right thing
to do :
- you got help on how to troubleshot and fix the issue
- we got some valuable feedback on endusers expectations.

Cheers,

Gilles

On 2014/11/14 3:36, Gus Correa wrote:
> On 11/13/2014 11:14 AM, Ralph Castain wrote:
>> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
>> assign different hostnames to their interfaces - I’ve seen it in the
>> Hadoop world, but not in HPC. Still, no law against it.
>
> No, not so unusual.
> I have clusters from respectable vendors that come with
> /etc/hosts for name resolution of the various interfaces.
> If I remember right, Rocks clusters also does that (or actually
> allow the sys admin to setup additional networks and at that point
> will append /etc/hosts with the additional names, or perhaps put those
> names in DHCP).
> I am not so familiar to xcat, but I think it has similar DHCP
> functionality, or maybe DNS on the head node.
>
> Having said that, I don't think this is an obstacle to setting up the
> right "if_include/if_exlculde" choices (along with the btl, oob, etc),
> for each particular cluster in the mca parameter configuration file.
> That is what my parallel conversation with Reuti was about.
>
> I believe the current approach w.r.t. interfaces:
> "use everythint, let the sysadmin/user restrict as
> (s)he sees fit" is both a wise and flexible way to do it.
> Guessing the "right interface to use" sounds risky to me (wrong
> choices may happen), and a bit of a cast.
>
>>
>> This will take a little thought to figure out a solution. One problem
>> that immediately occurs is if someone includes a hostfile that has lines
>> which refer to the same physical server, but using different interface
>> names. We’ll think those are completely distinct servers, and so the
>> process placement will be totally messed up.
>>
>
> Sure, and besides this, there will be machines with
> inconsistent/wrong/conflicting name resolution schemes
> that the current OMPI approach simply (and wisely) ignores.
>
>
>> We’ll also encounter issues with the daemon when it reports back, as the
>> hostname it gets will almost certainly differ from the hostname we were
>> expecting. Not as critical, but need to check to see where that will
>> impact the code base
>>
>
> I'm sure that will happen.
> Torque uses hostname by default for several things, and it can be a
> configuration nightmare to workaround that when what hostname reports
> is not what you want.
>
> IMHO, you may face a daunting guesswork task to get this right,
> to pick the
> interfaces that are best for a particular computer or cluster.
> It is so much easier to let the sysadmin/user, who presumably 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Gus Correa

On 11/13/2014 11:14 AM, Ralph Castain wrote:

Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
assign different hostnames to their interfaces - I’ve seen it in the
Hadoop world, but not in HPC. Still, no law against it.


No, not so unusual.
I have clusters from respectable vendors that come with
/etc/hosts for name resolution of the various interfaces.
If I remember right, Rocks clusters also does that (or actually
allow the sys admin to setup additional networks and at that point
will append /etc/hosts with the additional names, or perhaps put those
names in DHCP).
I am not so familiar to xcat, but I think it has similar DHCP 
functionality, or maybe DNS on the head node.


Having said that, I don't think this is an obstacle to setting up the 
right "if_include/if_exlculde" choices (along with the btl, oob, etc),

for each particular cluster in the mca parameter configuration file.
That is what my parallel conversation with Reuti was about.

I believe the current approach w.r.t. interfaces:
"use everythint, let the sysadmin/user restrict as
(s)he sees fit" is both a wise and flexible way to do it.
Guessing the "right interface to use" sounds risky to me (wrong choices 
may happen), and a bit of a cast.




This will take a little thought to figure out a solution. One problem
that immediately occurs is if someone includes a hostfile that has lines
which refer to the same physical server, but using different interface
names. We’ll think those are completely distinct servers, and so the
process placement will be totally messed up.



Sure, and besides this, there will be machines with
inconsistent/wrong/conflicting name resolution schemes
that the current OMPI approach simply (and wisely) ignores.



We’ll also encounter issues with the daemon when it reports back, as the
hostname it gets will almost certainly differ from the hostname we were
expecting. Not as critical, but need to check to see where that will
impact the code base



I'm sure that will happen.
Torque uses hostname by default for several things, and it can be a 
configuration nightmare to workaround that when what hostname reports is 
not what you want.


IMHO, you may face a daunting guesswork task to get this right,
to pick the
interfaces that are best for a particular computer or cluster.
It is so much easier to let the sysadmin/user, who presumably knows 
his/her machine, to write an MCA parameter config file,

as it is now in OMPI.


We can look at the hostfile changes at that time - no real objection to
them, but would need to figure out how to pass that info to the
appropriate subsystems. I assume you want this to apply to both the oob
and tcp/btl?

Obviously, this won’t make it for 1.8 as it is going to be fairly
intrusive, but we can probably do something for 1.9



The status quo is good.
Long life to the OMPI status quo.
(You don't know how reluctant I am to support the status quo, any status 
quo.  :) )
My vote (... well, I don't have voting rights on that, but I'll vote 
anyway ...) is to keeep the current approach.
It is wise and flexible, and easy to adjust and configure to specific 
machines with their own oddities, via MCA parameters, as I tried to 
explain in previous postings.


My two cents,
Gus Correa




On Nov 13, 2014, at 4:23 AM, Reuti > wrote:

Am 13.11.2014 um 00:34 schrieb Ralph Castain:


On Nov 12, 2014, at 2:45 PM, Reuti > wrote:

Am 12.11.2014 um 17:27 schrieb Reuti:


Am 11.11.2014 um 02:25 schrieb Ralph Castain:


Another thing you can do is (a) ensure you built with
—enable-debug, and then (b) run it with -mca oob_base_verbose 100
 (without the tcp_if_include option) so we can watch the
connection handshake and see what it is doing. The —hetero-nodes
will have not affect here and can be ignored.


Done. It really tries to connect to the outside interface of the
headnode. But being there a firewall or not: the nodes have no clue
how to reach 137.248.0.0 - they have no gateway to this network at all.


I have to revert this. They think that there is a gateway although
it isn't. When I remove the entry by hand for the gateway in the
routing table it starts up instantly too.

While I can do this on my own cluster I still have the 30 seconds
delay on a cluster where I'm not root, while this can be because of
the firewall there. The gateway on this cluster is indeed going to
the outside world.

Personally I find this behavior a little bit too aggressive to use
all interfaces. If you don't check this carefully beforehand and
start a long running application one might even not notice the delay
during the startup.


Agreed - do you have any suggestions on how we should choose the
order in which to try them? I haven’t been able to come up with
anything yet. Jeff has some fancy algo in his usnic BTL that we are
going to discuss after SC that I’m hoping will help, but 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Ralph Castain
Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to assign 
different hostnames to their interfaces - I’ve seen it in the Hadoop world, but 
not in HPC. Still, no law against it.

This will take a little thought to figure out a solution. One problem that 
immediately occurs is if someone includes a hostfile that has lines which refer 
to the same physical server, but using different interface names. We’ll think 
those are completely distinct servers, and so the process placement will be 
totally messed up.

We’ll also encounter issues with the daemon when it reports back, as the 
hostname it gets will almost certainly differ from the hostname we were 
expecting. Not as critical, but need to check to see where that will impact the 
code base

We can look at the hostfile changes at that time - no real objection to them, 
but would need to figure out how to pass that info to the appropriate 
subsystems. I assume you want this to apply to both the oob and tcp/btl?

Obviously, this won’t make it for 1.8 as it is going to be fairly intrusive, 
but we can probably do something for 1.9

 
> On Nov 13, 2014, at 4:23 AM, Reuti  wrote:
> 
> Am 13.11.2014 um 00:34 schrieb Ralph Castain:
> 
>>> On Nov 12, 2014, at 2:45 PM, Reuti  wrote:
>>> 
>>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>> 
 Am 11.11.2014 um 02:25 schrieb Ralph Castain:
 
> Another thing you can do is (a) ensure you built with —enable-debug, and 
> then (b) run it with -mca oob_base_verbose 100  (without the 
> tcp_if_include option) so we can watch the connection handshake and see 
> what it is doing. The —hetero-nodes will have not affect here and can be 
> ignored.
 
 Done. It really tries to connect to the outside interface of the headnode. 
 But being there a firewall or not: the nodes have no clue how to reach 
 137.248.0.0 - they have no gateway to this network at all.
>>> 
>>> I have to revert this. They think that there is a gateway although it 
>>> isn't. When I remove the entry by hand for the gateway in the routing table 
>>> it starts up instantly too.
>>> 
>>> While I can do this on my own cluster I still have the 30 seconds delay on 
>>> a cluster where I'm not root, while this can be because of the firewall 
>>> there. The gateway on this cluster is indeed going to the outside world.
>>> 
>>> Personally I find this behavior a little bit too aggressive to use all 
>>> interfaces. If you don't check this carefully beforehand and start a long 
>>> running application one might even not notice the delay during the startup.
>> 
>> Agreed - do you have any suggestions on how we should choose the order in 
>> which to try them? I haven’t been able to come up with anything yet. Jeff 
>> has some fancy algo in his usnic BTL that we are going to discuss after SC 
>> that I’m hoping will help, but I’d be open to doing something better in the 
>> interim for 1.8.4
> 
> The plain`mpiexec` should just use the specified interface it finds in the 
> hostfile. Being it hand crafted or prepared by any queuing system.
> 
> 
> Option: could a single entry for a machine in the hostfile contain a list of 
> interfaces? I mean something like:
> 
> node01,node01-extra-eth1,node01-extra-eth2 slots=4
> 
> or
> 
> node01* slots=4
> 
> Means: use exactly these interfaces or even try to find all available 
> interfaces on/between the machines.
> 
> In case all interfaces have the same name, then it's up to the admin to 
> correct this.
> 
> -- Reuti
> 
> 
>>> -- Reuti
>>> 
>>> 
 It tries so independent from the internal or external name of the headnode 
 given in the machinefile - I hit ^C then. I attached the output of Open 
 MPI 1.8.1 for this setup too.
 
 -- Reuti
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2014/11/25777.php
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25782.php 
>> 
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti
Am 13.11.2014 um 00:34 schrieb Ralph Castain:

>> On Nov 12, 2014, at 2:45 PM, Reuti  wrote:
>> 
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>> 
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>> 
 Another thing you can do is (a) ensure you built with —enable-debug, and 
 then (b) run it with -mca oob_base_verbose 100  (without the 
 tcp_if_include option) so we can watch the connection handshake and see 
 what it is doing. The —hetero-nodes will have not affect here and can be 
 ignored.
>>> 
>>> Done. It really tries to connect to the outside interface of the headnode. 
>>> But being there a firewall or not: the nodes have no clue how to reach 
>>> 137.248.0.0 - they have no gateway to this network at all.
>> 
>> I have to revert this. They think that there is a gateway although it isn't. 
>> When I remove the entry by hand for the gateway in the routing table it 
>> starts up instantly too.
>> 
>> While I can do this on my own cluster I still have the 30 seconds delay on a 
>> cluster where I'm not root, while this can be because of the firewall there. 
>> The gateway on this cluster is indeed going to the outside world.
>> 
>> Personally I find this behavior a little bit too aggressive to use all 
>> interfaces. If you don't check this carefully beforehand and start a long 
>> running application one might even not notice the delay during the startup.
> 
> Agreed - do you have any suggestions on how we should choose the order in 
> which to try them? I haven’t been able to come up with anything yet. Jeff has 
> some fancy algo in his usnic BTL that we are going to discuss after SC that 
> I’m hoping will help, but I’d be open to doing something better in the 
> interim for 1.8.4

The plain`mpiexec` should just use the specified interface it finds in the 
hostfile. Being it hand crafted or prepared by any queuing system.


Option: could a single entry for a machine in the hostfile contain a list of 
interfaces? I mean something like:

node01,node01-extra-eth1,node01-extra-eth2 slots=4

or

node01* slots=4

Means: use exactly these interfaces or even try to find all available 
interfaces on/between the machines.

In case all interfaces have the same name, then it's up to the admin to correct 
this.

-- Reuti


>> -- Reuti
>> 
>> 
>>> It tries so independent from the internal or external name of the headnode 
>>> given in the machinefile - I hit ^C then. I attached the output of Open MPI 
>>> 1.8.1 for this setup too.
>>> 
>>> -- Reuti
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25782.php
> 



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti
Gus,

Am 13.11.2014 um 02:59 schrieb Gus Correa:

> On 11/12/2014 05:45 PM, Reuti wrote:
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>> 
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>> 
 Another thing you can do is (a) ensure you built with —enable-debug,
>> and then (b) run it with -mca oob_base_verbose 100
>> (without the tcp_if_include option) so we can watch
>> the connection handshake and see what it is doing.
>> The —hetero-nodes will have not affect here and can be ignored.
>>> 
>>> Done. It really tries to connect to the outside
>> interface of the headnode. But being there a firewall or not:
>> the nodes have no clue how to reach 137.248.0.0 -
>> they have no gateway to this network at all.
>> 
>> I have to revert this.
>> They think that there is a gateway although it isn't.
>> When I remove the entry by hand for the gateway in the
>> routing table it starts up instantly too.
>> 
>> While I can do this on my own cluster I still have the
>> 30 seconds delay on a cluster where I'm not root,
>> while this can be because of the firewall there.
>> The gateway on this cluster is indeed going
>> to the outside world.
>> 
>> Personally I find this behavior a little bit too aggressive
>> to use all interfaces. If you don't check this carefully
>> beforehand and start a long running application one might
>> even not notice the delay during the startup.
>> 
>> -- Reuti
>> 
> 
> Hi Reuti
> 
> You could use the mca parameter file
> (say, $prefix/etc/openmpi-mca-params.conf) to configure cluster-wide
> the oob (and btl) interfaces to be used.
> The users can still override your choices if they want.
> 
> Just put a line like this in openmpi-mca-params.conf :
> oob_tcp_if_include=192.168.154.0/26
> 
> (and similar for btl_tcp_if_include, btl_openib_if_include).
> 
> Get a full list from "ompi_info --all --all |grep if_include".
> 
> See these FAQ:
> 
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
> 
> Compute nodes tend to be multi-homed, so what criterion would OMPI use
> to select one interface among many,

My compute nodes are having two interfaces: one for MPI (and the low ssh/SGE 
traffic to start processes somewhere) and one for NFS to transfer files from/to 
the file server. So: Open MPI may use both interfaces without telling me 
anything about it? How will it split the traffic? 50%/50%? When there is a 
heavy file transfer on the NFS interface: might it hurt Open MPI's 
communication or will it balance the usage on-the-fly?

When I prepare a machinefile with the name of the interfaces (or get the names 
from SGE's PE_HOSTFILE) it should use just this (except native IB), and not 
looking around for other paths to the other machine(s) (IMO). Therefore 
different interfaces have different names in my setup. "node01" is just eth0 
and different from "node01-nfs" for eth1.


> not knowing beforehand what exists in a particular computer?
> There would be a risk to make a bad choice.
> The current approach gives you everything, and you
> pick/select/restrict what you want to fit your needs,
> with mca parameters (which can be set in several
> ways and with various scopes).
> 
> I don't think this bad.
> However, I am biased about this.
> I like and use the openmpi-mca-params.conf file
> to setup sensible defaults.
> At least I think they are sensible. :)

I see that this can be prepared for all users this way. Whenever they use my 
installed version it will work - maybe I have to investigate on some other 
clusters where I'm not root what to enter there, but it can be done for sure.

BUT: it may be a rare situation that a group for quantum chemistry is having a 
sysadmin on their own taking care of the clusters and the well behaving 
operation of the installed software, being it applications or libraries. Often 
any PhD student in other groups will get a side project: please install 
software XY for the group. They are chemists and want to get the software 
running - they are no experts of Open MPI*. They don't care for a tight 
integration or using the correct interfaces as long as the application delivers 
the results in the end. For example: ORCA**. It's necessary for the users of 
the software to install a shared library of Open MPI in a specific version. I 
see in the ORCA*** forum that many struggle with it to compile a shared library 
version of Open MPI and have access to it during execution, i.e. how to set 
LD_LIBRARY_PATH that it's known on the slaves. The cluster admins are in 
another department and refuse to make any special arrangements for a single 
group sometimes. And as ORCA calls `mpiexec` several times during one job, the 
delay could occur several times.

On some other clusters that we have access to, the admins prepare Open MPI 
installations accessible by `modules`. But often not for the required 
combination of Open MPI and compiler type and version which is needed. If a 
software vendor suggests to 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Gus Correa

On 11/12/2014 05:45 PM, Reuti wrote:

Am 12.11.2014 um 17:27 schrieb Reuti:


Am 11.11.2014 um 02:25 schrieb Ralph Castain:


Another thing you can do is (a) ensure you built with —enable-debug,

and then (b) run it with -mca oob_base_verbose 100
(without the tcp_if_include option) so we can watch
the connection handshake and see what it is doing.
The —hetero-nodes will have not affect here and can be ignored.


Done. It really tries to connect to the outside

interface of the headnode. But being there a firewall or not:
the nodes have no clue how to reach 137.248.0.0 -
they have no gateway to this network at all.

I have to revert this.
They think that there is a gateway although it isn't.
When I remove the entry by hand for the gateway in the
routing table it starts up instantly too.

While I can do this on my own cluster I still have the
30 seconds delay on a cluster where I'm not root,
while this can be because of the firewall there.
The gateway on this cluster is indeed going
to the outside world.

Personally I find this behavior a little bit too aggressive
to use all interfaces. If you don't check this carefully
beforehand and start a long running application one might
even not notice the delay during the startup.

-- Reuti



Hi Reuti

You could use the mca parameter file
(say, $prefix/etc/openmpi-mca-params.conf) to configure cluster-wide
the oob (and btl) interfaces to be used.
The users can still override your choices if they want.

Just put a line like this in openmpi-mca-params.conf :
oob_tcp_if_include=192.168.154.0/26

(and similar for btl_tcp_if_include, btl_openib_if_include).

Get a full list from "ompi_info --all --all |grep if_include".

See these FAQ:

http://www.open-mpi.org/faq/?category=tcp#tcp-selection
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Compute nodes tend to be multi-homed, so what criterion would OMPI use
to select one interface among many,
not knowing beforehand what exists in a particular computer?
There would be a risk to make a bad choice.
The current approach gives you everything, and you
pick/select/restrict what you want to fit your needs,
with mca parameters (which can be set in several
ways and with various scopes).

I don't think this bad.
However, I am biased about this.
I like and use the openmpi-mca-params.conf file
to setup sensible defaults.
At least I think they are sensible. :)

Cheers,
Gus Correa




It tries so independent from the internal or external name of the headnode

given in the machinefile - I hit ^C then.
I attached the output of Open MPI 1.8.1 for this setup too.


-- Reuti

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25777.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25781.php





Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Ralph Castain

> On Nov 12, 2014, at 2:45 PM, Reuti  wrote:
> 
> Am 12.11.2014 um 17:27 schrieb Reuti:
> 
>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>> 
>>> Another thing you can do is (a) ensure you built with —enable-debug, and 
>>> then (b) run it with -mca oob_base_verbose 100  (without the tcp_if_include 
>>> option) so we can watch the connection handshake and see what it is doing. 
>>> The —hetero-nodes will have not affect here and can be ignored.
>> 
>> Done. It really tries to connect to the outside interface of the headnode. 
>> But being there a firewall or not: the nodes have no clue how to reach 
>> 137.248.0.0 - they have no gateway to this network at all.
> 
> I have to revert this. They think that there is a gateway although it isn't. 
> When I remove the entry by hand for the gateway in the routing table it 
> starts up instantly too.
> 
> While I can do this on my own cluster I still have the 30 seconds delay on a 
> cluster where I'm not root, while this can be because of the firewall there. 
> The gateway on this cluster is indeed going to the outside world.
> 
> Personally I find this behavior a little bit too aggressive to use all 
> interfaces. If you don't check this carefully beforehand and start a long 
> running application one might even not notice the delay during the startup.

Agreed - do you have any suggestions on how we should choose the order in which 
to try them? I haven’t been able to come up with anything yet. Jeff has some 
fancy algo in his usnic BTL that we are going to discuss after SC that I’m 
hoping will help, but I’d be open to doing something better in the interim for 
1.8.4

> 
> -- Reuti
> 
> 
>> It tries so independent from the internal or external name of the headnode 
>> given in the machinefile - I hit ^C then. I attached the output of Open MPI 
>> 1.8.1 for this setup too.
>> 
>> -- Reuti
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25781.php



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Reuti
Am 12.11.2014 um 17:27 schrieb Reuti:

> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
> 
>> Another thing you can do is (a) ensure you built with —enable-debug, and 
>> then (b) run it with -mca oob_base_verbose 100  (without the tcp_if_include 
>> option) so we can watch the connection handshake and see what it is doing. 
>> The —hetero-nodes will have not affect here and can be ignored.
> 
> Done. It really tries to connect to the outside interface of the headnode. 
> But being there a firewall or not: the nodes have no clue how to reach 
> 137.248.0.0 - they have no gateway to this network at all.

I have to revert this. They think that there is a gateway although it isn't. 
When I remove the entry by hand for the gateway in the routing table it starts 
up instantly too.

While I can do this on my own cluster I still have the 30 seconds delay on a 
cluster where I'm not root, while this can be because of the firewall there. 
The gateway on this cluster is indeed going to the outside world.

Personally I find this behavior a little bit too aggressive to use all 
interfaces. If you don't check this carefully beforehand and start a long 
running application one might even not notice the delay during the startup.

-- Reuti


> It tries so independent from the internal or external name of the headnode 
> given in the machinefile - I hit ^C then. I attached the output of Open MPI 
> 1.8.1 for this setup too.
> 
> -- Reuti
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25777.php



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Reuti
Am 11.11.2014 um 02:25 schrieb Ralph Castain:

> Another thing you can do is (a) ensure you built with —enable-debug, and then 
> (b) run it with -mca oob_base_verbose 100  (without the tcp_if_include 
> option) so we can watch the connection handshake and see what it is doing. 
> The —hetero-nodes will have not affect here and can be ignored.

Done. It really tries to connect to the outside interface of the headnode. But 
being there a firewall or not: the nodes have no clue how to reach 137.248.0.0 
- they have no gateway to this network at all.

It tries so independent from the internal or external name of the headnode 
given in the machinefile - I hit ^C then. I attached the output of Open MPI 
1.8.1 for this setup too.

-- Reuti

Wed Nov 12 16:43:12 CET 2014
[annemarie:01246] mca: base: components_register: registering oob components
[annemarie:01246] mca: base: components_register: found loaded component tcp
[annemarie:01246] mca: base: components_register: component tcp register 
function successful
[annemarie:01246] mca: base: components_open: opening oob components
[annemarie:01246] mca: base: components_open: found loaded component tcp
[annemarie:01246] mca: base: components_open: component tcp open function 
successful
[annemarie:01246] mca:oob:select: checking available component tcp
[annemarie:01246] mca:oob:select: Querying component [tcp]
[annemarie:01246] oob:tcp: component_available called
[annemarie:01246] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init rejecting loopback interface lo
[annemarie:01246] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init adding 137.248.x.y to our list of 
V4 connections
[annemarie:01246] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init adding 192.168.154.30 to our list 
of V4 connections
[annemarie:01246] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init adding 192.168.154.187 to our list 
of V4 connections
[annemarie:01246] [[37241,0],0] TCP STARTUP
[annemarie:01246] [[37241,0],0] attempting to bind to IPv4 port 0
[annemarie:01246] [[37241,0],0] assigned IPv4 port 53661
[annemarie:01246] mca:oob:select: Adding component to end
[annemarie:01246] mca:oob:select: Found 1 active transports
[node28:05663] mca: base: components_register: registering oob components
[node28:05663] mca: base: components_register: found loaded component tcp
[node28:05663] mca: base: components_register: component tcp register function 
successful
[node28:05663] mca: base: components_open: opening oob components
[node28:05663] mca: base: components_open: found loaded component tcp
[node28:05663] mca: base: components_open: component tcp open function 
successful
[node28:05663] mca:oob:select: checking available component tcp
[node28:05663] mca:oob:select: Querying component [tcp]
[node28:05663] oob:tcp: component_available called
[node28:05663] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[node28:05663] [[37241,0],1] oob:tcp:init rejecting loopback interface lo
[node28:05663] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[node28:05663] [[37241,0],1] oob:tcp:init adding 192.168.154.28 to our list of 
V4 connections
[node28:05663] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[node28:05663] [[37241,0],1] oob:tcp:init adding 192.168.154.98 to our list of 
V4 connections
[node28:05663] [[37241,0],1] TCP STARTUP
[node28:05663] [[37241,0],1] attempting to bind to IPv4 port 0
[node28:05663] [[37241,0],1] assigned IPv4 port 45802
[node28:05663] mca:oob:select: Adding component to end
[node28:05663] mca:oob:select: Found 1 active transports
[node28:05663] [[37241,0],1]: set_addr to uri 
2440626176.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:53661
[node28:05663] [[37241,0],1]:set_addr checking if peer [[37241,0],0] is 
reachable via component tcp
[node28:05663] [[37241,0],1] oob:tcp: working peer [[37241,0],0] address 
tcp://137.248.x.y,192.168.154.30,192.168.154.187:53661
[node28:05663] [[37241,0],1] PASSING ADDR 137.248.x.y TO MODULE
[node28:05663] [[37241,0],1]:tcp set addr for peer [[37241,0],0]
[node28:05663] [[37241,0],1] PASSING ADDR 192.168.154.30 TO MODULE
[node28:05663] [[37241,0],1]:tcp set addr for peer [[37241,0],0]
[node28:05663] [[37241,0],1] PASSING ADDR 192.168.154.187 TO MODULE
[node28:05663] [[37241,0],1]:tcp set addr for peer [[37241,0],0]
[node28:05663] [[37241,0],1]: peer [[37241,0],0] is reachable via component tcp
[node28:05663] [[37241,0],1] OOB_SEND: rml_oob_send.c:199
[node28:05663] [[37241,0],1]:tcp:processing set_peer cmd
[node28:05663] [[37241,0],1] SET_PEER ADDING PEER [[37241,0],0]
[node28:05663] [[37241,0],1] set_peer: peer [[37241,0],0] is listening on net 
137.248.x.y port 53661
[node28:05663] [[37241,0],1]:tcp:processing set_peer cmd
[node28:05663] [[37241,0],1] set_peer: peer [[37241,0],0] is listening on net 
192.168.154.30 port 53661
[node28:05663] [[37241,0],1]:tcp:processing 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Reuti

Am 11.11.2014 um 02:12 schrieb Gilles Gouaillardet:

> Hi,
> 
> IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use 
> all the published interfaces.
> 
> by any change, are you running a firewall on your head node ?

Yes, but only for the interface to the outside world. Nevertheless I switched 
it off and the result was the same 2 minutes delay during startup.


> one possible explanation is the compute node tries to access the public 
> interface of the head node, and packets get dropped by the firewall.
> 
> if you are running a firewall, can you make a test without it ?
> /* if you do need NAT, then just remove the DROP and REJECT rules "/
> 
> an other possible explanation is the compute node is doing (reverse) dns 
> requests with the public name and/or ip of the head node and that takes some 
> time to complete (success or failure, this does not really matter here)

I tried in the machinefile the internal and the external name of the headnode, 
i.e. different names for different interfaces. The result is the same.


> /* a simple test is to make sure all the hosts/ip of the head node are in the 
> /etc/hosts of the compute node */
> 
> could you check your network config (firewall and dns) ?
> 
> can you reproduce the delay when running mpirun on the head node and with one 
> mpi task on the compute node ?

You mean one on the head node and one on the compute node, opposed to two + two 
in my initial test?

Sure, but with 1+1 I get the same result.


> if yes, then the hard way to trace the delay issue would be to strace -ttt 
> both orted and mpi task that are launched on the compute node and see where 
> the time is lost.
> /* at this stage, i would suspect orted ... */

As the `ssh` on the headnode hangs for a while, I suspect it's something on the 
compute node. I see there during the startup:

orted -mca ess env -mca orte_ess_jobid 2412773376 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 2 -mca orte_hnp_uri 
2412773376.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:58782 
--tree-spawn -mca plm rsh

===
Only the subnet 192.168.154.0/26 (yes, 26) is used to access the nodes from the 
master i.e. login machine. As an additional information: the nodes have two 
network interfaces: one in 192.168.154.0/26 and one in 192.168.154.64/26 to 
reach a file server.
==


Falling back to 1.8.1 I see:

bash -c  orted -mca ess env -mca orte_ess_jobid 3182034944 -mca orte_ess_vpid 1 
-mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"3182034944.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:54436" 
--tree-spawn -mca plm rsh -mca hwloc_base_binding_policy none

So, the bash was removed. But I don't think that this causes anything.

-- Reuti


> Cheers,
> 
> Gilles
> 
> On Mon, Nov 10, 2014 at 5:56 PM, Reuti  wrote:
> Hi,
> 
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
> 
> > That is indeed bizarre - we haven’t heard of anything similar from other 
> > users. What is your network configuration? If you use oob_tcp_if_include or 
> > exclude, can you resolve the problem?
> 
> Thx - this option helped to get it working.
> 
> These tests were made for sake of simplicity between the headnode of the 
> cluster and one (idle) compute node. I tried then between the (identical) 
> compute nodes and this worked fine. The headnode of the cluster and the 
> compute node are slightly different though (i.e. number of cores), and using 
> eth1 resp. eth0 for the internal network of the cluster.
> 
> I tried --hetero-nodes with no change.
> 
> Then I turned to:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
> 192.168.154.0/26 -n 4 --hetero-nodes --hostfile machines ./mpihello; date
> 
> and the application started instantly. On another cluster, where the headnode 
> is identical to the compute nodes but with the same network setup as above, I 
> observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
> working addition was the correct "oob_tcp_if_include" to solve the issue.
> 
> The questions which remain: a) is this a targeted behavior, b) what changed 
> in this scope between 1.8.1 and 1.8.2?
> 
> -- Reuti
> 
> 
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
> >>> BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Ralph Castain
Another thing you can do is (a) ensure you built with —enable-debug, and then 
(b) run it with -mca oob_base_verbose 100  (without the tcp_if_include option) 
so we can watch the connection handshake and see what it is doing. The 
—hetero-nodes will have not affect here and can be ignored.

Ralph

> On Nov 10, 2014, at 5:12 PM, Gilles Gouaillardet 
>  wrote:
> 
> Hi,
> 
> IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use 
> all the published interfaces.
> 
> by any change, are you running a firewall on your head node ?
> one possible explanation is the compute node tries to access the public 
> interface of the head node, and packets get dropped by the firewall.
> 
> if you are running a firewall, can you make a test without it ?
> /* if you do need NAT, then just remove the DROP and REJECT rules "/
> 
> an other possible explanation is the compute node is doing (reverse) dns 
> requests with the public name and/or ip of the head node and that takes some 
> time to complete (success or failure, this does not really matter here)
> 
> /* a simple test is to make sure all the hosts/ip of the head node are in the 
> /etc/hosts of the compute node */
> 
> could you check your network config (firewall and dns) ?
> 
> can you reproduce the delay when running mpirun on the head node and with one 
> mpi task on the compute node ?
> 
> if yes, then the hard way to trace the delay issue would be to strace -ttt 
> both orted and mpi task that are launched on the compute node and see where 
> the time is lost.
> /* at this stage, i would suspect orted ... */
> 
> Cheers,
> 
> Gilles
> 
> On Mon, Nov 10, 2014 at 5:56 PM, Reuti  > wrote:
> Hi,
> 
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
> 
> > That is indeed bizarre - we haven’t heard of anything similar from other 
> > users. What is your network configuration? If you use oob_tcp_if_include or 
> > exclude, can you resolve the problem?
> 
> Thx - this option helped to get it working.
> 
> These tests were made for sake of simplicity between the headnode of the 
> cluster and one (idle) compute node. I tried then between the (identical) 
> compute nodes and this worked fine. The headnode of the cluster and the 
> compute node are slightly different though (i.e. number of cores), and using 
> eth1 resp. eth0 for the internal network of the cluster.
> 
> I tried --hetero-nodes with no change.
> 
> Then I turned to:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
> 192.168.154.0/26  -n 4 --hetero-nodes --hostfile 
> machines ./mpihello; date
> 
> and the application started instantly. On another cluster, where the headnode 
> is identical to the compute nodes but with the same network setup as above, I 
> observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
> working addition was the correct "oob_tcp_if_include" to solve the issue.
> 
> The questions which remain: a) is this a targeted behavior, b) what changed 
> in this scope between 1.8.1 and 1.8.2?
> 
> -- Reuti
> 
> 
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti  >> > wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
> >>> BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:49:51 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 2.
> >> Hello World from Node 3.
> >> Mon Nov 10 13:49:53 CET 2014
> >>
> >>
> >> -- Reuti
> >>
> >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever.
> >>>
> >>> Sent from my phone. No type good.
> >>>
>  On Nov 10, 2014, at 6:42 AM, Reuti   > wrote:
> 
> > Am 10.11.2014 um 12:24 schrieb Reuti:
> >
> > Hi,
> >
> >> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> >>
> >> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
> >> Each process receives a complete map of that info for every process in 
> >> the job. So when the TCP btl sets itself up, it attempts to connect 
> 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Gilles Gouaillardet
Hi,

IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really
use all the published interfaces.

by any change, are you running a firewall on your head node ?
one possible explanation is the compute node tries to access the public
interface of the head node, and packets get dropped by the firewall.

if you are running a firewall, can you make a test without it ?
/* if you do need NAT, then just remove the DROP and REJECT rules "/

an other possible explanation is the compute node is doing (reverse) dns
requests with the public name and/or ip of the head node and that takes
some time to complete (success or failure, this does not really matter here)

/* a simple test is to make sure all the hosts/ip of the head node are in
the /etc/hosts of the compute node */

could you check your network config (firewall and dns) ?

can you reproduce the delay when running mpirun on the head node and with
one mpi task on the compute node ?

if yes, then the hard way to trace the delay issue would be to strace -ttt
both orted and mpi task that are launched on the compute node and see where
the time is lost.
/* at this stage, i would suspect orted ... */

Cheers,

Gilles

On Mon, Nov 10, 2014 at 5:56 PM, Reuti  wrote:

> Hi,
>
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
>
> > That is indeed bizarre - we haven’t heard of anything similar from other
> users. What is your network configuration? If you use oob_tcp_if_include or
> exclude, can you resolve the problem?
>
> Thx - this option helped to get it working.
>
> These tests were made for sake of simplicity between the headnode of the
> cluster and one (idle) compute node. I tried then between the (identical)
> compute nodes and this worked fine. The headnode of the cluster and the
> compute node are slightly different though (i.e. number of cores), and
> using eth1 resp. eth0 for the internal network of the cluster.
>
> I tried --hetero-nodes with no change.
>
> Then I turned to:
>
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca
> oob_tcp_if_include 192.168.154.0/26 -n 4 --hetero-nodes --hostfile
> machines ./mpihello; date
>
> and the application started instantly. On another cluster, where the
> headnode is identical to the compute nodes but with the same network setup
> as above, I observed a delay of "only" 30 seconds. Nevertheless, also on
> this cluster the working addition was the correct "oob_tcp_if_include" to
> solve the issue.
>
> The questions which remain: a) is this a targeted behavior, b) what
> changed in this scope between 1.8.1 and 1.8.2?
>
> -- Reuti
>
>
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use
> certain BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile
> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile
> machines ./mpihello; date
> >> Mon Nov 10 13:49:51 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 2.
> >> Hello World from Node 3.
> >> Mon Nov 10 13:49:53 CET 2014
> >>
> >>
> >> -- Reuti
> >>
> >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever.
> >>>
> >>> Sent from my phone. No type good.
> >>>
>  On Nov 10, 2014, at 6:42 AM, Reuti 
> wrote:
> 
> > Am 10.11.2014 um 12:24 schrieb Reuti:
> >
> > Hi,
> >
> >> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> >>
> >> FWIW: during MPI_Init, each process “publishes” all of its
> interfaces. Each process receives a complete map of that info for every
> process in the job. So when the TCP btl sets itself up, it attempts to
> connect across -all- the interfaces published by the other end.
> >>
> >> So it doesn’t matter what hostname is provided by the RM. We
> discover and “share” all of the interface info for every node, and then use
> them for loadbalancing.
> >
> > does this lead to any time delay when starting up? I stayed with
> Open MPI 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there
> is a delay when the applications starts in my first compilation of 1.8.3 I
> disregarded even all my extra options and run it outside of any
> queuingsystem - the delay remains - on two different clusters.
> 
>  I forgot to mention: the delay is more or less exactly 2 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Hi,

Am 10.11.2014 um 16:39 schrieb Ralph Castain:

> That is indeed bizarre - we haven’t heard of anything similar from other 
> users. What is your network configuration? If you use oob_tcp_if_include or 
> exclude, can you resolve the problem?

Thx - this option helped to get it working.

These tests were made for sake of simplicity between the headnode of the 
cluster and one (idle) compute node. I tried then between the (identical) 
compute nodes and this worked fine. The headnode of the cluster and the compute 
node are slightly different though (i.e. number of cores), and using eth1 resp. 
eth0 for the internal network of the cluster.

I tried --hetero-nodes with no change.

Then I turned to:

reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
192.168.154.0/26 -n 4 --hetero-nodes --hostfile machines ./mpihello; date

and the application started instantly. On another cluster, where the headnode 
is identical to the compute nodes but with the same network setup as above, I 
observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
working addition was the correct "oob_tcp_if_include" to solve the issue.

The questions which remain: a) is this a targeted behavior, b) what changed in 
this scope between 1.8.1 and 1.8.2?

-- Reuti


> 
>> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
>> 
>> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
>> 
>>> Wow, that's pretty terrible!  :(
>>> 
>>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
>>> BTLs, does the delay disappear?
>> 
>> You mean something like:
>> 
>> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
>> ./mpihello; date
>> Mon Nov 10 13:44:34 CET 2014
>> Hello World from Node 1.
>> Total: 4
>> Universe: 4
>> Hello World from Node 0.
>> Hello World from Node 3.
>> Hello World from Node 2.
>> Mon Nov 10 13:46:42 CET 2014
>> 
>> (the above was even the latest v1.8.3-186-g978f61d)
>> 
>> Falling back to 1.8.1 gives (as expected):
>> 
>> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
>> ./mpihello; date
>> Mon Nov 10 13:49:51 CET 2014
>> Hello World from Node 1.
>> Total: 4
>> Universe: 4
>> Hello World from Node 0.
>> Hello World from Node 2.
>> Hello World from Node 3.
>> Mon Nov 10 13:49:53 CET 2014
>> 
>> 
>> -- Reuti
>> 
>>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
>>> 
>>> Sent from my phone. No type good. 
>>> 
 On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
 
> Am 10.11.2014 um 12:24 schrieb Reuti:
> 
> Hi,
> 
>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>> 
>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
>> Each process receives a complete map of that info for every process in 
>> the job. So when the TCP btl sets itself up, it attempts to connect 
>> across -all- the interfaces published by the other end.
>> 
>> So it doesn’t matter what hostname is provided by the RM. We discover 
>> and “share” all of the interface info for every node, and then use them 
>> for loadbalancing.
> 
> does this lead to any time delay when starting up? I stayed with Open MPI 
> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
> delay when the applications starts in my first compilation of 1.8.3 I 
> disregarded even all my extra options and run it outside of any 
> queuingsystem - the delay remains - on two different clusters.
 
 I forgot to mention: the delay is more or less exactly 2 minutes from the 
 time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
 for the initial `ssh` to reach the other node though).
 
 -- Reuti
 
 
> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
> creates this delay when starting up a simple mpihello. I assume it may 
> lay in the way how to reach other machines, as with one single machine 
> there is no delay. But using one (and only one - no tree spawn involved) 
> additional machine already triggers this delay.
> 
> Did anyone else notice it?
> 
> -- Reuti
> 
> 
>> HTH
>> Ralph
>> 
>> 
>>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>>> 
>>> Ok I figured, i'm going to have to read some more for my own curiosity. 
>>> The reason I mention the Resource Manager we use, and that the 
>>> hostnames given but PBS/Torque match the 1gig-e interfaces, i'm curious 
>>> what path it would take to get to a peer node when the node list given 
>>> all match the 1gig interfaces but yet data is being sent out the 10gig 
>>> eoib0/ib0 interfaces.  
>>> 
>>> I'll go do some measurements and see.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Ralph Castain
That is indeed bizarre - we haven’t heard of anything similar from other users. 
What is your network configuration? If you use oob_tcp_if_include or exclude, 
can you resolve the problem?


> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
> 
> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> 
>> Wow, that's pretty terrible!  :(
>> 
>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
>> BTLs, does the delay disappear?
> 
> You mean something like:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
> ./mpihello; date
> Mon Nov 10 13:44:34 CET 2014
> Hello World from Node 1.
> Total: 4
> Universe: 4
> Hello World from Node 0.
> Hello World from Node 3.
> Hello World from Node 2.
> Mon Nov 10 13:46:42 CET 2014
> 
> (the above was even the latest v1.8.3-186-g978f61d)
> 
> Falling back to 1.8.1 gives (as expected):
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
> ./mpihello; date
> Mon Nov 10 13:49:51 CET 2014
> Hello World from Node 1.
> Total: 4
> Universe: 4
> Hello World from Node 0.
> Hello World from Node 2.
> Hello World from Node 3.
> Mon Nov 10 13:49:53 CET 2014
> 
> 
> -- Reuti
> 
>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
>> 
>> Sent from my phone. No type good. 
>> 
>>> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
>>> 
 Am 10.11.2014 um 12:24 schrieb Reuti:
 
 Hi,
 
> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> 
> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
> Each process receives a complete map of that info for every process in 
> the job. So when the TCP btl sets itself up, it attempts to connect 
> across -all- the interfaces published by the other end.
> 
> So it doesn’t matter what hostname is provided by the RM. We discover and 
> “share” all of the interface info for every node, and then use them for 
> loadbalancing.
 
 does this lead to any time delay when starting up? I stayed with Open MPI 
 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
 delay when the applications starts in my first compilation of 1.8.3 I 
 disregarded even all my extra options and run it outside of any 
 queuingsystem - the delay remains - on two different clusters.
>>> 
>>> I forgot to mention: the delay is more or less exactly 2 minutes from the 
>>> time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
>>> for the initial `ssh` to reach the other node though).
>>> 
>>> -- Reuti
>>> 
>>> 
 I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
 creates this delay when starting up a simple mpihello. I assume it may lay 
 in the way how to reach other machines, as with one single machine there 
 is no delay. But using one (and only one - no tree spawn involved) 
 additional machine already triggers this delay.
 
 Did anyone else notice it?
 
 -- Reuti
 
 
> HTH
> Ralph
> 
> 
>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>> 
>> Ok I figured, i'm going to have to read some more for my own curiosity. 
>> The reason I mention the Resource Manager we use, and that the hostnames 
>> given but PBS/Torque match the 1gig-e interfaces, i'm curious what path 
>> it would take to get to a peer node when the node list given all match 
>> the 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
>> interfaces.  
>> 
>> I'll go do some measurements and see.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres) 
>>>  wrote:
>>> 
>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
>>> default.  
>>> 
>>> This short FAQ has links to 2 other FAQs that provide detailed 
>>> information about reachability:
>>> 
>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>> 
>>> The usNIC BTL uses UDP for its wire transport and actually does a much 
>>> more standards-conformant peer reachability determination (i.e., it 
>>> actually checks routing tables to see if it can reach a given peer 
>>> which has all kinds of caching benefits, kernel controls if you want 
>>> them, etc.).  We haven't back-ported this to the TCP BTL because a) 
>>> most people who use TCP for MPI still use a single L2 address space, 
>>> and b) no one has asked for it.  :-)
>>> 
>>> As for the round robin scheduling, there's no indication from the Linux 
>>> TCP stack what the bandwidth is on a given IP interface.  So unless you 
>>> use the btl_tcp_bandwidth_ (e.g., 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):

> Wow, that's pretty terrible!  :(
> 
> Is the behavior BTL-specific, perchance?  E.G., if you only use certain BTLs, 
> does the delay disappear?

You mean something like:

reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
./mpihello; date
Mon Nov 10 13:44:34 CET 2014
Hello World from Node 1.
Total: 4
Universe: 4
Hello World from Node 0.
Hello World from Node 3.
Hello World from Node 2.
Mon Nov 10 13:46:42 CET 2014

(the above was even the latest v1.8.3-186-g978f61d)

Falling back to 1.8.1 gives (as expected):

reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
./mpihello; date
Mon Nov 10 13:49:51 CET 2014
Hello World from Node 1.
Total: 4
Universe: 4
Hello World from Node 0.
Hello World from Node 2.
Hello World from Node 3.
Mon Nov 10 13:49:53 CET 2014


-- Reuti

> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
> 
> Sent from my phone. No type good. 
> 
>> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
>> 
>>> Am 10.11.2014 um 12:24 schrieb Reuti:
>>> 
>>> Hi,
>>> 
 Am 09.11.2014 um 05:38 schrieb Ralph Castain:
 
 FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
 Each process receives a complete map of that info for every process in the 
 job. So when the TCP btl sets itself up, it attempts to connect across 
 -all- the interfaces published by the other end.
 
 So it doesn’t matter what hostname is provided by the RM. We discover and 
 “share” all of the interface info for every node, and then use them for 
 loadbalancing.
>>> 
>>> does this lead to any time delay when starting up? I stayed with Open MPI 
>>> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
>>> delay when the applications starts in my first compilation of 1.8.3 I 
>>> disregarded even all my extra options and run it outside of any 
>>> queuingsystem - the delay remains - on two different clusters.
>> 
>> I forgot to mention: the delay is more or less exactly 2 minutes from the 
>> time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
>> for the initial `ssh` to reach the other node though).
>> 
>> -- Reuti
>> 
>> 
>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
>>> creates this delay when starting up a simple mpihello. I assume it may lay 
>>> in the way how to reach other machines, as with one single machine there is 
>>> no delay. But using one (and only one - no tree spawn involved) additional 
>>> machine already triggers this delay.
>>> 
>>> Did anyone else notice it?
>>> 
>>> -- Reuti
>>> 
>>> 
 HTH
 Ralph
 
 
> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
> 
> Ok I figured, i'm going to have to read some more for my own curiosity. 
> The reason I mention the Resource Manager we use, and that the hostnames 
> given but PBS/Torque match the 1gig-e interfaces, i'm curious what path 
> it would take to get to a peer node when the node list given all match 
> the 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
> interfaces.  
> 
> I'll go do some measurements and see.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
>> default.  
>> 
>> This short FAQ has links to 2 other FAQs that provide detailed 
>> information about reachability:
>> 
>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>> 
>> The usNIC BTL uses UDP for its wire transport and actually does a much 
>> more standards-conformant peer reachability determination (i.e., it 
>> actually checks routing tables to see if it can reach a given peer which 
>> has all kinds of caching benefits, kernel controls if you want them, 
>> etc.).  We haven't back-ported this to the TCP BTL because a) most 
>> people who use TCP for MPI still use a single L2 address space, and b) 
>> no one has asked for it.  :-)
>> 
>> As for the round robin scheduling, there's no indication from the Linux 
>> TCP stack what the bandwidth is on a given IP interface.  So unless you 
>> use the btl_tcp_bandwidth_ (e.g., 
>> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
>> equally.
>> 
>> If you have multiple IP interfaces sharing a single physical link, there 
>> will likely be no benefit from having Open MPI use more than one of 
>> them.  You should probably use btl_tcp_if_include / btl_tcp_if_exclude 
>> to select just one.
>> 
>> 
>> 
>> 
>>> On Nov 7, 2014, at 2:53 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Jeff Squyres (jsquyres)
Wow, that's pretty terrible!  :(

Is the behavior BTL-specific, perchance?  E.G., if you only use certain BTLs, 
does the delay disappear?

FWIW: the use-all-IP interfaces approach has been in OMPI forever. 

Sent from my phone. No type good. 

> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
> 
>> Am 10.11.2014 um 12:24 schrieb Reuti:
>> 
>> Hi,
>> 
>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>>> 
>>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
>>> process receives a complete map of that info for every process in the job. 
>>> So when the TCP btl sets itself up, it attempts to connect across -all- the 
>>> interfaces published by the other end.
>>> 
>>> So it doesn’t matter what hostname is provided by the RM. We discover and 
>>> “share” all of the interface info for every node, and then use them for 
>>> loadbalancing.
>> 
>> does this lead to any time delay when starting up? I stayed with Open MPI 
>> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a delay 
>> when the applications starts in my first compilation of 1.8.3 I disregarded 
>> even all my extra options and run it outside of any queuingsystem - the 
>> delay remains - on two different clusters.
> 
> I forgot to mention: the delay is more or less exactly 2 minutes from the 
> time I issued `mpiexec` until the `mpihello` starts up (there is no delay for 
> the initial `ssh` to reach the other node though).
> 
> -- Reuti
> 
> 
>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
>> creates this delay when starting up a simple mpihello. I assume it may lay 
>> in the way how to reach other machines, as with one single machine there is 
>> no delay. But using one (and only one - no tree spawn involved) additional 
>> machine already triggers this delay.
>> 
>> Did anyone else notice it?
>> 
>> -- Reuti
>> 
>> 
>>> HTH
>>> Ralph
>>> 
>>> 
 On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
 
 Ok I figured, i'm going to have to read some more for my own curiosity. 
 The reason I mention the Resource Manager we use, and that the hostnames 
 given but PBS/Torque match the 1gig-e interfaces, i'm curious what path it 
 would take to get to a peer node when the node list given all match the 
 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
 interfaces.  
 
 I'll go do some measurements and see.
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
> default.  
> 
> This short FAQ has links to 2 other FAQs that provide detailed 
> information about reachability:
> 
> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
> 
> The usNIC BTL uses UDP for its wire transport and actually does a much 
> more standards-conformant peer reachability determination (i.e., it 
> actually checks routing tables to see if it can reach a given peer which 
> has all kinds of caching benefits, kernel controls if you want them, 
> etc.).  We haven't back-ported this to the TCP BTL because a) most people 
> who use TCP for MPI still use a single L2 address space, and b) no one 
> has asked for it.  :-)
> 
> As for the round robin scheduling, there's no indication from the Linux 
> TCP stack what the bandwidth is on a given IP interface.  So unless you 
> use the btl_tcp_bandwidth_ (e.g., 
> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
> equally.
> 
> If you have multiple IP interfaces sharing a single physical link, there 
> will likely be no benefit from having Open MPI use more than one of them. 
>  You should probably use btl_tcp_if_include / btl_tcp_if_exclude to 
> select just one.
> 
> 
> 
> 
>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>> 
>> I was doing a test on our IB based cluster, where I was diabling IB
>> 
>> --mca btl ^openib --mca mtl ^mxm
>> 
>> I was sending very large messages >1GB  and I was surppised by the speed.
>> 
>> I noticed then that of all our ethernet interfaces
>> 
>> eth0  (1gig-e)
>> ib0  (ip over ib, for lustre configuration at vendor request)
>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
>> extrnal storage support at >1Gig speed
>> 
>> I saw all three were getting traffic.
>> 
>> We use torque for our Resource Manager and use TM support, the hostnames 
>> given by torque match the eth0 interfaces.
>> 
>> How does OMPI figure out that it can also talk over the others?  How 
>> does it chose to 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Am 10.11.2014 um 12:24 schrieb Reuti:

> Hi,
> 
> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> 
>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
>> process receives a complete map of that info for every process in the job. 
>> So when the TCP btl sets itself up, it attempts to connect across -all- the 
>> interfaces published by the other end.
>> 
>> So it doesn’t matter what hostname is provided by the RM. We discover and 
>> “share” all of the interface info for every node, and then use them for 
>> loadbalancing.
> 
> does this lead to any time delay when starting up? I stayed with Open MPI 
> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a delay 
> when the applications starts in my first compilation of 1.8.3 I disregarded 
> even all my extra options and run it outside of any queuingsystem - the delay 
> remains - on two different clusters.

I forgot to mention: the delay is more or less exactly 2 minutes from the time 
I issued `mpiexec` until the `mpihello` starts up (there is no delay for the 
initial `ssh` to reach the other node though).

-- Reuti


> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
> creates this delay when starting up a simple mpihello. I assume it may lay in 
> the way how to reach other machines, as with one single machine there is no 
> delay. But using one (and only one - no tree spawn involved) additional 
> machine already triggers this delay.
> 
> Did anyone else notice it?
> 
> -- Reuti
> 
> 
>> HTH
>> Ralph
>> 
>> 
>>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>>> 
>>> Ok I figured, i'm going to have to read some more for my own curiosity. The 
>>> reason I mention the Resource Manager we use, and that the hostnames given 
>>> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
>>> take to get to a peer node when the node list given all match the 1gig 
>>> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
>>> 
>>> I'll go do some measurements and see.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
 On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
 wrote:
 
 Ralph is right: OMPI aggressively uses all Ethernet interfaces by default. 
  
 
 This short FAQ has links to 2 other FAQs that provide detailed information 
 about reachability:
 
 http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
 
 The usNIC BTL uses UDP for its wire transport and actually does a much 
 more standards-conformant peer reachability determination (i.e., it 
 actually checks routing tables to see if it can reach a given peer which 
 has all kinds of caching benefits, kernel controls if you want them, 
 etc.).  We haven't back-ported this to the TCP BTL because a) most people 
 who use TCP for MPI still use a single L2 address space, and b) no one has 
 asked for it.  :-)
 
 As for the round robin scheduling, there's no indication from the Linux 
 TCP stack what the bandwidth is on a given IP interface.  So unless you 
 use the btl_tcp_bandwidth_ (e.g., 
 btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
 equally.
 
 If you have multiple IP interfaces sharing a single physical link, there 
 will likely be no benefit from having Open MPI use more than one of them.  
 You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
 just one.
 
 
 
 
 On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
 
> I was doing a test on our IB based cluster, where I was diabling IB
> 
> --mca btl ^openib --mca mtl ^mxm
> 
> I was sending very large messages >1GB  and I was surppised by the speed.
> 
> I noticed then that of all our ethernet interfaces
> 
> eth0  (1gig-e)
> ib0  (ip over ib, for lustre configuration at vendor request)
> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
> extrnal storage support at >1Gig speed
> 
> I saw all three were getting traffic.
> 
> We use torque for our Resource Manager and use TM support, the hostnames 
> given by torque match the eth0 interfaces.
> 
> How does OMPI figure out that it can also talk over the others?  How does 
> it chose to load balance?
> 
> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
> and eoib0  are the same physical device and may screw with load balancing 
> if anyone ver falls back to TCP.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> users 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Hi,

Am 09.11.2014 um 05:38 schrieb Ralph Castain:

> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
> process receives a complete map of that info for every process in the job. So 
> when the TCP btl sets itself up, it attempts to connect across -all- the 
> interfaces published by the other end.
> 
> So it doesn’t matter what hostname is provided by the RM. We discover and 
> “share” all of the interface info for every node, and then use them for 
> loadbalancing.

does this lead to any time delay when starting up? I stayed with Open MPI 1.6.5 
for some time and tried to use Open MPI 1.8.3 now. As there is a delay when the 
applications starts in my first compilation of 1.8.3 I disregarded even all my 
extra options and run it outside of any queuingsystem - the delay remains - on 
two different clusters.

I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
creates this delay when starting up a simple mpihello. I assume it may lay in 
the way how to reach other machines, as with one single machine there is no 
delay. But using one (and only one - no tree spawn involved) additional machine 
already triggers this delay.

Did anyone else notice it?

-- Reuti


> HTH
> Ralph
> 
> 
>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>> 
>> Ok I figured, i'm going to have to read some more for my own curiosity. The 
>> reason I mention the Resource Manager we use, and that the hostnames given 
>> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
>> take to get to a peer node when the node list given all match the 1gig 
>> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
>> 
>> I'll go do some measurements and see.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.  
>>> 
>>> This short FAQ has links to 2 other FAQs that provide detailed information 
>>> about reachability:
>>> 
>>>  http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>> 
>>> The usNIC BTL uses UDP for its wire transport and actually does a much more 
>>> standards-conformant peer reachability determination (i.e., it actually 
>>> checks routing tables to see if it can reach a given peer which has all 
>>> kinds of caching benefits, kernel controls if you want them, etc.).  We 
>>> haven't back-ported this to the TCP BTL because a) most people who use TCP 
>>> for MPI still use a single L2 address space, and b) no one has asked for 
>>> it.  :-)
>>> 
>>> As for the round robin scheduling, there's no indication from the Linux TCP 
>>> stack what the bandwidth is on a given IP interface.  So unless you use the 
>>> btl_tcp_bandwidth_ (e.g., btl_tcp_bandwidth_eth0) MCA 
>>> params, OMPI will round-robin across them equally.
>>> 
>>> If you have multiple IP interfaces sharing a single physical link, there 
>>> will likely be no benefit from having Open MPI use more than one of them.  
>>> You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
>>> just one.
>>> 
>>> 
>>> 
>>> 
>>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>>> 
 I was doing a test on our IB based cluster, where I was diabling IB
 
 --mca btl ^openib --mca mtl ^mxm
 
 I was sending very large messages >1GB  and I was surppised by the speed.
 
 I noticed then that of all our ethernet interfaces
 
 eth0  (1gig-e)
 ib0  (ip over ib, for lustre configuration at vendor request)
 eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
 extrnal storage support at >1Gig speed
 
 I saw all three were getting traffic.
 
 We use torque for our Resource Manager and use TM support, the hostnames 
 given by torque match the eth0 interfaces.
 
 How does OMPI figure out that it can also talk over the others?  How does 
 it chose to load balance?
 
 BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
 and eoib0  are the same physical device and may screw with load balancing 
 if anyone ver falls back to TCP.
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2014/11/25709.php
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-08 Thread Ralph Castain
FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
process receives a complete map of that info for every process in the job. So 
when the TCP btl sets itself up, it attempts to connect across -all- the 
interfaces published by the other end.

So it doesn’t matter what hostname is provided by the RM. We discover and 
“share” all of the interface info for every node, and then use them for 
loadbalancing.

HTH
Ralph


> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
> 
> Ok I figured, i'm going to have to read some more for my own curiosity. The 
> reason I mention the Resource Manager we use, and that the hostnames given 
> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
> take to get to a peer node when the node list given all match the 1gig 
> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
> 
> I'll go do some measurements and see.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.  
>> 
>> This short FAQ has links to 2 other FAQs that provide detailed information 
>> about reachability:
>> 
>>   http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>> 
>> The usNIC BTL uses UDP for its wire transport and actually does a much more 
>> standards-conformant peer reachability determination (i.e., it actually 
>> checks routing tables to see if it can reach a given peer which has all 
>> kinds of caching benefits, kernel controls if you want them, etc.).  We 
>> haven't back-ported this to the TCP BTL because a) most people who use TCP 
>> for MPI still use a single L2 address space, and b) no one has asked for it. 
>>  :-)
>> 
>> As for the round robin scheduling, there's no indication from the Linux TCP 
>> stack what the bandwidth is on a given IP interface.  So unless you use the 
>> btl_tcp_bandwidth_ (e.g., btl_tcp_bandwidth_eth0) MCA 
>> params, OMPI will round-robin across them equally.
>> 
>> If you have multiple IP interfaces sharing a single physical link, there 
>> will likely be no benefit from having Open MPI use more than one of them.  
>> You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
>> just one.
>> 
>> 
>> 
>> 
>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>> 
>>> I was doing a test on our IB based cluster, where I was diabling IB
>>> 
>>> --mca btl ^openib --mca mtl ^mxm
>>> 
>>> I was sending very large messages >1GB  and I was surppised by the speed.
>>> 
>>> I noticed then that of all our ethernet interfaces
>>> 
>>> eth0  (1gig-e)
>>> ib0  (ip over ib, for lustre configuration at vendor request)
>>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
>>> extrnal storage support at >1Gig speed
>>> 
>>> I saw all three were getting traffic.
>>> 
>>> We use torque for our Resource Manager and use TM support, the hostnames 
>>> given by torque match the eth0 interfaces.
>>> 
>>> How does OMPI figure out that it can also talk over the others?  How does 
>>> it chose to load balance?
>>> 
>>> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
>>> and eoib0  are the same physical device and may screw with load balancing 
>>> if anyone ver falls back to TCP.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25709.php
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25713.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25715.php



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-08 Thread Brock Palen
Ok I figured, i'm going to have to read some more for my own curiosity. The 
reason I mention the Resource Manager we use, and that the hostnames given but 
PBS/Torque match the 1gig-e interfaces, i'm curious what path it would take to 
get to a peer node when the node list given all match the 1gig interfaces but 
yet data is being sent out the 10gig eoib0/ib0 interfaces.  

I'll go do some measurements and see.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.  
> 
> This short FAQ has links to 2 other FAQs that provide detailed information 
> about reachability:
> 
>http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
> 
> The usNIC BTL uses UDP for its wire transport and actually does a much more 
> standards-conformant peer reachability determination (i.e., it actually 
> checks routing tables to see if it can reach a given peer which has all kinds 
> of caching benefits, kernel controls if you want them, etc.).  We haven't 
> back-ported this to the TCP BTL because a) most people who use TCP for MPI 
> still use a single L2 address space, and b) no one has asked for it.  :-)
> 
> As for the round robin scheduling, there's no indication from the Linux TCP 
> stack what the bandwidth is on a given IP interface.  So unless you use the 
> btl_tcp_bandwidth_ (e.g., btl_tcp_bandwidth_eth0) MCA 
> params, OMPI will round-robin across them equally.
> 
> If you have multiple IP interfaces sharing a single physical link, there will 
> likely be no benefit from having Open MPI use more than one of them.  You 
> should probably use btl_tcp_if_include / btl_tcp_if_exclude to select just 
> one.
> 
> 
> 
> 
> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
> 
>> I was doing a test on our IB based cluster, where I was diabling IB
>> 
>> --mca btl ^openib --mca mtl ^mxm
>> 
>> I was sending very large messages >1GB  and I was surppised by the speed.
>> 
>> I noticed then that of all our ethernet interfaces
>> 
>> eth0  (1gig-e)
>> ib0  (ip over ib, for lustre configuration at vendor request)
>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
>> extrnal storage support at >1Gig speed
>> 
>> I saw all three were getting traffic.
>> 
>> We use torque for our Resource Manager and use TM support, the hostnames 
>> given by torque match the eth0 interfaces.
>> 
>> How does OMPI figure out that it can also talk over the others?  How does it 
>> chose to load balance?
>> 
>> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
>> and eoib0  are the same physical device and may screw with load balancing if 
>> anyone ver falls back to TCP.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25709.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25713.php



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-08 Thread Jeff Squyres (jsquyres)
Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.  

This short FAQ has links to 2 other FAQs that provide detailed information 
about reachability:

http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network

The usNIC BTL uses UDP for its wire transport and actually does a much more 
standards-conformant peer reachability determination (i.e., it actually checks 
routing tables to see if it can reach a given peer which has all kinds of 
caching benefits, kernel controls if you want them, etc.).  We haven't 
back-ported this to the TCP BTL because a) most people who use TCP for MPI 
still use a single L2 address space, and b) no one has asked for it.  :-)

As for the round robin scheduling, there's no indication from the Linux TCP 
stack what the bandwidth is on a given IP interface.  So unless you use the 
btl_tcp_bandwidth_ (e.g., btl_tcp_bandwidth_eth0) MCA 
params, OMPI will round-robin across them equally.

If you have multiple IP interfaces sharing a single physical link, there will 
likely be no benefit from having Open MPI use more than one of them.  You 
should probably use btl_tcp_if_include / btl_tcp_if_exclude to select just one.




On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:

> I was doing a test on our IB based cluster, where I was diabling IB
> 
> --mca btl ^openib --mca mtl ^mxm
> 
> I was sending very large messages >1GB  and I was surppised by the speed.
> 
> I noticed then that of all our ethernet interfaces
> 
> eth0  (1gig-e)
> ib0  (ip over ib, for lustre configuration at vendor request)
> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
> extrnal storage support at >1Gig speed
> 
> I saw all three were getting traffic.
> 
> We use torque for our Resource Manager and use TM support, the hostnames 
> given by torque match the eth0 interfaces.
> 
> How does OMPI figure out that it can also talk over the others?  How does it 
> chose to load balance?
> 
> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 and 
> eoib0  are the same physical device and may screw with load balancing if 
> anyone ver falls back to TCP.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25709.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-07 Thread Ralph Castain
OMPI discovers all active interfaces and automatically considers them available 
for its use unless instructed otherwise via the params. I’d have to look at the 
TCP BTL code to see the loadbalancing algo - I thought we didn’t have that “on” 
by default across BTLs, but I don’t know if the TCP one automatically uses all 
available Ethernet interfaces by default. Sounds like it must.


> On Nov 7, 2014, at 11:53 AM, Brock Palen  wrote:
> 
> I was doing a test on our IB based cluster, where I was diabling IB
> 
> --mca btl ^openib --mca mtl ^mxm
> 
> I was sending very large messages >1GB  and I was surppised by the speed.
> 
> I noticed then that of all our ethernet interfaces
> 
> eth0  (1gig-e)
> ib0  (ip over ib, for lustre configuration at vendor request)
> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
> extrnal storage support at >1Gig speed
> 
> I saw all three were getting traffic.
> 
> We use torque for our Resource Manager and use TM support, the hostnames 
> given by torque match the eth0 interfaces.
> 
> How does OMPI figure out that it can also talk over the others?  How does it 
> chose to load balance?
> 
> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 and 
> eoib0  are the same physical device and may screw with load balancing if 
> anyone ver falls back to TCP.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25709.php