Re: [OMPI devel] [RFC] mca_base_select()

2008-05-06 Thread Ralph Castain
Excellent! Thanks Josh - both for the original work/commit and for the quick
fix!

Ralph


On 5/6/08 3:58 PM, "Josh Hursey"  wrote:

> Sorry about that. Looking back at the filem logic it seems that I
> returned success even if select failed (and just use the 'none'
> passthrough component). I committed a patch in r18389 that fixes this
> problem.
> 
> This commit now has a warning that prints on the filem verbose stream
> so if a user hits something like this in the wild unexpectedly then
> we can help them debug it a bit.
> 
> Cheers,
> Josh
> 
> 
> On May 6, 2008, at 2:56 PM, Ralph H Castain wrote:
> 
>> Hmmmwell, I hit a problem (of course!). I have mca-no-build on
>> the filem
>> framework on my Mac. If I just mpriun -n 3 ./hello, I get the
>> following
>> error:
>> 
>> --
>> 
>> It looks like orte_init failed for some reason; your parallel
>> process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_filem_base_select failed
>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>> 
>> --
>> 
>> 
>> After looking at the source code for filem_select, I can run just
>> fine if I
>> specify -mca filem none on the cmd line. Otherwise, it looks like your
>> select logic insists that at least one component must be built and
>> selectable?
>> 
>> Is that generally true, or is your filem framework the exception? I
>> think
>> this would not be a good general requirement - frankly, I don't
>> think it is
>> good for any framework to have such a requirement.
>> 
>> Ralph
>> 
>> 
>> 
>> On 5/6/08 12:09 PM, "Josh Hursey"  wrote:
>> 
>>> This has been committed in r18381
>>> 
>>> Please let me know if you have any problems with this commit.
>>> 
>>> Cheers,
>>> Josh
>>> 
>>> On May 5, 2008, at 10:41 AM, Josh Hursey wrote:
>>> 
 Awesome.
 
 The branch is updated to the latest trunk head. I encourage folks to
 check out this repository and make sure that it builds on their
 system. A normal build of the branch should be enough to find out if
 there are any cut-n-paste problems (though I tried to be careful,
 mistakes do happen).
 
 I haven't heard any problems so this is looking like it will come in
 tomorrow after the teleconf. I'll ask again there to see if there
 are
 any voices of concern.
 
 Cheers,
 Josh
 
 On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:
 
> This all sounds good to me!
> 
> On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:
> 
>> What:  Add mca_base_select() and adjust frameworks & components to
>> use
>> it.
>> Why:   Consolidation of code for general goodness.
>> Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
>> When:  Code ready now. Documentation ready soon.
>> Timeout: May 6, 2008 (After teleconf) [1 week]
>> 
>> Discussion:
>> ---
>> For a number of years a few developers have been talking about
>> creating a MCA base component selection function. For various
>> reasons
>> this was never implemented. Recently I decided to give it a try.
>> 
>> A base select function will allow Open MPI to provide completely
>> consistent selection behavior for many of its frameworks (18 of 31
>> to
>> be exact at the moment). The primary goal of this work is to
>> improving
>> code maintainability through code reuse. Other benefits also
>> result
>> such as a slightly smaller memory footprint.
>> 
>> The mca_base_select() function represented the most commonly used
>> logic for component selection: Select the one component with the
>> highest priority and close all of the not selected components.
>> This
>> function can be found at the path below in the branch:
>> opal/mca/base/mca_base_components_select.c
>> 
>> To support this I had to formalize a query() function in the
>> mca_base_component_t of the form:
>> int mca_base_query_component_fn(mca_base_module_t **module, int
>> *priority);
>> 
>> This function is specified after the open and close component
>> functions in this structure as to allow compatibility with
>> frameworks
>> that do not use the base selection logic. Frameworks that do *not*
>> use
>> this function are *not* effected by this commit. However, every
>> component in the frameworks that use the mca_base_select function
>> must
>> adjust their component query function to fit that specified above.
>> 
>> 18 frameworks in 

Re: [OMPI devel] [RFC] mca_base_select()

2008-05-06 Thread Josh Hursey
Sorry about that. Looking back at the filem logic it seems that I  
returned success even if select failed (and just use the 'none'  
passthrough component). I committed a patch in r18389 that fixes this  
problem.


This commit now has a warning that prints on the filem verbose stream  
so if a user hits something like this in the wild unexpectedly then  
we can help them debug it a bit.


Cheers,
Josh


On May 6, 2008, at 2:56 PM, Ralph H Castain wrote:

Hmmmwell, I hit a problem (of course!). I have mca-no-build on  
the filem
framework on my Mac. If I just mpriun -n 3 ./hello, I get the  
following

error:

-- 

It looks like orte_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_filem_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS

-- 



After looking at the source code for filem_select, I can run just  
fine if I

specify -mca filem none on the cmd line. Otherwise, it looks like your
select logic insists that at least one component must be built and
selectable?

Is that generally true, or is your filem framework the exception? I  
think
this would not be a good general requirement - frankly, I don't  
think it is

good for any framework to have such a requirement.

Ralph



On 5/6/08 12:09 PM, "Josh Hursey"  wrote:


This has been committed in r18381

Please let me know if you have any problems with this commit.

Cheers,
Josh

On May 5, 2008, at 10:41 AM, Josh Hursey wrote:


Awesome.

The branch is updated to the latest trunk head. I encourage folks to
check out this repository and make sure that it builds on their
system. A normal build of the branch should be enough to find out if
there are any cut-n-paste problems (though I tried to be careful,
mistakes do happen).

I haven't heard any problems so this is looking like it will come in
tomorrow after the teleconf. I'll ask again there to see if there  
are

any voices of concern.

Cheers,
Josh

On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:


This all sounds good to me!

On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:


What:  Add mca_base_select() and adjust frameworks & components to
use
it.
Why:   Consolidation of code for general goodness.
Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
When:  Code ready now. Documentation ready soon.
Timeout: May 6, 2008 (After teleconf) [1 week]

Discussion:
---
For a number of years a few developers have been talking about
creating a MCA base component selection function. For various
reasons
this was never implemented. Recently I decided to give it a try.

A base select function will allow Open MPI to provide completely
consistent selection behavior for many of its frameworks (18 of 31
to
be exact at the moment). The primary goal of this work is to
improving
code maintainability through code reuse. Other benefits also  
result

such as a slightly smaller memory footprint.

The mca_base_select() function represented the most commonly used
logic for component selection: Select the one component with the
highest priority and close all of the not selected components.  
This

function can be found at the path below in the branch:
opal/mca/base/mca_base_components_select.c

To support this I had to formalize a query() function in the
mca_base_component_t of the form:
int mca_base_query_component_fn(mca_base_module_t **module, int
*priority);

This function is specified after the open and close component
functions in this structure as to allow compatibility with
frameworks
that do not use the base selection logic. Frameworks that do *not*
use
this function are *not* effected by this commit. However, every
component in the frameworks that use the mca_base_select function
must
adjust their component query function to fit that specified above.

18 frameworks in Open MPI have been changed. I have updated all of
the
components in the 18 frameworks available in the trunk on my  
branch.

The effected frameworks are:
- OPAL Carto
- OPAL crs
- OPAL maffinity
- OPAL memchecker
- OPAL paffinity
- ORTE errmgr
- ORTE ess
- ORTE Filem
- ORTE grpcomm
- ORTE odls
- ORTE pml
- ORTE ras
- ORTE rmaps
- ORTE routed
- ORTE snapc
- OMPI crcp
- OMPI dpm
- OMPI pubsub

There was a question of the memory footprint change as a result of
this commit. I used 'pmap' to determine process memory footprint
of a
hello world MPI program. Static and Shared build numbers are below
along with variations on launching locally and to a single node
allocated by SLURM. All of this was on Indiana University's Odin
machine. We 

Re: [OMPI devel] [RFC] mca_base_select()

2008-05-06 Thread Ralph H Castain
Hmmmwell, I hit a problem (of course!). I have mca-no-build on the filem
framework on my Mac. If I just mpriun -n 3 ./hello, I get the following
error:

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_filem_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS

--

After looking at the source code for filem_select, I can run just fine if I
specify -mca filem none on the cmd line. Otherwise, it looks like your
select logic insists that at least one component must be built and
selectable?

Is that generally true, or is your filem framework the exception? I think
this would not be a good general requirement - frankly, I don't think it is
good for any framework to have such a requirement.

Ralph



On 5/6/08 12:09 PM, "Josh Hursey"  wrote:

> This has been committed in r18381
> 
> Please let me know if you have any problems with this commit.
> 
> Cheers,
> Josh
> 
> On May 5, 2008, at 10:41 AM, Josh Hursey wrote:
> 
>> Awesome.
>> 
>> The branch is updated to the latest trunk head. I encourage folks to
>> check out this repository and make sure that it builds on their
>> system. A normal build of the branch should be enough to find out if
>> there are any cut-n-paste problems (though I tried to be careful,
>> mistakes do happen).
>> 
>> I haven't heard any problems so this is looking like it will come in
>> tomorrow after the teleconf. I'll ask again there to see if there are
>> any voices of concern.
>> 
>> Cheers,
>> Josh
>> 
>> On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:
>> 
>>> This all sounds good to me!
>>> 
>>> On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:
>>> 
 What:  Add mca_base_select() and adjust frameworks & components to
 use
 it.
 Why:   Consolidation of code for general goodness.
 Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
 When:  Code ready now. Documentation ready soon.
 Timeout: May 6, 2008 (After teleconf) [1 week]
 
 Discussion:
 ---
 For a number of years a few developers have been talking about
 creating a MCA base component selection function. For various
 reasons
 this was never implemented. Recently I decided to give it a try.
 
 A base select function will allow Open MPI to provide completely
 consistent selection behavior for many of its frameworks (18 of 31
 to
 be exact at the moment). The primary goal of this work is to
 improving
 code maintainability through code reuse. Other benefits also result
 such as a slightly smaller memory footprint.
 
 The mca_base_select() function represented the most commonly used
 logic for component selection: Select the one component with the
 highest priority and close all of the not selected components. This
 function can be found at the path below in the branch:
 opal/mca/base/mca_base_components_select.c
 
 To support this I had to formalize a query() function in the
 mca_base_component_t of the form:
 int mca_base_query_component_fn(mca_base_module_t **module, int
 *priority);
 
 This function is specified after the open and close component
 functions in this structure as to allow compatibility with
 frameworks
 that do not use the base selection logic. Frameworks that do *not*
 use
 this function are *not* effected by this commit. However, every
 component in the frameworks that use the mca_base_select function
 must
 adjust their component query function to fit that specified above.
 
 18 frameworks in Open MPI have been changed. I have updated all of
 the
 components in the 18 frameworks available in the trunk on my branch.
 The effected frameworks are:
 - OPAL Carto
 - OPAL crs
 - OPAL maffinity
 - OPAL memchecker
 - OPAL paffinity
 - ORTE errmgr
 - ORTE ess
 - ORTE Filem
 - ORTE grpcomm
 - ORTE odls
 - ORTE pml
 - ORTE ras
 - ORTE rmaps
 - ORTE routed
 - ORTE snapc
 - OMPI crcp
 - OMPI dpm
 - OMPI pubsub
 
 There was a question of the memory footprint change as a result of
 this commit. I used 'pmap' to determine process memory footprint
 of a
 hello world MPI program. Static and Shared build numbers are below
 along with variations on launching locally and to a single node
 allocated by SLURM. All of this was on Indiana University's Odin
 machine. We compare against the trunk (r18276) 

Re: [OMPI devel] [RFC] mca_base_select()

2008-05-06 Thread Josh Hursey

This has been committed in r18381

Please let me know if you have any problems with this commit.

Cheers,
Josh

On May 5, 2008, at 10:41 AM, Josh Hursey wrote:


Awesome.

The branch is updated to the latest trunk head. I encourage folks to
check out this repository and make sure that it builds on their
system. A normal build of the branch should be enough to find out if
there are any cut-n-paste problems (though I tried to be careful,
mistakes do happen).

I haven't heard any problems so this is looking like it will come in
tomorrow after the teleconf. I'll ask again there to see if there are
any voices of concern.

Cheers,
Josh

On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:


This all sounds good to me!

On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:


What:  Add mca_base_select() and adjust frameworks & components to
use
it.
Why:   Consolidation of code for general goodness.
Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
When:  Code ready now. Documentation ready soon.
Timeout: May 6, 2008 (After teleconf) [1 week]

Discussion:
---
For a number of years a few developers have been talking about
creating a MCA base component selection function. For various  
reasons

this was never implemented. Recently I decided to give it a try.

A base select function will allow Open MPI to provide completely
consistent selection behavior for many of its frameworks (18 of 31  
to

be exact at the moment). The primary goal of this work is to
improving
code maintainability through code reuse. Other benefits also result
such as a slightly smaller memory footprint.

The mca_base_select() function represented the most commonly used
logic for component selection: Select the one component with the
highest priority and close all of the not selected components. This
function can be found at the path below in the branch:
opal/mca/base/mca_base_components_select.c

To support this I had to formalize a query() function in the
mca_base_component_t of the form:
int mca_base_query_component_fn(mca_base_module_t **module, int
*priority);

This function is specified after the open and close component
functions in this structure as to allow compatibility with  
frameworks

that do not use the base selection logic. Frameworks that do *not*
use
this function are *not* effected by this commit. However, every
component in the frameworks that use the mca_base_select function
must
adjust their component query function to fit that specified above.

18 frameworks in Open MPI have been changed. I have updated all of
the
components in the 18 frameworks available in the trunk on my branch.
The effected frameworks are:
- OPAL Carto
- OPAL crs
- OPAL maffinity
- OPAL memchecker
- OPAL paffinity
- ORTE errmgr
- ORTE ess
- ORTE Filem
- ORTE grpcomm
- ORTE odls
- ORTE pml
- ORTE ras
- ORTE rmaps
- ORTE routed
- ORTE snapc
- OMPI crcp
- OMPI dpm
- OMPI pubsub

There was a question of the memory footprint change as a result of
this commit. I used 'pmap' to determine process memory footprint  
of a

hello world MPI program. Static and Shared build numbers are below
along with variations on launching locally and to a single node
allocated by SLURM. All of this was on Indiana University's Odin
machine. We compare against the trunk (r18276) representing the last
SVN sync point of the branch.

 Process(shared)| Trunk| Branch  | Diff (Improvement)
 ---+--+-+---
 mpirun (orted) |   39976K |  36828K | 3148K
 hello (0)  |  229288K | 229268K |   20K
 hello (1)  |  229288K | 229268K |   20K
 ---+--+-+---
 mpirun |   40032K |  37924K | 2108K
 orted  |   34720K |  34660K |   60K
 hello (0)  |  228404K | 228384K |   20K
 hello (1)  |  228404K | 228384K |   20K

 Process(static)| Trunk| Branch  | Diff (Improvement)
 ---+--+-+---
 mpirun (orted) |   21384K |  21372K |  12K
 hello (0)  |  194000K | 193980K |  20K
 hello (1)  |  194000K | 193980K |  20K
 ---+--+-+---
 mpirun |   21384K |  21372K |  12K
 orted  |   21208K |  21196K |  12K
 hello (0)  |  193116K | 193096K |  20K
 hello (1)  |  193116K | 193096K |  20K

As you can see there are some small memory footprint improvements on
my branch that result from this work. The size of the Open MPI
project
shrinks a bit as well. This commit cuts between 3,500 and 2,000  
lines

of code (depending on how you count) so about a ~1% code shrink.

The branch is stable in all of the testing I have done, but there  
are

some platforms on which I cannot test. So please give this branch a
try and let me know if you find any problems.

Cheers,
Josh

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org

[OMPI devel] [RFC] mca_base_open() NULL

2008-05-06 Thread Josh Hursey

What:  Add a MCA-NULL option to open no components in mca_base_open()
Why:   Sometimes we do not want to open or select any components of a  
framework.

Where: patch attached for current trunk.
When:  Needs further discussion.
Timeout: Unknown. [May 13, 2008 (After teleconf)?]


Short Version:
--
This RFC is intended to continue discussion on the thread started here:
 http://www.open-mpi.org/community/lists/devel/2008/05/3793.php

Discussion should occur on list, but maybe try to come to some  
settlement on this RFC in the next week or two.


Longer Version:
---
Currently there is no way to express to the MCA system that absolutely  
no components of a framework are needed and therefore nothing should  
be opened. The addition of a sentinel value is needed to explicitly  
express this intention. It was suggested that if a 'MCA-NULL' value is  
passed as an argument for a framework then this should be taken to  
indicate such an intention.




mca-null.diff
Description: Binary data




Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown

2008-05-06 Thread Jeff Squyres
In addition to Steve's comments, we discussed this on the call today  
and decided that the patch is fine.


Jon and I will discuss further because this is the first instance of  
calling some form of "disconnect" on one side causes events to occur  
on the other side without the involvement from the remote OMPI (e.g.,  
the remote side's OMPI layer simply hasn't called its "disconnect"  
flavor yet, but the kernel level transport/network stack will cause  
things to happen on the remote side anyway).



On May 6, 2008, at 11:45 AM, Steve Wise wrote:


Jeff Squyres wrote:

On May 5, 2008, at 6:27 PM, Steve Wise wrote:



I am seeing some unusual behavior during the shutdown phase of ompi
at the end of my testcase.  While running a IMB pingpong test over
the rdmacm on openib, I get cq flush errors on my iWARP adapters.

This error is happening because the remote node is still polling
the endpoint while the other one shutdown.  This occurs because
iWARP puts the qps in error state when the channel is disconnected
(IB does not do this).  Since the cq is still being polled when the
event is received on the remote node, ompi thinks it hit an error
and kills the run.  Since this is expected behavior on iWARP, this
is not really an error case.


The key here, I think is that when an iWARP QP moves out of RTS, all
the
RECVs and any pending SQ WRs get flushed.  Further, disconnecting  
the
iwarp connection forces the QP out of RTS.  This is probably  
different

than they way IB works.  IE "disconnecting" in IB is an out-of-band
exchange done by the IBCM.  For iWARP, "disconnecting" is an in-band
operation (a TCP close or abort) so the QP cannot remain in RTS  
during

this process.



Let me make sure I understand:

- proc A calls del_procs on proc B
- proc A calls ibv_destroy_qp() on QP to proc B



Actually proc A calls rdma_disconnect() on QP to proc B

- this causes a local (proc A) flush on all pending receives and SQ  
WRs

- this then causes a FLUSH event to show up *in proc B*
  --> I'm not clear on this point from Jon's/Steve's text



Yes.  Once the connection is torn down the iwarp QPs will be flushed  
on

both ends.

- OMPI [currently] treats the FLUSH in proc B as an error

Is that right?

What is the purpose of the FLUSH event?




In general, I think it is to allow the application to recover any
resources that are allocated and cannot be touched until the WRs
complete.  For example, the buffers that were described in all the  
RECV

WRs.  If the app is going to exit, this isn't very interesting since
everything will get cleaned up in the exit path.  But if the process  
is

long lived and setting up/tearing down connections, then these pending
RECV buffers need to be reclaimed and put back into the buffer poll,  
as

an example...


There is a larger question regarding why the remote node is still
polling the hca and not shutting down, but my immediate question is
if it is an acceptable fix to simply disregard this "error" if it
is an iWARP adapter.



If proc B is still polling the hca, it is likely because it simply  
has

not yet stopped doing it.  I.e., a big problem in MPI implementations
is that not all actions are exactly synchronous.  MPI disconnects are
*effectively* synchronous, but we probably didn't *guarantee*
synchronicity in this case because we didn't need it (perhaps until
now).



Yes.


Steve.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown

2008-05-06 Thread Steve Wise

Jeff Squyres wrote:

On May 5, 2008, at 6:27 PM, Steve Wise wrote:

  
I am seeing some unusual behavior during the shutdown phase of ompi  
at the end of my testcase.  While running a IMB pingpong test over  
the rdmacm on openib, I get cq flush errors on my iWARP adapters.


This error is happening because the remote node is still polling  
the endpoint while the other one shutdown.  This occurs because  
iWARP puts the qps in error state when the channel is disconnected  
(IB does not do this).  Since the cq is still being polled when the  
event is received on the remote node, ompi thinks it hit an error  
and kills the run.  Since this is expected behavior on iWARP, this  
is not really an error case.
  
The key here, I think is that when an iWARP QP moves out of RTS, all  
the

RECVs and any pending SQ WRs get flushed.  Further, disconnecting the
iwarp connection forces the QP out of RTS.  This is probably different
than they way IB works.  IE "disconnecting" in IB is an out-of-band
exchange done by the IBCM.  For iWARP, "disconnecting" is an in-band
operation (a TCP close or abort) so the QP cannot remain in RTS during
this process.



Let me make sure I understand:

- proc A calls del_procs on proc B
- proc A calls ibv_destroy_qp() on QP to proc B
  


Actually proc A calls rdma_disconnect() on QP to proc B


- this causes a local (proc A) flush on all pending receives and SQ WRs
- this then causes a FLUSH event to show up *in proc B*
   --> I'm not clear on this point from Jon's/Steve's text
  


Yes.  Once the connection is torn down the iwarp QPs will be flushed on 
both ends.

- OMPI [currently] treats the FLUSH in proc B as an error

Is that right?

What is the purpose of the FLUSH event?

  


In general, I think it is to allow the application to recover any 
resources that are allocated and cannot be touched until the WRs 
complete.  For example, the buffers that were described in all the RECV 
WRs.  If the app is going to exit, this isn't very interesting since 
everything will get cleaned up in the exit path.  But if the process is 
long lived and setting up/tearing down connections, then these pending 
RECV buffers need to be reclaimed and put back into the buffer poll, as 
an example...


There is a larger question regarding why the remote node is still  
polling the hca and not shutting down, but my immediate question is  
if it is an acceptable fix to simply disregard this "error" if it  
is an iWARP adapter.
  


If proc B is still polling the hca, it is likely because it simply has  
not yet stopped doing it.  I.e., a big problem in MPI implementations  
is that not all actions are exactly synchronous.  MPI disconnects are  
*effectively* synchronous, but we probably didn't *guarantee*  
synchronicity in this case because we didn't need it (perhaps until  
now).
  


Yes.


Steve.



Re: [OMPI devel] NO IP address found

2008-05-06 Thread Jeff Squyres
I think the larger issue, though, is whether rdmacm will work properly  
for the LMC>0 case over IB, right?


The fact that it shouldn't be displaying this error message now  
because RDMA CM is not the default is one issue, but it's not the  
*real* issue...



On May 6, 2008, at 11:00 AM, Jon Mason wrote:


On Tuesday 06 May 2008 09:41:53 am Jeff Squyres wrote:

I actually don't know what the RDMA CM requires for the LMC>0 case --
does it require a unique IP address for every LID?


It requires a unique IP address for every hca/port in use by rdmacm.

I see the bug in rdmacm (since I don't believe you were trying to  
use rdmacm), and will have a patch out shortly.




On May 6, 2008, at 5:09 AM, Lenny Verkhovsky wrote:


Hi,

running BW benchmark with btl_openib_max_lmc >= 2 couses warning
( MPI from the TRUNK ) 


#mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca
btl_openib_max_lmc 2  ./mpi_p_LMC  -t bw -s 40
BW (40) (size min max avg)  40  321.493757
342.972837  329.493715

#mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca
btl_openib_max_lmc 3  ./mpi_p_LMC  -t bw -s 40
[witch9][[7493,1],7][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],0][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],9][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],4][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],2][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],5][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],10][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch9][[7493,1],17][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],3][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch8][[7493,1],6][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],14][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],19][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],13][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],12][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch9][[7493,1],27][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],23][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],20][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch9][[7493,1],37][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],35][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],32][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],22][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],33][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],30][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch8][[7493,1],16][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],15][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],39][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],25][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],29][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],34][../../../../../ompi/mca/btl/openib/connect/
btl_openib_connect_rdmacm.c:989:create_message] 

Re: [OMPI devel] NO IP address found

2008-05-06 Thread Jon Mason
On Tuesday 06 May 2008 09:41:53 am Jeff Squyres wrote:
> I actually don't know what the RDMA CM requires for the LMC>0 case --  
> does it require a unique IP address for every LID?

It requires a unique IP address for every hca/port in use by rdmacm.

I see the bug in rdmacm (since I don't believe you were trying to use rdmacm), 
and will have a patch out shortly.

> 
> On May 6, 2008, at 5:09 AM, Lenny Verkhovsky wrote:
> 
> > Hi,
> >
> > running BW benchmark with btl_openib_max_lmc >= 2 couses warning  
> > ( MPI from the TRUNK ) 
> >
> >
> >  #mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca  
> > btl_openib_max_lmc 2  ./mpi_p_LMC  -t bw -s 40
> > BW (40) (size min max avg)  40  321.493757   
> > 342.972837  329.493715
> >
> >  #mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca  
> > btl_openib_max_lmc 3  ./mpi_p_LMC  -t bw -s 40
> > [witch9][[7493,1],7][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch2][[7493,1],0][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch10][[7493,1],9][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch6][[7493,1],4][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch4][[7493,1],2][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch7][[7493,1],5][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch2][[7493,1],10][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch9][[7493,1],17][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch5][[7493,1],3][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch8][[7493,1],6][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch6][[7493,1],14][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch10][[7493,1],19][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch5][[7493,1],13][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch4][[7493,1],12][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch9][[7493,1],27][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch5][[7493,1],23][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch2][[7493,1],20][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch9][[7493,1],37][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch7][[7493,1],35][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch4][[7493,1],32][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch4][[7493,1],22][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch5][[7493,1],33][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch2][[7493,1],30][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch8][[7493,1],16][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch7][[7493,1],15][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch10][[7493,1],39][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch7][[7493,1],25][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch10][[7493,1],29][../../../../../ompi/mca/btl/openib/connect/ 
> > btl_openib_connect_rdmacm.c:989:create_message] No IP address found
> > [witch6][[7493,1],34][../../../../../ompi/mca/btl/openib/connect/ 
> > 

Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown

2008-05-06 Thread Brian W. Barrett

On Tue, 6 May 2008, Jeff Squyres wrote:


On May 5, 2008, at 6:27 PM, Steve Wise wrote:


There is a larger question regarding why the remote node is still
polling the hca and not shutting down, but my immediate question is
if it is an acceptable fix to simply disregard this "error" if it
is an iWARP adapter.


If proc B is still polling the hca, it is likely because it simply has
not yet stopped doing it.  I.e., a big problem in MPI implementations
is that not all actions are exactly synchronous.  MPI disconnects are
*effectively* synchronous, but we probably didn't *guarantee*
synchronicity in this case because we didn't need it (perhaps until
now).


Not to mention...  The BTL has to be able to handle a shutdown from one 
proc while still running its progression engine, as that's a normal 
sequence of events when dynamic processes are involved.  Because of that, 
there wasn't too much care taken to ensure that everyone stopped polling, 
then everyone did del_procs.


Brian


Re: [OMPI devel] NO IP address found

2008-05-06 Thread Jeff Squyres
I actually don't know what the RDMA CM requires for the LMC>0 case --  
does it require a unique IP address for every LID?



On May 6, 2008, at 5:09 AM, Lenny Verkhovsky wrote:


Hi,

running BW benchmark with btl_openib_max_lmc >= 2 couses warning  
( MPI from the TRUNK ) 



 #mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca  
btl_openib_max_lmc 2  ./mpi_p_LMC  -t bw -s 40
BW (40) (size min max avg)  40  321.493757   
342.972837  329.493715


 #mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca  
btl_openib_max_lmc 3  ./mpi_p_LMC  -t bw -s 40
[witch9][[7493,1],7][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],0][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],9][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],4][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],2][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],5][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],10][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch9][[7493,1],17][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],3][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch8][[7493,1],6][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],14][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],19][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],13][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],12][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch9][[7493,1],27][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],23][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],20][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch9][[7493,1],37][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],35][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],32][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch4][[7493,1],22][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch5][[7493,1],33][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch2][[7493,1],30][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch8][[7493,1],16][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],15][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],39][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch7][[7493,1],25][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch10][[7493,1],29][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],34][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch8][[7493,1],26][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch6][[7493,1],24][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
[witch8][[7493,1],36][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_rdmacm.c:989:create_message] No IP address found
BW (40) (size min max avg)  40  312.622582   
334.037277  324.014814

Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown

2008-05-06 Thread Jeff Squyres

On May 5, 2008, at 6:27 PM, Steve Wise wrote:

I am seeing some unusual behavior during the shutdown phase of ompi  
at the end of my testcase.  While running a IMB pingpong test over  
the rdmacm on openib, I get cq flush errors on my iWARP adapters.


This error is happening because the remote node is still polling  
the endpoint while the other one shutdown.  This occurs because  
iWARP puts the qps in error state when the channel is disconnected  
(IB does not do this).  Since the cq is still being polled when the  
event is received on the remote node, ompi thinks it hit an error  
and kills the run.  Since this is expected behavior on iWARP, this  
is not really an error case.


The key here, I think is that when an iWARP QP moves out of RTS, all  
the

RECVs and any pending SQ WRs get flushed.  Further, disconnecting the
iwarp connection forces the QP out of RTS.  This is probably different
than they way IB works.  IE "disconnecting" in IB is an out-of-band
exchange done by the IBCM.  For iWARP, "disconnecting" is an in-band
operation (a TCP close or abort) so the QP cannot remain in RTS during
this process.


Let me make sure I understand:

- proc A calls del_procs on proc B
- proc A calls ibv_destroy_qp() on QP to proc B
- this causes a local (proc A) flush on all pending receives and SQ WRs
- this then causes a FLUSH event to show up *in proc B*
  --> I'm not clear on this point from Jon's/Steve's text
- OMPI [currently] treats the FLUSH in proc B as an error

Is that right?

What is the purpose of the FLUSH event?

There is a larger question regarding why the remote node is still  
polling the hca and not shutting down, but my immediate question is  
if it is an acceptable fix to simply disregard this "error" if it  
is an iWARP adapter.


If proc B is still polling the hca, it is likely because it simply has  
not yet stopped doing it.  I.e., a big problem in MPI implementations  
is that not all actions are exactly synchronous.  MPI disconnects are  
*effectively* synchronous, but we probably didn't *guarantee*  
synchronicity in this case because we didn't need it (perhaps until  
now).



Opinions?



If the openib btl (or the layers above) assume the "disconnect" will
notify the remote rank that the connection should be finalized, then  
we

must deal with FLUSHED WRs for the iwarp case.  If some sort of
"finalizing" is done by OMPI and then the connections disconnected,  
then
that "finalizing" should include not polling the CQ anymore.  But  
that's

not what we observe.



I'd have to check the exact shutdown sequence...

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Intel MPI Benchmark(IMB) using OpenMPI - Segmentation-fault error message.

2008-05-06 Thread Jeff Squyres

On May 1, 2008, at 10:43 AM, Lenny Verkhovsky wrote:

(a) I did modify make_mpich makefile present in IMB-3.1/src folder  
giving the path for openmpi. Here I am using same mpirun as built  
from openmpi(v-1.2.5) also did mention in PATH & LD_LIBRARY_PATH.


That should be fine.

(b) What is the command on console to run any new additional file  
with MPI API contents call. Do I need to add in Makefile.base of  
IMB-3.1/src folder or mentioning in console as a command it takes  
care alongwith "$mpirun IMB-MPI1"


I don't understand this question...  What exactly are you trying to  
do; modify the IMB benchmarks or write your own/new MPI application?


(c) Does IMB-3.1 need INB(Infiniband) or TCP support to complete  
it's Benchmark routine call, means do I need to configure and build  
OpnMPI with Infiniband stack too?


IMB is a set of benchmarks that can be run between 1 and more machines
it calls for MPI API that does all the communication
MPI decides how to run ( IB or TCP or shared memory ) according to  
priorities and all possible ways to be connected to another host.


Lenny is right; in general Open MPI will decide what is the best  
network stack to use to communicate with a peer MPI process.  So  
whether you build Open MPI with IB support and/or TCP support is up to  
you.  Generally, you want to build Open MPI with support for your high  
speed network (e.g., IB) and let Open MPI use it for off-node  
communication (OMPI will usually use shared memory for communication  
between processes on the same node).


--
Jeff Squyres
Cisco Systems



[OMPI devel] NO IP address found

2008-05-06 Thread Lenny Verkhovsky
Hi,

running BW benchmark with btl_openib_max_lmc >= 2 couses warning ( MPI from
the TRUNK ) 


 #mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca
btl_openib_max_lmc 2  ./mpi_p_LMC  -t bw -s 40
BW (40) (size min max avg)  40  321.493757  342.972837
329.493715

 #mpirun --bynode -np 40 -hostfile hostfile_ompi_arbel  -mca
btl_openib_max_lmc 3  ./mpi_p_LMC  -t bw -s 40
[witch9][[7493,1],7][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch2][[7493,1],0][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch10][[7493,1],9][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch6][[7493,1],4][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch4][[7493,1],2][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch7][[7493,1],5][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch2][[7493,1],10][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch9][[7493,1],17][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch5][[7493,1],3][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch8][[7493,1],6][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch6][[7493,1],14][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch10][[7493,1],19][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch5][[7493,1],13][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch4][[7493,1],12][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch9][[7493,1],27][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch5][[7493,1],23][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch2][[7493,1],20][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch9][[7493,1],37][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch7][[7493,1],35][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch4][[7493,1],32][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch4][[7493,1],22][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch5][[7493,1],33][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch2][[7493,1],30][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch8][[7493,1],16][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch7][[7493,1],15][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch10][[7493,1],39][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch7][[7493,1],25][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch10][[7493,1],29][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch6][[7493,1],34][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch8][[7493,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch6][[7493,1],24][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
[witch8][[7493,1],36][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:989:create_message]
No IP address found
BW (40) (size min max avg)  40  312.622582  334.037277
324.014814

using -mca btl openib,self causes warning with LMC >=10


Best regards
Lenny.